This is a pretty cool feature, and as far as I know it’s unique to SAS and Perl. In SAS, you can have a substr to the left of an assignment.

stringDate='29MAR1984';
substr(stringDate,3,3)='FEB';

This avoids potentially awkward situations like this:

stringDate='29MAR1984';
stringDate=substr(stringDate,1,2)||'FEB'||substr(stringDate,6,4);

The bounds don’t have to be static. Lets suppose we have a dataset with 1 row per letter, and we want to combine those letters into a word.

data have;
    input letter $1.;
    datalines;
h
e
l
l
o
;
run;

My initial thought was to just retain and append each letter:

data want;
  length word $200;
  retain word;
  set have nobs=nobs;
  call cats(word,letter);
  if(nobs=_n_) then output;
run;

However this breaks if one of the letters is a space, and SAS treats the space as null, so appending a space has no effect on the string. Instead of appending, just stick the letter where it’s supposed to be.

data want;
  length word $200;
  retain word ' ';
  set have nobs=nobs;
  substr(word,_n_,1) = letter;
  if(_n_==nobs) then output;
run;

Officially SAS does not support Ubuntu Linux as a platform, in spite of it being the most popular desktop distribution. It will run, but there are two things you need to do:

First, reconfigure Ubuntu not to use dash. The SAS Deployment Wizard will freeze when you first click “Next” during the install if you don’t do this.

sudo dpkg-reconfigure dash

That being done, you should be able to install SAS Foundation.

Second, SAS requires libjpeg to run. When you first run sas, you’ll get this error:

ERROR:  BRIDGE FAILURE - ERROR LOADING IMAGE 
        MODULE: sasmotifsasvsub HyS SUBSYSTEM: 8 SLOT: 11 
ERROR: Could not load /usr/local/SASHome/SASFoundation/9.4/sasexe/sasmotif (35 images loaded) 
ERROR: libjpeg.so.62: cannot open shared object file: No such file or directory 
ERROR:  BRIDGE FAILURE - ERROR LOADING IMAGE 
        MODULE: sasmotifsasvsub HyS SUBSYSTEM: 8 SLOT: 11

Install this by running this command:

sudo apt-get install libjpeg62

SAS should now run.

Additionally, while not common, but if you have SAS/Secure installed and want to pull files via https, you’ll need to point it at SSL CA certificates. You can kick off SAS with

-SSLCALISTLOC="/etc/ssl/certs/ca-certificates.crt"

Or just create a file ~/sasv9.cfg with that in it.

Update: This works on 13.10 and 14.04

I thought this was neat and quirky. The implications of it are important, too. When you assign a macro variable in SAS, it doesn’t resolve immediately. It resolves It will also warn you if the variable doesn’t exist yet, too.

%let x = &y;
%let y = z;
%put &x;

Will output

z

Even though, when x was defined, y did not exist. We’ll get the following warning, which can be ignored.

WARNING: Apparent symbolic reference Y not resolved.

As far as I can tell (I could be wrong!), I don’t think there’s a way of turning off that warning.

Out of the box SAS 9.4 is not configured to use SSL on Linux x64. If you try to do this, you’ll get the error

ERROR: SSL Error: Missing CA trust list

This happens because when SAS gets the SSL certificate from the server, it sees that it’s signed by a Certificate Authority, but it doesn’t have the SSL certificates from those authorities. Those come with OpenSSL, so get that.

sudo apt-get install openssl

With that installed, the SSL certificates should be in /etc/ssl/certs, with ca-certificates.crt being an uber-cert that contains all of the other files in one file. To tell SAS to look at that file, add this to SASHOME/SASFoundation/9.4/sasv9_local.cfg:

-SSLCALISTLOC="/etc/ssl/certs/ca-certificates.crt"

You’ll probably need sudo to modify that file, but you probably had that if you installed OpenSSL. But maybe your admin installed it, he just didn’t know to configure it, and he’s out on holiday because he’s European. You can put that config option in ~/sasv9.cfg instead, creating it if it doesn’t exist.

NOTE: This was tested on Ubuntu Linux x64 13.10. The location of the cert file will probably differ from distro to distro and version to version. Just look for the SSL CA cert directory, and find the biggest file in it, that’s probably the uber-cert.

If you stick a file named autoexec.sas in the directory where SAS is installed, it will run automatically when SAS starts up. By default, this place is C:\Program Files\SASHome\SASFoundation\9.3. This has worked since at least SAS 8.2, probably before.

I like to use my Windows desktop as a temporary staging area, so I have my autoexec.sas automatically assign a “DESKTOP” library with this code:

%let USERPROFILE=%sysget(USERPROFILE);
libname desktop "&USERPROFILE\Desktop\";

So when SAS starts up, the library is assigned and my log says this:

NOTE: Copyright (c) 2002-2010 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.3 (TS1M2)
      Licensed to Microsoft Windows for x64 All Compatible Non-Plann, Site  XXXXXXX.
NOTE: This session is executing on the X64_7PRO  platform.
 
 
 
NOTE: Enhanced analytical products:
 
SAS/STAT 12.1, SAS/ETS 12.1, SAS/OR 12.2, SAS/IML 12.1, SAS/QC 12.1
 
NOTE: SAS initialization used:
      real time           0.65 seconds
      cpu time            0.63 seconds
 
 
NOTE: AUTOEXEC processing beginning; file is C:\Program
      Files\SASHome\SASFoundation\9.3\autoexec.sas.
 
NOTE: Libref DESKTOP was successfully assigned as follows:
      Engine:        V9
      Physical Name: C:\Users\philihp\Desktop
 
NOTE: AUTOEXEC processing completed.

I had a problem where I needed the center longitude and latitude of each US county. SAS comes with some datasets containing map data in the format of tables that trace a line around the borders of counties. I was interested in the MAPSGFK.US_COUNTIES dataset, which has data that looks like this:

The State and County numbers were FIPS codes, which were also stored in the US_COUNTIES_ATTR table, so those are easy enough to figure out. Segment seemed to be always 1. Resolution and Density were numbers that I figured had to do with how much detail the drawing would have if these points were used in a line drawing.

The longitude and latitude here are all points along the border. The simplest, and most incorrect way of getting the center of the county would be to take an average of them, which would be fine for rectangular counties, but for territories where one side is flat (and has few waypoints) and one side is jagged (such as a coastal border), the midpoint will be weighted to that side. So instead, use geometry.

In geometry, there’s a formula for the Centroid of a polygon. This is a magic formula that I just trust is correct, because everything on Wikipedia is true, without exception. Here was my first pass at an implementation of it in SAS:

proc sort data=mapsgfk.us_counties out=centroids;
  by state county;
run;
data centroids(keep=cx cy county state);
  retain yi yj xi xj a cx cy x0 y0 0;
  set centroids(keep=state county lat long rename=(lat=yj long=xj));
  by state county;
  if(first.county) then do;
    cx = 0;
    cy = 0;
    a = 0;
    x0 = xj;
    y0 = yj;
  end;
  else if(not first.county) then do;
    ta = (xi*yj - xj*yi);
    cx + ((xi+xj)*ta);
    cy + ((yi+yj)*ta);
    a + ta;
  end;
  if(last.county) then do;
    ta = (xj*y0 - x0*yj);
    cx + ((xj+x0)*ta);
    cy + ((yj+y0)*ta);
    a  = ta + a * 0.5;
    cx = cx / (6*a);
    cy = cy / (6*a);
    output;
  end;
  xi = xj;
  yi = yj;
run;

Here, xi and yi are always the coordinates from the previous point, and “if not first.county” prevents processing of the first point of a county because it didn’t have a previous point. The variables a, cx, and cy accumulate for every point. The variable ta is the area of the rectangle defined by the two points. Once all the points have been accumulated, the area is halfed so it’s the actual area, and the area is used in calculating the centroid (cx and cy).

But then I noticed the 2nd county was positioned somewhere in the mid-Atlantic. Something had to be up. Looking closer, this was Baldwin County, which had an island. Looking at the source data, this segment was drawn in two segments, which caused the centroid formula to choke, as it assumed contiguous shapes. So instead, in my second pass, I compute the centroid of every segment, then average them weighted by their areas.

proc sort data=mapsgfk.us_counties out=centroids;
  by state county segment;
run;
data centroids_temp;
  retain yi yj xi xj a cx cy x0 y0 0;
  set centroids(keep=state county segment lat long rename=(lat=yj long=xj));
  by state county segment;
  if(first.segment) then do;
    cx = 0;
    cy = 0;
    a = 0;
    x0 = xj;
    y0 = yj;
  end;
  else if(not first.segment) then do;
    ta = (xi*yj - xj*yi);
    cx + ((xi+xj)*ta);
    cy + ((yi+yj)*ta);
    a + ta;
  end;
  if(last.segment) then do;
    ta = (xj*y0 - x0*yj);
    cx + ((xj+x0)*ta);
    cy + ((yj+y0)*ta);
    a  = ta + a * 0.5;
    cx = cx / (6*a);
    cy = cy / (6*a);
    output;
  end;
  xi = xj;
  yi = yj;
run;
proc sql;
  create table centroid_weight as
  select
    state, county, sum(a) as sum
  from centroids_temp
    group by state, county;
quit;
proc sql;
  create table centroids as
  select a.state, a.county,
         sum(cx*(a / sum)) as lat,
         sum(cy*(a / sum)) as long
    from centroids_temp a
    inner join centroid_weight b
      on (a.state = b.state and a.county = b.county)
    group by a.state, a.county;
quit;
proc sql;
  drop table centroids_temp;
  drop table centroid_weight;
quit;

Here, I add two more steps. The first proc SQL block sums up the total area of each county, which is used in the second block to find the average centroid of each county’s centroids, weighted by the total area that centroid represents, which gives us correct centroids for counties drawn in multiple segments.

Hypothetical situation, lets say you’ve got a list of movies that you want to rank in a website or a report or something, and you have user-submitted ratings for them, but some are more popular than others, so your data looks like this:

  data ratings;
    input name $ rating;
    datalines;
  Lincoln 9
  Lincoln 8
  Lincoln 9
  Amour 9
  Argo 5
  Argo 10
  ;
  run;
Obs name rating
1 Lincoln 9
2 Lincoln 8
3 Lincoln 9
4 Amour 9
5 Argo 5
6 Argo 10

The easiest thing to do would be to calculate an average rating for each movie like this:

  proc sql;
    select distinct name, avg(rating) as average
      from ratings
      group by name
      order by average desc;
  run;
name average
Amour 9.00
Lincoln 8.67
Argo 7.50

But hey! That’s not cool. It looks like Amour wins, because its average rating is 9. Maybe we want to consider Lincoln as better because 3 people think it’s very high. A good way to deal with this is by instead taking a Bayesian Average.

This means we’re going to add in some “dummy” votes for each movie, who give each movie the average rating a movie gets. How many (C) is a judgement call, the more we add, the harder we make it for an obscure movie to be near the top. Likewise, if a movie’s first rating is low, it keeps it from suddenly dropping to the bottom of the list. If we expect thousands of ratings for each movie, a C=1000 might be appropriate. In this example, I use a small C of 10.

  proc sql;
    select avg(rating) into :average
      from ratings;
    select distinct
        name,
        (sum(rating) + &average * 10) / (count(*) + 10) as b_average
      from ratings
      group by name
      order by b_average;
  quit;
name b_average
Lincoln 8.41
Amour 8.39
Argo 8.19

And look! Lincoln is back on top, since its bayesian average more closely reflects a product of the number of ratings it has and what those ratings are.

I was asked recently in a Q&A, “how to select the last 5 years worth of data from a table in SAS?” One way of doing it is by selecting the data with a Proc SQL query similar to the following, and something similar could also be done in a data step:

PROC SQL;
  create table mywindow as
  select ...myvariables...
    from mylib.mytable
    where year(mydate) between year(today())-5 and year(today());
QUIT;

Caveat emptor, this is really only going to include the last 4 complete years and whatever portion of the current year is in the dataset. This is because year() truncates out the month and day of the date.

Of course, this is just one way to do it. By using the intnx function, more advanced intervals can be done.

SAS Documentation:

Below is a graph and table showing the exact odds of a dealer’s end-hand given his show card. One thing to keep in mind when playing: if the dealer is dealt a show-card of an ace and you’re being asked to hit or to stand, you know that his hole card is not a ten. If it were, he’d have a blackjack, and you’d have already lost.

This graph was computed with the following rules:

  • 6-deck shoe
  • dealer hits on a soft 17
  • zero-high-card-count (a fresh shoe)
  • No charlie rule.

These are the most common house variations in low stakes games. In higher stakes you usually see house rules that favor the house less. The data was computed with this simulator. It was rendered in SAS.

Show Card 17 18 19 20 21 Blackjack Bust
10 3.44% 3.44% 3.44% 10.46% 1.07% 2.37% 6.54%
2 1.00% 1.05% 1.01% 0.97% 0.92% 0.00% 2.74%
3 0.97% 1.01% 0.97% 0.94% 0.90% 0.00% 2.90%
4 0.94% 0.96% 0.94% 0.91% 0.87% 0.00% 3.07%
5 0.91% 0.95% 0.91% 0.86% 0.84% 0.00% 3.23%
6 0.89% 0.88% 0.88% 0.85% 0.81% 0.00% 3.38%
7 2.84% 1.06% 0.60% 0.61% 0.57% 0.00% 2.01%
8 0.99% 2.77% 0.99% 0.53% 0.53% 0.00% 1.87%
9 0.93% 0.90% 2.71% 0.93% 0.47% 0.00% 1.76%
A 0.44% 1.10% 1.10% 1.10% 0.51% 2.37% 1.07%

Interesting fact: The rarest dealer-hand is getting seven aces for soft 17, and then hitting and getting a 5 for a hard 12, and then getting 5 more aces for a hard 17. The odds of this are slightly less than 0.00000000000001509%.