Bayesian Averaging in SAS
Hypothetical situation, lets say you've got a list of movies that you want to rank in a website or a report or something, and you have user-submitted ratings for them, but some are more popular than others, so your data looks like this:
data ratings;
input name $ rating;
datalines;
Lincoln 9
Lincoln 8
Lincoln 9
Amour 9
Argo 5
Argo 10
;
run;
Obs | name | rating |
---|---|---|
1 | Lincoln | 9 |
2 | Lincoln | 8 |
3 | Lincoln | 9 |
4 | Amour | 9 |
5 | Argo | 5 |
6 | Argo | 10 |
The easiest thing to do would be to calculate an average rating for each movie like this:
proc sql;
select distinct name, avg(rating) as average
from ratings
group by name
order by average desc;
run;
name | average |
---|---|
Amour | 9.00 |
Lincoln | 8.67 |
Argo | 7.50 |
But hey! That's not cool. It looks like Amour wins, because its average rating is 9. Maybe we want to consider Lincoln as better because 3 people think it's very high. A good way to deal with this is by instead taking a Bayesian Average.
This means we're going to add in some "dummy" votes for each movie, who give each movie the average rating a movie gets. How many (C) is a judgement call, the more we add, the harder we make it for an obscure movie to be near the top. Likewise, if a movie's first rating is low, it keeps it from suddenly dropping to the bottom of the list. If we expect thousands of ratings for each movie, a C=1000 might be appropriate. In this example, I use a small C of 10.
proc sql;
select avg(rating) into :average
from ratings;
select distinct
name,
(sum(rating) + &average * 10) / (count(*) + 10) as b_average
from ratings
group by name
order by b_average;
quit;
name | b_average |
---|---|
Lincoln | 8.41 |
Amour | 8.39 |
Argo | 8.19 |
And look! Lincoln is back on top, since its bayesian average more closely reflects a product of the number of ratings it has and what those ratings are.