One of the problems with statistics is that they work really well when you have perfect data (and therefore don't really need to do statistics), but start falling apart when the real world rears it's ugly head and gives you data that isn't all smooth. Consider a very specific case: you have items that people can rate and then you want to pull out the "favorite" items based on those ratings. As a more concrete example, say you're Netflix and based on a person's movie ratings (from 1-5 stars), you want to identify their favorite actors (piggybacking the assumption that movies they like probably have actors they like).
This is a simple answer to derive: just average the ratings of every movie the actor was in, and whichever actors have the highest average are the favorites. Here it is expressed here in SQL:
select actor.name, avg(rating.stars) as avgRating from actor inner join movie_actor on movie_actor.actorId = actor.id inner join movie on movie_actor.movieId = movie.id inner join rating on movie.id = rating.movieId where rating.subscriberId = ? -- the ID of the subscriber whose favorite actors you want group by actor.name order by avgRating desc
The problem is that – as an example – Tom Hanks was in both Sleepless in Seattle and Saving Private Ryan. Clearly those two movies appeal to different audiences, and it seems very reasonable that someone who saw both would like one far more than the other, regardless of whether or not they like Tom Hanks. The next problem is if they've only seen one of those movies, the ratings are going to paint an unfair picture of Tom Hanks' appeal. So how can we solve this?
The short answer is that we can't. In order to solve it, we'd have to synthesize the missing data points, which isn't possible for obvious reasons. However, we can make a guess based on other datapoints that we do have. In particular, we know the average rating for all movies for a user, so we can bias "small" actor samples towards that overall average. This will help mitigate the dramatic effect of outliers in small sample sizes when there aren't enough other datapoints to mitigate them.
In other words, instead of this:
we can do something like this:
This simply takes the normal average from above, and "scoots" it towards the overall average based. The denominator is a constant picked by me (more later) raised to the power equal to the number of samples we have. This way as the number of samples goes up, the magnitude of the correction falls rapidly. Here's a chart illustrating this (the x axis is a log scale):
With only one sample, the per-actor average will be scooted 87% of the way towards the overall average. With four samples the correction will be only 57%, and by the time you get 32 samples there will be only a 1% shift. Note that those percentages are of the distance to the overall average, not any absolute value change. So if a one-sample actor happens to be only 0.5 stars away from the overall average, the net correction will be 0.465. However, if a different one-sample actor is 1.5 stars away from the overall average, the net correction will be 1.305.
Of course, I'm not Netflix, so my data was from PotD, but the concept is the identical. The "1.15″ factor was derived based on testing on the PotD dataset, and demonstrated an appropriate falloff as the sample size increased. Here's a sample of the data, showing both uncorrected and corrected averages ratings, along with pre- and post-correction rankings:
Model | Samples | Average | Corr. Average | Rank | Corr. Rank |
---|---|---|---|---|---|
#566 | 22 | 4.1818 | 4.1310 | 46 | 1 |
#375 | 12 | 4.1667 | 3.9640 | 47 | 2 |
#404 | 13 | 4.0000 | 3.8509 | 81 | 3 |
#1044 | 7 | 4.2857 | 3.8334 | 44 | 4 |
#564 | 5 | 4.4000 | 3.7450 | 42 | 5 |
#33 | 32 | 3.7500 | 3.7424 | 176 | 6 |
#954 | 4 | 4.5000 | 3.6895 | 40 | 7 |
#733 | 4 | 4.5000 | 3.6895 | 39 | 8 |
#330 | 7 | 4.0000 | 3.6551 | 74 | 9 |
#293 | 5 | 4.2000 | 3.6444 | 45 | 10 |
In particular, model #33 sees a huge jump upward because of the number of samples. You can't see it here, but the top 37 models using the simple average are all models with a single sample (a 5-star rating), which is obviously not a real indicator. Their corrected average is 3.3391, so not far off the leaderboard, but appreciably lower than those who have consistently received high ratings.
For different size sets (both overall, and expected number of ratings per actor/model) the factor will need to be adjusted. It must remain strictly greater than one, and is theoretically unbounded on the other end but there is obviously a practical/reasonable limit.
Is this a good correction? Hard to say. It seems to work reasonably well with my PotD dataset (both as a whole, and segmented various ways), and it makes reasonable logical sense too. The point really is that if you don't care about correctness, you can do some interesting fudging of your data to help it be useful in ways that it couldn't otherwise be.
Comments are closed.