Many Web sites let users rate stuff: products, content, even people. For example, based on user ratings, you might see that Product A has three stars and Product B has four. Yet behind such ratings there is always a methodology. And as we’ve seen with election systems, depending on how you count, the same set of user preferences can lead to different outcomes.
Here is an example about rating products from blogger Evan Miller’s How Not to Sort by Average Rating:
Average rating works fine if you always have a ton of ratings, but suppose item 1 has 2 positive ratings and 0 negative ratings. Suppose item 2 has 100 positive ratings and 1 negative rating. This algorithm puts item two (tons of positive ratings) below item one (very few positive ratings).
Miller then shows a screen-capture from Amazon.com. The first product has a single rating, which happens to be five stars. That product is ranked ahead of a product with 4.5 stars across 580 ratings. If you were evaluating the two products, and that’s all the information you had, which would you suspect is better?
As an alternative, Miller suggests using a statistical technique that factors-in the number of ratings as well as their magnitude. I’d prefer that technique, or something like it. However, it has a cost. Explaining this...
...to users is a lot harder than explaining a basic average.
In Amazon.com’s case, a user can see each product’s individual user reviews and ratings. So making it obvious how those individual ratings roll-up into the overall rating has value. Indeed, Amazon.com’s solution is to show the overall rating in terms of stars, but next to that rating Amazon.com shows the number of ratings in parentheses. Thus, the user can judge the relative importance of various products’ number and magnitude of ratings. This approach puts more responsibility on the user, but it keeps the situation easily understandable.
At the end of the day, Amazon.com’s solution may be best for its users, because displaying the two numbers together reveals the key weakness of the system when it occurs, inviting users to compensate as they see appropriate. In contrast, there are numerous statistical methods, of which Miller proposed one, that could improve the rankings if only a single aggregate rating is desired. The problem is, different methods will lead to different rankings under some conditions, and only a small number of specialists would understand why.
The larger point is, aggregated ratings tend to imply objectivity that is not fully there. While aggregating many people’s ratings will lead to a more objective assessment than a single person’s rating, the process of aggregation has its own subjectivity. In other words, we see once again that the voice of the people is subject to which amplifier you use.