Wednesday, March 12, 2008

Correcting for the Human Factor in Movie Ratings

A recent Wired article, This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize, is about Gavin Potter, a retired management consultant who is singlehandedly yet effectively competing against corporate and academic research teams in the top tier of the Netflix Prize.

(The Netflix Prize is a $1 million challenge to anyone who can exceed the performance of Netflix’s movie-recommendation algorithm by 10%. Netflix provides a big database of its users’ movie ratings as grist for the contestants’ mills. It also provides a means to test contestants’ predicted ratings against users’ actual ratings, thus measuring accuracy. Although a 10% improvement may not sound like much, I’ve previously discussed why it is not easy.)

The leading research teams are each exploring variations of statistical/machine-learning approaches, looking for new refinements to relatively well-understood algorithms. While Potter no doubt uses one or more of the standard algorithms, he has apparently gotten a long way with few resources by correcting for well-known behavioral quirks that affect how people rate things. As he puts it, “The fact that these ratings were made by humans seems to me to be an important piece of information that should be and needs to be used.”

The article provides an example:

One such phenomenon is the anchoring effect, a problem endemic to any numerical rating scheme. If a customer watches three movies in a row that merit four stars — say, the Star Wars trilogy — and then sees one that’s a bit better — say, Blade Runner — they’ll likely give the last movie five stars. But if they started the week with one-star stinkers like the Star Wars prequels, Blade Runner might get only a 4 or even a 3. Anchoring suggests that rating systems need to take account of inertia — a user who has recently given a lot of above-average ratings is likely to continue to do so. Potter finds precisely this phenomenon in the Netflix data; and by being aware of it, he’s able to account for its biasing effects and thus more accurately pin down users’ true tastes.

Admirably, the article goes on to consider the obvious pushback:

Couldn’t a pure statistician have also observed the inertia in the ratings? Of course. But there are infinitely many biases, patterns, and anomalies to fish for. And in almost every case, the number-cruncher wouldn’t turn up anything. A psychologist, however, can suggest to the statisticians where to point their high-powered mathematical instruments. “It cuts out dead ends,” Potter says.

Potter’s approach reminds me of ELIZA, a computer program from the 1960s that used simple psychological tricks to impersonate a human—for example, repeating someone’s statement back as a question (“My boyfriend made me come here.” “Why did your boyfriend make you come here?”). Although ELIZA did not know what it was talking about, it often did better at engaging people than far more sophisticated programs that actually tried to understand and respond to what was being said.

While I’m not suggesting that Potter’s work is the algorithmic sleight-of-hand that ELIZA was, he is nevertheless tapping the same success factor: exploiting the humanness of the humans in the system. Not only does it work, but in a contest like the Netflix prize, it is particularly effective because the other leading contestants apparently were not doing it.

No comments:

Post a Comment