Wednesday, December 24, 2008

CNET Content Solutions (and Me)

You probably know CNET, but do you know CNET Content Solutions? Indirectly, you do, because CNET Content Solutions’ content appears not only on CNET but also on CDW, Dell, Insight, MSN Shopping, Yahoo! Shopping, and hundreds of other sites around the world.

CNET Content Solutions’ content is product data: detailed specs, images, descriptions, and related-product links. This is the stuff you see on product pages and comparisons all over the Web. Without it, e-commerce sites would be empty shells.

CNET Content Solutions database comprises more than 3 million computer and consumer-electronics products in 35 markets and 15 languages. If you’re an e-commerce site, having manufacturers spray this information at you from all sides is not an answer. The scale of data is a problem, but the show-stopper is rampant inconsistency in the content provided, specs included/omitted, and terminology used. It’s the e-commerce Tower of Babel.

And thus the opportunity for a win-win: CNET Content Solutions does the heavy lifting of acquiring, normalizing, and internationalizing a world’s worth of product data; each customer pays a small fraction of the total cost to get the full benefit.

Put another way, in an era of infinite shelf space, CNET Content Solutions allows sites to keep the cost of merchandising that space under control.

I entered the picture at the end of 2004, when CNET acquired a company I co-founded, ExactChoice. We specialized in creating software applications that did analytics and mining of complex product data. Now, as CNET Content Solutions’ Analytic Products Group, we have the largest product-data operation in the world as our foundation.

The Web site has carried forward ExactChoice’s showcase application, a personalized computer recommender. Meanwhile, as VP of Analytic Products for CNET Content Solutions, I am in charge of defining and executing products that add value to the existing database of detailed product information. Intelligent Cross-Sell is the first such product.

[This post is a revision of a post from September 18, 2005, “CNET Channel and Me.” In late 2008, CNET Channel changed its name to CNET Content Solutions.]

Saturday, December 20, 2008

Understanding 0.02 Parts per Billion

Some measures are so extreme that the numbers are hard to grasp. For example:

The chemical that provides the dominant flavor of bell pepper can be tasted in amounts as low as 0.02 parts per billion.

A fraction of one-billionth? Perhaps an analogy would help.

One drop is sufficient to add flavor to five average-size swimming pools.


[The quote is from Eric Schlosser’s Fast Food Nation, excerpted in The Atlantic magazine. The original link on The Atlantic’s site is dead, so here is an alternate. For another variant of this theme, see How Big Was That Squid?]

Sunday, December 14, 2008

Tidal Waves Are Not Tidal

I had not thought about it before, but tidal waves are not tidal.

Tidal forces, which cause normal ocean waves, are from the sun’s and moon’s gravity. This explains why the cycle of high and low tides is regular—because the orbits and rotations of the sun/earth/moon system are regular.

So, tidal wave should just mean a normal wave. But most people understand tidal wave to mean a gigantic wave, a freak of nature.

The problem is, such waves have nothing to do with tides. They are caused by sudden displacements of ocean water due to earthquakes, volcanoes, or other major disruptions. Thus, scientists prefer the term tsunami to describe one or more massive waves caused by an irregular event.

I had always assumed that tidal wave and tsunami were either equivalent or subtle variants. Now I know, tidal waves are just big misnomers. Thanks to Jacqueline for enlightening me.

Saturday, December 6, 2008

Review: Ian Ayres’ Super Crunchers

“You must be a Super Cruncher!”

Over the past year, several people have told me that. They had read Ian Ayres’ book Super Crunchers, and whatever he was talking about, that must be what I do.

From the first time I heard it, I disliked the term “Super Cruncher.” If it referred to a machine that rendered cars into pellets, that would be fine. But as a description of someone that does data mining and analytics, it doesn’t work for me unless that someone is a comic-book character.

So I avoided the book despite numerous accusations of my involvement with Super Crunchery. Yet when I eventually bought the paperback, Super Crunchers won me over.

For the general reader, it provides an engaging tour of real-world, data-driven decisions and their increasing effect in business, medicine, and government. For example:

  • An algorithm that successfully predicts the best wines of a given year before they are even shipped
  • A Web site, Farecast, that not only shows current fares but advises when to buy based on predictions of whether a fare will go up or down
  • A medical campaign to save 100,000 lives from six improved procedures, derived from a statistical analysis of hospital-related mortality data
  • An analysis of whether longer prison sentences affect whether prisoners commit crimes after their release

Such examples appear throughout the book. Ayres organizes them into a larger story about how the science of data-driven decisions works. Of course he covers how predictions can be made from existing data, but he also highlights the value of generating data specifically to answer questions. For example, “Instead of being satisfied with a historical analysis of consumer behavior, CapOne proactively intervenes in the market by running randomized experiments. In 2006, it ran more than 28,000 experiments—28,000 tests of new products, new advertising approaches, and new contract terms.”

Although he only discusses it near the end of the book, Ayres rightly raises the risk of people overrelying on algorithms. That is, much data mining occurs on data that is noisy, incomplete, inadvertently biased during collection, and otherwise on the edge of a garbage-in/garbage-out scenario. Algorithms, and the applications thereof, can have flaws. Even randomized trials can ask the wrong questions, sample the wrong audience, or otherwise put an unseen tilt on reality.

In this context, Ayres makes the point that human intuition and expertise will always have a role in sanity-checking results, as well as framing the questions to ask and choosing the methodologies for answering them. In fact, I’d agree strongly with Ayres’ statement, “The future belongs to the Super Cruncher who can work back and forth and back between his intuitions and numbers.”

Finally, regarding the book’s title that I dislike so much, I still dislike it as a label for people who do data mining and analytics. But it appears to be an effective title for selling books. To his credit, Ayres tested the title against an alternate, using a Google text-ad campaign. “Super Crunchers” got 63% more clickthroughs than the alternate, “The End of Intuition.” Although I could quibble with the quality or number of alternatives in Ayres’ test, the fact the book made it to the New York Times Business Bestseller suggests the title did its part of the job.

So I’d say Super Crunchers deserves its success, and I’d recommend it without hesitation to the general reader.

That said, even if judged by the standards of a largely anecdotal and nontechnical book, Super Crunchers will no doubt catch flak from some in the data-mining field. Most obvious, many of Ayres’ examples do not qualify for his definition of Super Crunching, which involves “really big” data sets. He never specifically quantifies “really big,” but I can all but guarantee that the wine-prediction example above, as well as many of the social-science examples in the book, are based on a few thousand to perhaps tens of thousands of records. By today’s standards, those are nowhere near “really big” data sets.

In addition, one could argue that Ayres’ coverage of randomized trials is actually about the opposite of Super Crunching. That is, the point of most randomized trials is to generate a clean and relatively small data set that answers a question. Done right, there is little need for sophisticated crunching.

Yet these objections just reinforce that Super Crunchers and Super Crunching are somewhat misleading labels for what the book is actually about. Although this will irritate insiders, I suspect general readers won’t care about the semantic distinctions but will benefit from the wider coverage beyond large-scale data mining.

Here’s a link to Super Crunchers at

Monday, December 1, 2008

Art Auctions’ Self-Serving Numbers

Writing in The Wall Street Journal, Lee Rosenbaum explains how the numbers reported for art auctions have a twist:

Contrary to what you might expect, press accounts, relying on the information released by the auction houses, don’t normally measure a sale’s success by comparing an object’s hammer price — the last amount announced by the auctioneer — with the presale estimate of hammer price. Instead, they almost invariably compare the estimate of hammer price to a figure arrived at by adding hammer price to the commission that the auction house charges the buyer.

The result is an apples-to-oranges comparison that makes the sale results look better than they actually are, because they’ve been inflated by the commission....For example, Bloomberg News reported that a work by Abstract Expressionist Arshile Gorky, one of 16 drawings consigned to Christie’s by Richard S. Fuld Jr., chief executive of the failed Lehman Brothers Holdings, and his wife, sold last Wednesday “for $2.2 million, at the low estimate.” But its hammer price was, in fact, $1.9 million — $300,000 below the low estimate of hammer price. Only after the auction house’s commission was added did the price reach the predicted amount.

Of course, the auction houses prefer media coverage about auctions that exceed the predicted outcomes. Accordingly, the auction houses’ numbers serve that interest, especially when reporters pass them along as unqualified facts.

Kudos to Rosenbaum for challenging that form of business as usual.