Thursday, June 21, 2012

Stylistic Signals

As Omnivark trawls the Web for new, great writing, it has two distinct tasks. First, where does it find the candidates—the articles, essays, the blog posts—that might be great writing? My previous post, Following the Elites, was about this challenge.

Second, once Omnivark has a set of candidates, how does it know which few are great? For example, given an entire issue of The New Yorker, what is the best thing in it?

The New Yorker’s editor might say it’s all great. And different readers will surely have different opinions of what’s best. So to clarify: In this case best means most like the structure and style of other great reads. (The other great reads were classified as such by a human expert.)

Note that we are comparing texts’ forms, not their topics. So, given a great read about a boar-hunting congressman, Omnivark will try to find more pieces that are written like that, as opposed to more pieces about boar-hunting congressmen.

This is an important distinction. Most text-analytics systems do topic-matching (find more boar-hunting congressmen). Omnivark is about style-matching. Omnivark will measure a new piece of writing against the characteristics of great writing that Omnivark has already modeled. Those characteristics include statistical, semantic, and structural properties of the text. Some examples:

  • Simple statistical properties include the text’s total number of words, the average numer of words per sentence, and the average number of sentences per paragraph. These simple metrics are better for filtering-out the bad than discerning the best among the good. However, more complex metrics (such as the ratio of nouns to adjectives) resonate with certain writing styles.

  • Semantic properties refer to the meanings of the words used. This is tricky because we want to capture how word choices correlate with style but not with topic. We don’t care that boar appears a lot in the boar-hunting piece; we do care about the artful usage of certain adjectives, adverbs, and other flavoring words, the use of which makes the prose more expressive.

  • Structual properties include how sentences and paragraphs are put together. For example, the use of balanced or parallel phrases is an indicator of expressive writing, as is the use of similes and metaphors. Detecting these structures in a general way is hard.

In the world of search engines like Google, these properties are called signals. Omnivark’s job is to know the signals that best predict great writing. As an extra twist, because great writing takes different forms, Omnivark needs to employ different configurations of signals.

Behind the scenes, I built a tool that makes exploring for signals relatively easy. A new signal can be tested in real time on a set of training texts diverse in style and quality.

For me, this exploration for stylistic signals is the most interesting part of creating Omnivark. Having taught writing, I have reasonably good instincts for prose quality. However, knowing it when you see it is different from generalizing that knowledge into a computer. In practice, it’s easy to identify signals that find great writing but also find a lot of mediocre writing too. It is much harder to find the signals that cleanly discern the best from the rest.

Wednesday, June 13, 2012

Following the Elites

In a perfect world, Omnivark’s software would read everything published on the Web each day, then pick the best three “great reads.” That perfect world is not available. But can we find a more practical path to the same results?

With Omnivark, I’ve explored several approaches. In this post, I will focus on the most obvious and, it turns out, cost-effective: embrace elitism. By that I mean track the top publications where the top writers appear. You can argue whether the list of publications should be 20 or 200 long, but either way it’s nothing compared to the millions of other entities—minor publications, blogs, Tumblrs, Quora postings, and such—that comprise “everything.”

The Atlantic Wire’s “Five Best Columns” daily newsletter exemplifies this approach. It appears to draw from a short list of usual suspects: The New York Times, The Washington Post, and a handful of other top newspapers and highbrow magazines/Websites. The results are quite good.

With Omnivark, I use a much wider array of inputs, and the algorithms ignore a piece’s source. (In a similar vein, by intentionally omitting the source publication’s name from the preview quotes, the Omnivark site encourages readers to judge the preview quotes by their quality, not by where they come from.)

Still, Omnivark ends up with a lot of material from that same group of usual suspects. The reason is, true to reputation, they are venues where superb writing appears in volume. This combination of quality and quantity is hard to beat.

As support, consider Longreads, a crowdsourced site that highlights new, long-form nonfiction. Anybody can nominate a piece from anywhere, usually via the Twitter hashtag #longreads. But despite the potentially wide spectrum of nominations, the site’s official picks are still mostly from elite publications.

I doubt the Longreads editors are suppressing non-elite stuff; if anything, I suspect they welcome the chance to boost something obscure yet worthy. But I also suspect most of the (non-spammy) nominations are for pieces in elite publications because of the quantity/quality reason above.

Plus, when nominations are an open process, another factor helps the more popular, elite publications like The New York Times or The New Yorker. They have thousands of times more readers (and Twitter followers) than smaller publications or independent bloggers. So if the same quality of piece appears in the typical blog and The New Yorker, the New Yorker piece will have thousands of times more potential nominators.

All this goes to say that curating just from the elite publications is a good bang-for-buck strategy. It exploits the concentration of high-quality material in relatively few places.

And if you want to take it a step further but keep the bang-for-buck efficiency, you can also track the elite writers directly, such as by following on Twitter. That way, you can catch his/her work outside the elites without needing to trawl for it generally. Byliner.com seems to take this approach, as well as commissioning its own pieces.

In theory, an additional benefit of following elite writers is that they can recommend good stuff by other writers. In practice, it works a little, but writers in elite publications often just recommend other stuff in elite publications. Perhaps an apt analogy is with Major League Baseball players, who can talk all day about other MLB players but don’t think as much about what’s happening in the minor leagues.

Of course, this just makes me want to focus more on writing’s equivalent of the minor leagues—the non-elite venues where good stuff lurks deeper and more dispersed. However, if the goal is to surface great writing, today’s lesson is that much of it is already near the surface, in the elite publications where it’s expected to be. Distilling the best of that best is valuable, as the Atlantic Wire’s newsletter and Longreads show. The open question is, how much extra value is there in plumbing the depths further?