Thursday, June 21, 2012

Stylistic Signals

As Omnivark trawls the Web for new, great writing, it has two distinct tasks. First, where does it find the candidates—the articles, essays, the blog posts—that might be great writing? My previous post, Following the Elites, was about this challenge.

Second, once Omnivark has a set of candidates, how does it know which few are great? For example, given an entire issue of The New Yorker, what is the best thing in it?

The New Yorker’s editor might say it’s all great. And different readers will surely have different opinions of what’s best. So to clarify: In this case best means most like the structure and style of other great reads. (The other great reads were classified as such by a human expert.)

Note that we are comparing texts’ forms, not their topics. So, given a great read about a boar-hunting congressman, Omnivark will try to find more pieces that are written like that, as opposed to more pieces about boar-hunting congressmen.

This is an important distinction. Most text-analytics systems do topic-matching (find more boar-hunting congressmen). Omnivark is about style-matching. Omnivark will measure a new piece of writing against the characteristics of great writing that Omnivark has already modeled. Those characteristics include statistical, semantic, and structural properties of the text. Some examples:

  • Simple statistical properties include the text’s total number of words, the average numer of words per sentence, and the average number of sentences per paragraph. These simple metrics are better for filtering-out the bad than discerning the best among the good. However, more complex metrics (such as the ratio of nouns to adjectives) resonate with certain writing styles.

  • Semantic properties refer to the meanings of the words used. This is tricky because we want to capture how word choices correlate with style but not with topic. We don’t care that boar appears a lot in the boar-hunting piece; we do care about the artful usage of certain adjectives, adverbs, and other flavoring words, the use of which makes the prose more expressive.

  • Structual properties include how sentences and paragraphs are put together. For example, the use of balanced or parallel phrases is an indicator of expressive writing, as is the use of similes and metaphors. Detecting these structures in a general way is hard.

In the world of search engines like Google, these properties are called signals. Omnivark’s job is to know the signals that best predict great writing. As an extra twist, because great writing takes different forms, Omnivark needs to employ different configurations of signals.

Behind the scenes, I built a tool that makes exploring for signals relatively easy. A new signal can be tested in real time on a set of training texts diverse in style and quality.

For me, this exploration for stylistic signals is the most interesting part of creating Omnivark. Having taught writing, I have reasonably good instincts for prose quality. However, knowing it when you see it is different from generalizing that knowledge into a computer. In practice, it’s easy to identify signals that find great writing but also find a lot of mediocre writing too. It is much harder to find the signals that cleanly discern the best from the rest.

No comments:

Post a Comment