With every answer, search reshapes our worldview

six metrics of search futureofsearch 06

If you asked Google “Did the Holocaust happen?” earlier this month, the search engine first directed you to Stormfront, a white supremacist organization that denies the genocide ever occurred.

After initially defending its algorithm, Google eventually tweaked it to bury the offending result. But the enduring lesson is a lot more complicated than a single errant search. Every time search engines change how they measure results, they change our ideas about how the world fits together – or at least the world as reflected in the information that stands in for it.

Precision or recall? Choose one

It was the late 1980s when I first encountered full-text search technology. I was working at an early electronic publishing software company — Interleaf — whose clients needed to be able to search the large document sets they were creating. The available search engines varied, but the metrics were the same. In addition to the obvious speed-per-page metric for returning results, there were the paired terms “precision” and “recall.” Precision quantifies how many false positives show up in the results list: If you search for conference papers about Oingo Boingo, does the engine also give you reviews of their albums and photos of their recordings sessions? Worse, does it give you conference papers about Bingo and Bongos? There goes your precision!

Recall, on the other hand, measures many appropriate pages your search did not find: Are there a hundred conference papers about Oingo Boingo in the index that do not show up anywhere in the results list? That’s going to drive down the engine’s recall score.

The common wisdom was that if you optimized for one metric, you’d pay a price with the other: If you wanted to get every reference, you’d have to put up with some false positives, and if you wanted no false positives you wouldn’t be able to get every reference. This very real constraint showed us an information space — the set of searchable terms and their relationships — characterized by imprecision and unreliability because the space consisted of text that was created without any care for its findability when stirred into a cauldron with thousands of other texts. Our search engines tried to impose structure and find relationships using mainly unintentional clues. You therefore couldn’t rely on them to find everything that would be of help, and not because the information space was too large. Rather, it was because the space was created by us slovenly humans.

The rise of relevancy

By the mid to late 1990s, precision and recall were not enough. When you have millions and then billions of pages in your index, finding every instance of a term doesn’t much matter because users won’t know or care that your list of 100,000 hits really should have 100,001 entries. Likewise, precision becomes less important, so long as the false positives don’t show up in the first few pages of results, because, frankly, hardly anyone gets past those pages.

When 100,000 pages are relevant, you need to provide another way of sorting.

Instead, an existing term took on a new importance: relevancy. When a search finds 100,000 results, the user needs to be shown the most useful hits first. Of course, what’s useful depends on what the user is trying to do. And that makes judging relevancy a dark art practiced by mages and wizards. That modern Web search engines have gotten so good at it is a testament to the skill of their developers … as well as to how much data search engines have gathered about us.

When relevancy reigns, the information space shows itself as super-abundant and ambiguous. But the ambiguity has a different source than in the old precision-and-recall days, for it results not from the vice of slovenliness but from the collective human virtue of overloading language with rich and inextricably linked meanings. The information space is made up not of language-as-information but of language-as-poetry, that is, of words that are enriched by their layers of meanings and permeable borders.

But is it interesting?

As the body of the Web began to scale up from the millions to the billions and now conceivably to the trillions of pages — That’s a lot of typing, people! Good job! — relevancy began to suffer from the same problems of scale as precision and recall. When 100,000 pages are relevant, you need to provide another way of sorting. For example, after the photo-sharing site Flickr had been entrusted with its first couple of billion photos, it let us sort on interestingness. If you search for, say, “keys” at Flickr and sort by relevancy, you’ll see a collection of photos of keys. But if you sort by interestingness, you’ll see striking photos, many of which are not as clearly relevant to the search terms: sunset over Key Bridge, or a supporting column on a bridge with a gap that resembles a keyhole.

Flickr gauges interestingness by looking at a variety of factors, many of which are metrics of the community’s reaction to the photos. While Flickr doesn’t spell out its exact algorithms, a photo is judged to be more interesting if it gets lots of clicks and links especially from people who are not in the photographer’s social circle, if it gets printed out more frequently, if lots of people leave comments, and so forth. The result is a set of photos that the community likes, with a lowered requirement for relevancy.

If you’re simply trying to illustrate what a lock and key look like, sort by relevancy. If you’re using locks and keys as a metaphor for a false sense of security or for the dangers of a police state, go for interestingness.

Back in the 1980s, interestingness wouldn’t have been as useful because there wasn’t today’s super-abundance of content. But once we’re confident that we can find what’s relevant, we also want to find what is striking in its expression, or that is relevant in a non-literal meaning. We’ve always wanted that. We just didn’t know it because we couldn’t have it. Search engines in this way have become instruments of our non-literal use of language — a use that is more essential to language than mere literalism is.

Digging deeper

Now we’re hearing a cry for a fifth metric of searching. Serendipitous results are meant to do the opposite of what traditional searches have done, for they show you what you were not asking to see. But mere surprise isn’t enough. If you search for “key” and are shown pages about clown makeup or elephant toe nails, the serendipity is unlikely to be useful because these have absolutely nothing to do with your search terms. To be usefully serendipitous, there should be a subtle but meaningful relationship. Perhaps you’ll be shown a page about Harry Houdini, or a paper about biological models of DNA based on lock and key relationships, or a feminist history of chastity belts. Serendipity requires an extended sense of relevancy that hits the mark between too literal and too far afield.

Serendipity turns noise into signal: Results that had been filtered out now get filtered in.

Search engines can deliver on serendipity now because they have more semantic information — information about meanings — to play with. Much of that information is being assembled into graphs in which relationships among ideas are analyzed and represented in terms of distances or degrees, as in six degrees of Kevin Bacon. Or we could just look lower in the relevancy stack, although that will give less meaningful results.

Serendipity turns noise into signal: Results that had been filtered out now get filtered in. Serendipity signals that we rejoice in living in a world that we cannot fully know.

The future of search

More recently, especially because of the prominence of fake news, there’s been an increasing demand for two additional ways of searching.

The first is for serendipitous results to counteract the effect of echo chambers and filter bubbles that only show us what we already believe. If this is to be effective, it will require yet another type of search result: not just serendipity but results that are just different enough from what we believe that we’ll read them, understand them, and perhaps be nudged by them. Librarians are often superb at making this sort of suggestion, but machine learning could also get good enough at the task.

“Just different enough” searches would reveal our information space as being a more human space than ever. If the early search engines revealed us humans as slovenly, disorganized scavengers lacking the discipline that information management requires, and if in the ages of interestingness and serendipity we looked like contributors to an undisciplined super-abundance of meaningful connections, the call for content that pierces our echo chambers recognizes that our new information space does not consist only of information. Rather, it reflects the biases and evil inclinations of human thought, from outright racism and sexism to the subtler ways that privilege distorts our views.

That has led is to a sort criterion that has been surprisingly lacking: truth, or what Google quite reasonably referred to as quality when it lowered the ranking of Holocaust-denial sites. Sorting by truth-quality has shown up late because when full-text indexing began, the information space had been manually curated. When the Web began to scale up, search engines like Google optimistically assumed that analyzing the Web community’s use of pages — particularly, the network of links — would be a sufficient guide to quality. But, thanks to expert gamers of the system and the possibility that the Crowd isn’t as Wise as we’d hoped, filtering by truth-quality yields more reliable results than ranking by usage.

And so our information space is being revealed not as a jumble of words and phrases sorted by algorithms but as an instrument of power that reflects our biases, assumptions, ambitions, and blindnesses. Like the Internet itself, bit by bit, search technology is revealing us in our fullness.