Van Morrison, Crank and Google Scholar

In a Guardian article dated Saturday July 8 2006, Pico Iyer talks about how Google and other search engines have distorted the literary interview. He describes how interviewers prepare themselves by researching their subjects online and how search results tend to artificially highlight and emphasize interesting, but effectively trivial information about the interviewee. The author describes how he once, in some long-since forgotten interview, had mentioned Van Morrison as being an influence on his work and how almost every interviewer since has found this tidbit of information and incorporated it into their own interview. This repeated citation of the same fact has served only to exaggerate the actual importance of Van Morrison on the Author’s work. Of course, as these interviews also go online, the problem only gets worse. His Guardian article will make things worse. This blog entry will make things worse. Pico Iyer and Van Morrison are becoming forever intwined.

This is just one of many examples of the peculiar side-effects of Google’s page ranking algorithms. In Google Scholar (GS) researchers can find both of GS’s ranking algorithms frustrating. The first one, based largely on the number of citations an article receives (a more scholarly version of PageRank ) has the annoying habit of listing all of the articles that are the most well-known at the top of search results. While this might be a great default behavior for a casual user or a student, it is sometimes irritating to the specialist researcher who presumably already knows the most important articles in their field. GS’s alternative is to list the articles in reverse chronological order, which effectively strips out any pretense of “importance.” I’m sure Google will eventually fix these GS eccentricities and introduce a ranking based on “citation velocity” or some other metric that effectively mixes currency and influence. In the mean time Google and Google Scholar have become a sort of network effect meth-amphetamine.

As we get used to the peculiarities of the Internet, we sub-cognitively adjust our use of it accordingly. I remember in the late 1990s a colleague showed me some site that he had recently started to consult for statistics and data of some sort. I glanced at the site and, though it looked official enough, I almost immediately said to my colleague that I thought the site was bogus and that he’d better be deeply skeptical of its contents. Eventually he confirmed that the content on the site was utter bilge and he came to ask me how I had guessed that it would be. I looked at the site again and tried to figure out what tipped me off. As I said, the site itself looked official and my assessment certainly wasn’t based on the data (the nature of which I’ve since forgotten but that I certainly wasn’t qualified to assess), but something about it had made me uneasy. After a few puzzling minutes I realized what had made me suspicious- there was a tilde (~) in the URL. For those who never knew or have since forgotten, a tilde in a URL is a good indication that the URL in question is pointing to some individual’s private home directory on a *NIX based machine. The url “www.somewellknownorg.com/~ted/index.html” might look like it is official content from “somewellknownorg”, when it is actually pointing to home directory of somebody named “Ted” who happens to have an account on the somewellknoworg machine. One doesn’t often see such URLs these days, but back in those days they were fairly common. Somehow I had managed to subconsciously learn that a “tilde” in the URL should make me pause and since that incident I’ve confirmed with some of my geekier friends that they too had developed this unarticulated heuristic for determining the relative “authority” of content. We probably all have other such URL-based heuristics. I doubt many people trust URLs that have ip addresses in them. And we each have a notion of the relative trustworthiness of domain name endings (.COM, .CO.UK, .EDU, .NET, .RL), though we may not be actively aware of it.

A conversation at a recent conference made me realize that I’ve started to develop heuristics for dealing with the distorting effects of search engines. A colleague casually mentioned that he no longer looks at the first few search results returned by Google. He found the first three or four results to be of generally lower quality than those a little lower down in the result set. As soon as he said this, I realized that I had been doing the same thing for the past year or so. I find myself “starting” to look at Google search results about one third of the way down the page, skipping the first several results. Like my colleague, I’ve found that the first results seem to have an oddly distorted relevance ranking. I suspect that this is a side-effect of PageRank. Items that are more “interesting” filter to the top and “interesting” is not quite the same as “accurate”, “thorough” or “authoritative”. This, of course, is what Pico Iyer has encountered as he has become inexorably linked to Van Morrison.