Your starter for ten

I was invited to speak at an internal Wiley-Blackwell seminar. Before the seminar, they interviewed me for their Publishing News. I can’t find a copy of the interview online anymore, so I have reproduced it below.

I made several predictions in this interview, and it’s interesting to see which ones played out.

As I review this (in 2024), I’m reminded of how insistent certain researchers were that publishers should make their articles easily available for text mining so that we could analyze the literature at scale. Of course, now we see researchers aghast that giant, well-funded AI companies have done precisely that to feed their LLMs.

And of course, back then, the term “Web 3.0” hadn’t yet been hijacked by CryptoBros.

Q: Looking beyond CrossRef we’re delighted (Geoff) that you’ve agreed to be a speaker at one of our Executive Seminars again this year. In the past you’ve entertained our audience with a short history of the web and speculated on the role of social media. What are the three key messages about the web for the broader community involved in scholarly publishing in 2009?

A: We are in the scholarly communications business, not just the scholarly publishing business. “Publishing” articles and monographs might have been the most efficient and trustworthy methods for publically communicating scholarly findings and for keeping a permanent record of those findings, but we cannot assume that this will continue to be the case. Modern tools like web-accessible databases, blogs, wikis, streaming audio/video and social citation services will continue to become increasingly important channels for scholarly communication. Whatever we do, we should not dismiss the new channels because they are not as reliable or authoritative as traditionally published content. It is precisely the fact that they are currently less reliable and trustworthy that represents such an opportunity for our industry. This leads to my second point.

Since the 1970s Carol Tenopir and Donald King have documented the trend that researchers are reading more articles, but spending less time reading each article. This trend is, by itself, unsustainable and it is actually being exacerbated by the proliferation of the above-mentioned new communication channels that researchers are increasingly going to feel obliged to follow.

What this means, as my former colleague Allen Renear likes to point out, is that researchers spend the bulk of their time practicing “reading avoidance.” That is, that researchers are engaged in a titanic struggle to figure out what they can safely ignore so that they can, in turn, focus their limited time and energy on reading what truly matters. This leads me to my third and final point.

Scholarly publishing [sic] is essentially about trustworthy communication. How can we help researchers invest their time wisely? The successful publisher of the future will be able to say to the researcher- “I can save you a large percentage of your time by helping you filter-out what is irrelevant and focus only on what you need to do your research”.

This role will only become more relevant and more valuable as the amount of information on the web continues to explode and this is the fundamental driving force behind much of the interest in technologies like text mining and the semantic web (aka Web 3.0).

Q: Publishers, Wiley-Blackwell included, are exploring a few different avenues in innovating with online content – one of these is text mining. What’s your sense of how this might be used and what impact it might have?

A: “Text mining” is a perfect example of researchers practicing “reading avoidance.” The basic idea behind text mining is that the researcher might be able to get the computer to extract and distill key data from the narrative in articles and books. The hope is that, once data has been extracted from a large enough collection of content, then researchers will be able to “data mine” the results in order to discover (excuse the Rumsfeldian phrase) “previously unknown knowns”. In other words, to find trends and correlations in the literature that would have been impossible or impractical to discover simply by reading the articles. In essence, “text mining” is really just an information extraction process in support of “data mining”. The irony here, of course, is that when a researcher writes an article, they are essentially turning their data into human readable form and now we have those very same researchers hoping to turn the human readable article back into data. This is perverse, but it conveniently leads into the next question…

Q: What does Web 3.0 mean for you and what should it mean for us?

A: The move from “Web 1.0” to “Web 2.0” is often described as the move from the “read-only web” to the “read-write web.” That is, whereas early web tools made it very difficult for casual users to create and publish content online, Web 2.0 tools like Blogs, Wikis, IM, photo/video sharing sites and social bookmarking systems made it relatively easy for everybody to create and publish content online.

In this vein, “Web 3.0” has been described as the “read + write + compute web”. Web 3.0 is essentially synonymous with another phrase you might hear bandied about- “the semantic web.” The “semantic web”, in turn, is a concept designed to address the issue discussed above- the irony that, despite the fact that the web is built on computers, the content being hosted on the web cannot easily be interpreted or analyzed by computers because most of it is narrative designed for human consumption. To illustrate the issue, look at the following sentences:

I read it in Nature.
The book captured the zeitgeist of the time.
I am sure that I turned the gas off.

In each of the above sentences, you can see that there are several italicized words and, without too much thought, you can tell that in each case, the word has been italicized for a different reason.

But a computer has no way of telling that the word “Nature” has been italicized because it is a the title of a journal whereas the word “zeitgeist” has been italicized because it is a foreign phrase and the word “sure” has been italicized to indicate voiced emphasis.

The problem is that web pages are full of examples like this where important semantic information is usable for humans but not for computers. This is particularly true in online scholarly articles where researchers record the names of chemicals, compounds, processes, people, places, concepts, etc. in narrative form and thus, effectively make the content difficult for a computer to read and process.

Q: And finally, what do you think a “journal” or a “book” will look like in 10 years and what will they do that’s different from what they do today?

A: First of all, we need to distinguish between the “book” and the “journal” as:

Physical objects (e.g. paper, stitching, covers, etc.)
Structural conventions for presenting content (articles, chapters, serialization, edi- tions, etc.)
Business models (subscription, purchase, lease, etc.)

Clearly all of these aspects of the book and journal are related, but they are also too often conflated.

So first I will tackle the “physical” aspect of books and journals.

This past Christmas I was with my family and inlaws and at a certain point I realized that everybody in the house was reading something on either an iPod Touch, an iPhone or a laptop.

They are not geeks like me, yet they weren’t even in the slightest bit self-conscious about it.

In the next five years digital reading devices will start to dominate the publishing world just like MP3 players now dominate the music industry.

Will paper go away? No, but I understand people still like to do calligraphy, hand-bind books and listen to vinyl LPs despite the advent of the printing press the CD and MP3 players.

And just as a warning, I will openly mock anybody who trots out the tired old “4Bs” argument- that nothing can beat the paper book for reading on the Beach, Bus or in the Bath or Bed.

First of all, such reading accounts for a negligible percentage of the reading that we do in life. Secondly, I bet they can waterproof an electronic reader before they can waterproof a paper book. Have you ever fished a book out of the bathtub? It ain’t fun.

The structural aspects of books and journals are also likely to change radically. This change I expect will take a bit longer, but it will be at least as profound as the physical change I talk about above. I should first explain that, in this case, by “books” I am referring to monographs and reference works. I’m sure that novels and such will change as well, as evidenced by the advent of cellphone novels in Japan[1], but I have a less clear idea of what these changes will are likely to look like.

On the other hand I think that articles and monographs are going to undergo two types of structural changes. The first relates to the “Web 3.0”, semantic-web issues I described earlier. The second has to do with the periodicity of publications but first let me take what may seem like a detour.

Right now, we can go into a bookstore or library and pick up either a non-fiction book or a periodical and, without reading a word of the content, we can immediately tell whether what we have picked up is targeted at a scholarly market.

How can we do this?

Simply, by looking to see whether the book or periodical contains the apparatus of a scholarly work- full bibliographic metadata, a proper table of contents, foot- notes and/or endnotes, abstracts, a bibliography, figure captions, graphs and tables.

This kind of apparatus just doesn’t generally exist in non-scholarly publications. Now there are two important things to note about this apparatus, the first is that almost all it is designed to help the researcher “avoid reading.” That is, it is designed to allow the reader to more efficiently navigate the content, locate information that is important to them and, perhaps most importantly, ignore the rest.

We tend to take this apparatus for granted because our use of it has become so deeply entrenched in our working habits. An interesting experiment to conduct is to go into a bookstore near a college campus and play a game of “identify the academic”.

Just watch how another customer examines the books. A layperson will remove a book from the shelf, glance at the cover, perhaps glance at the back cover, but then they will generally then open the book at the start and start flipping through it sequentially. Contrast this with an academic who will exhibit a far more complex ritual.

They will take the book off the shelf, open it from the back and first peruse the index, and bibliography and any other back matter. Then they will skip to the front and look at the table of contents.

Often this is enough for the academic to assess the relevance of the work and, if it is deemed irrelevant, they will re-shelve it (with relief, I might add). It is only if the preceding ritual has not given them enough information that they will actually finally resort to opening the book (usually toward the middle (following and index entry or TOC heading of particular interest) and skimming the introductory and concluding paragraphs. All of this is, of course, a highly optimized way of “reading avoidance.”

Now all of this extra apparatus costs money to produce and it is really only expected by an academic audience. An academic expects that a serious publisher will invest the time and money in creating these “reading-avoidance tools” and I predict that they will develop the same expectation of semantic web (Web 3.0) tools. Researchers will simply expect to be able to look at a document and use tools to explore and visualize the chemical compounds discussed in the paper, the methodologies employed, the taxonomies referred to. The researcher will expect to be able to skim this information, just like they can now skim the back-matter of an article. More importantly, the researcher will expect to be able to use their tools to automatically search the semantic apparatus of many hundreds of articles and/or books in order to identify precisely those items that are relevant to their research. In short, semantic enrichment will be a sign of a serious scholarly publication.

The second big structural change that I think will affect journals and books concerns the periodicity of publication. So much of the structure of journals and books is inherited from and based on practical compromises that needed to be made due of the physical nature of distribution.

Journal articles are bundled into issues, in part because it would have been too inefficient to send them out in paper form separately.

New book editions are released only when the publisher feels that there is a critical mass of updated and/or corrected content to justify the expense of producing a new release.

Even then a new edition might be delayed so as to not eat into the sales of the older editions that are still in the sales pipeline.

Of course, to a researcher, this can mean the publication of their article is held up in order to make a complete issue, or an important revision of a book is delayed until after a warehouse has been emptied.

This isn’t the only perversity that we have inherited. Consider also the common expectation that when a researcher publishes a paper on topic A and then later has slightly updated things to say about topic A, they are encouraged to write and publish an entire new paper.

When publishing in print, this was pretty much a practical limitation. There was no realistic way for an author to provide an update of an existing print article without incurring most of the production expense of writing an entirely new article. The practice has been further enshrined into our working habits because researchers are now actually rewarded for this behavior.

Unfortunately, this behavior on the part of the researcher “as an author” makes their life “as a reader” miserable.

Readers are hugely frustrated by having to closely read several articles by an author that appear to say mostly the same thing. This practice thwarts their best efforts at reading avoidance.

The issue can be similar, though less severe with books. At least with books there is the general convention of including “an introduction to the new edition”, which outlines the differences between the editions.

In either case, researchers who are now accustomed to seeing other sorts of online in- formation updated continually (indeed, continuously) are likely to start demanding similar speed increases from scholarly publishers.

So, I expect that the second major structural change that we will see with books and journals is that we will start to update existing publications instead creating new publications or new editions of publications.

I’ve already experienced this shift as a consumer of technical books with the Pragmatic Programmer’s “Beta Books” program and O’Reilly’s similar “Rough Cuts” feature on the Safari service.

In each case, I am able to subscribe to a book and see each new edition, as it is being edited and worked on by the author. The payoff of this system when following rapidly-moving technologies is immense. Just being able to see how one edition of a book metamorphoses into the next edition greatly assists me in fine-tuning my reading avoidance strategy.

Finally, I have focused above on the structural changes that “books” and “journals” might undergo. I think that it is at least equally important for publishers to consider structural formats that are not based on either the book or the journal.

One question we should all be asking ourselves is, given the apparent demand for text mining and semantic web functionality, does it make sense for us to publisher “data converted into narrative?” Maybe we should be figuring out how to publish data annotations?

And this leaves us with considering the “book” and the “journal” as business models. Given how I’ve blathered, you might be happy to hear that I don’t have much to say here other than to address the inevitable concern that publishers raise when I bang on about these changes. That is, making these changes will be hard, will cost money, and there appears to be little appetite for spending money on content these days.

My only point here is this, all of the traditional apparatus that we included in print books also cost money and has always made the production of scholarly works a more expensive endeavour than producing similar trade content. This is particularly true when you consider the relative economies of scale between scholarly and trade publishing. On the other hand, investing in including this apparatus was important for several reasons:

It helped researches avoid reading
It’s mere presence helped researchers identify works that were meant to be treated seriously
It was almost completely impractical for authors or amateur publishers to create this apparatus on their own.

Now one of the problems that we face as an industry is that it is relatively easy for authors and amateur publishers to create traditional bibliographic apparatus on their own. Word processors make it trivial to create a passable table of content, index and bibliography (note I said “passable”, not “good”. I don’t want any indexers out there to moan at me about what we are missing by not having hand-made indexes.).

Combine this with the relative ease of online distribution, and you get people actively questioning what the value-add is of the publisher.

This is precisely why I think publishers have to up their game. Provide researchers with a new, digitally focused apparatus that will help them avoid reading. This is hard work they can’t do easily on their own. Publishers can do it and add value in the process. If we could tell re- searchers that we could save them a significant percentage of their time by allowing them to easily determine what not to read, they’ll be happy to pay for it.

W-B: Thank you!