Be discoverable: why publishers are making their reference lists openly available and how you as a researcher can help
A recent announcement from the Initiative for Open Citations calls for all stakeholders – including researchers – to start having conversations with the publishers and learned societies they know about opening up their indexed reference lists. Adding to the already open metadata about articles increases the discoverability of the work of authors (and of publishers) and could ultimately foster a new, more open and more rigorous way to evaluate researchers – a fundamental goal of Open Science.
In the four months since the public launch of the Initiative for Open Citations (I4OC), almost 50% of indexed scholarly references have been made publicly available – about 16 million in total. It has been made possible because of the growing support of publishers such as the American Association for the Advancement of Science (AAAS) – who publish Science magazine, the Proceedings of the National Academy of Sciences (PNAS), and several more society publishers including: the American Physics Association, the American Institute of Physics, the American Society for Cell Biology, the American Society for Biochemistry and Molecular Biology and the Electrochemical Society. Like the other publishers participating in the initiative, they have each asked Crossref – a service organisation that supports scholarly publishers – to essentially flick a switch, thus making what was once private data publicly available.
Crossref is a member organisation for anyone who publishes scholarly material (namely journals and journal articles, books and book chapters, conference proceedings and papers, reports, working papers, standards, dissertations, datasets and preprints). The organisation is a bit like a complex digital junction box (or a scholarly map as Jennifer Lin from Crossref describes it), acting to provide consistent digital signposts to the scholarly literature so that it is easier for researchers and others to link to and find the information they are looking for online. The publisher member can then register their content with Crossref by depositing the information – the ‘metadata’ – about the relevant output in a variety of formats that Crossref converts to XML. Part of this includes registering a persistent digital object identifier – a DOI – that is unique to that output (e.g. a published article or a preprint) thus making it unequivocally identifiable. Crossref is the caretaker and curator of the metadata and, although primarily a registering organisation (of DOIs), it also provides a suite of valuable services alongside. Perhaps the one best known to researchers is the ‘cited by’ service – a really useful tool that shows you who else has cited the article . Publishers pay to be a member and also incur a small fee for each DOI registered with Crossref.
Metadata – data about data – is crucial if you want to find information online. If you don’t have good metadata associated with your article you might as well be invisible – your best bet is to just print it out and give it to friends and colleagues or put it on a shelf in a public library . That’s because the metadata are the digital tags that tell your computer (or your text and data mining software) ‘this is the title’, and ‘this is the abstract’ but ‘this is the name of the author’ and ‘this their affiliation’ while ‘this is a figure’ etc..
The references (or bibliography) that authors list at the end of articles are part of the metadata submitted by publishers to Crossref. These references provide the basis of the ‘cited by’ service mentioned above. As an author, you think carefully about whose work you are going to cite in support of your evidence or argument. This is often further constrained by a limit imposed by editors on the number of references you can list. The reference list therefore provides an expertly curated filter of articles for readers to refer to if they want to understand more about the topic – it is an enormously valuable resource for science (in the broadest sense of the word) but has, until now, been locked up and therefore largely untapped.
The creation of the metadata is one of the underappreciated services of publishers (invaluable work that is paid for through subscriptions or APCs etc). And almost all that metadata is also publicly available via Crossref for anyone to use, or mine or build services upon (i.e. not just Crossref). Curiously, however, the metadata with the references is different – these are closed by default and not publicly available. Crucially, a publisher who is a member has to give Crossref permission to make these data public.
The founders of I4OC (myself included) couldn’t understand why these metadata weren’t available. After all, most member publishers (~3000) provide them freely to Crossref and make all their other metadata public. We thought that this was largely because publishers either weren’t aware they were private or because they didn’t know they needed to grant permission to make them publicly available – all it takes is an email to Crossref. So we started to talk to the largest publishers and ask them to ‘flick the switch’.
The response has been fantastic, which is why we have already reached almost 50%. And it’s partly because it’s an easy argument to make – publishers as well as researchers will benefit from the free availability of these data, primarily because it will make the content they publish more discoverable to readers.
But the larger benefit is yet to be realised. Even though I4OC is only a few months old, we are already beginning to see the potential. The fledgling metadata are starting to be used in innovative ways. Wikidata, for example, have showcased an example of how to visualise the citations over time to one article, a particularly influential one published in Nature in 1970 by Ulrich Laemelli, on a new method of electrophoresis revealing as yet unknown proteins in a bacteriophage (unfortunately, if you don’t have a subscription, you’ll need to pay to read the whole paper…). As Figure 1 shows, it’s a citation classic.
Figure 1. Citations accrued each year since 1970 to Laemmli, U. K. (1970) Cleavage of Structural Proteins during the Assembly of the Head of Bacteriophage T4. Nature 227, no. 5259, 680–85. doi:10.1038/227680a0 Data from Wikidata and English Wikipedia | Code from GitHub repository | Hosted on Wikimedia Tool Labs, a Wikimedia Foundation service | License for content: CC0 for data, CC-BY-SA for text and media.
More interestingly, you can begin to trace the influence of his ideas with other researchers by tracking not only who cited him but who also cited the authors who were citing him (Figure 2).
Figure 2: A partial citation graph showing some of the network of citations to Ulrich Laemmli’s classic 1970 paper. Data from Wikidata and English Wikipedia | Code from GitHub repository | Hosted on Wikimedia Tool Labs, a Wikimedia Foundation service | License for content: CC0 for data, CC-BY-SA for text and media.
These types of analyses are not new per se. There are many expert researchers already working in this field (who often have to pay other platform providers hefty sums for such data) and who are doing much more sophisticated analyses. There are others still who would like to join them but don’t yet have access to the metadata.
What these illustrations begin to demonstrate, however, is the potential power of such data when independent researchers start to get hold of it. If you could access this information about your own work, for example, it might open up new collaborations with researchers you didn’t know about who are two or more steps away in the citation links to your paper (perhaps in a different field). Moreover, other organisations such as OpenCitations are building a structured database of citation links which will provide even more nuanced information – not just who cited whom but whether the citers were positive or negative about article. After all, there are many ‘citation classics’ which are cited only because they are flawed.
This starts to take the art of understanding scholarly influence to a whole new level and calls into question, yet again, the current system of evaluation where researchers are assessed on a superficial metric of the average number of citations to a journal (the impact factor). This is why funders such as Wellcome and the Gates and Sloan Foundations are supporting I4OC as stakeholders alongside a whole host of other organisations, most recently LIBER, the British Library, Microsoft Research and the Allen Institute for Artificial Intelligence.
The good news is that we’re already almost 50% of the way there (at least for the metadata indexed by Crossref) but we have another 50% to go. And while some of the major publishers (regardless of their business model) have enthusiastically supported the initiative , there are 1000s of publishers who are members of Crossref but don’t yet know they can do this. Most of these publishers are small but cumulatively they make up the long tail of references still to be opened up. We are a small group at I4OC and can’t do this ourselves.
The carefully crafted reference lists you produce and give to publishers should be subject to the same level of independent scrutiny and rigour that you apply to your own research. They are especially crucial if we are to understand the impact and influence of different outputs (including data if we include citations to datasets) and want to change the current evaluation system from one that creates perverse incentives harming science (more on this in another post) to one that is evidence-based. You create this resource and you should also have the ability to analyse it if you want to. Reference lists are data and a fundamental part of Open Science.
So if you’re a member of a Scholarly Society or an editor, or a reviewer, or an author with a favourite journal, you can help take steps to change the culture of evaluation by having a conversation with your society journal or its publisher. If they are not a member of Crossref, encourage them to be one. If they are, ask them to flick that switch – all it takes is an email to email@example.com.
 This, of course, is an exaggeration – you can find articles online even if there is no metadata – it’s just harder to do and less easy to mine for specific bits of information (e.g. about the figures). Good metadata aids discoverability and also makes it easier to build discovery tools on top. There are also an increasing number of ways to e.g. automatically extract the references from pdfs but these are still in their infancy.
 Note that some publishers such as eLife, Hindawi, PLOS and the Royal Society were making (or in the process of) their reference metadata public via Crossref before the idea of I4OC came about. You’d also think this would be slam-dunk for any Open Access Publisher but many of them just weren’t aware of the issue before I4OC started.
August 18th, 2017
David Shotton, Co-Director of OpenCitations (http://opencitations.net) comments:
One small point of clarification about citation typing. You wrote:
“Moreover, other organisations such as OpenCitations are building a structured database of citation links which will provide even more nuanced information – not just who cited whom but whether the citers were positive or negative about article.”
It is true that Silvio and I developed CiTO, the Citation Typing Ontology, which we have progressively expanded and refined since it was first described in 2010.
Journal of Biomedical Semantics20101(Suppl 1):S6
This ontology (http://www.sparontologies.net/ontologies/cito; http://www.sparontologies.net/ontologies/cito/source.html) provides terms to describe the nature of citations, both factually and rhetorically, thus permitting an author at the time of writing to describe the author’s intent when making a citation, or others retrospectively to say what they think that intent was. However, CiTO needs to be used ‘in anger’ to create such descriptions. At present, with a few exemplar exceptions that we have hand-crafted, and Egon Willighagen’s pioneering use of CiTO terms in CiteULike (e.g. http://www.citeulike.org/user/egonw/article/7901082), there is little use of CiTO terms in the wild, largely because creating such descriptions involves additional work.
Silvio Peroni and his colleagues have made initial attempts to automate the retrospective assignment of CiTO terms to references in article reference lists using a sentiment analysis software system called CiTalO
 Identifying Functions of Citations with CiTalO. Angelo Di Iorio, Andrea Giovanni Nuzzolese, Silvio Peroni ESWC2013 DOI: 10.1007/978-3-642-41242-4_30
Abstract: Bibliographic citation is one of the most important activities of an author in the production of any scientific work. The reasons that an author cites other publications are varied: to gain assistance of some sort, to review, critique or refute previous works, etc. In this paper we propose a tool, called CiTalO, to infer automatically the nature of citations by means of Semantic Web technologies and NLP techniques. Such a characterisation makes citations more effective for linking, disseminating , exploring and evaluating research.
 Evaluating Citation Functions in CiTO: Cognitive Issues. Paolo Ciancarini, Angelo Di Iorio, Andrea Giovanni Nuzzolese, Silvio Peroni, Fabio Vitali ESWC2014 DOI: 10.1007/978-3-319-07443-6_39
Abstract: Networks of citations are a key tool for referencing, disseminating and evaluating research results. The task of characterising the functional role of citations in scientific literature is very difficult, not only for software agents but for humans, too. The main problem is that the mental models of different annotators hardly ever converge to a single shared opinion. The goal of this paper is to investigate how an existing reference model for classifying citations, namely CiTO (the Citation Typing Ontology), is interpreted and used by annotators of scientific literature. We present an experiment capturing the cognitive processes behind subjects’ decisions in annotating papers with CiTO, and we provide initial ideas to refine future releases of CiTO.
and this field is a potentially rich one for future Artificial Intelligence initiatives.
Thus, while the OpenCitations Corpus could in principle include CiTO-based semantic descriptions of the >9 million citations it hosts, it cannot do so until such descriptions are created and made available, either by authors or retrospectively by readers or computers.
The text and illustration in this blog post are by Hindawi and are distributed under the Creative Commons Attribution License (CC-BY).