A radically open approach to developing infrastructure for Open Science

Hindawi’s CEO, Paul Peters, explains the problems inherent in proprietary solutions for Open Science infrastructure and presents a proposal for how things can be done differently.

Should commercial companies have a role in developing infrastructure for an Open Science future?

This question regularly echoes across the Open Science movement, particularly in the weeks following an announcement that some large publisher has acquired another innovative, community-supported startup. In spite of concerns raised time and time again, commercial companies, whose objectives are often at odds with those of the communities that they serve, control a significant amount of the infrastructure behind the Open Science movement.

I do believe that there is an important role for commercial providers to play in developing open scholarly infrastructure, but in order to prevent private companies from owning and controlling this infrastructure a radically open approach to its development is required. What follows are my thoughts about the problems inherent in proprietary solutions for Open Science, as well as a proposal for how things can be done differently.

What are the risks of relying on proprietary scholarly communications infrastructure?

Geoffrey Bilder, Jennifer Lin, and Cameron Neylon clearly stated the issue at the heart of this question in 2015:

Everything we have gained by opening content and data will be under threat if we allow the enclosure of scholarly infrastructures.”

Providing open scholarly infrastructure is likely to be more challenging than providing Open Access to scholarly articles. The primary challenge in moving towards Open Access publication models has been a reorganization of how publication costs are paid, rather than a fundamentally new approach to scholarly publication. In contrast, open scholarly infrastructure will require completely new models of interaction between commercial companies (publishers, technology providers, data aggregators), non-profit organizations, and the research community. Developing open infrastructure for the creation, dissemination, and assessment of scholarly outputs will require parties with disparate incentives to work together to overcome difficult challenges.

I should attempt to define what I mean by “Open Science” and “scholarly infrastructure,” since there are important differences in how people understand these terms. By “Open Science” I am referring to a system for scholarly communications that is built to maximize the dissemination and reuse of all research outputs – including data, code, protocols, methods, and publications – throughout the research lifecycle. With the “infrastructure” of scholarly communications, I am referring to the tools and metadata used to create, share, and assess these outputs of scholarly research. Importantly, this includes data about the scholarly research process itself, such as reference lists and funding information. These metadata play an increasing role in the assessment of research outputs, research institutions, and researchers themselves, and drive incentives within many areas of the research process.

This growing reliance on a handful of companies to provide proprietary analytics and decision tools for research funders and universities poses serious risks for the future.

Proprietary sources supply much of this critical metadata. ORCID, the Initiative for Open Citations, and Crossref’s new Event Data service are important counterexamples that provide hope for future progress. However, these open services are a small minority in an environment where most of the data needed to support Open Science is controlled by commercial companies, both big and small. This growing reliance on a handful of companies to provide proprietary analytics and decision tools for research funders and universities poses serious risks for the future.

The impact of this commercial control of scholarly infrastructure is most clearly felt when traditional publishers acquire smaller infrastructure providers, which seems to be happening at an increasing rate in recent years. Elsevier’s acquisition of Mendeley and SSRN resulted in serious concerns from many Open Science advocates, as both of these services were considered to be community-driven responses to the power of large subscription-based publishers. More recently, Elsevier’s acquisition of Bepress, one of the largest providers of open access institutional repositories, drew fierce criticism from Heather Joseph and Kathleen Shearer, two of the world’s leading digital repository advocates.

How deep does the problem go?

Although the examples above all relate to Elsevier, the risks of proprietary infrastructure for research extend well beyond any single company. Following Elsevier’s acquisition of Bepress, Roger Schonfeld (the Director of the Libraries and Scholarly Communication Program for ITHAKA) wrote a thoughtful analysis about the growing investment that large commercial companies have made in “scholarly workflow services” as a means “to pivot beyond content licensing.”

In his analysis, Schonfeld draws a comparison between Elsevier’s growing suite of scholarly workflow tools and those owned by “SpringerNature sibling Digital Science,” whose portfolio includes many of the best-known tools supporting Open Science (Altmetric, Figshare, Readcube, Overleaf, Dimensions, and Symplectic). There are many other examples of commercially-owned products and services at the forefront of Open Science, including Publons (acquired earlier this year by Clarivate Analytics), Colwiz (acquired earlier this year by Taylor and Francis), as well as ResearchGate and Academia.edu (which together have raised more than $100 million in Venture Capital funding).

How did we get here?

Schonfeld explains that many libraries have opted to use commercially-provided services like Bepress’ Digital Commons either because open source alternatives are lacking in functionality or because many libraries do not have the technical resources to develop, customize, and maintain these services. Schonfeld further argues that “libraries adopting standalone institutional repositories are moving in exactly the wrong direction strategically” as their focus going forward should be to support scholarly workflows rather than gathering, preserving, and making available content on a standalone basis. Schonfeld concludes by saying that a “transition to supporting workflow would almost certainly mean accepting that scholarly infrastructure will be further outsourced,” but that libraries should be at the forefront of defining the terms under which this outsourcing of scholarly workflows is managed.

Is Schonfeld right? Do commercial companies have a necessary – perhaps even a beneficial – role in the development of scholarly communications infrastructure? I believe that the answer is yes. However, this also comes with serious risks that should not be overlooked.

The question is whether commercial providers can participate in the creation of Open Science infrastructure in a way that does not risk the eventual privatization of this infrastructure.

The success companies like Mendeley, Highwire, Figshare, Altmetric, Publons, and others have demonstrated in creating and scaling innovative products for Open Science is a clear sign of the role they can play in building infrastructure for the future. The question is whether commercial providers can participate in the creation of Open Science infrastructure in a way that does not risk the eventual privatization of this infrastructure.

What can be done?

What if there were a model in which commercial players could develop and support open infrastructure using service-based business models that didn’t involve ownership of this infrastructure or create dependencies on any single provider? What would a system like that look like? What kinds of openness would it require? Could a company working in such an environment recover the investment required to develop their products and services?

I believe that such a model is both achievable and very compelling. What follows is my attempt to outline what it would require.

The principles of openness

I believe a model where commercial providers develop and maintain open scholarly communications infrastructure requires four basic principles of openness: Open Source, Open Data, Open Integrations, and Open Contracts.

Open Source

Any open scholarly communications infrastructure needs to be licensed under permissive open source licenses in order to prevent a single provider from establishing control over that infrastructure. However, I do not believe that open source licenses on their own are sufficient, as true independence from any single provider requires that an active community of users and service providers contribute to developing and maintaining that infrastructure.

For open source projects to work, a vibrant development community that engages both commercial and not-for-profit participants is needed. A great example is Moodle, an open source Learning Management System maintained by a community of more than 90 independent service providers. There are over 80,000 Moodle sites around the world that collectively serve more than 100 million students, and the active development community behind Moodle ensures that its users have a range of service providers to choose from.

I believe a model where commercial providers develop and maintain open scholarly communications infrastructure requires four basic principles of openness: Open Source, Open Data, Open Integrations, and Open Contracts.

A similar community for developing open scholarly communications infrastructure is forming around the Collaborative Knowledge Foundation (known to its friends as Coko). Coko aims “to evolve how knowledge is created, produced and reported” by “building open source solutions in scholarly knowledge production that foster collaboration, integrity, and speed.” In its first two years, Coko has already built an early community of partners including the University of California Press, eLife, and Hindawi, and released an open source book production platform called Editoria. Hindawi’s initial interest in working with Coko was to develop next-generation platforms for journal peer review and hosting, which we are very excited about. However, I also believe that Coko has the potential to foster the collaborative development of open scholarly infrastructure more broadly.

Open Data

A second requirement for open scholarly communications infrastructure is that it should be built on, and contribute back to, openly available datasets. This is particularly important for metadata about the research process itself, such as funding data, publication and citation data, and “altmetrics” data, which are still largely controlled by proprietary data providers.

Proprietary databases like Scopus and the Web of Science continue to be the primary source for comprehensive citation and reference data. Even the push towards alternative metrics, a key component of the Open Science agenda, is largely dependent on proprietary data sources from companies like Altmetric (a Digital Science company) and Plum Analytics (acquired by Elsevier earlier this year).

As the dependence on these proprietary data providers grows, universities and research funders risk becoming completely reliant on a few large companies for critical evaluation and decision support.

Similarly, products like UberResearch’s Dimensions (another Digital Science company) and SciVal (from Elsevier) use proprietary aggregations of data about research funding and research outputs to develop analytics and decision tools for funders and research administrators.

As the dependence on these proprietary data providers grows, universities and research funders risk becoming completely reliant on a few large companies for critical evaluation and decision support. Fortunately, groups like Crossref and ORCID are starting to develop open alternatives, and I believe it’s essential for the future of Open Science that new services are developed on top of these openly available datasets rather than depending on proprietary data from commercial providers.

Open Integrations

Just as open scholarly infrastructure depends on open source software and open data, it must also integrate with other tools and services using standard metadata formats and open APIs. The promise of truly open scholarly infrastructure is that it will enable a wide range of systems to integrate with one another at every stage of the research process, but this is only possible if there is a shared commitment to open integrations between systems.

Even in the case of community-driven, non-profit initiatives, paid APIs are often an important source of revenue, but one which limits the potential value of these services. A telling example is biorXiv, which recently announced that it had started to integrate with journal peer review systems. Hindawi was excited by this opportunity as we felt it would make it easier for our authors to publish preprints of their articles in parallel with the normal review process for our journals. However, when we contacted biorXiv they quoted a price of more than $100,000 per year to enable this for the biomedical titles that we publish. Given that we could not justify this cost we are unfortunately unable to provide this functionality to our authors, which I’m sure will be the case for many other open access publishers who would otherwise love to support the growth of preprints in the biomedical sciences.

While I fully appreciate that infrastructure providers need to find ways to cover their costs, I believe that any business model that is based on limiting integrations with other tools and services is inherently problematic for an Open Science future. So although I believe there may be a place for paid APIs in the case of high-volume users who require service level agreements, freely available APIs should always be available to facilitate open integrations with other systems.

Open Contracts

The final principle that I believe is essential for the future of open scholarly infrastructure is that the terms under which products and services are provided should be completely open and free from unnecessary lock-in. Non-Disclosure Agreements, multi-year contract terms, and privately negotiated prices for journal subscriptions have long prevented a competitive and transparent marketplace for scholarly publications, so we must avoid a similar trend from taking hold among scholarly infrastructure providers. As Research Libraries UK argued in their statement supporting the long-awaited release of data about the UK’s spending on Elsevier’s big deal packages, “markets work best where there is pricing transparency and price non-disclosure clauses in contracts never work to the benefit of customers.”

To avoid similar dynamics among open infrastructure providers, contracts with service providers should include minimal notice periods for termination, be free from NDAs, free from termination fees, and standardized wherever possible. In fact, my personal view is that companies should be willing to make their contracts freely available online for all to see, except in the few rare cases where the privacy concerns of their customers would prevent doing so.

So what does Hindawi plan to do about it?

One of the great privileges of working at an amazing company like Hindawi is that we have the opportunity to take action where we see an opportunity to change how things are done. It’s still early days, but my colleagues and I are excited about the opportunity to engage with universities, research funders, and other organizations within the Open Science ecosystem to work together in building truly open infrastructure for the future. As we begin to do so, we will commit to the principles I have just outlined: open source, open data, open integrations, and open contracts. We believe a commercial company can operate profitably while honoring these commitments.

One of the great privileges of working at an amazing company like Hindawi is that we have the opportunity to take action where we see an opportunity to change how things are done.

The first step on this path is likely to be the development of an open institutional repository platform, one which fully embraces the vision laid out by the Coalition of Open Access Repository’s “Next Generation Repositories” working group. The potential for a connected network of open repositories facilitating both the sharing and machine-readability of research outputs is incredibly exciting, as this would enable so many of the tools and services needed for an Open Science future.

We plan to work closely with organizations like Coko and SPARC, alongside a pilot group of institutional repository managers, to define what such a system should include and how it should be developed. This will be done in parallel with the work that we have already started for the development of an open source peer review and publication platform in cooperation with the Coko community.

I plan to post regular updates as our work progresses, and I would encourage anyone interested in working alongside us to get in touch with me by email.

Paul Peters
Chief Executive Officer
Hindawi