7 minute read.An eLife filled with possibility thanks to great metadata
eLife recently won a Crossref Metadata Award for the completeness of its metadata, showing itself as the clear leader among our medium-sized members. In this post, the eLife team answers our questions about how and why they produce such high-quality open metadata. For eLife, the work of creating and sharing excellent metadata aligns with their mission to foster open science and supports their preprint-centred publication model, but it also lays the groundwork for all kinds of exciting potential uses.
Having complete and rich metadata puts you in the best position to fulfil future, as-yet-undetermined requirements.
– Fred Atherden, eLife
eLife is a mission-driven organisation tasked by its founders to help scientists accelerate discovery and encourage responsible behaviours in science. As such, we’re passionate about open science and metadata, and we’re vocal advocates of the benefits these provide to academic communities and beyond.
Given Crossref’s position as a hub at the centre of scholarly communication, providing Crossref with complete metadata furthers our mission. It facilitates the discovery and reuse of research and enables linkage to key but often overlooked outputs such as datasets and software. As signatories of DORA and supporters of the Barcelona Declaration, we are keenly aware of the wider context - that these efforts enable research assessment and policy decisions to be derived from open and transparent information, moving beyond closed systems that have proliferated the damaging use of anachronistic metrics.
There are plenty of existing guidelines that provide a great skeleton to follow. For example, we follow FAIR data and FORCE11 software citation principles, which ensure the capture of metadata for supporting datasets and software packages. There’s not any one particular element that we’ve prioritised, although we’re keen to ensure we follow best practices while also exploring the bleeding edge.
We’ve collaborated with and relied on the advice of many organisations over the years, including (but not limited to) Crossref, Research Organization Registry (ROR), JATS4R, FORCE11, Software Heritage, openRxiv, and our production vendors Exeter Premedia.
We’ve developed our own open source Crossref metadata generation library. Keeping this process in-house has proven really fruitful. It allows us to quickly and continuously improve upon the metadata we provide.
And we have a data team that has created a centralised data hub, serving as a really useful authoritative resource that can be queried, instead of always making use of disparate systems.
At submission, we collect ROR IDs for (a subset of) affiliations, and structured data for funding, datasets, and other information. Our publication model is centred around preprints, so it’s necessary to capture related information such as the preprint DOI, preprint posted date, the version that pertains to each specific revision (and so on). Without this information, we could not post public reviews to the correct preprint version on the preprint server, or indeed ensure the article we publish is the correct iteration of that work.
The systems that enable the publication of eLife Reviewed preprints are dependent on DocMaps, a framework for a machine-readable representation of the processes involved in the creation of a document. These are provided by our Data Hub and enable us to capture structured information about the peer review process and accompanying metadata for each article.
Our proofing system for journal articles only permits login via ORCID authentication, and we don’t capture unauthenticated ORCID IDs that have been copied or keyed (see ‘What’s So Special About Signing In?’). It also makes use of both the Crossref API and the PubMed Central API to ensure we have persistent identifiers where possible for references. We have an in-house content validator, which uses ROR’s API to ensure we have ROR IDs for affiliations and funders where possible. We use Software Heritage to archive author-generated code, and include their persistent ID (SWHID) in software references.
All our published content is captured as JATS XML (the industry standard format for journal articles), which our metadata generation library uses as its input.
Persistent identifiers are very useful for reporting. Creating a report that, for example, includes publication volumes from a particular institution is trivial when content is enriched with persistent identifiers. It’s more complex when all you have are messy author-supplied strings of text. They’re also useful for content validation. For example, when we have a persistent ID and a method to retrieve the related metadata, we can confirm that the information we’ve been provided is complete and correct.
There are, of course, many other benefits, some of which are “unknown unknowns.” Having complete and rich metadata puts you in the best position to fulfil future, as-yet-undetermined requirements.
In 2024, we started introducing persistent grant IDs for our content. While we updated our submission system to collect these from authors, it’s apparent that many authors aren’t aware when/if these have been registered by funders, and they still provide us with the (internal) grant numbers instead.
Our workaround was to pull grant data from Crossref and then replace the grant numbers with the persistent IDs when we’re confident of a match. Since the grant number registered at Crossref might not exactly match the grant number the authors have given us, potential matches are confirmed by a team member or our production vendors. Since many organisations do a great job of creating informative landing pages (for example, EuropePMC for Wellcome funding), this is feasible, but we’re investigating ways we can make this less manual while remaining careful that we don’t introduce false positives.
Yes, I think this is something that is becoming increasingly visible. Authors are very mindful of the benefits that good metadata can bring for discoverability and promotion. And much is lost without the increased interoperability it brings, both for publishers themselves but also the wider ecosystem. For example, we’ve had some great feedback from numerous organisations that appreciate that the outputs we publish directly link to the preprints they are based on.
In recent years, there’s been an increased focus on research integrity, and this is likely to remain the case. Metadata has an obvious and key role in providing trust and transparency, whether that’s through the presence of trust markers like ORCID IDs or through the inclusion of complete post-publication metadata such as correction, retraction, or withdrawal information.
Several years ago, we introduced a “publish, review, curate” model of publishing, where we publish ‘Reviewed preprints’ following each stage of review. We don’t collect the same level of structured information from authors at submission for these as we do for Versions of Record. This presents a challenge for retrieving and disseminating complete metadata for Reviewed preprints. We aim to start moving this forward so that comprehensive metadata is available at earlier stages of the publication process. For example, we recently started depositing (some) funding metadata for these.
We’re also keen to explore the ways in which we can make our eLife Assessments more discoverable. Our Editors use a common vocabulary to describe the significance of the findings and strength of evidence in a paper. Other publishers moving beyond accept/reject publication models use different rubrics and taxonomies, so having one restrictive field in a schema for the entire corpus of research won’t cut it. But nevertheless making these terms more discoverable and interoperable would be preferential.
We’ve found that the integration of public APIs/data within systems (such as ROR’s, Crossref’s, PubMed’s, and OpenAlex’s) to be really helpful in validating the correctness and completeness of content/metadata. The effort in adding these integrations will pay dividends in the future.
Time to enjoy Fred’s acceptance video.
Metadata Awards video - eLife