8 minute read.Data Science @Crossref
To address the growing scale and complexity of scholarly data, we’ve launched a new data science function at Crossref. In April, we were excited to welcome our first data scientists, Jason Portenoy and Alex Bédard-Vallée, to the team. With their arrival, the Data Science team is now fully up and running. In this blog post, we’re sharing our vision and what’s ahead for data science at Crossref.
New approach to achieve our mission
Over the last few years, we have witnessed substantial growth of the scholarly community in general, and Crossref in particular. This has been reflected in the increase in the volume and variety of the data we collect, store and process, including scholarly metadata and Crossref operational data related to membership, DOI registrations, billing, usage measurement, and other activities.
On the one hand, this growth opens new possibilities for using the data to better understand the scholarly landscape, serve our community, develop services, and make informed decisions. On the other hand, it forces us to address a set of challenges related to the scale and complexity of the data.
The new Data Science team, created as part of last year’s broader organisational changes, will address these challenges and fulfil our data-related ambitions. As part of our strategic mission, we created the following vision for the Data Science team within Crossref and our community:
The Data Science team uses scientific research and data science to deliver, assess, improve, and enrich scholarly metadata.
The work of the Data Science team broadly entails two types of projects: 1) data analysis & insights; and 2) data services & workflows.
Data analysis & insights: The goal of these kinds of projects is to broaden our understanding of the scholarly record and our community and help Crossref make decisions in a data-driven way, without trying to create any specific application or product. They will help Crossref explore new strategic directions, make more informed decisions, monitor the trends and outcomes of certain decisions and policies, and discover and share new insights with the community. This category also involves large and small data assessments and analyses, measuring and monitoring certain metrics, verifying hypotheses, answering questions using data, monitoring trends in the metadata, forecasting, data visualisation, reporting, and interpreting results.
Data services & workflows: The goal of these kinds of projects is to apply scientific knowledge and data analysis to build and maintain Crossref services, tools, and workflows. The Data Science team collaborates with other Crossref teams on the research, design and implementation of the Crossref system and its various components. This will involve modelling across different data stores and APIs, as well as designing efficient and robust data workflows for various processes, including metadata deposit, validation, and dissemination. Furthermore, the team will investigate and implement modern tools and techniques for efficient data processing, storage and analysis, and strategies for data enrichment. Finally, the Data Science team is involved in planning and implementing comprehensive monitoring and reporting for various features and services.
Crossref exists as part of a diverse, global community of 22,000 members from 160 countries, plus countless systems that rely on our metadata. Launching the new Data Science function gives us a great opportunity to connect more deeply and in new ways with the wider scholarly community. We’re keen to engage with Crossref members, users of our services, and partner organisations to better understand trends and needs, and to contribute to others’ community initiatives and awareness.
One area we’re particularly interested in is the growing range of initiatives in the metascience space. We’re looking to expand and solidify our understanding of how researchers use our data and services, and to learn more about their needs and perspectives. These insights will help inform the design and functionality of our data workflows and APIs over the long term.
We’re also committed to supporting the scholarly community’s efforts to preserve the integrity of the scholarly record (ISR). By applying modern, scalable data processing techniques, we aim to help detect and investigate potential issues affecting metadata quality, including both intentional manipulation and unintentional errors or inconsistencies.
More broadly, we’re looking forward to engaging with our community on scalable data processing approaches, as well as best practices and standards for processing and enriching scholarly metadata.
Introducing new members of the team
We couldn’t pursue our ambitious goals without the dedication and passion of our team. In April, we were thrilled to welcome two data scientists, Jason Portenoy and Alex Bédard-Vallée, to the Crossref team.
Alex Bédard-Vallée brings over six years of experience extracting meaningful insights from data within the research and scholarly publishing sector, applying it to large-scale bibliometric data, aiming to better serve the scholarly community. Prior to Crossref, during his tenure at Elsevier, he was instrumental in modernising data infrastructure, significantly enhancing the efficiency of massive research data pipelines. His contributions included developing automated data quality checks, creating reusable Python tools to streamline data access, and leveraging machine learning techniques to uncover research trends. Alex provided key insights for major reports, contributing to evaluations for the Canada Research Chairs Program and the NSF Science and Engineering Indicators between 2020 and 2024. Alex holds an M.Sc. in Quantum Physics (2018) and a B.Sc. in Physics (2016) from the Université de Sherbrooke.
Jason Portenoy is a New York-based data scientist with a background in bibliometric research and building applications using scholarly data. Through his work, he has become a passionate advocate for the maintenance and improvement of high-quality scholarly metadata. He holds a PhD in Information Science from the University of Washington where he studied how scholarly metadata can offer insights into scientific activity and help develop tools to address information overload. He brings experience working at OpenAlex, Semantic Scholar, and other organisations concerned with scholarly communication. Most recently, he was the Senior Data Engineer at OpenAlex, and he is now excited to continue his work using data science to support and strengthen crucial open scholarly infrastructure.
What’s next for us?
In the short term, we are focusing on two main projects: analysing how reliably DOIs resolve, and detecting discrepancies in bibliographic references at scale.
DOI resolutions: DOIs are persistent identifiers and links that are meant to consistently resolve to landing pages that represent the object they identify and Crossref has certain obligations that members have to adhere to, one of which is that if the location of the landing page changes, it is the responsibility of the member to update the metadata so the DOI continues to resolve correctly. Some prior work has suggested this doesn’t always happen, so there are some gaps in the scholarly record. We’re now analysing metadata from a broad sample of members to better understand the scale of the issue, and to identify cases where members may need to update their metadata records.
Detecting discrepancies in bibliographic references: Following last year’s reports of discrepancies between bibliographic references in metadata records and those found in full-text PDFs, we’ve explored ways to run broader, systematic checks across a larger set of members and metadata records. The goal was to understand how widespread these inconsistencies are and to identify cases where members may need support in correcting references in their metadata records. Ultimately, we aim to create a collaborative process that improves the accuracy and reliability of bibliographic references across the scholarly record, enhancing research discovery and reproducibility and ensuring impact assessments are reliable.
Look out for forthcoming blog posts with more details on these projects!
Looking further ahead, Crossref has two big projects for which the Data Science team will serve central roles: developing dashboards, and improving metadata matching.
Data dashboards: We are planning to develop a series of dashboards to monitor the state of the scholarly record over time. These will include both work-level statistics (e.g., how many works of a given type have been registered?) and more detailed insights at the relationship level (e.g., how many bibliographic references have been automatically matched? How often are ROR IDs included in funder assertions?). Upstream, this will require us to build an environment where all relevant data sources can be combined, as well as adopting a suite of scalable tools and data processing techniques.
Metadata matching: In April, we commenced the matching project. It is a major effort to rebuild Crossref’s metadata matching workflows using modern software development and data science practices. The goal is to create a dedicated consolidated matching workflow that will eventually replace all existing production matching processes, with results made available through the REST API. This project covers six matching tasks: bibliographic reference matching, funder name matching, preprint matching, affiliation matching, grant matching, and title matching.
(In the meantime, as we do not have a good mechanism to add matching results to the REST API yet, we separately released two datasets with relationships discovered by automated matching strategies: a dataset of relationships between preprints and journal articles, and a dataset of relationships involving research organisations.)
As you can tell, we are very excited about Crossref’s role in the modern, open, community-focused future of scholarly infrastructure. The new Data Science team is a crucial component of this vision. If you’re interested in collaborating or learning more about data science at Crossref, we’d love to hear from you!