Blog

Dominika Tkaczyk

Dominika Tkaczyk

Dominika joined Crossref in 2018 as a Principal R&D Developer, where she focused on metadata matching research aimed at enriching the scholarly record through the discovery of new relationships. In 2024, she became Crossref’s Director of Data Science and established the Data Science team, with a mission to explore innovative ways of using data to support the scholarly community, enrich the Research Nexus with more metadata and relationships, and develop collaborations with like-minded community initiatives. Since 2025, Dominika has served as Director of Technology, leading a unified technology team that integrates infrastructure, software development, and data science functions. Dominika holds a PhD in Computer Science from the Polish Academy of Sciences. Prior to joining Crossref, she she was a researcher and a data scientist at the University of Warsaw, Poland, and a postdoctoral researcher at Trinity College Dublin, Ireland.

Read more about Dominika Tkaczyk on their team page.

The myth of perfect metadata matching

In our previous instalments of the blog series about matching (see part 1 and part 2), we explained what metadata matching is, why it is important and described its basic terminology. In this entry, we will discuss a few common beliefs about metadata matching that are often encountered when interacting with users, developers, integrators, and other stakeholders. Spoiler alert: we are calling them myths because these beliefs are not true! Read on to learn why.

The anatomy of metadata matching

In our previous blog post about metadata matching, we discussed what it is and why we need it (tl;dr: to discover more relationships within the scholarly record). Here, we will describe some basic matching-related terminology and the components of a matching process. We will also pose some typical product questions to consider when developing or integrating matching solutions.

Basic terminology

Metadata matching is a high-level concept, with many different problems falling into this category. Indeed, no matter how much we like to focus on the similarities between different forms of matching, matching affiliation strings to ROR IDs or matching preprints to journal papers are still different in several important ways. At Crossref and ROR, we call these problems matching tasks.

Metadata matching 101: what is it and why do we need it?

At Crossref and ROR, we develop and run processes that match metadata at scale, creating relationships between millions of entities in the scholarly record. Over the last few years, we’ve spent a lot of time diving into details about metadata matching strategies, evaluation, and integration. It is quite possibly our favourite thing to talk and write about! But sometimes it is good to step back and look at the problem from a wider perspective. In this blog, the first one in a series about metadata matching, we will cover the very basics of matching: what it is, how we do it, and why we devote so much effort to this problem.

Discovering relationships between preprints and journal articles

In the scholarly communications environment, the evolution of a journal article can be traced by the relationships it has with its preprints. Those preprint–journal article relationships are an important component of the research nexus. Some of those relationships are provided by Crossref members (including publishers, universities, research groups, funders, etc.) when they deposit metadata with Crossref, but we know that a significant number of them are missing. To fill this gap, we developed a new automated strategy for discovering relationships between preprints and journal articles and applied it to all the preprints in the Crossref database. We made the resulting dataset, containing both publisher-asserted and automatically discovered relationships, publicly available for anyone to analyse.

The more the merrier, or how more registered grants means more relationships with outputs

One of the main motivators for funders registering grants with Crossref is to simplify the process of research reporting with more automatic matching of research outputs to specific awards. In March 2022, we developed a simple approach for linking grants to research outputs and analysed how many such relationships could be established. In January 2023, we repeated this analysis to see how the situation changed within ten months. Interested? Read on!

Follow the money, or how to link grants to research outputs

The ecosystem of scholarly metadata is filled with relationships between items of various types: a person authored a paper, a paper cites a book, a funder funded research. Those relationships are absolutely essential: an item without them is missing the most basic context about its structure, origin, and impact. No wonder that finding and exposing such relationships is considered very important by virtually all parties involved. Probably the most famous instance of this problem is finding citation links between research outputs. Lately, another instance has been drawing more and more attention: linking research outputs with grants used as their funding source. How can this be done and how many such links can we observe?

Double trouble with DOIs

Dominika Tkaczyk

Dominika Tkaczyk – 2020 March 10

In R&DMetadata

Detective Matcher stopped abruptly behind the corner of a short building, praying that his loud heartbeat doesn’t give up his presence. This missing DOI case was unlike any other before, keeping him awake for many seconds already. It took a great effort and a good amount of help from his clever assistant Fuzzy Comparison to make sense of the sparse clues provided by Miss Unstructured Reference, an elegant young lady with a shy smile, who begged him to take up this case at any cost.

Crossref metadata for bibliometrics

Our paper, Crossref: the sustainable source of community-owned scholarly metadata, was recently published in Quantitative Science Studies (MIT Press). The paper describes the scholarly metadata collected and made available by Crossref, as well as its importance in the scholarly research ecosystem.

Crossref: The sustainable source of community-owned scholarly metadata

Dominika Tkaczyk

Dominika Tkaczyk, Friday, Jun 26, 2026

In

Leave a comment

The foundational paper describing Crossref’s metadata — its scale, breadth, and role in the scholarly ecosystem. Published in Quantitative Science Studies in 2020 by Hendricks, Tkaczyk, Lin, and Feeney.

What’s your (citations’) style?

Bibliographic references in scientific papers are the end result of a process typically composed of: finding the right document to cite, obtaining its metadata, and formatting the metadata using a specific citation style. This end result, however, does not preserve the information about the citation style used to generate it. Can the citation style be somehow guessed from the reference string only?

TL;DR

  • I built an automatic citation style classifier. It classifies a given bibliographic reference string into one of 17 citation styles or “unknown”.
  • The classifier is based on supervised machine learning. It uses TF-IDF feature representation and a simple Logistic Regression model.
  • For training and testing, I used datasets generated automatically from Crossref metadata.
  • The accuracy of the classifier estimated on the test set is 94.7%.
  • The classifier is open source and can be used as a Python library or REST API.

Introduction

Threadgill-Sowder, J. (1983). Question Placement in Mathematical Word Problems. School Science and Mathematics, 83(2), 107-111

This reference is the end result of a process that typically includes: finding the right document, obtaining its metadata, and formatting the metadata using a specific citation style. Sadly, the intermediate reference forms or the details of this process are not preserved in the end result. In general, just by looking at the reference string we cannot be sure which document it originates from, what its metadata is, or which citation style was used.