15 minute read.
Follow the money, or how to link grants to research outputs
The ecosystem of scholarly metadata is filled with relationships between items of various types: a person authored a paper, a paper cites a book, a funder funded research. Those relationships are absolutely essential: an item without them is missing the most basic context about its structure, origin, and impact. No wonder that finding and exposing such relationships is considered very important by virtually all parties involved. Probably the most famous instance of this problem is finding citation links between research outputs. Lately, another instance has been drawing more and more attention: linking research outputs with grants used as their funding source. How can this be done and how many such links can we observe?
- We looked for links between research outputs and grants registered with Crossref.
- Grant DOIs alone are not enough for linking research outputs with grants, because the funding information in research outputs typically does not contain grant DOIs (yet). Award numbers alone are also not enough because they are not globally unique.
- We used either grant DOIs (if available) or the combination of award number and funder information to match grants to research outputs.
- In total, we found 20,834 links between research outputs and registered grants, involving 17,082 research outputs and 3,858 grants (10% of all registered grants).
- Erroneous and incomplete metadata, especially involving award numbers, is the main factor that prevents linking research outputs to grants.
The ecosystem of scholarly metadata is filled with relationships between items of various types: a person authored a paper, an author works at a university, a paper cites a book, a book contains a chapter, a funder funded research. Those relationships are absolutely essential: an item without them is missing the most basic context about its structure, origin, and impact.
No wonder that finding and exposing relationships between items in the scientific ecosystem is considered very important by virtually all parties involved. Probably the most famous instance of this problem is finding citation links between research outputs. Another, relatively new example, is linking research outputs with grants used as their funding source.
At Crossref, for some time now we have been seeing a steady growth of funder membership and grant registration. We are aware that the possibility of finding relationships between grants and research outputs is a big reason why funders are registering grants with us in the first place. Being able to see which research outputs are being supported by which grants helps reduce the reporting burden on researchers, funders, and institutions alike, especially now with the addition of ROR IDs to help complete the picture. Exposing relationships between research outputs and grants also increases the transparency of funding sources of the research, making it easier to assess and trust scientific findings.
But how can we find those relationships and how many of them can we already observe? Thankfully our REST API, recently equipped with the grant metadata, can help us answer these questions.
The perfect scenario
Imagine a world where the metadata of any scientific output states all relationships with other items existing in the scientific ecosystem, and those related items are always referred to by their persistent identifiers, allowing all this information to be accessed in a fully machine-readable way… Lovely, right?
In the case of citations, in such a perfect world every bibliographic reference has a DOI of the cited item. And in the case of funding information, a scientific paper contains grant DOIs, stating the funded-by relationships between the paper and the grants.
But, as the last two years have painfully taught us all, life is not all rainbows and unicorns.
The reality kicks in
We know that around 71% of bibliographic references are deposited with Crossref without a DOI of the cited item. This means that if we want to establish citation links between items, we need to match the bibliographic references using the provided metadata, which is not a trivial task and can potentially introduce errors.
And the situation with the funding information and grant DOIs is even worse.
Problem #1: our schema does not allow the publishers to attach grant DOIs to research outputs
This issue is 100% on us. Because grant DOIs are relatively new, our deposit schema does not yet allow to specify the grant DOI in the funding information of a research output, even if the publisher wanted to. We are working on changing this.
Interestingly, it looks like persistent identifiers always find a way. Within over 7.4 million research outputs with funding information, we noticed 6 cases where a grant DOI was provided as an award number. For example in 10.1093/nar/gkaa994 we have the following:
name: "Wellcome Trust",
This may not be 100% correct from the schema perspective, but it is very useful when one is interested in linking grants to research outputs!
But those cases are extremely rare outliers. For the vast majority of the outputs, grant DOIs are not present in the metadata. This means that, just like in the case of bibliographic references, we have to use the metadata to match funding information to grants.
Funding information is typically given as a pair: award number, funder information. Grants contain similar metadata. One might be tempted to use only the award number for linking, as in some cases it can look like a grant identifier.
Let’s consider an example. We want to find all papers funded by grant 10.37807/gbmf7622. The award number is
GBMF7622. A simple approach might be to search for items with this award number in Crossref’s REST API, which returns 12 results. However, one of the resulting items is the grant itself. So excluding that, it seems like there are 12-1=11 research outputs funded by this grant.
Simple and easy, right? Well, think again.
Problem #2: award numbers are not unique
Let’s look at another example grant: 10.25585/60000600. Its award number is
2817 and the funder is the US Department of Energy.
When we search for this award we get 10 results. Like before, one of them is our grant. After examining the remaining 9 we will see that:
So among only 9 items mentioning the same award number we have in fact 5 different grants. Our input grant should probably be linked only to the three items mentioning Joint Genome Institute. The main problem illustrated here is that the award numbers are not globally unique, and thus should not be treated like identifiers.
Indeed, within 38,326 grants registered so far, we have 37,608 distinct award numbers, and among those, there are 716 award numbers, each of which appears in multiple grants. This issue comes in two flavours: conflicts between and within funders.
Between-funder award number conflicts
A conflict between funders is when more than one funder uses the same award number for one of their grants. This is expected - award numbers are assigned by funders internally and are not designed to be a globally unique identifier.
Out of 716 award numbers that appear in multiple grants, 12 are numbers that appear in grants of different funders. For example, there are two grants with the award number
Because of those conflicts, we cannot simply rely on the award numbers for linking grants to research outputs. Instead, we have to use more information to be sure that the links are correctly established.
Within-funder award number conflicts
To our big surprise, it turns out that the majority of the award number conflicts happen not between different funders, but within the grants of a single funder. Out of 716 award numbers that appear in multiple grants, 704 appear in multiple grants of a single funder only. Such situations are not expected and could indicate an error or some other systematic issue with the data.
Interestingly, out of those 704 award numbers, 700 are associated with the US Department of Energy. We’ve followed up with them in order to clarify or resolve this. The US Department of Energy pointed out a fundamental issue with the data model: currently a grant deposited with Crossref has to have at least one funder DOI, and no other way of identifying the associated organisation is allowed. At the same time, some of the facilities that should appear in their grants’ metadata are not funders at all and thus cannot be identified by a funder DOI. In the future, they plan to identify those facilities in their grant metadata by providing ROR IDs.
Because of within-funder award number conflicts, in some cases it might be difficult to distinguish between two grants with the same award number and funder. A solution might be to use additional information or simply not accept any links if a research output cannot be reliably linked to one grant only.
Our linking approach
Based on all those observations, we adopted the following approach:
- We iterated over all registered grants, for each we performed the following steps:
- We used
award.number:<grant DOI> filter in the REST API to find all items listing a given grant’s DOI as the award number. Because this is based on the grant’s persistent identifier, we recorded those links without any further verification.
- We used the
award.number:<grant award number> filter in the REST API to find all items listing grant’s award number in the funding information. Each resulting item was then verified by comparing the funder information in the item to the funder information in the grant. We recorded the link between the grant and the candidate item only if the verification succeeded.
- In the final step, we examined all recorded links to make sure that each pair (research output, award number) is linked to at most one grant. Links violating this rule were flagged as not reliable.
We used different techniques to verify the funder information between the research output (item) and the grant, depending on what information is available. Grants always have the funder DOI. The item, however, can have the funder DOI, the funder name, or both.
If the funder DOI was available on both sides, the following rules were used for the funder verification (ordered by decreasing confidence):
If the funder DOI was not available in the item, the following rules were used for the funder verification (ordered by decreasing confidence):
Note that this is in fact very similar to our reference matching approach. In both cases, first we search for candidate items, and then verify the candidates by comparing the metadata. The actual metadata used for the verification varies, because different information is typically given in the bibliographic reference and the funding information.
What we found
This procedure applied to the entire Crossref dataset resulted in 20,846 links between research outputs and grants. Of those, 12 were flagged as unreliable, because they involved more than one grant linked to the same item and award number. The rest of this section focuses on the remaining 20,834 links.
Within the 20,834 links, we have 17,082 research outputs and 3,858 (10.1%) grants.
Here is the breakdown into the verification approaches used:
|The item contains grant DOI - no verification||6||<0.1%|
|Funder DOIs are the same||8,364||40.1%|
|Funder DOIs are related with a replaced/was replaced by relationship||3,704||17.8%|
|Funder DOIs are related with an ancestor/descendant relationship||7,718||37.0%|
|Funder names are the same||591||2.8%|
|The name of the funder in the item is the same as the name of the funder that replaced/was replaced by the funder in the grant||364||1.7%|
|The name of the funder in the item is the same as the name of the ancestor or descendant of the funder in the grant||87||0.4%|
In most cases, just using the funder DOIs for the verification was enough. Verifying by the funder name added 1,042 links, which is 5% of all links.
And here are statistics for individual funders. Only funders with at least 10 deposited grants are listed in the table. The table shows the number of detected links, the number of distinct research outputs linked, the total number of outputs mentioning the given funder DOI, and the number of grants.
|Funder||#links||#linked research outputs||#total outputs with funder DOI||#grants|
|Japan Science and Technology Agency||11,922||10,411||25,779||9,383|
|Wellcome Trust (including both funder DOIs 10.13039/100004440 and 10.13039/100010269)||8,001||6,246||49,492||17,534|
|James S. McDonnell Foundation||463||457||2,534||557|
|Melanoma Research Alliance||152||150||894||392|
|Asia-Pacific Network for Global Change Research||100||100||838||539|
|U.S. Department of Energy||56||52||97,482||8,462|
|Gordon and Betty Moore Foundation||51||50||5,928||94|
|American Cancer Society||3||3||7,276||107|
|Children’s Tumor Foundation||1||1||759||630|
|American Parkinson Disease Association||0||0||181||12|
|Neurofibromatosis Therapeutic Acceleration Program||0||0||101||68|
|International Anesthesia Research Society||0||0||94||34|
|Australian National Data Service||0||0||92||67|
Note that the fourth column reports the total number of outputs registered with Crossref and mentioning the given funder DOI, including grants, journal papers and all other content types.
It is interesting to compare the number of linked research outputs for a given funder with the total number of research outputs mentioning a given funder DOI. In general, for a funder that registers grants, the more research outputs mentioning this funder, the more links we should be able to find.
And for some funders (Japan Science and Technology Agency, Melanoma Research Alliance, Asia-Pacific Network for Global Change Research, Wellcome Trust, James S. McDonnell Foundation), the number of linked outputs is indeed high, as compared with how many outputs mention the funder in the first place. This suggests our procedure was quite successful in linking outputs funded by these funders, meaning that in general the metadata in their grants and the funding information in the research outputs match.
On the other hand, we have a few funders for which we managed to link only a very small fraction of research outputs. There are several potential explanations here. A simple one is that not all relevant grants have been deposited yet. For example, a funder might be registering new grants only, whereas many research outputs mention older, not yet registered grants. It is also possible that there are systematic differences in how the publishers deposit the funding information in articles and other outputs, and how it is given in grants. Such differences might prevent us from establishing links, contributing to the overall low percentage of linked grants.
The importance of being precise
Here are some examples of existing links that should’ve been found, but were not.
The award number in grant 10.48105/pc.gr.93156 is
CTF-2020-01-004. This article: 10.3390/ijms22094716 mentions award number
2020‐01‐004 and the same funder (Children’s Tumor Foundation). It is very probable that this is the same grant, but our procedure expects exactly the same award number, and so the two were not linked.
Paper 10.1128/genomea.00159-18 contains award number
1931 and U.S. Department of Energy as the funder. There are two grants with the same award number and funder: 10.46936/10.25585/60001053 and 10.46936/genr.proj.2000.1931/60002530. It is difficult to choose between them, and these links were marked as unreliable.
These examples could be signs of systematic errors and/or discrepancies that effectively prevent linking of those funders’ grants.
In problems such as linking grants to research outputs, there are typically two key ingredients of the success, which at the same time are the main areas of improvement: the quality of the metadata, and the strength of the linking approach.
The metadata could be improved greatly by addressing existing discrepancies between grants and research outputs and allowing (and encouraging!) the publishers to provide grant DOIs in the funding information. Thankfully, we are not alone in those efforts. Both this recent Upstream blog from Alexis-Michel Mugabushaka, and this Scholarly Kitchen post from Robert Harrington call for the development and adoption of grant DOIs in scholarly metadata.
In terms of the linking approach, there are some ideas that could be used to further improve the linking accuracy and completeness:
- The verification by funder name could be fuzzy and allow for minor variations like typos or additional words.
- Apart from replaced/replaced by and ancestor/descendant, there are other relationships between funders in the Funder Registry: continuation of, incorporates/incorporated into, merged with, renamed as, split into/split from. We could also consider those relationships during the funder validation.
- Apart from the funder information, there is other information that could be potentially used for verification, for example, the names of the authors and the investigators, the domain, or keywords.
If you have any questions, do get in touch!