Tips for working with Crossref public data files and Plus snapshots
What is this?
About once a year Crossref releases a public metadata file that includes all of Crossref’s public metadata. We typically release this as a tar
file and distribute it via Academic Torrents.
Users of Crossref’s Plus service can access similar data snapshots that we updated monthly. These are also tar
files, but we distribute them via the Plus service API, and you need a Plus API token to access them.
In either case, these files are large and unwieldy. This document provides you with tips that should make your life easier when handling Crossref public metadata files and Plus snapshots.
Downloading the public data file directly from AWS
The first three public data files were only accessible via torrent download to keep costs manageable and to enable anonymous downloads. As an alternative, we are also making the 2023 file available via a “Requester Pays” option.
A copy of the public data file is stored on AWS S3 in a bucket configured with the “Requester Pays” option. This means that rather than the bucket owner (Crossref) paying for bandwidth and transfer costs when downloading objects, the requester pays instead. The cost is expected to vary slightly year to year depending on variables like file size and end-user setups. The 2024 file is approximately 200 GB, and plugging that into this calculator results in an estimated cost of $18 USD. More information about “Requester Pays” can be found in the AWS documentation.
The bucket is called api-snapshots-reqpays-crossref. You can use either the AWS CLI or the AWS REST API to access it. There are code examples in the AWS documentation.
Using the AWS CLI for example, after authenticating, you could run:
# List the objects in the bucket
aws s3 ls --request-payer requester s3://api-snapshots-reqpays-crossref
# Download the public data file
aws s3api get-object --bucket api-snapshots-reqpays-crossref --request-payer requester --key April-2023-public-data-file-from-crossref.tar ./April-2023-public-data-file-from-crossref.tar
Note that the key part of the command is --request-payer requester
which is mandatory. Without that flag, the command will fail.
Handling tar files
Q: The tar
file contains many files that, in turn, contain the individual DOI records. Some of these files are very large and hard to process. Could you break them out into separate files per DOI instead?
A: Yes, we could. But that creates its own set of problems. Standard filesystems on Linux/macOS/Windows really, really don’t like you to create hundreds of millions of small files on them. Even standard command-line tools like ls
choke on directories with more than a few thousand files in them. Unless you are using a specialized filesystem, formatted with custom inode settings optimized for hundreds of millions of files- saving each DOI as an individual record will bring you a world of hurt.
Q: Gah! The tar
file is large and uncompressing it takes up a ton of room and generates a huge number of files. What can we do to make this easier? Can you split the tar file so we can manage it in batches?
A: Don’t uncompress or extract the tar file. You can read the files straight from the compressed tar file.
Q: But won’t reading files straight from the tar file be slow?
We did three tests- all done on the same machine using the same tar
file, which, at the time of this writing, contained 42,210 files which, in turn, contained records for 127,574,634
DOIs.
Test 1: Decompressing and untarring the file took about 82 minutes.
On the other hand…
Test 2: A python script iterating over each filename in the tar
file (without extracting and reading the file into memory) was completed in just 29 minutes.
Test 3: A python script iterating over each filename in the tar
file and extracting and reading the file into memory completed in just 61 minutes.
Both of the above scripts worked in a single process. However, you could almost certainly further optimize by parallelizing reading the files from the tar
file.
In short - the tar
file is a lot easier to handle if you don’t decompress and/or extract it. Instead, it is easiest to read directly from the compressed tar
file.
Downloading and using Plus snapshots
Q: How should I best use the snapshots? Can we get them more frequently than each month?
A: The monthly snapshots include all public Crossref metadata up to and including data for the month before they were released. We make them available to seed and occasionally refresh a local copy of the Crossref database in any system you are developing that requires Crossref metadata. In most cases, you should just keep this data current by using the Crossref REST API to retrieve new or modified records. Typically, only a small percentage of the snapshot changes from month to month. So if you are downloading it repeatedly, you are just downloading the same unchanged records time and time again. Occasionally, there will be a large number of changes in a month. This typically happens when:\
\
A large Crossref member adds or updates a lot of records at once.\
We add a new metadata element to the schema.\
We change the way we caluclate something (e.g. citation counts) and that effects a lot of records.
In these cases, it makes sense to refresh your metadata from the newly downloaded snapshot instead of using the API.
In short, if you are downloading the snapshot more than a few times a year- you are probably doing something very inefficient.
Q: The snapshot is large and difficult to download. I keep having it fail and have to start the download again. Can you split the snapshot so that I can download smaller parts instead?
A: If your download gets interrupted, you can resume the download from the point it got interrupted instead of starting over. This is easiest to do using something like wget.
But you can also do it with curl.
You can try it yourself:
> export TOKEN='<insert-your-token-here>'
> curl -o "all.json.tar.gz" --progress-bar -L -X GET https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/latest/all.json.tar.gz -H "Crossref-Plus-API-Token: ${TOKEN}"
Wait a few minutes, then execute ctrl-c
to interrupt the download.
Then to resume it from where it left-off, include the switch -C -
:
curl -o "all.json.tar.gz" --progress-bar -L -X GET https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/latest/all.json.tar.gz -H "Crossref-Plus-API-Token: ${TOKEN} -C -"
Then the curl
command will calculate the byte offset from where it left off and continue the download from there.
In late 2023 we started experimenting with supplementary tools and alternative file formats meant to make our public data files easier to use by broder audiences.
The Crossref Data Dump Repacker is a python application that allows you to repack the Crossref data dump into the JSON Lines format.
doi2sqlite is a tool for loading Crossref metadata into a SQLite database.
And for finding the record of a particular DOI, we’ve published a python API for interacting with the annual public data files. This tool can create an index of the DOIs in the file, enabling easier record lookups without having to iterate over the entire file, which can take hours. A torrent is available for the 2024 index in SQLite format if you do not wish to generate it yourself.