Public data files and snapshots
Once a year we release files that include the metadata for all Crossref-registered DOIs. We typically release this as a tar file and distribute it via Academic Torrents, with the metadata in JSON format. You can see details of the latest release on our blog.
On a monthly basis, we release snapshots that are available to Metadata Plus subscribers. The monthly snapshots are available in both JSON and XML format and accessed via an API token. New snapshots are created each month, available by the 5th day, and includes all records up to and including the end of the previous month. Snapshots remain available until the end of the following quarter, after which they are removed (for example the October snapshot remains available until the end of the following March).
The snapshots are very large (>200GB) so may take a long time to download on slow connections. This document provides you with information about how to access them and tips for handling them.
For applications that only require a small amount of the data, you should use the REST API instead. For applications where you want to keep a copy of our metadata records current, use the REST API or OAI-PMH to query for new records at your preferred interval.
Accessing the public data file
Access via Academic Torrents
The most recent public data file is accessible via Academic Torrents using the DOI https://doi-org.turing.library.northwestern.edu/10.13003/87bfgcee6g.
Download from AWS
Since 2023, the public data file has also been available via a “Requester Pays” option, primarily to provide access for organisations that don’t permit downloads via torrent services.
A copy is stored on AWS S3 in a bucket configured with the “Requester Pays” option. This means that rather than the bucket owner (Crossref) paying for bandwidth and transfer costs when downloading objects, the requester pays instead. The cost is expected to vary slightly year to year depending on variables like file size and end-user setups. The 2025 file is approximately 200 GB, and plugging that into this calculator results in an estimated cost of $18 USD. More information about “Requester Pays” can be found in the AWS documentation.
The bucket is called api-snapshots-reqpays-crossref. You can use the AWS CLI or the AWS REST API to access it. There are code examples in the AWS documentation.
Using the AWS CLI for example, after authenticating:
# List the objects in the bucket
aws s3 ls --request-payer requester s3://api-snapshots-reqpays-crossref
# Download the public data file
aws s3api get-object --bucket api-snapshots-reqpays-crossref --request-payer requester --key March-2025-public-data-file-from-crossref.tar ./March-2025-public-data-file-from-crossref.tar
Note that --request-payer requester is mandatory. Without that flag, the command will fail.
Accessing monthly snapshots
Snapshots are available to Metadata Plus via a /snapshots route in the REST API, which offers a compressed .tar file (tar.gz). These links always lead to the most recent snapshot:
- JSON output:
https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/latest/all.json.tar.gz - UNIXSD (XML) output:
https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/latest/all.xml.tar.gz
To see if the snapshot from a particular month is available, make an HTTP HEAD request using the following URL patterns:
- JSON output:
https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/{YYYY/MM}/all.json.tar.gz - XML output:
https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/{YYYY/MM}/all.xml.tar.gz
Include your API token in the request header using the format Crossref-Plus-API-Token: Bearer [API token].
Tips and tricks
Don’t decompress the file, since you can work directly with the tar file. This saves space and in our tests performed quicker than decompressing to perform analysis.
Check what’s changed from month to month using the REST API. This can help you decide whether to download the next monthly snapshot or update from the REST API. The number of new or updated records deposited by members can be found using a request such as the following:
https://api-crossref-org.turing.library.northwestern.edu/v1/works?filter=from-update-date:2025-01-01,until-update-date:2025-01-31&rows=0
The number of records with any changes, including new relationships and changes in Cited-by counts, can be found using a request such as the following:
https://api-crossref-org.turing.library.northwestern.edu/v1/works?filter=from-index-date:2025-01-01,until-index-date:2025-01-31&rows=0
You can use these requests (with rows=1000 and a cursor) to retrieve all items with changes.
Keep your snapshot updated by retrieving works using a created, update, or index date filter. created date filters retrieves only newly created metadata records; update filters retrieves items that are new or updated by a member; and index finds any items created or modified by the member, Crossref, or other sources. See information about using filters on queries and learn about using cursors to scroll through large datasets. Note that this can be done on an hourly basis (or even per minute):
https://api-crossref-org.turing.library.northwestern.edu/v1/works?filter=from-index-date:2025-01-01T12,until-index-date:2025-01-31T12&rows=1000&cursor=*
Typically the number of changes per month are low. Occasionally, a member updates a very large number of records, for example to add a new piece of metadata or to register historical content. Refreshing the snapshot several times a year and retrieving changes via the REST API in the iterim period should be sufficient to keep your local cache in sync.
If your download is interrupted, you can resume from the point it stopped instead of starting over. This is easiest to do using something like wget, but you can also do it with curl. For example:
> export TOKEN='<insert-your-token-here>'
> curl -o "all.json.tar.gz" --progress-bar -L -X GET https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/latest/all.json.tar.gz -H "Crossref-Plus-API-Token: ${TOKEN}"
Wait a few minutes, then execute ctrl-c to interrupt the download. To resume, include the switch -C -:
curl -o "all.json.tar.gz" --progress-bar -L -X GET https://api-crossref-org.turing.library.northwestern.edu/snapshots/monthly/latest/all.json.tar.gz -H "Crossref-Plus-API-Token: ${TOKEN}" -C -
The curl command will calculate the byte offset from where it left off and continue the download from there.
We have developed several tools to support handling the snapshots and public data file. These are not currently maintained but may make it easier to handle the data.
The Crossref Data Dump Repacker is a Python application that allows you to repack the Crossref data dump into JSON Lines format.
doi2sqlite is a tool for loading Crossref metadata into a SQLite database.
And for finding the record of a particular DOI, we’ve published a Python API for interacting with the annual public data files. This tool can create an index of the DOIs in the file, enabling easier record lookups without having to iterate over the entire file, which can take hours. A torrent is available for the 2024 index in SQLite format if you do not wish to generate it yourself.