8 minute read.We’ve migrated to the cloud; we hope you didn’t notice (but maybe you did)
TLDR: We’ve successfully moved the main Crossref systems to the cloud! We’ve more to do, with several bugs identified and fixed, and a few still ongoing. However, it’s a step in the right direction and a significant milestone, as, whilst it is a much larger financial investment, it addresses several risks and limitations and shores up the Crossref infrastructure for the future.
Some background
We have been doing a lot of thinking, planning, and working on paying down our technical debt and modernising our systems. It’s not fun and flashy work, but it is vital for sustaining our infrastructure, meeting the demand on existing services, and developing new services.
Just about a year ago, we completed phase one, migrating our main database from Oracle to PostgreSQL, an open-source database. This move brought us more in line with our commitment to the POSI principles, reduced our dependencies on costly private licenses, and opened up the possibility to use and offer additional and more contemporary features. With the transition to PostgreSQL we made upgrades to the operating system, the database software, and the underlying hardware, resulting in significant improvements to the overall throughput and capacity of the deposit system. Previously, we typically maintained a queue of more than 10,000 deposits waiting to be processed; now, the queue holds fewer than 100 deposits on average. Consequently, the average latency – the elapsed time from submission to deposit – has reduced from hours to seconds.
During phase one, a total of 35 new servers were created, and for the first time, the entire system configuration was defined through infrastructure-as-code, enabling the infrastructure to be recreated as necessary. This effort not only enabled the migration but also established a solid foundation for our cloud migration strategy, as the code was leveraged to configure our infrastructure on AWS. Additionally, it serves as a critical component of our disaster recovery planning.
Most importantly, phase one set us up for phase two and our next migration: moving the system into the cloud.
Why we moved to the cloud
We had been running most of our services in a physical data centre near Boston, MA, USA (there are a few exceptions: the REST API and our test system (test.crossref.org) were already in the cloud, as was the Crossref website). We’ve been planning to move to the cloud for ahem quite some time, but as always, competing priorities and limited resources have thwarted us, and the data centre was mainly serving us well.
But… with staff across 12 countries, and increased global use of our system, operating our own hardware in a physical data centre was becoming increasingly challenging and risky, not to mention, frustrating.
Moving to the cloud has solved several pain points for us:
- Physical access to the data centre was required for various tasks (e.g., hardware upgrades, troubleshooting, general maintenance), but as Crossref grew as an organisation and became more distributed, we had fewer staff in the area. Hosting services in the cloud means staff around the world can access our servers remotely from anywhere (and we can leave the hardware upgrades to our vendor).
- Scalability in the data centre required installing new hardware or upgrading connections, which also meant a good amount of time. In the cloud, we can scale up almost instantly.
- We can maintain copies of our databases and services in distributed places, providing insurance against natural or other disasters.
Upgrades now don’t involve buying physical hardware and installing it; it’s a much quicker and more straightforward process.
Moving from a physical data centre to the cloud also has some trade-offs; for instance, the cost will be approximately five times higher than running the system in the data centre; with initial data, it’s not unlikely the annual cost may be up to 2,000,000 USD. We aim to optimise and control this cost going forward.
What we did
The size of the undertaking was partly due to leaving it so long; technical debt has accumulated over many years of running the system in the data centre.
The whole plan was hugely detailed, but we can distil it to a few bullets:
- We conducted an analysis of components, considered risks and sequencing, and created a test plan and timeline, including comms.
- While most of the drive and work was on the shoulders of two infrastructure services colleagues, our software engineers were heavily involved too, and we had weekly check-ins with a cross-team group to review progress, reassess risks, and adjust timelines as we got closer to the migration date (or decided to move it once or twice).
- We first created the deposit system in the cloud.
- We then created other parts of our services that aren’t in the deposit system code base, but run alongside it, such as reports, querying, and other tools.
We replicated our databases (of which there are several, in a few different flavours - PostgreSQL, MySQL).
- We gave 14 days’ notice to our members, via email, and kept this maintenance notice up to date.
- We commenced the migration on 8th July, which involved taking the whole system down and rejecting deposits for up to 24 hours.
- In the process, we scripted the process to create CS and the other services using Terraform and Ansible, so that going forward, bringing up a whole new instance of CS (should we need to) won’t be a manual process.
- We moved the DNS to point at our new system in the cloud, rather than the data centre. We brought the system back up on 9th July, after 14 hours of downtime, and watched the first few deposits come in, while testing thoroughly.
- Alongside the technical team, the membership and support team was at the ready to work through the testing in the new live production environment.
The message we sent to members, Metadata Plus subscribers, and key integrators like PKP and Turnitin, listed which services would be down and described what changes they might see, such as:
- The system timezone shifted from EST to UTC (universal coordinated time), which would be noticeable in the timestamps reported back to members after metadata deposits
- Our IP address became dynamic and is no longer static. If members had hardcoded our previous IP static address to connect to our services, that would no longer work.
- We previously allowed connections using the HTTP/1.0 protocol, but now require HTTP/1.1.
Likewise, we previously allowed TLS version 1.1, but now require at least version 1.2. Older ciphers will not work. A list of accepted ciphers can be found on this page for “ELBSecurityPolicy-TLS13-1-2-2021-06”.
How it went and what’s next
We still have more to do, with both expected and unexpected issues arising from the migration. There are a couple of functions that still route through the data centre, configuration changes to wrangle, and processes to iron out, so we’ll be keeping that open for another couple of months.
Those were the known issues…
…we also uncovered a few bugs along the way, and we’ve been reporting those (and our progress toward fixing them) on our status page. See history.
A few diligent members also alerted us to problems they were having. In some cases, we could tell why, and in many cases, their systems needed to be upgraded to work with ours. Thanks go to mEDRA, Spandidos Publications, and Stichting SciPost who helped us identify gaps that resulted in configuration improvements and lessons learned (that we then shared with other members).
There were three issues that we were contacted about more than others:
- Delayed delivery of notification emails which is partly due to the volume of backlogged notification emails in the system.
- Mostly solved: We have repaired delivery of notification emails for all metadata deposits and are working on a fix for the delivery of messages associated with very large queries.
- A small percentage of registered records not being indexed in the REST API - this can cause downstream issues for a number of other services (e.g., Crossref metadata search - search.crossref.org, Participation Reports, ORCID auto-update, and for external services that make use of the metadata from our REST API).
- Mostly solved: All records in July are now indexed in the REST API, albeit we have new reports of a few records missing in the last week, which we are actively investigating.
- Delayed delivery of July’s resolution reports.
- Solved - not only has July’s resolution report run completed, but we also completed August’s ahead of schedule.
This migration was a significant effort, and 2025’s top priority project for the Open and Sustainable Operations (OSO) program team. Overall, we’re happy with our progress toward making Crossref infrastructure more robust, reliable, and future-proof. And judging by the messages of support we received, you are too! Onwards to the next infrastructure project… check out our roadmap to see what’s up next.
References
- ‘Infrastructure as code’ (2025) Wikipedia, 12 August. Available at: https://en.wikipedia.org/wiki/Infrastructure_as_code (Accessed: 12 August 2025).
- ‘The programs approach: our experiences during the first quarter of 2025’ (2025) Crossref. Available at: https://doi-org.turing.library.northwestern.edu/10.64000/4s2ee-wkr84 (Accessed: 12 August 2025).
Further reading
- Jul 1, 2024 – Celebrating five years of Grant IDs: where are we with the Crossref Grant Linking System?
- Jun 4, 2024 – Rebalancing our REST API traffic
- Mar 24, 2022 – Outage of March 24, 2022
- Oct 27, 2021 – Update on the outage of October 6, 2021
- Oct 6, 2021 – Outage of October 6, 2021
- Apr 30, 2021 – Open-source code: giving back
- Oct 4, 2019 – Accidental release of internal passwords, & API tokens for the Crossref system