9 minute read.From storage closet to metadata champions: ASM’s journey toward a smarter scholarly infrastructure
The American Society for Microbiology (ASM) has earned recognition in Crossref’s Participation Reports for its exceptional metadata coverage among large publishing members––an achievement built on intentional change, technical investment, and collaborative work. In this Q&A, the ASM team shares what that journey looked like, the challenges they’ve tackled, and how centering metadata has helped them better connect research with the global scientific community.
A key lesson we learned is that meaningful progress doesn’t require perfection from day one. Start small, find manageable wins, refine as you go, and build a shared understanding across all your teams.
– David Haber, ASM
Once we completed our initial metadata cleanup of our backfile and made sure that we were producing good, clean, and consistent Crossref metadata (no small feat), we realized that each new policy, process, or even style change should be viewed through a metadata capture lens. By looking at our publishing goals through that lens, we are better able to see the right time and method to help enrich and “grow” both our article metadata breadth and depth. Much of the metadata work is invisible or an afterthought. But the recognition of ASM’s coverage in the participation reports has affirmed that our change in perspective — shifting from viewing Crossref metadata as something produced as an afterthought to centering our processes around the creation of that metadata — has put us on the right path.
When we first started on our various metadata cleanup projects, it felt like there were just a few of us, arguing, agreeing, and arguing some more about obscure tagging structures and proper XML modeling in a closet––literally… My office actually was an old storage closet, and my pre-pandemic whiteboard still has that ghostly blue haze of angle brackets scribbled with dry-erase markers.
Since then, our goals have shifted significantly. Early on, we just wanted all our content mapped to DOIs; then we thought, “Oh wait. Let’s include as many abstracts as possible. And references. If we have the data, let’s send it.” Now that we have a strong metadata foundation, we can think proactively about what to capture and transmit, how we want to prioritize our efforts, and how to make research we publish more discoverable to those who need it.
Looking back, were there any changes in internal collaboration or external partnerships that influenced your progress?
Over the past three to four years, we have made some significant changes to our partnerships. We migrated to a new online platform (Atypon), a new production partner (Kriyadocs), a new submission platform (Chronoshub), and a new billing system (RLSC). Each of these partnerships allowed us to evaluate how we were capturing metadata, when that capture occurred, and how best to improve the QC process to ensure accuracy and quality. These partnerships accelerated all our efforts to improve hidden metadata and finally brought them out of the storage closet into the light.
Have you adopted any new tools, standards, or technologies since your last blog?
Our production software (Kriyadocs) has centered metadata capture as a core function. We have processes and procedures that match all affiliations to Ringgold and ROR IDs. We have invested heavily in partnerships with organizations like Chronoshub to utilize natural language processing, automating the identification of authors and affiliations, so that users no longer have to fill out tedious forms. We embraced ORCID and strongly encourage all authors to register for one if they don’t already have it. We have also adopted the CRediT taxonomy as a contributor framework and have built processes to make it easy for authors to stay within that taxonomy.
The core problem (from our perspective) has always been the difference between author profile information and what is actually submitted in manuscripts. Auto-extraction of manuscript data into submission forms is one small step toward unifying author identity with manuscript data. One of our biggest pain points now is reconciling the chaotic data on author affiliations in manuscripts with institutional identifiers. Over the next year, this will be one of our main initiatives.
The capture of ORCID IDs has improved our ability to match papers to editors and identify hidden conflicts of interest. ORCID IDs have also helped us expand our reviewer pool, as they enable us to better disambiguate individuals with similar names.
Because we now capture CRediT roles in a controlled manner (rather than as loose text in the acknowledgments section), we are better able to identify when authors are contributing equally and how authors determine author order in the byline when this occurs. This analysis was undertaken by one of our Editors-in-Chief to study gender bias when authors contributed equally to a work. Now that we capture CRediT roles as structured data, we can build on his research.
In the last two years, we have also begun capturing Data Availability Statements and Ethics Statements in unique metadata fields (rather than as unstructured text in the body of an article or in the acknowledgments sections) because some of our editors are curious about open data policy compliance and whether there is higher uptake of open science initiatives in certain microbiology fields.
RC: These are very interesting and quite profound results, especially for integrity and equality in the publishing process! Good to see how useful you find this information as we’re approaching our schema updates to include contributor roles, among other things. I see that editors are already on board and taking advantage of high quality metadata. Are authors more engaged with metadata now than before?
Our authors likely are engaged too––though we have tried to build author metadata QC into our proofing and typesetting process in such a way that they wouldn’t even notice.
In the realm of metadata, there are two standard solutions: 1) hire vendors to clean data at the end (the throw-people-at-the-problem philosophy); or 2) trust a black-box technical solution. The problem with the first method is that it is inefficient and can become expensive. The issue with the second is that, in my experience, most technical solutions have an 80% success rate. That may be acceptable for certain types of data, but it can fail spectacularly at the worst possible moment.
For example, let’s say you find a technical solution that parses affiliation data in such a way as to assign a PID. Great, wonderful. Let’s say your parser is the best natural language processor in the world and makes matches 90% of the time (if you have one that does this, I’m all ears). You announce that you are including these IDs. Everyone cheers. It is great, right? Now, imagine you want to use those IDs to identify subscribing institutions to offer discounts or fee-less publishing for authors. You also want to use those IDs to send alerts to institutional admins of publishing activity. In both situations, achieving 90% accuracy simply won’t work. What we’ve learned is that black-box technology and ’throw people at it’ philosophies cannot work alone. Metadata curation must be a collaborative effort among authors, publishers, funders, and institutions, where the information grows throughout the research process.
Over the next year, we will focus on CRediT identifiers and pass them to Crossref, along with institutional PIDs (ROR, Ringgold, and ISNI). We are also exploring various ways to capture peer reviewer activity and contributions, which will inevitably lead us down new and interesting paths.
Anything else you want to share?
Here’s the thing about metadata that I wish I’d known when I started: it’s not a project with a finish line. It’s more like tending a garden that keeps growing in unexpected directions. Every time you think you’ve got it figured out, someone invents a new identifier, or your authors start doing something creative with their affiliations, or a funder changes their requirements, and suddenly you’re back to the drawing board.
But what I’ve also learned from our journey out of that metaphorical (and literal) storage closet: the best metadata work happens when you start thinking of it as infrastructure. Good metadata is like good plumbing; when it’s working, nobody notices it, but when it’s not, everything backs up and gets messy fast.
If you’re just starting this journey, my advice is this: don’t try to boil the ocean (gosh, I still need to remember that one). Pick one thing. Perhaps it could be ORCID IDs or institutional identifiers. Do it really, really well. Then build on that success. And please, for the love of all that is holy, invest in good partnerships. We couldn’t have done any of this without partners who understood that metadata isn’t just data entry; it’s the connective tissue of scholarly communication.
Of course, even with the best partners and aligned teams, there will still be moments when you’ll sit dumbfounded in front of a screen where an author’s affiliation that was listed as “Bloomberg School of Public Health” matched to the identifier linked to the “Escuela Nacional de Sanidad.” On those days, just remember: at least you’re not still working in a storage closet with a haunted whiteboard.
Good metadata is more than just a technical specification, and it’s not just for those XML wonks and nerds. It’s a service to science, and its core mission is to help us understand the world around us.
– David Haber, ASM
ASM’s story is a reminder that building a strong metadata infrastructure isn’t just about meeting technical requirements—it’s about aligning people, tools, and values around the idea that clean, connected, and consistent metadata is foundational to open and discoverable research. Whether you’re starting small or overhauling major systems, their experience shows what’s possible when you treat metadata not as a checkbox, but as a core part of scholarly publishing.
Thank you, David, for taking the time to share your insights. Again, congratulations!