Thoughts

April 7, 2026

The Future of Science Depends on Open Source Engineers

The processes of modern science, from how research gets funded and conducted to how it gets published, evaluated, and preserved, were never engineered. They emerged over hundreds of years, built by researchers and institutions one workaround at a time. What we call "the scientific system" is really a patchwork of customs, formats, and platforms that were never designed to work together, never stress-tested at scale, and never designed to evolve.

And now that system is failing in ways that matter.

What's actually broken

A published paper gets a DOI, a semi-persistent, citable address. But the dataset underneath it? The code that produced the analysis? The protocol that generated the data? The peer reviews that shaped the conclusions? Those get nothing. They live on a grad student's laptop, in an institutional server with no preservation guarantee, or nowhere at all. The paper is the only artifact the system treats as real. Everything that actually produced the result, everything you'd need to verify it, reproduce it, or build on it, is structurally invisible.

Worse, that published paper is treated as a product, not an artifact of a research process in which you engage. A team spends three years refining a method, generating data, and iterating on analysis. The world sees none of it until a paper drops at the end. A methodological breakthrough in month four that could save another lab six months of dead ends stays locked in a single group's workflow. Intermediate datasets that could anchor entirely new studies sit on local drives until the "final" version gets published, if it ever does. The infrastructure has no concept of work-in-progress. There is no mechanism to share a partial or null result, get credit for it, and let others build on it. So science's most valuable outputs, the accumulating work of the research process itself (all the way down to the discussion and notes level), are simply lost.

The data and artifacts that do get shared are shockingly fragile. When a single agency loses funding, entire repositories go dark, and the datasets inside them vanish. We're watching this happen in real time. Across U.S. federal agencies right now, research data that took decades and hundreds of millions of dollars to collect is disappearing because the preservation model was centralized, underfunded, and architecturally brittle. One budget decision, one political shift, and years of scientific work ceases to exist. There is no redundancy. There is no fallback.

Then there are the invisible dependencies. Modern research runs on open source software, like the rest of digital society. NumPy, R, and the deeper software infrastructure and domain-specific packages are often maintained by one or two people in their spare time. And, in a cruel twist worth its own discussion, research grants do not fund software maintenance or practices.

Perhaps most fundamentally: there is no infrastructure for coordination. Researchers working on related problems across labs, institutions, and disciplines have no way to find each other. Pipelines, methodologies, and practices in one field that could transform multiple adjacent fields never propagate while the technological infrastructure they utilize is built independently for immediate, one-off uses; FAIR workflows, storage solutions, and other technical systems are created from scratch again and again, solving the same engineering problems over and over simply because nothing connects the needs and efforts of and from various research endeavors. Each of these individual, entity-centric systems is desperately incompatible and adds to the fog, choking the necessary shared solutions behind technical debt and sunk cost fallacies.

Infrastructure shapes practice

Infrastructure doesn't fail passively. When the only thing that gets a persistent identifier is a paper, publishing papers becomes the only thing that matters. When there's no infrastructure for sharing incremental results, researchers have no choice but to sit on their work until it's "complete." When there's no technical substrate for collaboration, the same problems get solved independently in lab after lab.

Even researchers who want to work more openly by sharing data, releasing code, and publishing in progress struggle to do so. The infrastructure exists for finished products, not ongoing processes, and the will to work openly can't overcome that alone.

Our scientific system isn't failing due to a failure of scientific thinking or culture. It's failing due to compounding failures of engineering.

Open source already solved this

This should sound familiar to anyone in open source, because open source was built by distributed communities precisely to coordinate living, evolving systems. Version control. CI/CD. Dependency management. Package registries. Distributed architecture. Governance models. These are tools for managing software development, an ongoing, collaborative, cumulative process, and science, fundamentally, is exactly the same. A hypothesis evolves. Data accumulates. Code gets revised. Results build on each other. Open source built infrastructure for a system you operate and improve. Science built infrastructure for a product.

AI makes this urgent

Now layer on AI. The research apparatus is generating results, datasets, analyses, and code at volumes the current infrastructure never imagined. Every AI-assisted study produces more artifacts, more dependencies, and more downstream connections that need to be tracked, identified, and preserved. The infrastructure that was already failing at human speed is about to face machine speed and machine volume. Without intervention, the next decade of research will produce unprecedented volumes of work that can't be reproduced, verified, or preserved.

What needs to be built

The next generation of scientific infrastructure is an engineering challenge. Here's what it requires:

  • Cryptographically permanent, free, hyper-granular identifiers for every research artifact. Not just papers, but datasets, code, protocols, reviews, claims, figures, and so much more. Every component, line, and symbol must be independently findable, verifiable, reproducible, and citable.
  • Modular, independently reviewable components. Research broken into pieces that can be executed, reviewed, funded, and attributed separately.
  • Interoperable, flexible, on-the-fly adaptable schemas and tooling. Labs working on the same problem must be able to combine their data without losing meaning, and cross-domain use must be possible when the opportunity arises.
  • Integrated provenance tracking. An automatically generated audit trail from funding to published claim, supported by rigorous evidence.
  • Reproducible execution environments. Anyone must be able to re-run an analysis and get the same result.
  • Distributed preservation and compute. No single point of failure, no single funding cut that wipes out a field's data, and compute access for researchers that scales with the compute potential of society.
  • Large-network governance models. Community governance that works across disciplines, institutions, and borders.

The opportunity

Governments and funding agencies worldwide are mandating open access, FAIR data, and reproducibility. Billions in research funding now come with these requirements attached. But the infrastructure to actually comply doesn't exist. The engineering brilliance to build it does… in the open source community.

This is a once-in-a-generation infrastructure buildout. Building the next generation of scientific infrastructure requires software architects, maintainers, DevOps practitioners, OSPO leads, community builders, and governance designers. Everyone who knows how to build and sustain systems that work as ongoing processes.

The field is growing. The funding is there. The mandates are real. And the open source community is uniquely positioned to help build the infrastructure that will power science for the next century.

If you've ever thought science wasn't your domain of expertise, think again. It might be the single domain that needs you the most.

Get involved.