A blog summary of the SciOS Resilient Data Futures whitepaper. The full paper is at rdf.scios.tech/narratives, and the underlying discourse graph and contribution model are at github.com/jring-o/rdf.

The Architecture Problem in Research Data

73 to 93 percent of papers fail to deliver their underlying data on request. This lost data, endemic across all research institutions as a byproduct of architectural decisions baked into daily operations, represents a $1.1 billion liability for an average R-1 institution. The architectural solution that hedges this liability is simple to implement, has been proven over four decades, and provides the exact properties AI-ready data policies are asking for.

And as for the 20% of data that is retrievable? Don't worry. It's often one organizational decision — an acquisition, a defunding, a jurisdictional change — away from oblivion.

Training, data management plans, researcher discipline, better staffed libraries — these address real problems and produce real, but bounded improvements. None of them addresses the property that makes such volumes of loss possible in the first place: research data typically exists in a single copy, held by a single organization, funded by a single grant, maintained by a single person who is expected to leave in several years.

This flawed architecture operationalizes loss in ways familiar to any institution. A graduate student leaves and the operational knowledge of where the data lives leaves with them. A laptop is stolen. A grant ends and the server it funded ends with it. A repository shuts down (191 of them have since 2012, at a median operating age of twelve years). A platform changes its access terms (the Twitter API in 2023, GitHub blocking developers in Iran in 2019, GISAID's account suspensions, CNKI's foreign-access cutoff). A backup script runs sideways and erases 77 terabytes of supercomputer data in an afternoon.

These look like different problems, but they're not. They are manifestations of the same core problem eating away at our ability to maintain scientific and technological advancement: research data infrastructure emerged to serve a reality that no longer exists.

Four kinds of preservation, only one of which scales

There's a four-tier framework to research data storage.

Tier 0 is local storage — one copy on one system. Most research data lives here, and the default outcome is the 17-percent annual availability decay the literature has been documenting for over a decade.

Tier 1 is hosted storage with one provider — an institutional repository, a domain repository, a cloud bucket. This is the architecture most data management plans describe and most mandates produce. It absorbs local failure. It does not absorb provider failure, which arrives through hardware failure, bankruptcy, acquisition, defunding, jurisdictional change, or terms-of-service shift on a timeline the institution does not control.

Tier 2 is coordinated preservation across institutional agreements. INSDC mirrors GenBank across three continents; the Worldwide Protein Data Bank synchronizes four sites weekly; CLOCKSS preserves 63 million articles across 12 nodes. These are the most sophisticated preservation systems ever built, and they work until the coordination doesn't.

Tier 3 is protocol-level distribution. Redundancy is a byproduct of use rather than the output of an organization's continued commitment. The architecture is visible in the longest-running information systems on the internet — DNS for 43 years, email for 44, BitTorrent for 25, Git for 21. These systems have outlived the companies, the operating systems, and in some cases the institutional structures that designed them. None of them lives or dies with any single organization.

Most research data sits at Tier 0 or Tier 1. Tier 2 is reserved for data whose value justifies the financial and operational costs. Tier 3 is the only architecture in the field that produces scalable, accessible preservation, verification, and audit evidence as structural byproducts of operation.

Verification falls out of the architecture

The funder regime is shifting from "did you write a plan?" to "did you do it, and can you prove it?" The NIH's May 2026 standardized DMSP format, the Gates Foundation's transition to programmatic compliance checking, and the False Claims Act's implied-certification doctrine are converging on a single question: can the institution surface, on inspection, evidence that the data exists where the plan said it would, intact, with the access controls it claimed?

At Tiers 0 and 1, the institution cannot answer the question by inspection. Every property is an assertion the provider made and the institution cannot independently verify. At Tier 2, the consortium often runs verification on members' behalf, but it remains an assertion the consortium makes about its own protocols rather than one the institution can independently re-run — and MetaArchive's sunset audit showed those internal protocols can silently fail with no external party positioned to catch it. At Tier 3, all of those properties fall out of a single cryptographic query against the distributed network. Audit becomes inspection rather than forensic reconstruction.

The $1.1 billion liability is a carrying cost, not a realized loss — a four-term formula (sunk grant value, irreplaceable-dataset replacement, foregone reuse, and False Claims Act exposure on data the institution can't independently verify) applied to a representative R-1. The full math is in the paper. The probability of latent exposure converting to realized cost is rising while the architecture that produces verification by inspection is the architecture that hedges the liability, on the same deployment.

AI runs on the same architecture

Provenance, reproducibility, federation, and verification — the data properties any defensible AI program needs in order to train, document, and deploy a model — are the same architectural properties Tier 3 produces by default. Content addressing produces provenance. Persistence across the lifetime of any model trained on the corpus is exactly what Tier 3 delivers. Permissioned distribution networks produce federation without consolidating sensitive data into a single trust domain. A single cryptographic query produces verification any third party can independently re-run.

The infrastructure that hedges an institution's data-loss exposure is the infrastructure that produces its AI-ready substrate. The same investment holds both positions on the same deployment.

The infrastructure already exists

A standalone Tier 3 node — a BitTorrent seeder, a Forgejo instance, an IPFS pinning node, a Matrix homeserver, an AT Protocol PDS, a Tor relay, an Academic Torrents seeder — runs $42 to $360 a year on commodity hosting. Meanwhile, universities run servers at 37 to 41 percent utilization, networks at 26 percent, on bandwidth contracted at flat rates regardless of traffic. The marginal cost of adding a protocol node to existing institutional infrastructure approaches zero. TU Dortmund, TU Dresden, MIT, and the forty-five-plus universities running Tor relays already do it.

What you need to do

An architectural audit of every research dataset against the four tiers. Record the number of independent copies, the failure domains they occupy, and the verification capability available.
At least one Tier 3 node on existing institutional infrastructure within twelve months. The reference deployments exist. The operational overhead fits inside a student-volunteer team.
Compliance evidence generated at the point of deposit. Produce content addresses, hashes, and signed attestations as the data lands, instead of reconstructing them under audit.
Verifiable evidence of preservation required of grantees, not self-reported plans. Replace the human-readable plan with the inspectable artifact.
F&A funding for preservation, not project funding. Three-year grants cannot underwrite multi-decade obligations.
Local clones of everything that matters, as standard practice at the lab level. A single local clone is the difference between a Tier 1 access restriction producing permanent loss and producing temporary inconvenience.

How

The infrastructure exists, and it's fairly straight-forward to deploy. The economics favor deployment by more than an order of magnitude while the compliance regime appears to be closing the window in which the decision is voluntary. And that same deployment positions the institution for the emerging AI-data-readiness funding cycle.

For institutions: SciOS's Resilient Data Futures Lab has already worked with research labs to implement this infrastructure. We are further coordinating reference deployments, audit templates, and cost models across institutions via the RDF working group out of the lab. If you'd like our help directly implementing a Tier 3 solution for your data, we're ready to embed with your team, identify the solutions that best fit your processes, and to get it running. Reach out at contact@scios.tech to scope the engagement. If you'd like to move forward on your own, ask your faculty and students. Someone at your institution almost certainly wants to implement Tier 3 infrastructure (or already has).

For faculty/students/contributors: If research data, distributed systems, and AI-ready data, are in your native vocabulary, the next decade of scientific infrastructure is yours to architect. The RDF working group out of the Resilient Data Futures Lab meets monthly (next call on May 7th). Join the Resilient Data Futures Lab for an invitation to the calls where we discuss various solutions and our progress implementing them.

For builders: If the need to implement solutions is in your blood, whether at your own institution or on a laptop in your basement, SciOS builds daily. Reach out to join us: contact@scios.tech.

"Publishing" Method — 43 questions, 53 claims, 122 pieces of evidence, 6 methods, and 135 sources

The full paper is at rdf.scios.tech/narratives.

Beneath the paper, we structured our work as a discourse graph, a communication method designed to make every claim, evidence item, question, source, and method individually addressable, individually contributable, and stored on distributed infrastructure rather than a single hosted server. A separate post on what this publishing form changes about scientific communication is on the way, or you can read more about Discourse Graphs now.

To engage with this paper — provide evidence that counters a claim, pose a new question, discuss the details of a claim or node, and so on — the discourse graph and contribution model live at github.com/jring-o/rdf, and the rendered narrative, narrative generator, node browser, and node creator are at rdf.scios.tech.

As always, feel free to reach out for questions, comments, engagements, or just to say hello.

contact@scios.tech

The processes of modern science, from how research gets funded and conducted to how it gets published, evaluated, and preserved, were never engineered. They emerged over hundreds of years, built by researchers and institutions one workaround at a time. What we call "the scientific system" is really a patchwork of customs, formats, and platforms that were never designed to work together, never stress-tested at scale, and never designed to evolve.

And now that system is failing in ways that matter.

What's actually broken

A published paper gets a DOI, a semi-persistent, citable address. But the dataset underneath it? The code that produced the analysis? The protocol that generated the data? The peer reviews that shaped the conclusions? Those get nothing. They live on a grad student's laptop, in an institutional server with no preservation guarantee, or nowhere at all. The paper is the only artifact the system treats as real. Everything that actually produced the result, everything you'd need to verify it, reproduce it, or build on it, is structurally invisible.

Worse, that published paper is treated as a product, not an artifact of a research process in which you engage. A team spends three years refining a method, generating data, and iterating on analysis. The world sees none of it until a paper drops at the end. A methodological breakthrough in month four that could save another lab six months of dead ends stays locked in a single group's workflow. Intermediate datasets that could anchor entirely new studies sit on local drives until the "final" version gets published, if it ever does. The infrastructure has no concept of work-in-progress. There is no mechanism to share a partial or null result, get credit for it, and let others build on it. So science's most valuable outputs, the accumulating work of the research process itself (all the way down to the discussion and notes level), are simply lost.

The data and artifacts that do get shared are shockingly fragile. When a single agency loses funding, entire repositories go dark, and the datasets inside them vanish. We're watching this happen in real time. Across U.S. federal agencies right now, research data that took decades and hundreds of millions of dollars to collect is disappearing because the preservation model was centralized, underfunded, and architecturally brittle. One budget decision, one political shift, and years of scientific work ceases to exist. There is no redundancy. There is no fallback.

Then there are the invisible dependencies. Modern research runs on open source software, like the rest of digital society. NumPy, R, and the deeper software infrastructure and domain-specific packages are often maintained by one or two people in their spare time. And, in a cruel twist worth its own discussion, research grants do not fund software maintenance or practices.

Perhaps most fundamentally: there is no infrastructure for coordination. Researchers working on related problems across labs, institutions, and disciplines have no way to find each other. Pipelines, methodologies, and practices in one field that could transform multiple adjacent fields never propagate while the technological infrastructure they utilize is built independently for immediate, one-off uses; FAIR workflows, storage solutions, and other technical systems are created from scratch again and again, solving the same engineering problems over and over simply because nothing connects the needs and efforts of and from various research endeavors. Each of these individual, entity-centric systems is desperately incompatible and adds to the fog, choking the necessary shared solutions behind technical debt and sunk cost fallacies.

Infrastructure shapes practice

Infrastructure doesn't fail passively. When the only thing that gets a persistent identifier is a paper, publishing papers becomes the only thing that matters. When there's no infrastructure for sharing incremental results, researchers have no choice but to sit on their work until it's "complete." When there's no technical substrate for collaboration, the same problems get solved independently in lab after lab.

Even researchers who want to work more openly by sharing data, releasing code, and publishing in progress struggle to do so. The infrastructure exists for finished products, not ongoing processes, and the will to work openly can't overcome that alone.

Our scientific system isn't failing due to a failure of scientific thinking or culture. It's failing due to compounding failures of engineering.

Open source already solved this

This should sound familiar to anyone in open source, because open source was built by distributed communities precisely to coordinate living, evolving systems. Version control. CI/CD. Dependency management. Package registries. Distributed architecture. Governance models. These are tools for managing software development, an ongoing, collaborative, cumulative process, and science, fundamentally, is exactly the same. A hypothesis evolves. Data accumulates. Code gets revised. Results build on each other. Open source built infrastructure for a system you operate and improve. Science built infrastructure for a product.

AI makes this urgent

Now layer on AI. The research apparatus is generating results, datasets, analyses, and code at volumes the current infrastructure never imagined. Every AI-assisted study produces more artifacts, more dependencies, and more downstream connections that need to be tracked, identified, and preserved. The infrastructure that was already failing at human speed is about to face machine speed and machine volume. Without intervention, the next decade of research will produce unprecedented volumes of work that can't be reproduced, verified, or preserved.

What needs to be built

The next generation of scientific infrastructure is an engineering challenge. Here's what it requires:

Cryptographically permanent, free, hyper-granular identifiers for every research artifact. Not just papers, but datasets, code, protocols, reviews, claims, figures, and so much more. Every component, line, and symbol must be independently findable, verifiable, reproducible, and citable.
Modular, independently reviewable components. Research broken into pieces that can be executed, reviewed, funded, and attributed separately.
Interoperable, flexible, on-the-fly adaptable schemas and tooling. Labs working on the same problem must be able to combine their data without losing meaning, and cross-domain use must be possible when the opportunity arises.
Integrated provenance tracking. An automatically generated audit trail from funding to published claim, supported by rigorous evidence.
Reproducible execution environments. Anyone must be able to re-run an analysis and get the same result.
Distributed preservation and compute. No single point of failure, no single funding cut that wipes out a field's data, and compute access for researchers that scales with the compute potential of society.
Large-network governance models. Community governance that works across disciplines, institutions, and borders.

The opportunity

Governments and funding agencies worldwide are mandating open access, FAIR data, and reproducibility. Billions in research funding now come with these requirements attached. But the infrastructure to actually comply doesn't exist. The engineering brilliance to build it does… in the open source community.

This is a once-in-a-generation infrastructure buildout. Building the next generation of scientific infrastructure requires software architects, maintainers, DevOps practitioners, OSPO leads, community builders, and governance designers. Everyone who knows how to build and sustain systems that work as ongoing processes.

The field is growing. The funding is there. The mandates are real. And the open source community is uniquely positioned to help build the infrastructure that will power science for the next century.

If you've ever thought science wasn't your domain of expertise, think again. It might be the single domain that needs you the most.

Get involved.

Thoughts

A Case for Resilient Research Data Infrastructure