
Why biopharma data is different - and what it means for data management in life sciences
For the last 4 years I've been working on life sciences data management. This letter shares what we've learned about bounded vs unbounded domains and why Office tools persist in scientific workflows.
Why biopharma data is different - and what it means for data management in life sciences
Hi - my name is Angelina, I’m CEO at sci2sci. For the last 4 years I’ve been working on life sciences data management. This letter shares what we’ve learned.
Bounded vs unbounded domains
In IT or finance, data lives in what I call “bounded domains”:
- All possible states are known upfront
- $100 is $100, regardless of server room temperature
- Edge cases are just technical failures
Life sciences operate in unbounded domains, and reality constantly teaches us new ways why context matters.
Here’s what I mean: two proteomics experiments. Same protocol, same day. Different results. Six months later you’re debugging why. Your LIMS shows both were stored at “-80°C”. But then you find it - buried in the protocol, a freetext note: “Use Freezer B for longer storage, as it’s less frequently opened and temperature fluctuates less.” One researcher read this note. The other didn’t.
That freetext instruction? That IS important. But it doesn’t fit in any dropdown menu or structured field.
Why those Office tools won’t die
This explains a pattern every data manager knows: You implement a million-dollar LIMS, but researchers still use Excel. You build a data warehouse, but critical context lives in PowerPoint slides.
It’s not only that scientists resist change. These tools persist because they handle unbounded reality well:
- Word docs preserve those critical freetext notes
- Excel sheets capture relationships that don’t fit in predefined schemas
- PowerPoint presentations explain the “why” behind experiments
The history keeps teaching us
Look at how our field evolved:
- Sanger sequencing → NGS → long-read sequencing
- “One gene, one protein” → alternative splicing → who knows what’s next
Each transition revealed that our previous “gold standard” was incomplete. When we standardized genetic data pre-CRISPR, we didn’t have fields for guide RNAs. We couldn’t - we hadn’t discovered them yet.
Here’s what makes this especially challenging: our product development cycles span 10+ years. Your Phase 1 trial uses 2015’s best practices. By Phase 3 in 2025, those methods look primitive. But regulators need to see the full story - from those “primitive” early experiments through modern standards.
A single drug’s journey witnesses multiple technological revolutions. The data management system that started your project will be obsolete before you file your NDA. Remember “junk DNA”? Turns out it wasn’t junk. But labs that threw away that “noise” can’t go back and reanalyze it now.
We can’t predict what 2030’s methods will reveal about today’s data. But we CAN predict they’ll reveal something. They always do.
What this means for data management
If life sciences is fundamentally unbounded, we need additional approaches beyond traditional data management:
Recognize that unbounded data needs unbounded thinking. Standard approaches work for the structured parts, but we need more for the contextual complexity.
Build for change, not just for scale. Your schema will be obsolete before it’s fully implemented. Plan for it.
Preserve context like a packrat. That “irrelevant” detail in today’s experiment might be the key insight in tomorrow’s Nature paper.
Those Office tools are telling us something. They survive because they handle aspects of discovery that structured systems miss - the messy, contextual, constantly evolving parts.
The lesson I learned
The problem of managing data is more complex in life sciences. In addition to all the challenges IT faces - volume, velocity, variety - we have to manage unbounded context and constant evolution.
The goal isn’t to abandon structure or standards. It’s to build systems that handle both the structured AND the unbounded - systems that expect the unexpected, preserve the “irrelevant,” and evolve as fast as our understanding does.
Next time: specific strategies for managing “unbounded” data. For now, I’m curious - what “noise” in your old data turned out to be a signal?