
Beyond chats - LLMs as data management assistants
Can AI help in making data "AI-ready"? Yes, but there are nuances to it.
Beyond chats - LLMs as data management assistants
Agentic AI shows up in every other post on social media, yet practical questions often go unanswered. How do we deal with AI that confidently generates nonsense or fakes citations? What if our data isn’t as clean as it should be - and how can we improve it? These critical issues are lost in the current marketing hype.
That’s why I want to focus on real-world use cases - how AI can assist with and automate routine data management tasks. These are problems we can tackle today. Automating them not only improves data quality but also unlocks more advanced capabilities. I’ll also share best practices for guardrailing LLM implementations to prevent failures and ensure they are ready for regulatory scrutiny.
General Principles
I’d recommend approaching AI automation by focusing on small but high-impact tasks where LLMs offer a clear advantage over humans. People aren’t great at reading 50-page documents and instantly classifying them, or extracting a list of 100 terms by a specific criterion - LLMs are. With proper oversight and expert validation, this capability can significantly reduce the time needed for:
- Curating important data and metadata
- Checking documents for retention policies, compliance and PII at scale
- Registering and organizing internal and collaboration data
- Performing data harmonisation tasks
Automatic Curation and Metadata Extraction
We interviewed dozens of biopharma organizations on their approach to organizing data. Most of them start with a folder-based approach. It works initially, but as they scale, finding and aggregating data becomes hard. This happens because folder structures encode only some aspects of metadata - like which project the study was related to, what was the date and type of study. What they can’t do is provide granular metadata parameters which are useful for historical exploration or data aggregation (e.g. which compounds were tested, how and under what conditions).
To address this, companies often set up separate metadata repositories. But populating them usually involves manual entry - copying information from protocols and lab notes into forms. It’s tedious, error-prone and difficult to maintain through version changes.
LLMs can drastically reduce this burden. They can read associated documents and pre-fill hundreds of metadata fields in minutes, turning hours of manual labor into a quick validation step. Because they are fast, consistent and tireless, keeping metadata aligned with the latest version of the standard becomes much easier.
Compliance, Policies and PII at Scale
In biopharma, we handle large volumes of documents that require careful review - whether it’s for PII, retention policies or adverse event reporting. Manual processing is slow and expensive, while rule-based systems are brittle and often miss edge cases.
LLMs offer a scalable, cost-effective alternative. They can flag the presence of sensitive data, identify policy references or annotate retention requirements automatically. This enables fast, consistent document checks across massive corpora.
Register and Organize Internal and Collaboration Data
External data brings unique challenges. While internal assets may follow consistent formats and metadata standards, collaborations rarely do. We’ve seen cases where key metadata was emailed separately or omitted entirely, leading to the loss of critical context. Information gets scattered across systems and informal channels.
LLM agents can help validate incoming data packages against internal standards, flag missing or inconsistent elements, and even reconstruct metadata from available documentation. This reduces manual work, improves data reliability and offers a safety net for registering both historical and new datasets.
Accelerating Data Harmonization
LLMs excel at understanding semantics when given the right context. This makes them highly effective for suggesting harmonization strategies. For example, they can easily detect that a dataset column labeled “Sex” with values 0/1 should map to an internal standard like “Participant Gender” with “Male”/“Female.”
Used this way, LLMs can drive automated suggestions for harmonization pipelines, making integration across datasets faster and more accurate.
Implementation Challenges
Despite these opportunities, biopharma implementations must meet several key requirements to be viable:
- Contextual grounding: LLMs must have access to relevant source documents (protocols, lab notes, QC reports).
- Human-in-the-loop validation: All outputs should be treated as drafts and reviewed by experts.
- Transparent reasoning: LLMs must clearly document their evidence and assumptions.
- Hallucination detection: Suspect outputs should be flagged for deeper review.
- Reproducibility: Identical inputs should produce consistent results.
- Full audit trails: Each run must log input sources, reasoning steps and outputs for traceability.
These guardrails ensure AI outputs are reliable and auditable. In regulated industries, transparency and predictability matter more than creativity or novelty. While tech trends often prioritize “magical” user experiences, that tradeoff doesn’t work for biopharma. Here, we must prove - clearly and repeatedly - that a promising new idea actually delivers results.
For AI to become part of daily operations, it must follow the same principle. Otherwise, it will remain confined to discovery-stage proof-of-concepts.
Takeaway
Advanced capabilities require a stable foundation. Data quality and its readiness for AI are the most cited reasons for the failure of AI initiatives. That’s especially true for biopharma, where the stakes and transparency requirements are higher than in other sectors.
That’s why I wanted to focus on the small data management use cases over advanced analytics, drug discovery or autonomous agents. As it turns out, the most important practice during the AI revolution is “thinking through the task step-by-step”. Making data trustable and complete is a “Step 1”.