Data and knowledge management in the LLM era

Data and knowledge management in the LLM era

By Angelina, CEO at sci2sci 6 min read
AI Knowledge management Biopharma Scientific Data

LLMs and research data integration: what we learned working with life-sciences data.

Data and knowledge management in the LLM era

Last time, we established that life sciences data is unbounded. We have our LIMS, but the critical why behind an experiment often lives in a PowerPoint slide, a spreadsheet or a free-text note.

While there are many resources on managing data warehouses and LIMS/ELN systems, the topic of how to connect them to each other - and to other sources of context - is often missing. You won’t see sophisticated knowledge graphs or data fabrics in this letter. They are great - but they need a foundation, and while it’s easy to implement, it’s surprisingly rare to see it established.

I’ll explain what AI has solved - and what it hasn’t. LLMs have changed how we work with unstructured data. But if the data is missing or unfindable, they won’t offer much help.

Contextual Identifiers and Connectivity

Identifiers are important - and tricky to get right. Most companies have identifiers by entity type: experiment ID, order ID, compound ID. That’s good, but it’s critical to establish IDs for processes too.

Example:

Consider this workflow: prepare samples → send to a CRO → analyze results.

Company A has IDs for the preparation protocol, CRO order and analysis results.

Company B has all the same IDs, but also consistently mentions “Project-XXX” in every system.

When the coordinator leaves:

  • Company A spends weeks connecting the dots. Even with AI, it takes hours or days.
  • Company B connects everything in hours without AI. With AI integrated - minutes or seconds.

Why the difference?

Company B preserved the information that existed only in the coordinator’s head. Company A has to restore it through painful archaeology.

Same data. Same process. The only difference: Company B had a rule of mentioning internal project IDs everywhere - from documents and spreadsheets to databases.

Templates and Versioned Metadata Capture

Templates are needed to make decisions like “include project ID” automatic and not an additional cognitive burden.

Implementation strategies:

  • PowerPoint: Add a template slide with relevant metadata as the last slide.
  • Excel: Include an additional sheet with required metadata.
  • Documents: Add metadata in headers.
  • ELN: Make it part of protocols or additional fields.
  • Databases: Establish metadata best practices in documentation.
  • Naming conventions: Create standard formats for files, samples and experiments (e.g. ProjectXXX_YYYYMMDD_ExperimentType_Version).

Templates are crucial for change management. They bridge the gap between “good idea, forgotten in a week” and “established practice” - but only if they’re extremely easy to apply.

Key requirements for effective templates:

  • Non-limiting: Templates must be convenient to fill in, or they will be filled in poorly.
  • Include free-text space: With LLMs, we can finally process this easily, making it valuable for capturing deviations and non-standard situations.
  • Versioned: Easy to evolve over time - since it’s hard to predict what data you’ll need in the future.
  • Fill once: Avoid duplicate data entry. Use existing information wherever possible.
  • Consistent naming: Naming conventions should be simple enough to remember and apply without needing to look them up.
  • Start simple: Make everything traceable to the project, owner email and dates. A clear structure is a great starting point for automation, and LLMs make free text an accessible first-class data source. Then add the payload flexibly - branch the templates from a single core.

Visibility and Easy Interface for Cloud Data

Researchers won’t learn new systems. If you have automated data pipelines doing BLAST over sequencing data - provide an easy search and chat interface on top.

This dramatically reduces “Where is the data?” questions. Engineering teams should do engineering - not act as chatbots for data requests. Turning queries into results from natural language is a solved problem with LLMs - let chatbots handle chatbot tasks.

Implementation approaches:

  • Make it accessible via MCP to the company LLM
  • Integrate with company chat systems
  • Establish interfaces that can be extended over time

The Takeaway

LLMs have provided great capabilities - structuring free-text documents, enabling conversational interfaces and simplifying interaction - but they’ve also highlighted the cracks in our general information management.

Key principles that remain constant:

  • If data connectivity is lost, it will always be hard to recover
  • If critical metadata wasn’t captured, nothing will restore it
  • If interfaces aren’t easy - people won’t use them

Which gives us three strategic directions:

  1. Preserve cross-system data connectivity
  2. Develop processes to evolve metadata standards
  3. Implement lightweight interfaces that reduce manual work.