What a scientist could learn from an engineer

Hi there, I hope you’re having a great summer! I’ve just come back from vacation with a great idea: to share what I’ve learnt from software engineers about making data collaboration-ready.

I come from the Neuroscience background and have classical training from the “wet” lab. However, while building sci2sci, I’ve had a chance to work with brilliant colleagues with a software engineering background.

There’s something quietly admirable about them. Sure, they break production systems at 2 a.m. and spend three hours optimizing code that saves two milliseconds, but they do one thing consistently well: they write their code as if someone else will have to use it, read it or fix it later. Sometimes that someone else is future-them - after a weekend, a vacation or a job change. Often it’s a teammate. Either way, the code is expected to survive a handoff.

Scientists, on the other hand, tend to assume their data will never leave the lab. That it will never need to be interpreted by someone else. Or worse: that they’ll remember everything about it when they come back to it in six months.

Spoiler: they (we) won’t.

Engineers are taught - early and often - to design with reusability and maintainability in mind. For me, watching how they work has been a crash course in a completely different mindset - one that I keep finding surprisingly helpful for scientists, especially when it comes to making data easier to share, re-use and build on.

This letter shares some lessons on writing the code that I’ve learned and translated to data management principles in life sciences.

✅ 1. Write for someone else to make it clear for yourself

Software engineers don’t start out being good at writing readable code. They get good at it through years of code reviews - showing their work to colleagues who haven’t seen it before and getting immediate feedback on what’s confusing, unclear or impossible to follow.

Scientists rarely have this feedback loop. We write our analysis scripts and organize our data in isolation, then wonder why nobody else can make sense of it six months later.

The fix isn’t just better documentation - it’s building in a review step. Hand your dataset to someone unfamiliar and watch where they get stuck. What assumptions did you make that aren’t obvious? What seems clear to you but cryptic to them?

If they can look at your folder, read your README and understand what “data_final_v2b.csv” contains without you having to explain it - congratulations, you’ve passed the test. Now you’ll have a good chance of figuring it out too when you return to it in six months. If they’re still guessing what the “v2b” means or which version is actually the one to use, you’ve got more work to do.

✅ 2. Keep a simple changelog

Engineers use version control tools like Git to keep track of what changed, when and why. Scientists often have folders full of half-dated file versions and no real idea what makes them different.

If you’re not using Git (and no judgment if you aren’t), even a lightweight changelog can help:

Aug 3 – Removed outliers from dataset B based on z > 3
July 30 – Normalized columns 3-6 to unit variance

It’s not fancy. But it works.

✅ 3. Build interfaces, not just outputs

Engineers use APIs - clearly defined interfaces that tell other people how to interact with their systems. Scientists could benefit from something similar: define what each dataset is, what it isn’t, and how it should (or shouldn’t) be used.

For example: is this raw data, cleaned data or normalized data? Are missing values zeros, nulls or “not measured”? What units are we in? Celsius? Kelvin? What should be handled with caution?

The goal isn’t to scare people off - just to keep surprises to a minimum.

✅ 4. Automate what you can, or at least make it reproducible

If your data workflow involves six undocumented copy-paste steps and one mysterious R script with no comments - that’s not a workflow. That’s a ritual. And it won’t survive contact with another human.

Engineers automate not because it’s fun (although sometimes it is) - but because it makes errors less likely and handoffs smoother. Even a basic script or checklist can help make your process less fragile and more shareable.

Conclusion

Science is increasingly a team sport. Multidisciplinary, multi-institutional and often, multilingual - not in the poetic sense, but in the “why are you using commas as decimal points?” sense.

That means our data needs to be more than just correct - it needs to be understandable. Reusable. Friendly. And - fundamentally - considerate of the next user.

Adopting these habits isn’t about adding bureaucratic overhead. It’s a small, upfront investment in clarity that pays massive dividends in scientific velocity and reliability.

You don’t need to become a programmer to think like one. You just need to start treating your data like it has a future.