/ Data Integration

"Data Plumbing" for the Digital Lab

Let’s talk about data as the plumbing of the Digital Lab.

John Conway of 20/15 Visioneers has an analogy between R&D organizations and a builDing. The capital D is not a typo - he argues that the word is active (builDing) instead of passive (builTing) because it is never really finished. There will always be some kind of upgrade, improvement, enhancement, or repair. He talks about power - electricity and networks that allow for communication. He talks about virtual capabilities like software and advanced analytics. He talks about infrastructure as the critical foundation that protects and allows for proper existence. He also talks about plumbing as the processes and instruments that produce and move data. Here at TetraScience, we think this analogy is spot on, and we are inspired to dive deeper into the concept of "data plumbing."

“Data Plumbing” for the Digital Lab

When you think about plumbing for your home, running water, dishwashers, washing machines, etc. is what what pops into your head. At face value, it doesn't seem too complicated. However, plumbing can be incredibly complex when you think about how it needs to work in a NYC skyscraper, sports stadium, or the New England Aquarium. For biotech and pharmaceutical companies today, data flow in the lab is equally - if not more - complex than the water plumbing systems in those buildings. In the lab, there are heterogeneous instruments - different brands and models producing disparate file formats. There are distributed partners, such as CROs and CDMOs, that are doing outsourced research. Different applications are also found in the lab, such as registration, inventory, ELN, LIMS, and home-grown applications. There are many workflows that scientists are running, different kinds of experiments and assays, and increasingly, different visualization and analysis tools that data scientists are introducing into the ecosystem.

We define lab data plumbing as the collection, cleansing, harmonization, and movement of all lab data.

Just like the plumbing in a building, it can be messy and dirty. It involves the unrestricted flow of a valuable substance, (information or water), through a complex system that can be simple to start, but quickly becomes complicated. This requires special skill sets, tools, and careful design to properly implement.

Let’s consider some examples of data plumbing in today’s lab:

  • CRO sends a huge volume of excel files and pdfs via file share. Scientists then have to manually download the files and perform a quality check to make sure the barcode is available in the registration, for example.
  • The next step involves scientists copying & pasting the values into lab notebooks, then manually transcribing into an ELN. In a GxP compliant environment, this process often includes additional review steps.
  • Results are manually compiled or aggregated from multiple batches or samples at different time points into an excel spreadsheet to align the curves and illustrate trends and anomalies.

This type of primitive data plumbing is happening every day in every biopharma lab around the world. It isn’t automated, but it is already present - like manually pumping water out of a well instead of simply turning on the faucet.

Data-Plumbing-1

Requirements for a modern lab data plumbing system

Using building plumbing as inspiration, let’s think about some design requirements for a modern Digital Lab Data Plumbing system:

  1. Prevent dirt and use filtration. We do not want dirty water coming into our buildings. Incoming water will often undergo additional sanitization and filtration. Similarly, we do not want dirty or untagged data going into the data lake, ELN, visualizations, or reports. It is crucial to collect as much data as possible from different sources. Equally crucial is to attach the right metadata, perform validation checks, and harmonize the data.
  2. Fix leaks and clogs. There will always be leaks, broken pipes, and clogs causing drips and floods in buildings. This can be very annoying, or it can be disastrous. Similarly, in data plumbing, there can be processing errors. We need alerting, a notification system to proactively detect missing data, processing failures, and any kind of throughput bottlenecks or fluctuations so we can fix them before they become disasters. Imagine a file size is much larger than expected and causes memory issues. This needs to be tracked and proactively alerted. Trying to identify the source of a pipe leak would be a nightmare without a building floor plan to map out how the pipes are connected. It is the same for data flow - it is important to view and manage from a central dashboard.
  3. Configurability. We need to be able to swap out a sink or shower head without impacting or changing the rest of the system. In a data plumbing system, we need the same, or maybe a higher, degree of configurability, enabling labs to change an instrument or application without impacting the rest of the system. We need to be able to easily plug in multiple ways to consume data - Spotfire, Tableau, Juptyer Notebook, R, visualization tools, or your own applications. This should be able to be done in a plug-and-play fashion. The lab instruments and applications should enable the science; the science shouldn’t be limited by the available lab instruments and applications.
  4. Manage pressure. Water enters the building under pressure. That’s how it travels upstairs and around corners. In data plumbing, when you connect a new data source, the data ingestion pressure will build up. Maybe you want to ingest two years’ worth of experiments. Or a new CRO, or a new type of experiment, or a new robotic system. In each case, the amount of new data builds up. You will need proper load balancing, throttling, and auto-scaling to handle the pressure.

These are just some of the parallels we can pull from plumbing in a building. This can inspire to use proper design of a lab data plumbing system.

Of course, labs have their own unique challenges to consider:

  • collecting data from instruments is very difficult - this can be a fight as many instruments do not consider the downstream use cases of the data they produce
  • comprehensive and immutable data processing logs - a lot of the transformed experimental data will be used in FDA submissions or quality processes
  • ability to rerun or replay the data flow - you may want to extract new metadata out of the raw data, merge the data, or direct it to new places
  • rapidly customize and/or upgrade pipelines - for new types of studies or new types of insights
  • keep track of changes and upgrades to the data processing tools - need to maintain a thorough audit log for 21 CFR part 11

Hopefully by now you agree that data flow in the Digital Lab is quite complex and quite important, and deserves a renovation. Just to be safe, let’s put some numbers to the concept.

The impact of not taking action to improve your lab Data Plumbing

Let’s go back to the 3 data plumbing examples we talked about: 1) CROs sending a large volume of files that need to be quality checked manually, 2) scientists copy-pasting into notebooks and then manually transcribing into ELNs, and 3) data scientists manually aggregating results. Customers tell us it is common for scientists to spend as much as 10-20 hours per week on these types of manual data wrangling activities.

If we do some quick, back-of-the-envelope math to scale this up to a large organization with thousands of scientists, it could look something like this:

  • 15 hours per week x 44 working weeks x 1500 scientists = almost 1,000,000 hours per year on manual data wrangling.

  • If we assume $125/hour, that translates to $125 million per year wasted on manual data processes.

This is a meaningful, sizable, fundamental problem in the lab. In terms of opportunity cost, what could your scientists do with that 1 million hours of productive or uninterrupted time? Could they run 10% or even 20% more experiments?

What could you data scientists do with clean, accessible, prepared data at their fingertips? What is the implication, or rippling effect, of transcription errors or the lack of reported information propagating downstream?

Recommendations to get started on “home improvements” for your lab data plumbing

First, plan the data plumbing system for your lab as a first level architecture consideration, not as an afterthought. This does not mean that you have to be building a new lab. For your existing labs, the typical approach is to buy the instrument, buy the ELN, and then figure out how to connect them. In your building, you design the plumbing system knowing that there will be a dishwasher and a sink, even if the exact fixture hasn’t been purchased yet. Same for the lab, you know you will need to move data around, so plan a configurable “message bus” (for our IT readers) that is capable of plugging into a variety of instruments and applications. And then plug them in.

Second, START NOW! Don’t wait. Don’t let scientists and data scientists waste their time on data wrangling. Every little bit helps make a dent in that 1 million hours and $125 million dollars wasted every year. If the job seems daunting, just pick off one small component at a time. You are not in this alone! We are here to help. Follow our blog and our social media for tips, or check out our website to learn more about our data engineering platform that provides the data plumbing for the Digital Lab.

Watch our "Data Plumbing for the Digital Lab" presentation from the Discovery IT Digital Week 2020 to learn more!

Follow TetraScience for ongoing updates about R&D data in life sciences and other related topics:

Spin Wang

Spin Wang

Cornell Applied Physics and MIT EECS. Co-founder and CEO of TetraScience. Forbes 30 Under 30 in Science.

Read More