Advanced Data Engineering Improves Data Integrity Using an Allotrope-Compatible, Data Science Ready File Format
The pharmaceutical industry generates experimental data every day that is stored on local PCs or on instrument vendor servers, with vendor dependent data schemas. These practices result in the creation of data silos across pharma that do not adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) data principles and prevent companies from utilizing all the information found in their raw data sets.
“The lab of the future is built on data. Right now, our cell counter data is largely inaccessible, and the heterogeneous nature of the information makes it difficult to analyze without significant manual manipulation. My team needs to make this data accessible and actionable for our scientists and data scientists,” says Len Blackwell, Associate Director of Strategic Analytics at Biogen.
This blog post highlights how our recent collaboration with Biogen knocks down the data silos associated with their Beckman Coulter Vi-CELL cell counter results, making the data readily available for further analysis using standard data science tools.
Cell Counters generate valuable, but heterogenous, data
Cell counters enable process development scientists to differentiate the number of viable versus non-viable cells using the trypan blue exclusion cell counting method. These values are then used to measure overall cell density and cell viability percentage. Cell density measurements help monitor cell culture feed requirements and the cell viability percentage assesses the overall health of a cell culture.
“Cell counting is a critical step in the biomanufacturing process that provides information about the density and viability of the mammalian cells that produce our protein products. In the process development laboratories, daily cell counts inform process development engineers of the results of experimental conditions with the goal of optimizing cell health and productivity,” says Brandon Moore, Cell Culture Engineer at Biogen.
A cell culture study typically lasts 14 days, with one sample analyzed from each condition per day. Biogen’s cell counter sample analysis produces 50 images that are analyzed with the trypan blue exclusion cell counting method. The resulting measurements are stored in a single text file as aggregated values and raw data arrays. The image files, used for the measurements, are exported and stored separately from the numerical data.
This presents a few challenges:
- Multiple file types (image and text files)
- 51 files generated for each sample, multiplied by the number of days in a study
- Major data integrity risk due to the separate file storage
Image: The Beckman Coulter Vi-CELL exports images and a .txt report for each experiment.
Automating cell counter data movement and conversion
TetraScience addresses these data integrity challenges with our Data Integration Platform. The platform provides automated cell counter file movement from the instrument PC to an AWS data lake. Once the files are moved to the data lake, they undergo two conversion pipelines. First, the instrument data schema is mapped into an intermediate data schema JSON file (IDS-JSON) and then the IDS-JSON is mapped into a pharmaceutical standard Allotrope Data Format (ADF) file. The ADF file provides a single output that captures both image files and the numerical data reports generated per cell counter sample.
Image: an excerpt of the cell counter report as a JSON in the IDS format.
Harnessing Cell Counter Data
Once the cell counter data is moved to an AWS data lake and packaged as either an IDS-JSON, or ADF, data scientists can begin analyzing these files with python notebooks like Google CoLab, or other common data science tools. Scientists can call files related to cell growth studies by querying the IDS-JSON files with ElasticSearch. Next, the cell density data is plotted versus date using the Pandas and Seaborn python libraries. ADF files enhance cell counter data integrity by combining numerical and image data into one file output; scientists can access this information with the H5py python library. The combined data outputs enable new types of data analysis, such as cell contamination monitoring with cell image analysis. For more information on the ADF file format, check out our blog post about the ADF graph model and leaf node model.
Image: Converting cell counter data into an ADF wraps up the standardized data along with associated images and ontology specified by the Allotrope Foundation
We can also easily use TetraScience REST API to import data into interactive python notebooks to do more detailed image analysis.
Image: Use TetraScience REST API to access your data using popular data science tools, like Jupyter iPython notebooks
“There are two really important improvements that come out of this process for Biogen. First, analysis is fully automated; once someone reads a sample in the cell counter, the data for the growth curve is visualized in the BI tool. This was previously a manual process with a lot of data movement. Second, data integrity is improved by aggregating multiple files from each sample into a single file and automating storage of the data. These are key points and should not be overlooked, “ says Blackwell.
We are delighted to have the opportunity to work with innovators at Biogen like Len Blackwell, Associate Director, and George Van Den Driessche, Scientist I. This collaboration will save their scientists time, make their cell counter data truly accessible and actionable, and negate the risk presented by separately storing files.
Follow TetraScience for ongoing updates about data engineering in life sciences R&D and other related topics:
Learn more about how we automatically harmonize and centralize experimental data, connecting disparate silos to activate the flow of data across your R&D ecosystem. Contact us at www.TetraScience.com/contact-us.