Accelerating Discovery: High-Throughput Science for Data & Computer Scientists
How High-throughput science is providing a large number of value creation opportunities for Data and Computer Scientists.
Computer science (CS) has deeply impacted every area of human endeavor in the last 50 years through a process we now know as digital transformation. The arrival of the Internet in the late 90s and smartphones in the late 2000s accelerated this process to its logical culmination we now experience. It has been an exhilarating journey for everyone involved. As a popular movie song asks, what’s next?
The Internet created “big” datasets. Big data, combined with advances in machine learning and AI in the early 2010s gave birth to modern data science (DS), an even more powerful instrument of transformation than CS. Like CS, DS is touching every field of human endeavor. This revolution has been led primarily by the big tech as it smartly invested in DS and monetized big data.
As we entered the 2020s, COVID shocked us into accelerated digital transformation. Adoption of digital processes became mandatory (e.g., zoom, e-commerce) and we saw a continued tech boom as a result. As the pandemic continues to wind down, thanks to incredibly fast vaccine design, testing, manufacturing, and rollout, we are now experiencing post-pandemic changes in business activity.. Global e-commerce revenue is expected to go down this year, for the first time ever. The next hype cycle around Metaverse is taking shape but there’s a silent revolution that’s passing under the radar of most CS/DS professionals.
High-throughput Science: A new revolution taking shape
Scientific Computing disciplines such as Computational Chemistry and Biology have been around for a long time, and the interactions between fundamental sciences and DS have also been around (e.g., quantum physics motivated the development of many statistical models and modern deep learning architectures are inspired by biological neural networks). However, a few developments of late are clearly indicating the arrival of a new High-throughput Science (HTS) era, providing great value creation opportunities (once again) for CS/DS professionals. Let me illustrate with a few examples.
Opportunities for Data Scientists
There’s a virtuous cycle of innovation currently underway between advances in Science and Data Science. Look at the schedules of any top AI conference or flip through any reputed science journal and you’ll see what I’m talking about. On one hand, advances in high-throughput experimentation (HTE) in Science are generating large datasets which are being analysed with the help of AI/ML to understand the secrets of nature. On the other hand, advances in AI/ML are being inspired by needs and ideas in Science.
Latest in AI/ML is helping solve high value problems in Science and in the process, enriching itself
Of course, AlphaFold2(AF2) is a widely known example. AI/ML helped solve a long-standing, high-value problem of protein structure prediction. While this unlocked immediate value-creation opportunities for Science, what’s instructive is how many good AI/ML ideas had to be generated and integrated together in AF2 to achieve the success it did. Had the science of proteins not posed the challenges it did, some of these innovative AI/ML ideas may not have been discovered and/or become popular.
Another very impactful application of AI/ML, one that may not be as widely known in the broader data science community compared to AF2, is the analysis of gene expression datasets produced at the granularity of a single cell by scRNA-seq experiments. The curse of dimensionality that all data scientists are very familiar with presents the foremost challenge in analyzing about 20,000 gene expression values from thousands or even millions of cells.
Given the high value attached to questions in biology and medicine, if we are able to de-noise, visualize, and meaningfully cluster scRNA-seq data, AI/ML advances such as diffusion maps and graph clustering techniques have come to the fore in the last few years and have been widely adopted.
If you are a serious data science practitioner coding with PyTorch, Tensorflow, JAX, sklearn, Python, and R in Jupyter/RStudio to extract insights using AI/ML, science presents some of the best opportunities for value creation with your skills.
Opportunities for Data Engineers
The scale, variety, and richness of scientific datasets provide the toughest challenges to data engineers.
Each of our cells holds the blueprint, the human genome, for how our bodies function. The human genome consists of about 6 billion nucleotides (A, C G, or T). One might wonder:
- How much memory/storage space do you need if you need to compare, contrast, and associate patterns in the genomes of 1 million individuals?
- How do we tabulate the results of such analysis?
- Can we leverage known constraints we understand from a priori knowledge in biology and chemistry to optimize this analysis?
- How do we even capture the vast biochemical and biomedical knowledge we may already have, and factor all of that in analysis?
These are the kinds of questions you tackle as a data engineer processing scientific data. Given the high dimensional arrays scientific datasets usually are, you will also often be dealing with questions on how to distribute data across storage nodes, optimise transport, and accelerate data processing and integration.
You will also be dealing with rich ontologies that are ever evolving and have to deal with the challenge of capturing knowledge in efficient data structures for storage, lookup, integration and analysis.
Distributed data processing platforms like Spark and knowledge graph datastores like Neo4j are quite popular in data engineering communities dealing with scientific data.
Opportunities for Algo Developers
You would’ve heard of mRNA vaccines for COVID having to be stored and transported at very low temperatures in order to keep the mNA stable. If an expert in RNA stability works alongside you, can you develop an efficient algorithm to search the vast space (order of 10^632) of possible mRNA sequences that can encode the SARS-CoV-2 spike protein and come up with stabler alternatives?
Science offers tremendous opportunities for algorithm developers by providing a wide variety of problems that are too difficult to solve using naive approaches.
Opportunities for DevOps & Cloud engineers
- What’s the largest cluster you ever had to provision and manage?
- Can you design and stand up large compute and storage clouds to process scientific workloads that are likely to take days to weeks to complete, taking advantage of opportunities available in the workload for parallel and distributed processing?
- Can you do this while ensuring reproducibility and robustness?
The demands of scientific computing test your understanding of Linux, Docker, Kubernetes, and cloud automation.
Opportunities for UX designers and UI developers
Science provides incredible opportunities for anyone working in data visualization, UX design, or UI development. The domains involved are extremely rich in detail and complexity, presenting challenges but great opportunities for improving the productivity of researchers and empowering them with the right design of visualizations and user interfaces.
Scientific data, of course, tends to be large and complex. Hence, UI development for researchers often involves server-side programming along with modern Javascript-powered visualization frameworks such as D3js.
There is already clear evidence of CS and DS making an impact in fundamental sciences. For example, see the recent exponential rise in the number of biomedical journal articles explicitly mentioning AI as a method of study.
Now is a very exciting time for all of us
At Aganitha, we are excited to leverage and participate in these innovations and are inspired by the opportunity we have to contribute to the alleviation of suffering and disease through our work with R&D teams at Global Biopharma. While it may seem daunting to many CS/DS professionals to enter rich and complex scientific fields, rewards more than justify the efforts needed.
MIT is championing what it calls multilingualism – blending every field of science and engineering with data science. Big tech companies are making their own deep investments as can be seen in DeepMind and CZI. Most importantly, an era of open science, accelerated by COVID, has made convergence possible. Are you ready to contribute to this revolution?