Virtual Drug Discovery

Evaluating how different data modalities can be combined to computationally predict assay results

It can take 10 to 15 years, and cost $2.5 billion to develop a new drug for a given disease. I want to make drug discovery faster and cheaper.

- Anne Carpenter


Discovering new drugs is an expensive and challenging process. I learned this early on in my research career while working at the UW Cancer Vaccine Institute. In one of the projects I worked on, we ran some of our assays at the Quellos High-Throughput Screening Core, which completely blew my mind. I realized that drug discovery is an incredibly difficult search problem, requiring intensive laboratory automation and robotics to screen an enormous number of compounds in a wide variety of experimental assays.

I became really interested in this general problem, and wondered about ways that it might be made more efficient. Around that time, I read a paper from Bob Murphy’s lab at Carnegie Mellon that described a machine learning framework for predicting the results of a large experiment screen while only doing 29% of the experiments.1 I also became deeply inspired by a startup called Transcriptic (now Strateos) which was offering an API to do experiments over the Internet in a “cloud robotic lab.”

Based on these two data points, I wildly extrapolated early on that over the course of my career all biologists would be computationally modeling cells and tissues, and sending experimental designs off to cloud labs to validate them. While this world didn’t quite come into fruition, and we aren’t (yet) all cloud biologists running our assays via HTTP requests, computational modeling has become a foundational guiding force for experimental design for some fields, including drug discovery.2

One of the pioneering groups in this space has been the Carpenter-Singh Lab3 at the Broad Institute. This group is best known for the development of the Cell Painting assay and the CellProfiler image analysis software (now maintained by the Cimini Lab), which have both been hugely influential. For this post, I’ll be highlighting the updated version of their exciting preprint entitled "Predicting compound activity from phenotypic profiles and chemical structures which was led by Juan C. Caicedo.

Key Advances

The challenge of discovering a new therapeutic fundamentally consists of identifying a chemical compound that treats a disease of interest.4 The core challenge in solving this type of problem is that chemical space is enormous, making it akin to finding a needle in a haystack. It isn’t tractable (or even possible) to experimentally assay this entire space, because companies and academic labs have finite time and resources.

One strategy to mitigate this bottleneck is to use machine learning to build models to predict the outcomes of these assays instead of doing them, and then only pursue empirical validation of the most promising predicted results. Work has been done to make these types of predictions based on information in chemical structures with promising results. But what other information could be incorporated to improve our ability to computationally predict the results of drug discovery assays?

In this work, the Carpenter-Singh Lab explored the idea that they could integrate three different data modalities to better predict assay results: 1) chemical structures, 2) transcriptomic profiles (L1000 assays), and 3) Cell Painting morphology profiles. The core hypothesis of this study is that “data representations of compounds and their experimental effects in cells have complementary strengths to predict assay readouts accurately, and that they can be integrated productively to improve compound prioritization in drug-discovery projects.


One of the first aspects of the analysis in this paper was to understand how well each data modality performed independently, and whether there was overlap in what types of assays each modality could predict. They attempted to predict the results of 314 assays with each modality separately, and found that “all three data modalities can predict compound activity with high accuracy in 7-8% of assays tested” but that there wasn’t substantial overlap in which assays were predicted correctly:

This was a promising result, because it indicates that each type of data might be representing a different aspect of biology not captured by the other modalities. Ideally, we would want to determine how to best incorporate these signals into a model that could outperform any given data type on its own.

To see if they could accomplish this, the authors evaluated how well models fusing the different data modalities performed. They used two different types of assessments: 1) directly fusing the data5 and assessing predictive performance, and 2) a retrospective assessment, which “estimates the performance of an ideal data fusion method that perfectly synergizes all modalities.” The results were interesting:

It turned out that the best predictor was the model using chemical structures and Cell Paint morphology profiles (CS+MO) which could accurately predict 27 assays. This was only 1 more assay than cell morphology alone was able to accurately predict. It’s also noteworthy that the model with all three data modalities (CS+GE+MO) actually performed worse!

Why is this the case? Why is there such a large gap between the theoretically possible results of figure 2d and what was observed in 2c? I think that this represents how challenging data fusion is. The authors provide the following explanation:

Our results indicate that the three data modalities only predict a small fraction of the assays in common (Figure 2B, only three assays are predicted by all modalities), suggesting that in most cases, at least one of the data modalities will effectively introduce noise for predicting a given assay. When one of the data modalities cannot signal the bioactivity of interest, the noise-to-signal ratio in the feature space increases, making it more challenging for predictive models to succeed. This explains why late fusion, which independently looks at each modality, tends to produce better performance.

I think that this establishes a meaningful research direction for computational scientists to develop new methods to better fuse different data modalities in order to mitigate these challenges.

Fundamentally, these types of models hold the promise of accelerating drug discovery. When looking at how useful the models in this paper might be, the authors “found that predictors meeting AUROC > 0.9 in our experiments produce on average a 50 to 70-fold improvement in hit rate (i.e., compounds with the desired activity, see Supplementary Figure 7) for assays with a baseline hit rate below 1%.” It is still early days for this field, but this indicates the type of potential gains from using models to guide the search for new therapeutics.

Final Thoughts

Machine learning is a highly enabling tool for biotechnology. It is being used to design new DNA, optimize AAVs, understand gene regulation, and much more. Research groups like the Carpenter-Singh Lab are also driving forward the application of machine learning to the challenging problem of discovering new drugs more efficiently.

In "Predicting compound activity from phenotypic profiles and chemical structures, this group has shown the potential promise of using information beyond just chemical structures to accurately computationally predict the results of experimental assays. These types of predictive models could ultimately be used to better prioritize experimental efforts on the most promising potential compounds.

If you’ve enjoyed this research highlight, you should consider subscribing so that the next post will arrive in your email inbox:

Until next time! 🧬


This was one of the primary papers that motivated me to become a computational biologist.


Two of the most prominent examples of biotech startups using machine learning to accelerate drug discovery are Insitro and Recursion.


This lab was formerly the Carpenter Lab, and the change was announced recently. This is an interesting move, because academic research so heavily focuses on labs with a single Principal Investigator. Would we be better off with more jointly led laboratories? It is reflective of the fact biomedicine is increasingly deeply interdisciplinary.


I’ll note that this is how the drug discovery paradigm works for small molecules. A different problem to focus on is developing entirely new therapeutic modalities, such as cell-based therapies.


They tried two approaches for data fusion: 1) concatenating the data before making predictions, or 2) integrating the probabilities after predictions made independently with each data type. They found better success with the latter, which they call “late data fusion.”