fastsolv

developing accurate cheminformatics models for solubility prediction

Most of my thesis is motivated by poor aqueous solubility of hydrophobic drugs. However, organic solubility is also a crucial property, particularly for synethic/process chemistry and formulation processing. Measuring solubility accurately is extremely challenging, and experimental variability is high. Despite its importance, solubility is notoriously difficult to predict. There has been an enormous amount of work in making a good predictor for aqueous solubility (due to its relevance in drug function), and some tools have been quite successful. However, organic solubility has been comparatively neglected. We sought to create an accurate and accessible tool for the community to use.

We used standard deep learning cheminformatics models (chemprop and fastprop), and trained them on large datasets in the litetature. One cool training technique; we trained on the solubility loss and the loss of the gradient of solubility with respect to temperature (see equation below). This is a very general approach called Sobolev training, which can be adapted for any DL problem where you care about the accuracy of the gradient of your target value with respect to an input (temperature, pH, concentration, etc).

\[\mathcal{L}(\hat S, S) = \frac{1}{n} \sum_{i=1}^n \left[ \left( \log S_i - \log \hat{S}_i \right)^2 + \alpha \left( \frac{d\, \log S_i}{d T_i} - \frac{d\, \log \hat{S}_i}{d T_i} \right)^2 \right].\]

The combination of this training approach, great model architectures, and compiled literature datasets gave us state-of-the-art performance. We rigorously show that our models achieve average accuracy that approaches the limit of experimental variability, the noise floor of the measurements themselves. Excitingly, our model (fastsolv) is quickly being adopted in industry, most notably by Rowan Scientific (docs here). We’ve also released an open-source deployment of FastSolv, making it accessible to the broader community for both research and practical applications.

Deployed model GUI

This project started as a class project in an MIT ML course, so it’s been awesome to see it develop into a useful tool for the community.

References

2024

  1. figure1.png
    Organic Solubility Prediction at the Limit of Aleatoric Uncertainty
    Lucas Attia, Jackson W Burns, Patrick S Doyle, and 1 more author
    Chemarxiv, Nov 2024