Soluble

A custom Graph Neural Network that predicts molecular solubility directly from chemical topology.

Overview

Soluble is a Graph Neural Network (GNN) built to predict molecular solubility directly from chemical structure. Instead of using standard molecular fingerprints, it models atoms and bonds as a graph, allowing the network to learn its own representations of chemical properties. The core architecture uses a 3-layer Graph Isomorphism Network (GINE), selected because it handles edge features (like double bonds) better than standard convolutional networks.

The model converts raw SMILES strings into graph data, featurizing atoms with a 16-dimentional vector of specific chemical properties including hybridization, aromaticity, and formal charge. I solved common training stability issues like vanishing gradients by implementing residual skip connections and layer normalization. For the final prediction, I built a custom "hybrid pooling" layer that combines Sum, Mean, and Max operations. This allows the model to account for both the total size of the molecule and the intensity of specific functional groups. It achieves an RMSE of ~0.59 on the ESOL dataset, effectively matching professional benchmarks.

Highlights

Created a custom GNN using PyTorch Geometric and GINEConv layers.
Engineered a domain-specific featurizer incorporating various data types.
Achieved an R² of 0.92 and RMSE of 0.59, proving the models viability.
Used residual connections to prevent "oversmoothing".
Created end-to-end system that parses SMILES strings and runs inference in real-time.

Links

View on GitHub ↗Visit Website ↗

Stack

PythonPyTorchPyTorch GeometricRDKitGINE ArchitectureMatplotlibNumPyPandas