The Data Problem in AI Drug Discovery: From Biased Inputs to Biological Truth
- Highlight the data, not algorithm bottlenecks in AI-driven drug discovery. Humans and models train on narrow chemical and biological spaces, shaped by convenience or funding. Pro-Phet designs efficient libraries for popular proteins and reuses biased datasets
- Build models that reflect biology, not bias, by starting with sequence, not pre-structure data. Learn from biological language to normalize inputs across diverse sources, creating a shared information standard. Embrace negative data, failures, non-binders and weak affinities so that AI learns boundaries and contexts
- Create confidence through cleaner, broader, more representative data. Couple sequence-level modeling, normalization, and large-scale screening. Apply the law of large numbers to biology and operate across billions of protein–molecule interactions, with stabilized patterns and clear outliers. Build reproducibility so AI solves biology, instead of mirroring it