top of page

The Enlace Bio virtual screening: state-of-the-art in silico HTVS system for protein–ligand affinity prediction

Updated: 2 minutes ago

This blog outlines an earlier version of the offering, for details on the latest offering please refer to: https://www.enlacebio.com/virtual-screen.

Executive summary

  • The Enlace Bio virtual screening system improves performance by more than 20% over the next best in silico high-throughput method for protein–ligand affinity prediction, while still processing over one million compounds in a week.

  • It integrates cutting-edge deep learning algorithms for docking and affinity assessment trained on 500,000 curated data points.

  • This results in faster, more accurate hit identification, uncovering a wider variety of strong binders and increasing the likelihood of success in subsequent stages of drug discovery.

Figure 1. Performance vs speed (virtual assay throughput). The performance was measured as the average spearman correlation between predicted and experimental values in 8 unseen assays released by Merck (paper). Speed calculation assumes an 8-GPU compute cluster.


Introduction

Protein–ligand affinity assessment plays a key role in small molecule drug discovery, especially during hit identification, where it ranks small molecules by their predicted binding strength to a target protein. This ranking can be done with in vitro methods, in silico methods, or a combination of both, resulting in a shortlist of top-binding molecules for the next phase of drug development.


In this white paper we introduce the docking part of the Enlace Bio virtual screening system for protein–ligand affinity prediction. As shown in Figure 1 it provides an improvement of over 20% (0.39->0.48) compared to other in silico methods viable to run at high scale. The benchmark used is a hard benchmark based on a dataset published by Merck (paper). The Enlace Bio virtual screening does this while still enabling high-throughput virtual screenings (HTVS); with a single 8 GPU compute cluster over one million compounds can be processed in a week. Combining this with active learning and up to billions of compounds can be processed in one screening campaign, active learning is not within the scope of this white paper.


Results

The benchmark was selected to reflect real-world HTVS hit identification, focusing on ranking strong binders for specific protein targets. It includes 8 unseen assays for different protein targets released by Merck, with known protein holo structures and experimentally determined binding strengths for 264 ligands for which only unbound structures are provided (paper). The task is to rank the ligands by binding strength within each assay. Performance is measured as the average Spearman correlation between predicted and actual affinity values within the assays. Our method achieves 0.47 on this benchmark, compared with 0.35-0.39 for other high-throughput methods (paper), for example TankBind which achieves 0.38 (paper, code).


This benchmark is more challenging and realistic than the frequently used PDBBind benchmark (link) for two reasons: (1) It does not assume the crystallised structures of protein–ligand complexes are known, mirroring actual HTVS conditions, and (2) It evaluates the ability to rank ligands within individual assays separately, aligning with the goal of identifying strong binders for one protein target per assay. The difficulty of this benchmark is evident, as the state-of-the-art method on PDBBind PLAPT (paper, code) achieves only 0.20 average Spearman correlation on this benchmark compared with 0.47 of our method.


The AI-driven virtual screening system

Figure 2. Virtual screening system design.


Figure 2 outlines the system, which takes as input a protein with a 3D holo structure and an unbound ligand structure. The process has four steps:

  1. The most suitable high quality holo protein structure is selected from PDB, and the ligand is sanitised and preprocessed.

  2. DiffDock (paper, code) docks the ligand into the protein, generating 16 possible docked poses with different random seeds.

  3. A confidence subsystem filters out non-physical poses with steric clashes and selects the most plausible remaining one.

  4. A graph neural network inspired by PAMNet (paper, code) assesses binding strength by analysing the bonds present and how well the ligand fits into the protein pocket. This model is trained on a highly curated dataset, discussed further in the next section.

The system outputs an estimated affinity value range, allowing molecules to be ranked and potential binders separated from non-binders.


The training data for the affinity model

To ensure the model's ability to understand diverse molecular configurations, a large dataset of over 500,000 docked protein–ligand structures was created using PDB structures, ChEMBL assay data, and DiffDock. This dataset far surpasses the 19,000 structures in the PDBBind dataset which is frequently used to train protein–ligand binding affinity models.


Several data filtering and cleaning strategies were employed to ensure high-quality data. The most important noise sources that were removed were:

  1. ChEMBL batch effect: Variations between results from similar assays performed in different labs is a large known source of noise in ChEMBL. This was mitigated by framing the learning task as ranking within assays rather than regression on individual data points, reducing distortions in the training signal from inconsistent data (inspired by this paper).

  2. PDB data quality issues: Structures with mutations unaligned with the ChEMBL data and other data quality issues were identified and removed.

  3. Protein pocket conformation: The most suitable PDB structure for each ligand was selected to ensure the protein pocket matched the ligand structure, abiding by the induced-fit model of protein–ligand bindings.

  4. Steric clashes in docked structures: DiffDock-generated poses containing steric clashes were filtered out wherever possible to prevent misleading the affinity model.

These steps were critical to creating a reliable dataset for effective model training.


Learn more

  • To learn more about what protein targets the system has been applied to, visit this page.

  • Contact us if you want to discuss how we can help you or if you have any questions!

  • Stay tuned for upcoming blogs, case studies, and papers with deeper details. Let us know what topics you'd like us to cover.

227 views0 comments

Comments


bottom of page