Paper review : PIGNet

Updated: October 21, 2022

Paper information

Seokhuyn Moon, Wonho Zhung, Soojung Yang, Jaechang Lim, and Woo Youn Kim. “PIGNet: a physics-informed deep learning model toward generalized drug–target interaction predictions”. Chem. Sci., 2022, 13, 3661

Abstract

Recently, deep neural network (DNN)-based drug–target interaction (DTI) models were highlighted for theirhigh accuracy with affordable computational costs. Yet, the models’ insufficient generalization remainsa challenging problem in the practice ofin silicodrug discovery. We propose two key strategies toenhance generalization in the DTI model. Thefirst is to predict the atom–atom pairwise interactionsviaphysics-informed equations parameterized with neural networks and provides the total binding affinity ofa protein–ligand complex as their sum. We further improved the model generalization by augmentinga broader range of binding poses and ligands to training data. We validated our model, PIGNet, in thecomparative assessment of scoring functions (CASF) 2016, demonstrating the outperforming dockingand screening powers than previous methods. Our physics-informing strategy also enables theinterpretation of predicted affinities by visualizing the contribution of ligand substructures, providinginsights for further ligand optimization.

Purpose

In the field of biology and pharmacology, DNNs are not as good as they are in other fields due to the characteristics of the field and the biased data distribution. The problem of predicting interactions between drugs and proteins is a crucial task for filtering drug candidates, but it is still a challenging area.

Before DNNs, the following attempts were made in this area.

Docking method : It is the most commonly used method, but it is cheap in terms of computational cost and accuracy is low.
Thermodynamic integration : It is more rigorous, but it is expensive in terms of computational cost.

In this way, physics-based approaches have a clear tradeoff between computational cost and accuracy. However, data-based approaches can improve accuracy with almost the same computational cost by utilizing more data, so it is considered as a way to overcome this limitation.

Drug and protein location information plays a very important role in their interactions, so 3D-CNN Model, GNNs, Atomic environment vector, etc. have been studied.

However, the lack of data has not guaranteed the sufficient extensibility of the model, and it has been a common problem that the model is overfitted to the training data. Chen et al¹ mentioned that the performance of the model using both receptor and ligand and the model using only ligand was very similar in the training results using DUD-E dataset, and the model was inferred only using ligand information, so the performance was greatly reduced in the inference for proteins with a large difference from the trained structures.

This paper proposes two key strategies to enhance generalization in the DTI model. The first is to predict the atom–atom pairwise interactions via physics-informed equations parameterized with neural networks and provides the total binding affinity of a protein–ligand complex as their sum.

Contribution

PIGNet

PIGNet is a physics-informed graph neural network that predicts the binding affinity of a protein–ligand complex by summing up the van der Waals interactions, hydrogen bonding, metal bonding, and small molecule interactions. It can be used to find the region that greatly contributes to binding, or to provide a direction for improving affinity in the future.

Data augmentation

When using the actual model, we do not know the exact crystal structure of the drug-protein. Therefore, we need to construct training data with some errors, but in practice, it almost does not exist. In this paper, docking structures are used as a comparison group to train the model so that the crystal structure has a lower energy, and additional training is performed to filter out arbitrary drugs well.

Epsitemic uncertatinty

Due to the incorrect training of the model, there may be many false-positive results. PIGNet uses Monte Carlo dropout to create uncertainty and averages the results to reduce false-positive.

Network characteristics

3D CNN cannot specify the interatomic interactions, so GNNs that can contain interaction and position information and are invariant to parallel translation and rotation are used.

Greydanus et al², Pun et al³ have confirmed that when using Hamiltonian neural networks to predict physical systems, the scalability of the model increases. Therefore, we used Hamiltonian neural networks to predict the interaction between atoms directly.

Input / Output

The network is given the protein-ligand connection graph $G$ as input. Usually, the graph is determined by the node information and the adjacency matrix, but in this model, the adjacency matrix is divided into two, one containing covalent bond information, and the other containing (determined by distance) interaction information. The components of the second matrix are 1 only if the distance is between 0.5Å and 5.0Å, otherwise 0, which is designed to not consider interactions that are too far or too close.

The network output is designed to predict the binding free energy, which is given by the sum of the four interaction terms divided by the rotor penalty. This result is estimated by averaging the results of 30 reconstructions with a dropout rate of 0.1 and presented as epistemic uncertainty.

Network Architecture

The network learns the parameters of each interaction.

Gated graph attention networks
Interaction networks
Force computing
- van der Waals
- Hydrogen bonding
- Metal bonding
- Small molecule interaction
- Rotor penalty: The entropy loss ratio caused by the inability to rotate due to bonding.

Training

PIGNet’s training is a bit unusual. It is trained by minimizing the sum of three types of loss.

$L_{energy}$: Loss due to the difference between the guessed energy and the actual binding energy.
$L_{derivative}$: This loss is created because energy values alone are not enough to find stable binding structures. It is given by the square of the first derivative and the difference of the second derivative of the energy function.
$L_{augmentation}$: We don’t have that many “correct answers”. In particular, we have even fewer actual values for cases where binding does not occur. However, when we actually use it, we may input compounds that don’t bind, or structures far from the correct structure. The model has several mechanisms for such cases.
- Docking: A loss function to train the model so that the actual structure is more stable and has lower energy than when docking results are input.
- Random screening: A loss function to prevent binding energy from accidentally becoming small with random compounds taken from the IBS molecule library.
- Cross screening: A loss function to incorporate the prediction that compounds bound to a specific pocket in PDBbind will not bind to pockets of other proteins.

The weights for each loss function are given as 1, 10.0, 0.0, 5.0, 5.0 (in order of energy, derivative, docking, random screening, cross screening).

Results and Discussion

Model performance / Generalization ability

Tested with CASF-2016 benchmark data on four criteria:

Scoring: Linear correlation between predicted and measured binding affinity. Pearson’s correlation $R$
Ranking: Prediction accuracy of ranking between predicted and measured binding affinity. Spearman’s correlation coefficient $\rho$
Docking: How well it can filter out real binding poses from data mixed with incorrect binding pose data. Success rate within the top N candidates.
Screening: How well it can find ligands that bind to target proteins among random molecules. Success rate, Enhancement factor within the top $\alpha$ percent of candidates.

The reason for testing with various criteria is that in CASF-2016 benchmarks, even models showing high scoring power often had low screening or docking power, so docking and screening power were used as indicators of model generalization ability.

Model performance

Effect of the physics-informed parametrized function

Except for using physics-informed parametric equations, PIGNet is completely identical in shape to 3D GNN-based models. Therefore, we can gauge the effectiveness of this strategy through performance comparisons with other models.

Effect of the DTI-adapted data augmentation strategy

Another key feature considered in PIGNet is data augmentation. While the PDBbind dataset is an excellent collection of both binding structures and binding strengths, there have been many criticisms that it has inherent bias in that it lacks chemical diversity of ligands and only provides binding structures at minimum energy.¹ Therefore, PIGNet used docking data and non-binding molecules as training data to learn a wider chemical space.

Parametric equation

Virtual screening

For actual use in DTI, screening must work sufficiently well even with previously unseen target proteins and ligands. Benchmarks were compared with existing docking methods (SMINA) and methodologies using deep learning.

Screening power

Interpretation: dominant contributing substructure

Since PIGNet is structured to calculate energies separately within its structure, it can be used to find substructures with large energy contributions. This can be quite important information in drug design.

Interpretation

Epistemic uncertainty

There may be many false-positives due to incorrect training of the model. PIGNet tried to reduce false-positives by creating uncertainty through Monte Carlo dropout in the model and averaging multiple results created this way.

Epistemic uncertainty

Reference

Chen et al. “Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening”. PLoS one, 2019, 14, e0220113. ↩ ↩²
S. Greydanus et al, “Hamiltonian Neural Networks”, NeurlIPS, 2019, vol. 32, pp. 15379–15389. ↩
G. P. Pun et al, “Physically informed artificial neural networks for atomistic modeling of materials”, Nat. Commun., 2019, 10, 2339. ↩

Share on

Twitter Facebook LinkedIn

Myeongseon Choi

Paper review : PIGNet

Paper information

Abstract

Purpose

Contribution

PIGNet

Data augmentation

Epsitemic uncertatinty

Network characteristics

Input / Output

Network Architecture

Training

Results and Discussion

Model performance / Generalization ability

Effect of the physics-informed parametrized function

Effect of the DTI-adapted data augmentation strategy

Virtual screening

Interpretation: dominant contributing substructure

Epistemic uncertainty

Reference

Share on

You may also enjoy

Matrix Transpose in CUDA

Kummer’s Theorem

Numb3rs S1E3: Spatial SIR Model

Numb3rs S1E1: The Rossmo Formula