Restore Missing Atoms In PDB File Using Modeller

Updated:

What is comparative modelling

There are only 20 amino acids that exist in nature. The chemical structure of each amino acid is well known. However, predicting the structure of proteins, which are made up of hundreds of these amino acids linked together, is very challenging. This is because the structure can vary greatly depending on how the connections between amino acids fold, and protein structures can change through interactions with other substances. This is known as the Protein Folding problem. In fact, Google’s DeepMind’s AlphaFold, which predicted the structures of 2 million human proteins using neural networks in July 2021, gets its name from this Protein Folding problem.

When AlphaFold was first released, there were expectations that most protein-related problems would be solved. However, over time it has become clear that AlphaFold’s accuracy is not that high. While it predicts core structures well, it does not predict the detailed positions of secondary structures very accurately. Of course, since secondary structures can easily change, it may have been difficult to use them as grading criteria in “correct answer” tests. However, many biochemical properties of proteins are determined by these secondary structures, which leaves room for improvement.

The Protein Folding problem that even DeepMind, considered the world’s best at solving complex problems using neural networks, couldn’t fully solve - does this mean previous scientists couldn’t predict protein structures at all? That’s not the case. Since the atomic composition and bonds are determined once the protein sequence is set, structures can be predicted by finding energy-minimizing configurations using molecular dynamics simulations. However, this approach has the drawback of being very time-consuming and complex due to systems containing hundreds to thousands of atoms.

This is where comparative modeling comes in as a simpler method. Even if overall protein sequences differ, since they are made up of only 20 amino acids, different proteins often share partial sequences. These shared sequence portions must have similar actual structures. Therefore, it’s possible to predict the 3D structure of a desired amino acid sequence piece by piece using predetermined protein structures as templates. This comparative prediction method is called comparative modeling.

In this post, we’ll follow a tutorial on predicting protein 3D structures using Modeller, a leading program that provides comparative modeling functionality. While the tutorial explains how to search for possible structures to predict TvLDH’s structure and select the one showing highest similarity by trying several simultaneously, I will use Modeller slightly differently - to restore the complete structure of the protein Cyclin-dependent kinase 2 (Uniprot id: P24941) using its APO structure 5IF1 as a template. (While the APO structure itself is structural data captured from the protein, since it’s experimentally measured, it’s missing many elements and amino acids, which is why we need this process.)

Basic Tutorial

Prepare target sequence

First, we need to create the desired amino acid sequence in PIR format. 5IF1 is a protein with multiple chains, but let’s restore just chain A.

>P1;5IF1
sequence:5IF1:::::::0.00: 0.00
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLL
DVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLL
INTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRA
LFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLH
YDPNKRISAKAALAHPFFQDVTKPVPHLRL*

The first line is the identifier of the amino acid sequence.

>P1;5IF1

The next line contains various metadata about the amino acid sequence, which is omitted here.

sequence:5IF1:::::::0.00: 0.00

An example of all information being included is as follows.

structureX:5fd1:1    :A:106  :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19

The first column is information about the measurement method of the amino acid sequence, and the subsequent columns are PDB id, the start and end numbers of the portion to be used in the sequence, chain, protein name, measurement resolution, and R-factor value, respectively.

Finally, the amino acid sequence is written, which is the result of conversion using the following rules.

GLY: G  ALA: A  LEU: L  MET: M  PHE: F
TRP: W  LYS: K  GLN: Q  GLU: E  SER: S
PRO: P  VAL: V  ILE: I  CYS: C  TYR: Y
HIS: H  ARG: R  ASN: N  ASP: D  THR: T

Alignment

The tutorial explains how to use the Profile.build() function to find candidate protein structures to use as templates, and the malign3d function to select the final template from among the candidates. However, since we are using already existing protein data to restore a more precise protein structure, we can simply use the given protein data as a template without going through this process. However, since the apo data used as a template may have missing residues or irregular numbering, we cannot proceed directly to model building.

Model building requires an alignment file containing sequences of both the template and target, and the code below generates this file.

from modeller import *

env = Environ()
aln = Alignment(env)

# 원하는 template 이름과 사용할 서열 정보
mdl = Model(env, file='5if1', model_segment=('FIRST:A','LAST:A'))

# apo 구조의 pdb 파일
aln.append_model(mdl, align_codes='5if1A', atom_files='5if1.pdb')

# Target sequence alignement file
aln.append(file='5if1.ali', align_codes='5if1')

# 가능한 gap의 최대 길이
aln.align2d(max_gap_length=50)

# Output
aln.write(file='5if1-ideal-apo.ali', alignment_format='PIR')
aln.write(file='5if1-ideal-apo.pap', alignment_format='PAP')

We need to change the pdb code of the protein we want to create (here, 5if1), the pdb file path (5if1.pdb), and the target sequence alignment file (5if1.ali). max_gap_length is the maximum length that can be skipped when aligning the pdb file and the target sequence. Here, we set it to 50 as in the tutorial.

Model building

Now, we only need to create the 3D structure of the target sequence using the given template. This is implemented as follows.

from modeller import *
from modeller.automodel import *

env = Environ()

a = AutoModel(env, alnfile='5if1-ideal-apo.ali',
              knowns='5if1A', sequence='5if1',
              assess_methods=(assess.DOPE,
                              assess.GA341))
a.starting_model = 1
a.ending_model = 5
a.make()

alnfile should be the alignment file between the template and target sequence created earlier, knowns is the name of the template, and sequence is the name of the target sequence. assess_methods is a box for inputting the method for evaluating the 3D structure prediction result, which is used to compare the results of the models and select the final result. This part is the same as the DOPE and GA341 used in the tutorial.

Model evaluation

Since we are using 5 models, the results will be for 5 models.

>> Summary of successfully produced models:
Filename                          molpdf     DOPE score    GA341 score
----------------------------------------------------------------------
5if1.B99990001.pdb            1650.57690   -36916.17969        1.00000
5if1.B99990002.pdb            1680.34778   -36939.61719        1.00000
5if1.B99990003.pdb            1609.59802   -36872.38672        1.00000
5if1.B99990004.pdb            1679.17847   -36643.48828        1.00000
5if1.B99990005.pdb            1546.36304   -37145.25781        1.00000

We can select one of them, but there are several ways to do this. We can simply select the model with the lowest DOPE score, or the model with the highest GA341 score. Here, since the GA341 score is all 1, we can select the 5th model with the lowest DOPE score.

Comparison between before and after modelling

Let’s check how much information has been added through this restoration. Of course, since 5if1 was originally data without missing residues, not much information was added. We checked whether the missing parts were restored at the atom level.

Residue 39 of Threonine should be composed of N, C, CA, CB, CG2, O, OG1 in the original pdb file. However, only N, CA, C, O, and CB were present in the 5if1 pdb file. We were able to confirm that this part was restored well.

# 5if1.pdb
ATOM    312  N   THR A  39     -72.852 -28.579  12.797  1.00161.25           N
ATOM    313  CA  THR A  39     -71.986 -29.787  12.911  1.00133.43           C
ATOM    314  C   THR A  39     -71.912 -30.593  14.234  1.00134.00           C
ATOM    315  O   THR A  39     -71.735 -31.807  14.102  1.00101.58           O
ATOM    316  CB  THR A  39     -70.584 -29.588  12.336  1.00114.37           C

# 5if1.B99990005.pdb
ATOM    304  N   THR A  39     -75.235 -28.513  13.811  1.00275.90           N
ATOM    305  CA  THR A  39     -74.457 -29.710  14.008  1.00275.90           C
ATOM    306  CB  THR A  39     -73.265 -29.832  13.091  1.00275.90           C
ATOM    307  OG1 THR A  39     -72.823 -31.182  13.052  1.00275.90           O
ATOM    308  CG2 THR A  39     -72.120 -28.947  13.587  1.00275.90           C
ATOM    309  C   THR A  39     -74.025 -30.133  15.389  1.00275.90           C
ATOM    310  O   THR A  39     -73.631 -31.294  15.497  1.00275.90           O