Restore Missing Atoms In PDB File Using Modeller
What is comparative modelling
There are only 20 amino acids that exist in nature. The chemical structure of each amino acid is well known. However, predicting the structure of proteins, which are made up of hundreds of these amino acids linked together, is very challenging. This is because the structure can vary greatly depending on how the connections between amino acids fold, and protein structures can change through interactions with other substances. This is known as the Protein Folding problem. In fact, Google’s DeepMind’s AlphaFold, which predicted the structures of 2 million human proteins using neural networks in July 2021, gets its name from this Protein Folding problem.
When AlphaFold was first released, there were expectations that most protein-related problems would be solved. However, over time it has become clear that AlphaFold’s accuracy is not that high. While it predicts core structures well, it does not predict the detailed positions of secondary structures very accurately. Of course, since secondary structures can easily change, it may have been difficult to use them as grading criteria in “correct answer” tests. However, many biochemical properties of proteins are determined by these secondary structures, which leaves room for improvement.
The Protein Folding problem that even DeepMind, considered the world’s best at solving complex problems using neural networks, couldn’t fully solve - does this mean previous scientists couldn’t predict protein structures at all? That’s not the case. Since the atomic composition and bonds are determined once the protein sequence is set, structures can be predicted by finding energy-minimizing configurations using molecular dynamics simulations. However, this approach has the drawback of being very time-consuming and complex due to systems containing hundreds to thousands of atoms.
This is where comparative modeling comes in as a simpler method. Even if overall protein sequences differ, since they are made up of only 20 amino acids, different proteins often share partial sequences. These shared sequence portions must have similar actual structures. Therefore, it’s possible to predict the 3D structure of a desired amino acid sequence piece by piece using predetermined protein structures as templates. This comparative prediction method is called comparative modeling.
In this post, we’ll follow a tutorial on predicting protein 3D structures using Modeller, a leading program that provides comparative modeling functionality. While the tutorial explains how to search for possible structures to predict TvLDH’s structure and select the one showing highest similarity by trying several simultaneously, I will use Modeller slightly differently - to restore the complete structure of the protein Cyclin-dependent kinase 2 (Uniprot id: P24941) using its APO structure 5IF1 as a template. (While the APO structure itself is structural data captured from the protein, since it’s experimentally measured, it’s missing many elements and amino acids, which is why we need this process.)
Basic Tutorial
Prepare target sequence
First, we need to create the desired amino acid sequence in PIR format.
5IF1 is a protein with multiple chains, but let’s restore just chain A.
>P1;5IF1
sequence:5IF1:::::::0.00: 0.00
MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKIRLDTETEGVPSTAIREISLLKELNHPNIVKLL
DVIHTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKPQNLL
INTEGAIKLADFGLARAFGVPVRTYTHEVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRA
LFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQDFSKVVPPLDEDGRSLLSQMLH
YDPNKRISAKAALAHPFFQDVTKPVPHLRL*
The first line is the identifier of the amino acid sequence.
>P1;5IF1
The next line contains various metadata about the amino acid sequence, which is omitted here.
sequence:5IF1:::::::0.00: 0.00
An example of all information being included is as follows.
structureX:5fd1:1 :A:106 :A:ferredoxin:Azotobacter vinelandii: 1.90: 0.19
The first column is information about the measurement method of the amino acid sequence, and the subsequent columns are PDB id, the start and end numbers of the portion to be used in the sequence, chain, protein name, measurement resolution, and R-factor value, respectively.
Finally, the amino acid sequence is written, which is the result of conversion using the following rules.
GLY: G ALA: A LEU: L MET: M PHE: F
TRP: W LYS: K GLN: Q GLU: E SER: S
PRO: P VAL: V ILE: I CYS: C TYR: Y
HIS: H ARG: R ASN: N ASP: D THR: T
Alignment
The tutorial explains how to use the Profile.build() function to find candidate protein structures to use as templates,
and the malign3d function to select the final template from among the candidates.
However, since we are using already existing protein data to restore a more precise protein structure,
we can simply use the given protein data as a template without going through this process.
However, since the apo data used as a template may have missing residues or irregular numbering, we cannot proceed directly to model building.
Model building requires an alignment file containing sequences of both the template and target, and the code below generates this file.
from modeller import *
env = Environ()
aln = Alignment(env)
# 원하는 template 이름과 사용할 서열 정보
mdl = Model(env, file='5if1', model_segment=('FIRST:A','LAST:A'))
# apo 구조의 pdb 파일
aln.append_model(mdl, align_codes='5if1A', atom_files='5if1.pdb')
# Target sequence alignement file
aln.append(file='5if1.ali', align_codes='5if1')
# 가능한 gap의 최대 길이
aln.align2d(max_gap_length=50)
# Output
aln.write(file='5if1-ideal-apo.ali', alignment_format='PIR')
aln.write(file='5if1-ideal-apo.pap', alignment_format='PAP')
We need to change the pdb code of the protein we want to create (here, 5if1), the pdb file path (5if1.pdb), and the target sequence alignment file (5if1.ali).
max_gap_length is the maximum length that can be skipped when aligning the pdb file and the target sequence.
Here, we set it to 50 as in the tutorial.
Model building
Now, we only need to create the 3D structure of the target sequence using the given template. This is implemented as follows.
from modeller import *
from modeller.automodel import *
env = Environ()
a = AutoModel(env, alnfile='5if1-ideal-apo.ali',
knowns='5if1A', sequence='5if1',
assess_methods=(assess.DOPE,
assess.GA341))
a.starting_model = 1
a.ending_model = 5
a.make()
alnfile should be the alignment file between the template and target sequence created earlier,
knowns is the name of the template, and sequence is the name of the target sequence.
assess_methods is a box for inputting the method for evaluating the 3D structure prediction result,
which is used to compare the results of the models and select the final result.
This part is the same as the DOPE and GA341 used in the tutorial.
Model evaluation
Since we are using 5 models, the results will be for 5 models.
>> Summary of successfully produced models:
Filename molpdf DOPE score GA341 score
----------------------------------------------------------------------
5if1.B99990001.pdb 1650.57690 -36916.17969 1.00000
5if1.B99990002.pdb 1680.34778 -36939.61719 1.00000
5if1.B99990003.pdb 1609.59802 -36872.38672 1.00000
5if1.B99990004.pdb 1679.17847 -36643.48828 1.00000
5if1.B99990005.pdb 1546.36304 -37145.25781 1.00000
We can select one of them, but there are several ways to do this. We can simply select the model with the lowest DOPE score, or the model with the highest GA341 score. Here, since the GA341 score is all 1, we can select the 5th model with the lowest DOPE score.
Comparison between before and after modelling
Let’s check how much information has been added through this restoration.
Of course, since 5if1 was originally data without missing residues, not much information was added.
We checked whether the missing parts were restored at the atom level.
Residue 39 of Threonine should be composed of N, C, CA, CB, CG2, O, OG1 in the original pdb file.
However, only N, CA, C, O, and CB were present in the 5if1 pdb file.
We were able to confirm that this part was restored well.
# 5if1.pdb
ATOM 312 N THR A 39 -72.852 -28.579 12.797 1.00161.25 N
ATOM 313 CA THR A 39 -71.986 -29.787 12.911 1.00133.43 C
ATOM 314 C THR A 39 -71.912 -30.593 14.234 1.00134.00 C
ATOM 315 O THR A 39 -71.735 -31.807 14.102 1.00101.58 O
ATOM 316 CB THR A 39 -70.584 -29.588 12.336 1.00114.37 C
# 5if1.B99990005.pdb
ATOM 304 N THR A 39 -75.235 -28.513 13.811 1.00275.90 N
ATOM 305 CA THR A 39 -74.457 -29.710 14.008 1.00275.90 C
ATOM 306 CB THR A 39 -73.265 -29.832 13.091 1.00275.90 C
ATOM 307 OG1 THR A 39 -72.823 -31.182 13.052 1.00275.90 O
ATOM 308 CG2 THR A 39 -72.120 -28.947 13.587 1.00275.90 C
ATOM 309 C THR A 39 -74.025 -30.133 15.389 1.00275.90 C
ATOM 310 O THR A 39 -73.631 -31.294 15.497 1.00275.90 O