pdb-l: Machine detection of sequence microheterogeneity

Wed Dec 7 10:21:07 PST 2005

What is the best way for a program, reading a PDB file, to distinguish 
sequence microheterogeneity from insertions in the sequence numbering? Both 
involve "insertion codes" in the ATOM records (column 27), but in the 
former case, they represent alternate residues at the same position in the 
chain some times using insertion codes, others just with the same number, 
while in the latter, they represent sequential residues.

At http://www.rcsb.org/pdb/docs/format/pdbguide2.2/part_35.html is stated 
for SEQRES:
"In case of microheterogeneity, only one of the sequences is presented. A 
REMARK is generated to explain this and a SEQADV is also generated."
PDBv3 (as of 2009.05.24) says at http://www.wwpdb.org/documentation/format32/sect3.html#SEQRES:
"Microheterogeneity is to be represented as a variant with one of the possible 
residues in the site being selected (arbitrarily) as the primary residue. 
The residues which do not match to the UNP reference will be listed in 
SEQADV records with the explanation of “microheterogeneity”.

1H9H describes microheterogeneity in SEQADV and REMARK 999, but has none in 
its ATOM records. Instead, it has three sequence insertions!
I don't see that now (maybe due to PDB remediation?), it has ATOM 
records for residues C & S at several positions along the chains, 
that share the residue number without insertion code.
The pre-2007 remediation file* is in the same state.

(*) obtained from http://www.umass.edu/microbio/chime/pe_beta/pe/protexpl/unremed.htm

1DIN specifies microheterogeneity in REMARK 6, but gives no ATOM 
coordinates for the alternate residue.
No; it has both CSD 123 and Cys 123 in the ATOM section, only CSD in SEQRES.
Even in the pre-2007 remediation file* REMARK 6 says that coordinates 
are provided for both residues.

1AL4, 1CBN, and 1ETA have microheterogeneity in their ATOM records, but no 
mention of it in SEQADV. Instead, 1AL4 describes it in COMPND 
OTHER_DETAILS, while 1ETA and 1CBN describe it in REMARK 4, and 1ETA also 
in FTNOTE 1.

1TAB describes microheterogeneity in REMARK 4 for three positions, 184, 
188, and 221. In the ATOM records, GLY 184A precedes(!) TYR 184, but both 
are in the SEQRES, as though an insertion rather than microheterogeneity. 
The alpha carbons have different positions, and the two residues are 
peptide-bonded. The same pattern occurs at the other two positions. I don't 
understand why this is described as microheterogeneity!
Agreed. This is still so in current PDB, and they are processed and 
displayed correctly as insertions. Same happens at Gly188a/Lys188 
and Ala221a/Gln221.

Clearly, SEQADV cannot be relied upon to indicate the presence of 
microheterogeneity in the ATOM records.

Possible Method I: Compare the sequence in SEQRES with the sequence in ATOM 
records. In the  few cases I have examined, residues with insertion codes 
representing microheterogeneity do not appear in SEQRES. In contrast, for 
sequence insertions (e.g. 1QKZ, 1H9H) the SEQRES contains all the residue 
with insertion codes.
Implemented. Microheterogeneity is interpreted when a residue in ATOM 
is absent from SEQRES, and it is grouped inside [] with the previous (and
next) residue(s) based on their identical residue number, 
with or wthout insertion code present.

Possible Method II: Compare the coordinates for the alpha carbon atoms. In 
1CBN, they are identical for 22 vs. 22A, and 25 vs. 25A. But in 1ETA, they 
are slightly different at position 30, 0.484 Angstroms. So, if the alpha 
carbon distance is less than 1 Angstrom, consider it microheterogeneity?
Not implemented.

Method II seems simpler to implement, and likely more robust.
Don't know, but fulfilling the other specifications lead me to implement 
method I.

One can also wonder how to determine the number of entries in the PDB that 
have sequence microheterogeneity. Searching for the word gives 24 hits, but 
some of the hits lack actual sequence heterogeneity in the coordinates 
(e.g. 1CN4, 180D, 1UCS, 1BGN).
A simple search for 'microheterogeneity' now returns 65 entries.

Advice will be appreciated.

Thanks, -Eric

/* - - - - - - - - - - - - - - - - - - - - - - - - - - -
Eric Martz, Professor Emeritus, Dept Microbiology
U Mass, Amherst -- http://www.umass.edu/molvis/martz

Protein Explorer - 3D Visualization: http://proteinexplorer.org
FirstGlance in Jmol - http://firstglance.jmol.org
Workshops: http://www.umass.edu/molvis/workshop
Biochem 3D Education Resources http://MolviZ.org
World Index of Molecular Visualization Resources: http://molvisindex.org
ConSurf - Find Conserved Patches in Proteins: http://consurf.tau.ac.il
Atlas of Macromolecules: http://molvis.sdsc.edu/atlas/atlas.htm
PDB Lite Macromolecule Finder: http://pdblite.org
Molecular Visualization EMail List (molvis-list):
       http://bioinformatics.org/mailman/listinfo/molvis-list
- - - - - - - - - - - - - - - - - - - - - - - - - - - */