Proteins: Structure, Function and Genetics 40, 662-674,

Recurrent oligomers in proteins -- an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies

Cristian Micheletti, Flavio Seno and Amos Maritan

Link to online article.
A novel scheme is introduced to capture the spatial correlations of consecutive amino acids in naturally occurring proteins. This knowledge-based strategy is able to carry out optimally automated subdivisions of protein fragments into classes of similarity. The goal is to provide the minimal set of protein oligomers (termed ``oligons'' for brevity) that is able to represent any other fragment. At variance with previous studies where recurrent local motifs were classified, our concern is to provide simplified protein representations that have been optimised for use in automated folding and/or design attempts. In such contexts it is paramount to limit the number of degrees of freedom per amino acid without incurring in loss of accuracy of structural representations. The suggested method finds, by construction, the optimal compromise between these needs. Several possible oligon lengths are considered. It is shown that meaningful classifications cannot be done for lengths greater than 6 or smaller than 4. Different contexts are considered were oligons of length 5 or 6 are recommendable. With only a few dozen of oligons of such length, virtually any protein can be reproduced within typical experimental uncertainties. Structural data for the oligons is made publicly available.