Amino acid substitution matrices from an information theoretic perspective
[edit] 1 Abstract
Protein sequence alignments have become an important tool for molecular biologists. Local alignments are frequently constructed with the aid of a "substitution score matrix" that specifies a score for aligning each pair of amino acid residues. Over the years, many different substitution matrices have been proposed, based on a wide variety of rationales. Statistical results, however, demonstrate that any such matrix is implicitly a "log-odds" matrix, with a specific target distribution for aligned pairs of amino acid residues. In the light of information theory, it is possible to express the scores of a substitution matrix in bits and to see that different matrices are better adapted to different purposes. The most widely used matrix for protein sequence comparison has been the PAM-250 matrix. It is argued that for database searches the PAM-120 matrix generally is more appropriate, while for comparing two specific proteins with suspected homology the PAM-200 matrix is indicated. Examples discussed include the lipocalins, human alpha 1 B-glycoprotein, the cystic fibrosis transmembrane conductance regulator and the globins.
[edit] 2 Two main assumptions
Two crucial assumptions about substitution scores:
- There is at least one positive score
- The expected score is negative: Failed to parse (Cannot write to or create math temp directory): E = \sum_{i,j}p_i p_j s_{ij}
[edit] 3 Relative entropy
Failed to parse (Cannot write to or create math temp directory): H = \sum_{i,j} q_{ij} s_{ij} = \sum_{i,j} log_2 \frac{q_{ij}}{p_i p_j}
- H is the relative entropy of the target and background distributions. It measures the average information available per position to distinguish alignment from chance.
- The higher the value of the relative entropy of the target, the more easily alignments are distinguished from random chance
- High value of H means relatively short alignments are possible to distinguish by chance
- Low value of H means longer alignments required to distinguish from chance
- "distinguishing an alignment from chance in a search of a typical current protein database using an average length protein requires about 30 bits of information" -- length can be calculated by 30 / H
| Date published | 5 June 1991 + |
| Has author | Stephen F. Altschul + |
| Paper topic | Substitution matrices + |
| PubMed ID | 2,051,488 + |
| Published in | Journal of Molecular Biology + |
| Title | Amino acid substitution matrices from an information theoretic perspective + |