PAM250 extrapolations (old)
We have been using this Darwin recipe to do extrapolations of matrices to PAM250.
Contents |
[edit] 1 The Problem
DARWIN calculates the aa frequencies from a count matrix where the counts of nonsymmetric substitutions are split in half between the two possibilities (ie, A->C and C->A). For disordered matrix classes with a percent identity below 85%, these aa frequencies diverge wildly from the empirical frequencies, counted from the sequences.
Using the empirical frequencies in the resulting log odds calculations leads to an unsymmetrical log odds matrix. But using the calculated frequencies results in numbers that can't be trusted for disordered matrices below the 85% identity level.
Additionally, the initial calculated percent mutation significantly differed between ordered and disordered classes below 85% identity.
| % Id. Class | Disordered | Ordered | Difference |
|---|---|---|---|
| 85% | 6.24 | 6.07 | 0.17 |
| 60% | 18.41 | 26.10 | 7.69 |
| 40% | 34.34 | 43.38 | 9.04 |
The 85% class was fairly close, but the 60% and 40% disordered mutation rates were much lower than the comparable ordered transition rates. This could be because disordered frequencies often evolve through repeat expansion, so there are fewer disordered sequences passing the gap limits (4 for 60% and 10 for 40%) than ordered sequences. That would skew the results to a lower percent mutation than a similar set of ordered sequences, because a greater proportion of the counted sequences would have a higher percent identity.
So, in order to see if the comparisons were valid, I had to do two things:
- Find a way of calculating the extrapolated PAM250 matrices using empirical amino acid frequencies.
- See how the difference of mutation rate between the ordered and disordered sets could affect the results.
[edit] 2 Solution
[edit] 2.1 AA Frequencies
The PAM250 mutation matrix (where all columns sum to 1) was calculated like before, where the PAM1 mutation matrix was determined and then multiplied out 250 times.
Instead of using DARWIN's recipe to calculate log odds, I reverted the matrix back into a count matrix by multiplying each matrix column by the original count matrix's total aa sum. I then converted the counts into an upper triangular matrix by adding values below the diagonal to the equivalent positions above the diagonal and setting them to 0. I calculated the log odds the same way I calculated the log odds for the other matrices:
Failed to parse (Cannot write to or create math temp directory): s_{ij} = 2 * log_2\frac{p(i,j)}{p(i)p(j)}
Using the normalized matrix of counts converted from PAM250 and the empirical amino acid frequencies.
[edit] 2.2 Mutation rates
To try and make matrices based on higher mutation rates for the lower percent identity classes, I did "band" iterations, where alignments were only counted in between a certain band. The 60% band was between 60-85% identity and the 40% band was between 40-60%.
[edit] 2.3 Class stats
| % Id. Class | Sequences | Alignments | AA total | Substitutions |
|---|---|---|---|---|
| 85%, ordered | 3,631 | 42,081 | 755,019 | 8,103,744 |
| 85%, disordered | 3,121 | 28,753 | 327,275 | 2,256,981 |
| 60%, ordered | 29,097 | 3,763,579 | 5,774,090 | 668,422,787 |
| 60%, band ordered | 28,830 | 3,509,685 | 5,719,456 | 624,653,940 |
| 60%, disordered | 30,596 | 772,264 | 5,874,199 | 60,447,047 |
| 60%, band disordered | 29,010 | 669,180 | 5,376,513 | 50,670,772 |
| 40%, ordered | 65,166 | 13,529,611 | 14,079,000 | 2,594,638,217 |
| 40%, band ordered | 63,753 | 9,319,290 | 13,743,688 | 1,833,344,030 |
| 40%, disordered | 46,300 | 2,966,007 | 8,598,584 | 245,347,452 |
| 40%, band disordered | 42,111 | 2,093,332 | 5,376,513 | 189,437,335 |
[edit] 2.4 Mutation rates comparison
| % Id. Class | Disordered, scale | Ordered, scale | Difference | Disordered, band | Ordered, band | Difference |
|---|---|---|---|---|---|---|
| 85% | 6.24% | 6.09% | 0.17% | N/A | N/A | N/A |
| 60% | 17.84% | 26.17% | 8.33% | 24.68% | 30.01% | 5.33% |
| 40% | 34.12% | 43.34% | 9.22% | 48.21% | 50.28% | 2.07% |
The difference between the banded class and the original class:
| % Id. Class | Disordered band difference | Ordered band difference |
|---|---|---|
| 60% | 6.84% | 3.84% |
| 40% | 14.09% | 6.94% |
[edit] 3 Results
Difference numbers are the total sum of the absolute values in the upper triangular difference matrix obtained by subtracting one matrix from another.
The matrices used to make the numbers in this table have been calculated from the log odds matrix method explained above.
[edit] 3.1 PAM250 Log Odds sums
| Disordered sum | Ordered sum | Disordered absolute sum | Ordered absolute sum | |
|---|---|---|---|---|
| 85% | 204 | 180 | 500 | 492 |
| 60% | 493 | 197 | 733 | 493 |
| 60% banded | 459 | 192 | 739 | 492 |
| 40% | 413 | 235 | 703 | 489 |
| 40% banded | 338 | 231 | 688 | 487 |
[edit] 3.2 Original class differences between percent identity classes
| First class | Second class | Disordered difference | Ordered difference | Difference between ordered and disordered differences |
|---|---|---|---|---|
| 60% | 85% | 457 | 131 | 326 |
| 40% | 85% | 419 | 175 | 244 |
| 40% | 60% | 128 | 80 | 48 |
[edit] 3.3 Banded class differences between percent identity classes
The original compare column is calculated by subtracting the equivalent value in the original table from the banded value.
| First class | Second class | Disordered difference | Original compare | Ordered difference | Original compare | Difference between ordered and disordered differences | Original compare |
|---|---|---|---|---|---|---|---|
| 60%, banded | 85% | 465 | 8 | 140 | 9 | 325 | -1 |
| 40%, banded | 85% | 394 | -25 | 179 | 4 | 215 | -29 |
| 40%, banded | 60%, banded | 179 | 51 | 89 | 9 | 90 | 42 |
[edit] 3.4 Differences between original and banded extrapolated log odds
Differences are banded 250 log odds - original 250 log odds.
| Class | Disordered difference | Ordered difference |
|---|---|---|
| 60% | 54 | 11 |
| 40% | 95 | 18 |
[edit] 4 Discussion
- The gap between ordered and disordered mutation rates shrinks with banded classes, suggesting that disordered proteins used for the original matrices have a higher proportion of sequences with higher percent identity. I suspect this is because disordered sequences are more likely to accrue gaps in lower percent identities, and thus disqualify themselves from consideration. That's why at the 85% level, the disordered and ordered mutation rates are very close.
- The gap between ordered and disordered mutation banded rates is smaller with the 40% identity than the 60% identity--maybe 4 gaps at 60% is a harsher penalty than the 10 for 40%.
- The lower change in mutation rate between ordered original and banded shows up as a smaller difference than disordered in those extrapolated log odds matrices. The disordered mutation rate change is twice that of ordered, and the difference between the banded and original extrapolated log odds is five times greater than ordered.
- The difference of disordered extrapolated matrices at lower percent identities with those of higher percent identities is emphatically higher than the ordered ones. Banding does not change this. Why? For reference, the differences of the 40/60% with 85% can rival those between ordered and disordered matrices:
> sum(abs(dpi85_log_odds - opi85_log_odds)); [1] 244 > sum(abs(dpi60_log_odds - opi60_log_odds)); [1] 524 > sum(abs(dpi40_log_odds - opi40_log_odds)); [1] 450 > sum(abs(dpi60b_log_odds - opi60b_log_odds)); [1] 527 > sum(abs(dpi40b_log_odds - opi40b_log_odds)); [1] 433
And the difference between disordered at 85% with ordered of lower percent identities is lower than between disordered matrices:
> sum(abs(dpi85_log_odds - opi60_log_odds)) [1] 249 > sum(abs(dpi85_log_odds - opi40_log_odds)) [1] 263
- Most of the differences occur between 85% and 60%; the differences between 60% and 40% are much less. This holds true whether the difference in mutation rate between the two classes is smaller or larger or whether the matrices are ordered or disordered. Suggests most of the long term constraints kick in after 85% or when we start allowing gaps in alignments?