PAM250 extrapolations (old)

We have been using this Darwin recipe to do extrapolations of matrices to PAM250.

Contents

[edit] 1 The Problem

DARWIN calculates the aa frequencies from a count matrix where the counts of nonsymmetric substitutions are split in half between the two possibilities (ie, A->C and C->A). For disordered matrix classes with a percent identity below 85%, these aa frequencies diverge wildly from the empirical frequencies, counted from the sequences.

The differences between aa frequencies calculated empirically and using DARWIN's method.
The differences between aa frequencies calculated empirically and using DARWIN's method.

Using the empirical frequencies in the resulting log odds calculations leads to an unsymmetrical log odds matrix. But using the calculated frequencies results in numbers that can't be trusted for disordered matrices below the 85% identity level.

Additionally, the initial calculated percent mutation significantly differed between ordered and disordered classes below 85% identity.

 % Id. Class Disordered Ordered Difference
85% 6.24 6.07 0.17
60% 18.41 26.10 7.69
40% 34.34 43.38 9.04

The 85% class was fairly close, but the 60% and 40% disordered mutation rates were much lower than the comparable ordered transition rates. This could be because disordered frequencies often evolve through repeat expansion, so there are fewer disordered sequences passing the gap limits (4 for 60% and 10 for 40%) than ordered sequences. That would skew the results to a lower percent mutation than a similar set of ordered sequences, because a greater proportion of the counted sequences would have a higher percent identity.

So, in order to see if the comparisons were valid, I had to do two things:

  1. Find a way of calculating the extrapolated PAM250 matrices using empirical amino acid frequencies.
  2. See how the difference of mutation rate between the ordered and disordered sets could affect the results.

[edit] 2 Solution

[edit] 2.1 AA Frequencies

The PAM250 mutation matrix (where all columns sum to 1) was calculated like before, where the PAM1 mutation matrix was determined and then multiplied out 250 times.

Instead of using DARWIN's recipe to calculate log odds, I reverted the matrix back into a count matrix by multiplying each matrix column by the original count matrix's total aa sum. I then converted the counts into an upper triangular matrix by adding values below the diagonal to the equivalent positions above the diagonal and setting them to 0. I calculated the log odds the same way I calculated the log odds for the other matrices:

Failed to parse (Cannot write to or create math temp directory): s_{ij} = 2 * log_2\frac{p(i,j)}{p(i)p(j)}


Using the normalized matrix of counts converted from PAM250 and the empirical amino acid frequencies.

[edit] 2.2 Mutation rates

To try and make matrices based on higher mutation rates for the lower percent identity classes, I did "band" iterations, where alignments were only counted in between a certain band. The 60% band was between 60-85% identity and the 40% band was between 40-60%.

[edit] 2.3 Class stats

 % Id. Class Sequences Alignments AA total Substitutions
85%, ordered 3,631 42,081 755,019 8,103,744
85%, disordered 3,121 28,753 327,275 2,256,981
60%, ordered 29,097 3,763,579 5,774,090 668,422,787
60%, band ordered 28,830 3,509,685 5,719,456 624,653,940
60%, disordered 30,596 772,264 5,874,199 60,447,047
60%, band disordered 29,010 669,180 5,376,513 50,670,772
40%, ordered 65,166 13,529,611 14,079,000 2,594,638,217
40%, band ordered 63,753 9,319,290 13,743,688 1,833,344,030
40%, disordered 46,300 2,966,007 8,598,584 245,347,452
40%, band disordered 42,111 2,093,332 5,376,513 189,437,335

[edit] 2.4 Mutation rates comparison

 % Id. Class Disordered, scale Ordered, scale Difference Disordered, band Ordered, band Difference
85% 6.24% 6.09% 0.17% N/A N/A N/A
60% 17.84% 26.17% 8.33% 24.68% 30.01% 5.33%
40% 34.12% 43.34% 9.22% 48.21% 50.28% 2.07%

The difference between the banded class and the original class:

 % Id. Class Disordered band difference Ordered band difference
60% 6.84% 3.84%
40% 14.09% 6.94%

[edit] 3 Results

Difference numbers are the total sum of the absolute values in the upper triangular difference matrix obtained by subtracting one matrix from another.

The matrices used to make the numbers in this table have been calculated from the log odds matrix method explained above.

[edit] 3.1 PAM250 Log Odds sums

Disordered sum Ordered sum Disordered absolute sum Ordered absolute sum
85% 204 180 500 492
60% 493 197 733 493
60% banded 459 192 739 492
40% 413 235 703 489
40% banded 338 231 688 487

[edit] 3.2 Original class differences between percent identity classes

First class Second class Disordered difference Ordered difference Difference between ordered and disordered differences
60% 85% 457 131 326
40% 85% 419 175 244
40% 60% 128 80 48

[edit] 3.3 Banded class differences between percent identity classes

The original compare column is calculated by subtracting the equivalent value in the original table from the banded value.

First class Second class Disordered difference Original compare Ordered difference Original compare Difference between ordered and disordered differences Original compare
60%, banded 85% 465 8 140 9 325 -1
40%, banded 85% 394 -25 179 4 215 -29
40%, banded 60%, banded 179 51 89 9 90 42

[edit] 3.4 Differences between original and banded extrapolated log odds

Differences are banded 250 log odds - original 250 log odds.

Class Disordered difference Ordered difference
60% 54 11
40% 95 18

[edit] 4 Discussion

  • The gap between ordered and disordered mutation rates shrinks with banded classes, suggesting that disordered proteins used for the original matrices have a higher proportion of sequences with higher percent identity. I suspect this is because disordered sequences are more likely to accrue gaps in lower percent identities, and thus disqualify themselves from consideration. That's why at the 85% level, the disordered and ordered mutation rates are very close.
  • The gap between ordered and disordered mutation banded rates is smaller with the 40% identity than the 60% identity--maybe 4 gaps at 60% is a harsher penalty than the 10 for 40%.
  • The lower change in mutation rate between ordered original and banded shows up as a smaller difference than disordered in those extrapolated log odds matrices. The disordered mutation rate change is twice that of ordered, and the difference between the banded and original extrapolated log odds is five times greater than ordered.
  • The difference of disordered extrapolated matrices at lower percent identities with those of higher percent identities is emphatically higher than the ordered ones. Banding does not change this. Why? For reference, the differences of the 40/60% with 85% can rival those between ordered and disordered matrices:

> sum(abs(dpi85_log_odds - opi85_log_odds));
[1] 244
> sum(abs(dpi60_log_odds - opi60_log_odds));
[1] 524
> sum(abs(dpi40_log_odds - opi40_log_odds));
[1] 450
> sum(abs(dpi60b_log_odds - opi60b_log_odds));
[1] 527
> sum(abs(dpi40b_log_odds - opi40b_log_odds));
[1] 433

And the difference between disordered at 85% with ordered of lower percent identities is lower than between disordered matrices:

> sum(abs(dpi85_log_odds - opi60_log_odds))
[1] 249
> sum(abs(dpi85_log_odds - opi40_log_odds))
[1] 263
  • Most of the differences occur between 85% and 60%; the differences between 60% and 40% are much less. This holds true whether the difference in mutation rate between the two classes is smaller or larger or whether the matrices are ordered or disordered. Suggests most of the long term constraints kick in after 85% or when we start allowing gaps in alignments?
XHTML 1.1 CSS 2 Sec 508