Working with amino acid compositions in R

To do: Remove last vestiges of matrix-based amino acid composition calculations in favor of vectors. Make sure Working with substitution matrices in R uses vectors instead, too.


Contents

[edit] 1 Loading data

[edit] 1.1 Tab delineated file

A simple tab delineated file:

A       1352842
R       684674
N       607751
D       842419
C       191133
Q       468387
E       857909
G       1212549
H       322473
I       834462
L       1202104
K       796555
M       292392
F       505608
P       605857
S       826026
T       806058
W       159438
Y       420648
V       1085109

Amino acids in the first column, their counts in the second.

aa_counts <- read.table(file="aa_counts.txt", row.names = 1,
  col.names=c("","count"));

As a function:

read_aa_table <- function ( file, colname = "count" ) {
  aa_table <- read.table(file = file, row.names = 1,
    col.names=c("",colname), comment.char = '#');
  return(aa_table);
}

# Example
aa_counts <- read_aa_table <- function( file = "FILENAME" );

You can use a class name for col.names if you want something other than "count". That way, later on when you bind different count compositions into matrices, you don't have to manually set the column names.

aa_counts will now be a list that looks like:

    count
A 1352842
R  684674
N  607751
D  842419
C  191133
Q  468387
E  857909
G 1212549
H  322473
I  834462
L 1202104
K  796555
M  292392
F  505608
P  605857
S  826026
T  806058
W  159438
Y  420648
V 1085109

This can also be done with aa files that contain percentages:

# Example
aa_counts <- read_aa_table <- function( file = "FILENAME",
  colname = "percent" );

[edit] 1.1.1 Load as a vector instead of a matrix

read_aa_list <- function ( file, colname = "count" ) {
  aa_table <- read.table(file = file, row.names = 1,
    col.names=c("",colname), comment.char = '#');
  aa_vector <- aa_table[,colname]
  names(aa_vector) <- row.names(aa_table)
  return(aa_vector);
}

[edit] 1.2 DARWIN array

Warning: This will only work on Unix based computers where tr is installed.


For amino acid lists, counts or percentages, made by using lprint in the DARWIN environment (Format: [#, #, ... #];). It will only read in as many numbers as exist in the aa_order given, default 20.

# Reads in a DARWIN array made using
# OpenWriting('filename');
# lprint(aa_freqs);
# OpenWriting('terminal');
read_darwin_aas <- function(filename,
  aa_order = c( 'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
  'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V' )) {
  num_aas = length(aa_order);
  # filters out unneeded characters from the file, replacing them
  # with white space that will then get stripped out with the Unix
  # command 'tr'
  file_filter <- paste("tr '[],;[:alpha:]:=' ' ' < ", filename)
  # scans in all the numbers of the file
  input <- pipe(file_filter);
  darwin_aas <- scan(input, what=double(0),
    nmax=num_aas, strip.white=TRUE, quiet=TRUE, comment.char='#');
  close(input);
  # makes a new matrix from those numbers
  darwin_aas <- matrix( darwin_aas, num_aas, 1,
    dimnames=list( aa_order, "percent"));
  return(darwin_aas);
}

#Example
mmatrix <- read_darwin_aas("filename");

[edit] 2 Manipulation

[edit] 2.1 Order AA matrix by aa composition

Get the order of the amino acids according to their composition:

# ascending order
order <- row.names(aa_percent)[order(aa_percent[,1])];
# descending order
order <- row.names(aa_percent)[order(aa_percent[,1]),
  decreasing = TRUE)];

Get a matrix using that order:

ordered_freqs <- matrix( aa_percent[order(aa_percent[,1]),],
  length(aa_percent[,1]), 1,
  dimnames=list( row.names(aa_percent)[order(aa_percent[,"percent"])],
  colnames(aa_percent)));

[edit] 2.2 Order AA vector by composition

ordered_freqs <- aa_percent[order(aa_percent)]

[edit] 2.3 Order AA vector by AA list order

aa_reorder <- aa_percent[aa_order]

[edit] 2.4 Trim an AA vector

This will trim off B, X, Z, and * from an EMBOSS set of aas:

trim_emboss_aas <- function( aa_list ) {
  return(aa_list[1:20]);
}

[edit] 3 Math

[edit] 3.1 Access an individual item

To access an individual item:

aa_counts["A"]

To get:

      A
1352842

[edit] 3.2 AA percent composition from counts

To get the percentages of each amino acid from a set of counts:

aa_percent <- aa_counts/sum(aa_counts);

[edit] 3.3 Difference between two compositions

aa_diff <- aa_percent_a - aa_percent_b;

[edit] 4 Graphing

[edit] 4.1 Graph a single composition

barplot( aa_freq, main = "Amino acid compositions",
  ylab = "% composition", xlab = "Amino Acids, listed in EMBOSS order",
  names.arg = names(matrix));

[edit] 4.2 Graphing multiple compositions

First, bind several loaded aa compositions into a matrix with the amino acids as the columns and the composition class as the rows, multiplied by 100 to get percent values from the decimals:

# bind the columns together
aa_percents <- cbind(aa_percent_a, aa_percent_b, aa_percent_c);
# give the proper names to the columns if they don't have them already
colnames(aa_percents) <- c("A", "B", "C");
# transpose the matrix and scale the percentages
aa_percents <- t(aa_percents*100);

Now, making a bar graph of the matrix:

# specify three colors for the different classes
bar_colors = c("navyblue", "blue2", "skyblue2");
# make the bar plot
barplot( as.matrix(aa_percents), beside = TRUE, col = bar_colors,
  main = "Amino acid compositions", ylab = "% composition",
  xlab = "Amino Acids, listed in EMBOSS order" );
# make the legend that links the classes to the colors
legend(x = "topleft", rownames(aa_percents), xjust = 0.5, fill=bar_colors,
  ncol = 1, inset = .01, title="Class Key" );
XHTML 1.1 CSS 2 Sec 508