Working with amino acid compositions in R
Contents |
[edit] 1 Loading data
[edit] 1.1 Tab delineated file
A simple tab delineated file:
A 1352842 R 684674 N 607751 D 842419 C 191133 Q 468387 E 857909 G 1212549 H 322473 I 834462 L 1202104 K 796555 M 292392 F 505608 P 605857 S 826026 T 806058 W 159438 Y 420648 V 1085109
Amino acids in the first column, their counts in the second.
aa_counts <- read.table(file="aa_counts.txt", row.names = 1, col.names=c("","count"));
As a function:
read_aa_table <- function ( file, colname = "count" ) { aa_table <- read.table(file = file, row.names = 1, col.names=c("",colname), comment.char = '#'); return(aa_table); } # Example aa_counts <- read_aa_table <- function( file = "FILENAME" );
You can use a class name for col.names if you want something other than "count". That way, later on when you bind different count compositions into matrices, you don't have to manually set the column names.
aa_counts will now be a list that looks like:
count A 1352842 R 684674 N 607751 D 842419 C 191133 Q 468387 E 857909 G 1212549 H 322473 I 834462 L 1202104 K 796555 M 292392 F 505608 P 605857 S 826026 T 806058 W 159438 Y 420648 V 1085109
This can also be done with aa files that contain percentages:
# Example aa_counts <- read_aa_table <- function( file = "FILENAME", colname = "percent" );
[edit] 1.1.1 Load as a vector instead of a matrix
read_aa_list <- function ( file, colname = "count" ) { aa_table <- read.table(file = file, row.names = 1, col.names=c("",colname), comment.char = '#'); aa_vector <- aa_table[,colname] names(aa_vector) <- row.names(aa_table) return(aa_vector); }
[edit] 1.2 DARWIN array
tr is installed.
For amino acid lists, counts or percentages, made by using lprint in the DARWIN environment (Format: [#, #, ... #];). It will only read in as many numbers as exist in the aa_order given, default 20.
# Reads in a DARWIN array made using # OpenWriting('filename'); # lprint(aa_freqs); # OpenWriting('terminal'); read_darwin_aas <- function(filename, aa_order = c( 'A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V' )) { num_aas = length(aa_order); # filters out unneeded characters from the file, replacing them # with white space that will then get stripped out with the Unix # command 'tr' file_filter <- paste("tr '[],;[:alpha:]:=' ' ' < ", filename) # scans in all the numbers of the file input <- pipe(file_filter); darwin_aas <- scan(input, what=double(0), nmax=num_aas, strip.white=TRUE, quiet=TRUE, comment.char='#'); close(input); # makes a new matrix from those numbers darwin_aas <- matrix( darwin_aas, num_aas, 1, dimnames=list( aa_order, "percent")); return(darwin_aas); } #Example mmatrix <- read_darwin_aas("filename");
[edit] 2 Manipulation
[edit] 2.1 Order AA matrix by aa composition
Get the order of the amino acids according to their composition:
# ascending order order <- row.names(aa_percent)[order(aa_percent[,1])]; # descending order order <- row.names(aa_percent)[order(aa_percent[,1]), decreasing = TRUE)];
Get a matrix using that order:
ordered_freqs <- matrix( aa_percent[order(aa_percent[,1]),], length(aa_percent[,1]), 1, dimnames=list( row.names(aa_percent)[order(aa_percent[,"percent"])], colnames(aa_percent)));
[edit] 2.2 Order AA vector by composition
ordered_freqs <- aa_percent[order(aa_percent)]
[edit] 2.3 Order AA vector by AA list order
aa_reorder <- aa_percent[aa_order]
[edit] 2.4 Trim an AA vector
This will trim off B, X, Z, and * from an EMBOSS set of aas:
trim_emboss_aas <- function( aa_list ) { return(aa_list[1:20]); }
[edit] 3 Math
[edit] 3.1 Access an individual item
To access an individual item:
aa_counts["A"]
To get:
A 1352842
[edit] 3.2 AA percent composition from counts
To get the percentages of each amino acid from a set of counts:
aa_percent <- aa_counts/sum(aa_counts);
[edit] 3.3 Difference between two compositions
aa_diff <- aa_percent_a - aa_percent_b;
[edit] 4 Graphing
[edit] 4.1 Graph a single composition
barplot( aa_freq, main = "Amino acid compositions", ylab = "% composition", xlab = "Amino Acids, listed in EMBOSS order", names.arg = names(matrix));
[edit] 4.2 Graphing multiple compositions
First, bind several loaded aa compositions into a matrix with the amino acids as the columns and the composition class as the rows, multiplied by 100 to get percent values from the decimals:
# bind the columns together aa_percents <- cbind(aa_percent_a, aa_percent_b, aa_percent_c); # give the proper names to the columns if they don't have them already colnames(aa_percents) <- c("A", "B", "C"); # transpose the matrix and scale the percentages aa_percents <- t(aa_percents*100);
Now, making a bar graph of the matrix:
# specify three colors for the different classes bar_colors = c("navyblue", "blue2", "skyblue2"); # make the bar plot barplot( as.matrix(aa_percents), beside = TRUE, col = bar_colors, main = "Amino acid compositions", ylab = "% composition", xlab = "Amino Acids, listed in EMBOSS order" ); # make the legend that links the classes to the colors legend(x = "topleft", rownames(aa_percents), xjust = 0.5, fill=bar_colors, ncol = 1, inset = .01, title="Class Key" );