UW Interview Code Samples

First of all, I would like to thank everyone for what a pleasant visit I had this Monday interviewing. I've set up this page for people interested in seeing code samples or other work I've done. Since you all have different specialties and interests, I've tried to include something for everyone--the included code spans Java, R, Perl, Bash, C, SQL, PHP, and Python. I also have links to my thesis and the presentation slides for my thesis defense.

If you have any further requests or questions about anything here, you can contact me at audrakjohnson@gmail.com. I am often signed into Jabber/GTalk with that email, and I also can be reached over AIM as AudraKJohnson.

Contents

[edit] 1 HiTSA and Statgen

To describe the structure and function of microbial communities, microbial ecologists often analyze sequences of microbial genes obtained from environmental samples, usually using the 16S ribosomal gene. These two programs streamline sequence data analysis and presentation for these large-scale studies on the composition and dynamics of microbial communities.

[edit] 1.1 HiTSA

A flowchart of HiTSA's process.
A flowchart of HiTSA's process.

The High Throughput Sequence Analysis (HiTSA) pipeline identifies and groups closely related sequences. HiTSA first culls incorrect and low quality sequences. It compares valid sequences to those in databases using BLAST. Then, it aligns valid sequences and their best matches using ClustalW, and clusters the sequences by similarity using the neighbor-joining algorithm. The HiTSA program uses freely available software and databases, and runs under UNIX on either single processor or cluster computers. Although it was made with the 16S ribosomal gene in mind, it's generalized enough to use with any set of homologous sequences.

HiTSA download
This download contains all of the scripts and reference files to run HiTSA, a README, and some example data that would run with it. Instructions on making RDP databases for use with HiTSA can be found here.

[edit] 1.2 Statgen

A labeled screenshot of Statgen.
A labeled screenshot of Statgen.

A second program, Statgen, uses information from a genetic distance matrix and/or a phylogenetic tree, both of which are generated by the HiTSA, to summarize the sequence variation within or between groups of sequences. Statgen produces a table with the mean, standard deviation, minimum, maximum and quartile values for the pair-wise genetic distances among the sequences being compared, and a box plot graphic of these values. Statgen is written in Java to run on multiple platforms.

Statgen source
This contains Statgen's source, as well as data and examples. The project is managed by ant and is best compiled under Java 1.5. Mac OS X apps can be made by ant app. All generated JAR and Mac app files will be found in the dist folder.
Statgen JAR
This is a compiled JAR file of the Statgen program.

[edit] 2 Thesis Code

Below is some of the code for my thesis, which can be found here. There's also the slides from my defense here--a much shorter overview of the project!

Thesis analysis -- code only
This contains all of the R analysis code for my thesis, along with a couple of Python scripts used to parse AAIndex's XML files into something more easily readable by simple programs. In general, these aren't runnable by themselves--the data they need is not included.
Thesis analysis -- code and data
This is the code and the data to generate all the analysis the code does. Warning: this download is almost 500MB! If you only want to see the code without trying it, use the above download.
Thesis functions
Perl functions used in the substitution matrix pipeline; see highlighted version on the wiki here.
Alignment compression
This is a technique I used to save disk space for all of the pairwise alignments I made while calculating substitution matrices, code included.
Pairwise (EMBOSS)
This is a modified version of the EMBOSS program needle designed to do large amount of pairwise comparisons within a set of sequences much faster than running the needle program over and over. EMBOSS is written in C.

[edit] 2.1 MATTs

MATTs is a Java program for making visualizations of substitution matrices. These visualizations help make sense of an otherwise unwieldy mass of numbers--with color coding, it's easy to tell where likely and unlikely substitutions are happening and detect patterns.

A substitution matrix graphic generated by MATTs.
A substitution matrix graphic generated by MATTs.
MATTs source
This contains MATTs' source. The project is managed by ant and is best compiled under Java 1.5. Mac OS X apps can be made by ant app. All generated JAR and Mac app files will be found in the dist folder.
MATTs JAR
This is a compiled JAR of the MATTs program. I'll note that MATTs has only been tested on Mac OS X recently for a large amount of development, and so it's possible running it on other platforms might reveal GUI quirks.

[edit] 3 Wish Submission

This is some non-sensitive code from my recent project with Creature, done in PHP and MySQL. All the files below are syntax highlighted on this page for easier viewing.

Wish submission functions
The functions that drive the site.
Spreadsheet maker
A simple administrative page that generates spreadsheets, using functions in the file listed above.
Wish submission SQL
The SQL to create the wish submission database.

[edit] 4 Set Solver

This is just a fun little problem I solved to help me learn Python better this summer. The script solves the puzzle card game Set.

Set Solver
Highlighted code on the wiki.
Set Solver download
Folder with script, README, and example files.
XHTML 1.1 CSS 2 Sec 508