UW Interview Code Samples
First of all, I would like to thank everyone for what a pleasant visit I had this Monday interviewing. I've set up this page for people interested in seeing code samples or other work I've done. Since you all have different specialties and interests, I've tried to include something for everyone--the included code spans Java, R, Perl, Bash, C, SQL, PHP, and Python. I also have links to my thesis and the presentation slides for my thesis defense.
If you have any further requests or questions about anything here, you can contact me at audrakjohnson@gmail.com. I am often signed into Jabber/GTalk with that email, and I also can be reached over AIM as AudraKJohnson.
Contents |
[edit] 1 HiTSA and Statgen
To describe the structure and function of microbial communities, microbial ecologists often analyze sequences of microbial genes obtained from environmental samples, usually using the 16S ribosomal gene. These two programs streamline sequence data analysis and presentation for these large-scale studies on the composition and dynamics of microbial communities.
[edit] 1.1 HiTSA
The High Throughput Sequence Analysis (HiTSA) pipeline identifies and groups closely related sequences. HiTSA first culls incorrect and low quality sequences. It compares valid sequences to those in databases using BLAST. Then, it aligns valid sequences and their best matches using ClustalW, and clusters the sequences by similarity using the neighbor-joining algorithm. The HiTSA program uses freely available software and databases, and runs under UNIX on either single processor or cluster computers. Although it was made with the 16S ribosomal gene in mind, it's generalized enough to use with any set of homologous sequences.
- HiTSA download
- This download contains all of the scripts and reference files to run HiTSA, a README, and some example data that would run with it. Instructions on making RDP databases for use with HiTSA can be found here.
[edit] 1.2 Statgen
A second program, Statgen, uses information from a genetic distance matrix and/or a phylogenetic tree, both of which are generated by the HiTSA, to summarize the sequence variation within or between groups of sequences. Statgen produces a table with the mean, standard deviation, minimum, maximum and quartile values for the pair-wise genetic distances among the sequences being compared, and a box plot graphic of these values. Statgen is written in Java to run on multiple platforms.
- Statgen source
- This contains Statgen's source, as well as data and examples. The project is managed by ant and is best compiled under Java 1.5. Mac OS X apps can be made by
ant app. All generated JAR and Mac app files will be found in thedistfolder. - Statgen JAR
- This is a compiled JAR file of the Statgen program.
[edit] 2 Thesis Code
Below is some of the code for my thesis, which can be found here. There's also the slides from my defense here--a much shorter overview of the project!
- Thesis analysis -- code only
- This contains all of the R analysis code for my thesis, along with a couple of Python scripts used to parse AAIndex's XML files into something more easily readable by simple programs. In general, these aren't runnable by themselves--the data they need is not included.
- Thesis analysis -- code and data
- This is the code and the data to generate all the analysis the code does. Warning: this download is almost 500MB! If you only want to see the code without trying it, use the above download.
- Thesis functions
- Perl functions used in the substitution matrix pipeline; see highlighted version on the wiki here.
- Alignment compression
- This is a technique I used to save disk space for all of the pairwise alignments I made while calculating substitution matrices, code included.
- Pairwise (EMBOSS)
- This is a modified version of the EMBOSS program needle designed to do large amount of pairwise comparisons within a set of sequences much faster than running the needle program over and over. EMBOSS is written in C.
[edit] 2.1 MATTs
MATTs is a Java program for making visualizations of substitution matrices. These visualizations help make sense of an otherwise unwieldy mass of numbers--with color coding, it's easy to tell where likely and unlikely substitutions are happening and detect patterns.
- MATTs source
- This contains MATTs' source. The project is managed by ant and is best compiled under Java 1.5. Mac OS X apps can be made by
ant app. All generated JAR and Mac app files will be found in thedistfolder. - MATTs JAR
- This is a compiled JAR of the MATTs program. I'll note that MATTs has only been tested on Mac OS X recently for a large amount of development, and so it's possible running it on other platforms might reveal GUI quirks.
[edit] 3 Wish Submission
This is some non-sensitive code from my recent project with Creature, done in PHP and MySQL. All the files below are syntax highlighted on this page for easier viewing.
- Wish submission functions
- The functions that drive the site.
- Spreadsheet maker
- A simple administrative page that generates spreadsheets, using functions in the file listed above.
- Wish submission SQL
- The SQL to create the wish submission database.
[edit] 4 Set Solver
This is just a fun little problem I solved to help me learn Python better this summer. The script solves the puzzle card game Set.
- Set Solver
- Highlighted code on the wiki.
- Set Solver download
- Folder with script, README, and example files.