Welcome to Alternate Allele

It seemed fitting to spend some time this Thanksgiving holiday to finally write a post about the launch of Alternate Allele Consulting.  We officially opened doors on September 1, 2012 and haven't slowed down since.  Like most small businesses, we've relied on friends, family and our extended networks to get things off the ground.  I wanted to take a moment to recognize those folks now.

Alternate Allele's guiding principal is to do useful work and to do it thoughtfully and honestly.  This blog reflects that mentality.  We won't post unless we have something useful to say, which means we might not post that often.  We hope that hearing about the decisions and people who helped us get started might help those considering a similar endeavor.

Starting a company (particularly in California) comes with a lot of paperwork and as is true in most endeavors, it helps to have a very good lawyer.  We were fortunate to have the counsel of a long-time college friend Joel Westbrook, of Miles and Westbrook.  We were similarly fortunate with our accountant, hiring family-friend Jim Crofoot at Crofoot Accountancy Corp to get us through the tax maze.

Curtis works his magic.

Curtis works his magic.

For our logo design, we used 99designs to crowd-source a design.  Hasto Anggoro designed the winning logo and business cards.  Keeping with our "be useful" mantra, we printed an IUPAC table on the back of our cards and Hasto made it look like art.  Speaking of art, Curtis Kautzer is responsible for the awesome labware photos that you see across the site.  We are to blame for any clipping and filtering of the photos - sorry Curtis.  

Finally, we have to give credit to Katie Kong of Kong and Associates and Andrew Boudreau of Gentelligent - two good friends who also struck out on their own.  Their sage advice and encouragement convinced us that Alternate Allele was an idea worth pursuing.  Also, our thanks to 5AM Solutions where J learned the ins and outs of the consulting business.  We still actively collaborate with 5AM and recommend them highly for developing rock solid applications for the life sciences.  

Some of the tools we've used to launch Alternate Allele:

Genome Coordinate Cheat Sheet

The following table is a quick cheat sheet identifying the genomic coordinate conventions used at different sites, file formats, etc.  I welcome corrections and additions in the comments. I'll add them to the companion sheet as time allows. A more detailed description of genome coordinate systems can be found in my earlier post. Note that in this table, I consider zero-based, half-open coordinate conventions to be equivalent to space-counted, zero-start and so do not distinguish them in the table.

Name Resource Type Chromosome 0 vs 1 Space vs Base Notes
UCSC Genome Browser Genome Browser chr1, chr2, .. chrX, chrY, chrM 1 Base Note that when zooming in on the genome browser, the positioning of the tick marks appears to use a space-counted, 0-start convention. As discussed here, this is not the intention. To get the 1-based position, the base coordinate corresponds to the tick mark to its immediate right.
NCBI Map Viewer Genome Browser 1, 2, .. X, Y, MT 1 Base
Ensembl Location Viewer Genome Browser 1, 2, .. X, Y, MT 1 Base
UCSC BLAT Web Tool chr1, chr2, .. chrX, chrY, chrM 1 Base
NCBI BLAST Web Tool chr1, chr2, .. chrX, chrY, chrMT 1 Base
UCSC Table Browser Web Tool chr1, chr2, .. chrX, chrY, chrM 0 Space Output formats are all 0-space formats. However, when specifying the region field is 1-base format.
BED Format File Format chr1, chr2, .. chrX, chrY, chrM 0 Space
WIG Format File Format chr1, chr2, .. chrX, chrY, chrM 1 Base UCSC's Wiggle Track Format
Galaxy Interval Format File Format chr1, chr2, .. chrX, chrY, chrM 0 Space
GFF/GTF/GFF3 Format File Format Depends on context 1 Base Chromosome names depend on resource using the file format
VCF Format File Format Depends on context 1 Base Chromosome just needs to refer to an identifier in a reference genome or can be a contig id. Position 0 and N+1 (where N = chrom length) are used to refer to telomeres.
UCSC Annotation Files Data File chr1, chr2, .. chrX, chrY, chrM 0 Space
Special thanks to John DiCarlo's insights and suggestions!

Genome Coordinate Conventions

Denoting a contiguous region in a reference genome seems straightforward enough. However, I’ve seen many a bioinformatician get tripped up by the different conventions and formats used on web sites, file formats and genomic tools. Here are some things to keep in mind.

Chromosome

Taking the human genome as our example, chromosome names may or may not use a prefix so chromosome 12 may be “chr12”, “ch12” or simply “12”. Also, the mitochondrial chromosome may be “MT” or simply “M” and the sex chromosomes might be denoted as “X” and “Y” or “23” and “24”.

Strand

Typical conventions for denoting the plus vs minus strand orientation of a sequence include using “1” and “-1” or “+” and “-“.  It is worth noting that for most of the genomic resources I use, genomic positions are almost always given relative to the plus strand even if the feature is said to be on the minus strand.  For example, BLAT-ing a minus strand sequence against the human genome at UCSC will return a search result with "-" strand but the start and stop positions will be given relative to the plus strand.  Minus strand positions would start counting from 0 at the opposite end of the chromosome (q-arm).

Position

Several conventions exist which differ in (1) what number they start with when numbering bases, (2) whether they number the bases themselves or the spaces between bases and (3) whether the interval is considered "open" or "closed".  The notion of "closed" and "open" intervals is a mathematical concept which in this context implies either that the start and stop of the interval should be included (a closed interval) or they should not be included (an open interval).  Sometimes you will see square brackets used to denote closed intervals (e.g. [12,20]) and parentheses used to denote open intervals (e.g. (12,20)).

Below are examples of three of the more common conventions. In each case I show the coordinates of an ATG subsequence (in red) and a cut site (marked by a red triangle).

Base-counted, one-start (a.k.a. one-based, fully-closed)

ATG location: 7-9 or [7,9]
Cut site: 11^12 or (11,12)
Interval length = stop - start + 1
Notes:
  • This is by far the most common convention used by the major genome browsers, tools like BLAST and BLAT, etc.
  • Using a base-counted system is problematic for describing features that occur between bases such as insertions or enzyme cut sites.  To deal with this, some conventions replace the "-" usually used to separate the start and stop position with an alternative notation such as "^".  Alternatively, one could use the parentheses notation. 
  • Base-counted systems can short-hand a one base interval with just the start (or stop) location.  This is useful for denoting the location of SNPs, for example.


Space-counted, zero-start

ATG location: 6-9
Cut site: 11-11
Interval length = stop - start


Notes:
  • Less common, this convention is attractive because of its natural way of denoting features between bases such as insertions, etc.  Length calculations are also simple.  


Zero-based, half-open

ATG location: 6-9 or [6,9)
Cut site: 11-11 or [11,11)
Interval length = stop - start


Notes:
  • In this convention, the start base is included in the interval but not the stop base.
  • This convention is used in data formats (especially at UCSC) such as BED.
  • Although conceptually different, space-counted, zero-start and zero-based, half-open give the same start and stop coordinates for intervals.
  • Python programmers will find this convention familiar as Python indexes arrays in the same manner.


Conclusions

As you can see, the differences between conventions are subtle, which makes it very easy to make undetected errors. In my experience, programmers prefer to use one of the zero-start conventions since most computer languages use zero-based indexing where as biologists are more fond of the one-start conventions. This commonly leads to a juggling of conventions where software using a zero-start convention is switched to one-start for displaying results to the user. The UCSC Genome Browser is a good example of a site with straddling conventions between the user interface and the underlying data tables and it must cause some confusion because they've devoted a FAQ entry to the issue.

My personal favorite convention is the space-counted, zero-start convention. All intervals require a start and stop location (even for single base features like SNPs), which makes up for in consistency what it might lack in conciseness. Denoting a location between two bases, such as for an insertion or enzyme cut site, is conceptually clear and does not require syntatic trickery like the base-counted methods. My understanding is that one of the many differences between the public and Celera human genome sequencing efforts was the public effort used base-counted, one-start while Celera used space-counted, zero-start although I don't have a reference to confirm this.

I haven't touched on the coordinate conventions used in the human variant world which give positions relative to cDNA. The HGVS variant nomenclature is a good example of this.  Among the interesting features of this convention are starting counting (at 1) with the A of the ATG initiation codon, using negative coordinates and skipping the 0 base altogether. Good times.

Until our evil bioinformatics overlords descend upon us and force us to standardize to one system, I've started a cheat sheet to help me remember who's using what convention. Please have a look, correct me where I'm wrong and point out what I'm missing.

References

  • A great blog post from the Bergman lab dealing with genome coordinate conventions with an emphasis on transposable elements annotation.
  • UCSC's description of the zero-based, half-open convention.

Switching Between One and Three-Letter Amino Acid Codes


Below is a simple tool for switching between one and three letter amino acid codes in the context of a sequence variant. Enter a protein sequence variant such as A79P or NP_000305.3:p.Ala126Pro, select whether a 3 to 1 or a 1 to 3 translation is desired and then hit the submit button. Variants that adhere to HGVS nomenclature standards should work.
Variants
Translation Type




DIY Bioinformatics: A Whole New Galaxy

This blog post was first published by my company on June 16, 2011

Inspired by Will’s recent

book review

of

Biopunk

and the

analysis crowd sourcing

of the European

E. Coli

outbreak, I thought I would take another look at DIY (Do It Yourself) biology in this week’s post. Unlike some, I have no interest in trying to run molecular biology experiments out of my kitchen. As anyone who has had the misfortune of trying my cooking would tell you, if there is a way to make a PCR reaction dry and tasteless I'm sure I'd find it. DIY bioinformatics I find more intriguing. I'm not a practitioner as I'm too busy with PSTDIFY (Pay Somebody To Do It For You) bioinformatics, but I like the vision of the lone, amateur scientist, sitting amongst a pile of empty pizza boxes and Red Bull cans finding unknown biological treasure with just their laptop, curiosity and some serendipity. This vision is not so unlikely. Large biological data sets are readily available including thousands of

microarray experiments

,

genotypes

and even

full genomes

. Someone modestly adept at programming or a package like R can interrogate, correlate and mine this data - and indeed this is happening all the time. What about the true amateur, however, who even lacks programming skills? Can the Excel Warrior or my web savvy grandma participate in their own DIY bioinformatics adventure? That’s what I set out to discover this week. As a test, I went back to a favorite paper of mine by Majewski and Ott (

Genome Research, 2004

). What I like about the paper is the number of insights made simply through careful mining of genomic databases. For example, even with inherently noisy data sets like

dbSNP

and the annotated human genome, the authors were able to clearly see the extent and importance (for splice regulation) of sites near exon-intron boundaries simply by looking at the overall frequency of SNPs discovered in these positions compared to other sites. This figure (F2) from the paper shows the low SNP frequency in the immediate 5' and 3' positions of the intron where it meets the neighboring exons.

My test was to see if I could reproduce at least a part of this analysis by simply using free public tools and without programming. I settled on the web-based analysis tool

Galaxy

as it seemed to have a lot of the functionality I would need and I wasn’t very familiar with it - making me a better stand in for the Red Bull-intoxicated amateur scientist. After some time poking around, I settled on these steps in Galaxy:

  1. Get introns from chromosome 12 via UCSC’s Table Browser (I just did chromosome 12 to keep my data sets manageable for this example).
  2. Get all SNPs from chromosome 12
  3. Join the introns and SNPs producing a table of only those SNPs that fall within an intron
  4. Calculate the position of the SNP relative to the 5’ end of the intron
  5. Count up number of SNPs found at each 5’ position
  6. Sort results by position (probably not necessary)
  7. Limit results to just positions within 50 bp of the exon-intron boundary
  8. Plot the SNP frequency vs SNP location

For more detail, you can see my workflow

here

. And this was my final result:

This corresponds pretty well to the left portion of Majewski and Ott's own intron plot and was finished before cracking my second can of Red Bull. Score one for DIY! Before I quit my day (and night) job to make room for the waves of empowered amateurs, it's worth pointing out a few minor details. First, this rudimentary analysis glosses over many important details (such as normalizing for intron lengths) and any publication ready workflow would be much more complex. Second, like all tools of this type, Galaxy walks a fine line to balance functionality and usability. It took me quite a bit of exploring to find the right functions and many of these functions probably only made sense to me because I knew a lot about programming, databases, genomics, etc. Third, it's near impossible to match the power, speed and flexibility a programmer has to analyze data with a web based tool like Galaxy. And finally, although I am empowered by Galaxy to do the steps, the know-how of what questions to ask and the science to understand the observations I make comes from many years of experience - unfortunately there's no short cut around that. With that said, Galaxy has some very nice features and is a powerful addition to the DIY's tool box. Stock up on your Red Bull now.

Excel Hacks: Calculating Oligo Temperature with the Wallace Rule

Calculating oligonucleotide melting temperatures using the Wallace-Itakura formula (2*(number of A's and T's) + 4*(number of G's and C's)) in Excel is a piece of cake. I rarely use it myself as I typically use nearest neighbor with SantaLucia thermodynamic parameters for oligo design, but on occasion it does come up. Assuming your oligo sequence is in A1, this formula will calculate the Wallace TM:
=2*(LEN(A1)-LEN(SUBSTITUTE(SUBSTITUTE(A1,"A",""),"T","")))+4*(LEN(A1)-LEN(SUBSTITUTE(SUBSTITUTE(A1,"G",""),"C","")))
The formula counts the numbers of A's and T's by making a new oligo sequence with A's and T's removed. It then compares the length of the original oligo to the new one - the difference is the number of A's and T's. Counting G's and C's is done in the same manner. Any base other than A,C, G or T will be ignored in the calculation.

Excel Hacks: Reverse Complement a Nucleotide Sequence

Reverse complementing a nucleotide sequence in Excel turns out to be really ugly - in fact, this may be one case where it's better to do some cut and pasting into a web tool or finally go pick up that Perl book that's been sitting on your shelf. However, if you’re desperate and absolutely have to do it in Excel, here's how you can reverse complement short (~30 base) nucleotide sequences. 



Complementing


Let's break reverse complementing into its two steps - reversing the DNA sequence and then taking its complement. We'll start with complementing:

= SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( SUBSTITUTE( A1, "A", 1), "C", 2), "G", 3),"T", 4), 1, "T"), 2, "G"), 3, "C"), 4, "A")

As in previous posts, we assume the sequence to be complemented is in cell A1.

This rather lengthy formula is performing a two-step substitution. First, nucleotides are converted to numbers, e.g. “A” to “1”, “C” to “2”, etc. Next, the numbers are then replaced with the complementary nucleotide, so “1” goes to “T”, “2” to “G” and so on. Doing the substitution in two steps allows us to distinguish which bases have been complemented from those that have not. The formula works correctly for nucleotides “A”, “C”, “G”, “T” and “N”. Other degenerate IUPAC codes could be handled, but we’ll leave this as an exercise for the reader.

Reversing



Surprisingly, reversing the DNA sequence turns out to be even trickier. Most programming languages have a function to reverse text, but not Excel. Breaking out a macro in Excel might be worthwhile since writing a little Visual Basic to reverse text isn't too bad. Unfortunately, this is not possible for folks with Macs using Office 2008 since macro functionality was removed.

For a general method to reverse our sequence, our options are limited and ugly. Perhaps one of the less awful methods is repeated use of the MID function. Assuming that our complemented nucleotide sequence is in cell B1, we would do something like this:

= MID(B1,30,1) & MID(B1,29,1) & MID(B1,28,1) & MID(B1,27,1) & MID(B1,26,1) & MID(B1,25,1) & MID(B1,24,1) & MID(B1,23,1) & MID(B1,22,1) & MID(B1,21,1) & MID(B1,20,1) & MID(B1,19,1) & MID(B1,18,1) & MID(B1,17,1) & MID(B1,16,1) & MID(B1,15,1) & MID(B1,14,1) & MID(B1,13,1) & MID(B1,12,1) & MID(B1,11,1) & MID(B1,10,1) & MID(B1,9,1) & MID(B1,8,1) & MID(B1,7,1) & MID(B1,6,1) & MID(B1,5,1) & MID(B1,4,1) & MID(B1,3,1) & MID(B1,2,1) & MID(B1,1,1)

The formula constructs the reverse sequence base-by-base up to 30 bases. Make certain that your sequence to reverse is 30 or fewer nucleotides, otherwise your sequence will be truncated without warning. You could extend the function to handle larger sequences, but clearly this method is only practical for short sequences.

Excel Hacks: Calculating GC%

As a quick hack, you can calculate the GC% of a DNA sequence in Excel without breaking out the macros. Here’s how you do it assuming your sequence is in Excel cell A1.

= (1 - LEN(SUBSTITUTE(SUBSTITUTE(A1,"G",""),"C",""))/LEN(A1)) * 100

Although the formula is a bit daunting, the principle is straightforward. First, use the SUBSTITUTE command to get a version of your sequence with all G’s and C’s removed. Determine the length of this new sequence using the LEN command and compare this to the length of the original sequence. This gives you the fraction of sequence that is NOT a G or a C. To get the GC%, subtract this fraction from 1 and multiply by 100% to go from a fraction to a percent.

This will probably satisfy most folks, but it’s not perfect. For an accurate calculation, we need to take into account ambiguous IUPAC base codes. For example, a "N" should be treated as 0.5 of a G-C base. A "D" should count as 0.33. Although it’s possible to extend the hack above to handle these cases, it might be a good time to break out Perl or Python.

Excel Hacks: Plate Well to Number Conversions

Here’s a simple but useful transformation for going from a well “address” on a plate (e.g. “B5”) to a numbered location. For a 96 well plate, the numbered location will be a number from 1 to 96. Typically, this numbering goes first from left to right across the 12 columns and then top to bottom down the 8 rows. So, A1 would map to 1, B1 would map to well 13 and H12 would map to 96.

Let's assume the plate address you want to convert is in the Excel cell A1. The conversion is done in Excel as shown below:

= (CODE(LEFT(A1,1))-65)*12 + RIGHT(A1,LEN(A1)-1)

The formula assumes the well format is one character denoting the row followed by one or more integers which denote the column. The left half of the formula before the '+' converts the character into an integer using the CODE function. Since "A" has a code of 65, we subtract this off before multiplying by the row length. The portion of the formula after the '+' simply extracts the integer portion of the well address and adds it to get the well number. The formula works with padded (A01)and unpadded (A1) representations.

To go back from a numbered address to the letter-number format, use the following:

= CHAR(INT((A1-1)/12)+65) & (MOD(A1-1, 12)+1)

The portion of the formula left of the '&' handles the row calculation. The CHAR function performs the opposite duty of CODE, taking a number and converting it back to a character. The portion to the right of the '&' calculates the integer (column) portion of the well address.

It's quite easy to do 384-well or other formats. Just replace the red 12 with the correct number of columns. In the case of a 384-well format, you'd replace it with 24.

It is less common to see plate wells converted into numbers from top to bottom and then left to right, although it does happen. In this case, B1 maps to well 2 and H2 to 16 for a 96 well plate.

The conversion from plate well to number in this case is:

= (CODE(LEFT(A1,1))-65) + 1 + (RIGHT(A1,LEN(A1)-1)-1)*8

And to go back:

= CHAR(MOD((A1-1),8)+65) & (INT((A1-1)/8)+1)

Note that instead of using the number of columns (12), we're now using the number of rows (8). It's left as an exercise to the student to tweak the formula for 384-well formatted plates ;)



Identity Crisis: How to Pick Good Identifiers

What makes a good identifier? It seems like a simple question, but making the wrong decision can create mistakes that you'll live with for a long time. Unlike other tasks in bioinformatics, choosing how to identify reagents, samples, microarrays, etc is one area where you'll have lots of opinions and lots of "help". For example, in the lab I've often had scientists want to cram sample names, day of the week, batch number, favorite flavor of ice cream, etc all into the name used to identify a tube or plate. The only limit to their imagination was label real estate and font size.

When you get into a situation of identifier overload, it's good to take a step back and re-evaluate. What's the primary function of an identifier? By its very definition, an identifier should be a unique and unambiguous way to identify something (in math speak, this is called a 1-to-1 mapping). When I give you an identifier you should know exactly what I'm referring to - that's the unambiguity. On the flip side, I shouldn't have a whole bunch of identifiers referring to the same thing - that's the uniqueness.

It's not hard to find identifiers in genomics that fail to live up to these properties. Entrez gene identifiers are simply integers, which makes these identifiers completely ambiguous without context. Genes also often go by several names which breaks the uniqueness and is why you should stick to official HUGO nomenclature. Although Genbank sequence identifiers include a version number (the part of the name after the decimal point), many people neglect it and the identifier becomes ambiguous. Mutations have historically been identified in a format such as E7V. Again, without context it's impossible to know which gene sequence is being referenced or sometimes even whether the mutations are denoting a nucleotide or amino acid change. The Human Genome Variation Society is trying to replace this last system with a new nomenclature. It couldn't come too soon.

How about in your own lab? The most common issue with home grown identifiers is the unintended breaking of the uniqueness or the unambiguity rule due to identifier overloading. "Identifier overloading" is when an identifier is layered with added information beyond what's needed for unique and unambiguous identification. There's almost always a good reason - "I really want the label to contain the date that it's run" or "using the sample name as the name for my microarray will be convenient". This all sounds good, but inevitably an unforeseen circumstance leads to a break down in the system. What if an array needs to be run again? What if a label was generated on one date but not run until the next day?

As a concrete example, I once worked with a group that used an internal, automatically generated database primary key as an identifier for reagents in the lab. This worked great until a software upgrade forced a dump and re-population of the database. The internal database identifier no longer matched the labels pasted on tubes in the lab. The identifiers' double function of reagent name and internal database key were at odds.

So, how do you come up with a good identifier? As a bioinformatician, I know that having unique, unambiguous identifiers is of the utmost importance and the best way to achieve this is through a simple, incremental naming system with no additional encumbrances. As a pragmatist, I know that an identifier that is this anonymous requires the user to do some sort of look-up to know anything useful, and until augmented reality becomes mainstream, this will be a problem.

In general, I try to stick to the follow rules:
  1. Begin with a short prefix (2-5 letters). The sole purpose of the prefix is to clearly distinguish what type of object is being identified. Examples in the wild include "rs" (SNPs in dbSNP), "ENST" (transcript at Ensembl), etc
  2. Follow with a simple, incremental integer. Do not pad with 0's to make the identifiers a constant width. It may look pretty, but people are bad at counting 0's and you've capped the number of identifiers at your disposal. Do not get into a habit of trying to always start a new day's experiment at a thousand or similar. You're starting to overload your identifier.
  3. If you must use versioning and won't settle for just taking the next available ID, do it just once using a simple decimal point followed by an integer. Decide to do this from the start even if it's not obvious you need it.
  4. If your user demands more, make a deal with them. Stick with identifiers as above, but where space, time and convenience allow, let the user to come up with a small amount of information that will make their lives easier. Include this when labels are printed, identifiers are shown on web pages, etc. The key here is the user's data doesn't have to be unique or even well thought out. It's your simple ID that's important.
Have your own rules? Would love to hear them!