Using liftover

The LiftOver tool is essentially a way for you to take a location or set of locations in one build of a genome and map it to a different build or genome. For instance, say you have a locus chr17:1000000-1000100 in human genome build hg17 and you want to know where it resides in the hg18 build. You can use LiftOver for this.

If you have a relatively small number of locations you can use the web version of LiftOver (http://genome.ucsc.edu/cgi-bin/hgLiftOver LiftOver).

Here we’ll use the command line tool to lift over coordinates.

Grab the executable you need from http://hgdownload.cse.ucsc.edu/admin/exe/, browser over to your operating system (for instance linux.x86_64) and download the file liftOver to your local machine. This file is an executable, so you don’t need to compile and it should just work. You’ll probably need to make it executable by changing the permissions

chmod 744 liftOver

The basic synatax is this:

liftOver fileOfOldPositions map.chain fileOfNewPositions fileOfUnmappedPositions

The map.chain files are the ones that allow you to get from one build (or species) to another. These are found on the UCSC sequence and annotation downloads page (http://hgdownload.cse.ucsc.edu/downloads.html#terms) for the organism and genome build of interest. For each species and build, there’s a “LiftOver files” link. Click that to get to the chain files.

The chain files are named as explained below:

The file names reflect the assembly conversion data contained within
in the format To.over.chain.gz. For example, a file named
hg15Tohg16.zip file contains the liftOver data needed to convert hg15
(Human Build 33) coordinates to hg16 (Human Build 34). If no file is
available for the assembly in which you’re interested, please send a
request to the genome mailing list (genome@soe.ucsc.edu) and we will
attempt to provide you with one.

So if you want to liftOver hg17 positions to hg18 positions in human, you’ll want a file named hg17ToHg18.over.chain.gz. Chain files are basically the result of whole genome-to-genome alignments.

Unzip the chain file
gunzip hg17ToHg18.over.chain.gz

You’ll also need your list of positions in BED format, which is basically a tab separated file of positions like so:

chrN startpos endpos

Then all you have to do is run liftOver. Not every position will be found in the newer build because merges, changes or revisions of the build so unmapped locations will be written out to the unmapped file. You can also use gff/gtf files as input. The only option I’ve really messed with is the minMatch parameter. For human to human, the recommended setting is 0.95 (95% of bases in your given interval match). For cross-species, say human to mouse, I commonly used 0.10 (10%) which is the recommended setting.

merging data frames

Great resource here on merging data frames:
http://rwiki.sciviews.org/doku.php?id=tips%3adata-frames%3amerge

Suppose you had two data frames, df1 and df2, each with an “id” column. To merge we just
merge(df1, df2, by.x="id", by.y="id", all=TRUE)
The all=TRUE keeps all rows of both data frames.

You can also do a cross join using data frames. Another way to think of this is that one data frame is a look up table for another. This will create all of df1 with the values column filled in from df2.

df1 = data.frame(id=c(1,1,2,3,3,3,4,4))
df2 = data.frame(id=c(1,2,3,4), value=c("blue", "red", "green", "white"))
merge(df1, df2)

There’s also the Reduce function. Basically it takes two arguments at a time from a longer list and processes them from left to right and results in a single value.

It’s syntax
Reduce( function, list, arguments)

Reduce(function(x,y), merge(x, y, all=T, by.x="var1", by.y="var2"), my.list, accumulate=F)

The reduce function is pretty cool and works with lists!
Reduce(sum, 1:10))
Reduce(sum, list(1:10))

You can learn more about functional programming ?funprog