Accessing biomart from R

A good use for Ensembl’s biomart is the ability to pull a list of gene names linked to ensembl transcript ID’s, as when using Sleuth.

We will need Bioconductor and the biomaRt library if you don’t have it already

The usual bioC way of doing things like this is to create an object that encapsulates all your desired parameters. So create a biomart object

Basically you are connecting to the Ensembl backend, accessing the human data table (defaults to the current version) at the specified host. One note is that many tutorials mention using “” as the host, but with things like URL redirection and Ensembl’s server location may affect your success rate. If you are in the US, there is a “” presumably to make our lives a little easier and it works well (so far).

Then we specify what attributes you want. For some reason ‘external_gene_name’ is the name for gene symbol.

If you are looking for a specific build of the genome such as the hg19/Grch37 human build there are stable archives of the Ensembl site at Archives are also available with the format (date) but they have a limited lifespan.

How to view function definitions in R (S3/S4)

Typically, when you type the name of a function in R, it will show you the source code of that function. For example, to see the definition of the ls() function, simply type ls into the R IDE.

Viewing S3 methods
Some functions work by dispatch. An example is print(). When you call print()with some parameter, R will check to see what the type of the parameter (object) is and then call the correct print function which might be something like print.table() or print.dates(). That will be obvious because when you type “print”, R will show you something like UseMethod("print") which is the dispatcher.

Use methods("print") to see all methods that are print.* and you can view the function definition by typing the name of the function. For example, print.table will display the source code. Another way to get the same info is getS3method(function_name, class_name).

If the print method has an asterisk after its name, that means it is hidden. To see hidden function definitions, try something like this getS3method("print", "aov") or getAnywhere("print.aov").

Viewing S4 methods

For S4 methods, it’s a bit more work.
Using a method from cummeRbund, type showMethods(csHeatmap) to figure out the parent object. The object=”CuffFeatureSet” is the clue to the object we need to call for getMethod()

getMethod("csHeatmap", "CuffFeatureSet")

Note that csHeatmap gets its method from CuffGeneSet which inherited it from CuffFeatureSet so call the parent class to get the function definition.

More here

ipython notebooks

ipython notebooks are a useful way to run/save a session of python code/analysis. I can see it as a useful way to share code and promote reproducible research. In several respects, its similar to knitr in R except that it runs in your browser.

I’ve been using ipython notebook on our cluster, which turns out to be fairly easy. You really need a few things:

Server Side

  • IP address of a machine on the cluster
  • port (of your choosing) on that machine

Typically, I log on into our system and qlogin or ssh onto a node where I know the IP address and then run

This will start up ipython notebook on that node.

Client Side (Windows)

Then from my laptop, I tunnel into the cluster node using Putty for SSH. I ssh into the cluster and then use a SSH tunnel for port forwarding. So under SSH Tunnels

Source port: 8888 (for example)

Destination: (for example)

Now I open my browser and point it to:


and i will have a ipython notebook session! One thing to note is that whatever directory on the server that you start up ipython notebook is the directory where your notebooks will be saved (by default). I actually have a jupyter (the new name for ipython notebooks I guess) directory in my home folder where I keep my ipython notebooks but you can imagine putting them on git and doing pull requests, etc.


D3 for visualization

I’ve seen very few scientific journal articles make use of interactive visualizations the way, for example, the New York Times, does. Datasets are large and complex and giving scientists a way to browse the data can add a lot to understanding the data.

I spent some time learning D3.js as a way to do this. D3 is a javascript based data visualization library commonly used for making online charts, plots, chloropleths, etc. Among its strengths are its ability to allow users to interact with the data and animate datasets.

As an example, I looked at our dataset of DMD boys who had been exome sequenced. in our Genetic Modifier study. Using R/Bioconductor, I compute the average depth of coverage per exons for each of the 79 exons of the muscle form of the DMD gene. The goal was to create an interactive heatmap style rendering of the data. Rows are samples, columns are exons and each cell represents the amount of average sequencing coverage per exon. Most interesting (and easy to see) are the exonic deletions where coverage is 0 because that region is missing.

The visualization is here mutation explorer.

I’ll try to explain the code in a later post.

Bash shortcuts

Useful but no well known shortcuts to editing previous commands in bash

!! expands to the last command and all arguments
!* all arguments of the last command, but not the command itself
!$ last argument of the last command
!foo last command beginning with “foo”
!?bar ast command containing “bar”
^foo^bar last command with the first occurrence of “foo” replaced with “bar”
!:gs/foo/bar last command with all occurrences of “foo” replaced with “bar
<any_above>:p print the command but does not execute

Other useful thins

Brace Expansion

cp file.txt{,.bak}

this basically says cp file.txt file.txt.bak

mkdir -p {src,test}/com/eriwen/{data,view}

This will expand every combination so we get  src/com/eriwen/data, src/com/eriwen/view, test/com/eriwen/data, and so forth. Being able to create a template directory structure with one line is a big time saver!


Good reference:

first scratch at spark

This follows from the previous post where i tried out Hadoop. This uses spark on the same comet cluster.

To run interactively, the best thing to do first off is add this to my .bashrc

Then in the folder with the spark code (get it here), run this

Basically, it sleeps for 4 hours because we need to reserve multiple nodes that we will use for Spark. This reserves the machines for  our use, but we will direct ssh in to do our work.

First, source the SLURM env variables file (auto created)

Then identify the namenode. You can ether

  1. examine ~/mycluster.conf/hdfs-site.xml and look for the namenode,
  2. do squeue -u $USER and identify the first machine which will be your namenode.
  3. the env variable $SLURMD_NODENAME will be set to the namenode

SSH into the namenode and then

This will start a terminal that looks like python IDLE and you can type interactively.

The first example:

Most of it is straightforward. The most confusing part is that the data is created then distributed across HDFS but that is invisible to us. The data is split across machines with the parallelize() function. Spark calls this an RDD – resillient distributed dataset. The data cannot be displayed directly (ie by typign ‘data’) because it exists with part on one machine, part on another machine. So to get it back, you have to tell spark to go get the data from the various machines then assemble it so that it may be displayed. That’s what collect() does. Glom(), as far as i understand, creates one object/entity per machine, which can then be joined.

Spark can read from local files, HDFS and probably other sources. The main advantage of Spark over Hadoop is that all the file manipulations are invisible. The code you would use on your personal computer to analyze one data set is almost the same as you would use for Spark. It’s just that Spark handles the sharding of data for you, managing, copying, etc. Talking to a colleague, there’s no reason to use Hadoop unless you have legacy code/data.

In the Hadoop/Spark parlance, be wary whenever you generate a ‘shuffle’ command. Those are the ones that require communication between nodes, such as when generating final output since this generates a lot of network traffic which is the whole thing you’re trying to avoid.

Cacheing is also recommended immediately after cleaning a data set to your liking. This makes the read-back super fast instead of hitting all the disks on all the machines and waiting for response.

Lots of Machine learning libraries and SQL like syntax, so this looks like fun stuff. All with the convenience (I guess) of python.

IPython notebooks can also be viewable on Comet with the program running locally on your computer and communicating back to the cluster.

First scratch at hadoop

Trying out comet at SDSC, thanks to the XSEDE folks. I’m not familiar with SLURM, but it was fun especially since I’ve been interseted in Hadoop and Spark for awhile.

This is a run through using Hadoop Map Reduce using a Java program that looks for anagrams in a list of words. (link to data/code provided by sdsc guys). The details of Hadoop are broad, but the basic idea here is that we need to allocate some machines for a Hadoop run, set up a hadoop environment (simplified through the use of Comet’s scripts provided by SDSU), create an HDFS storage enrivonment (sort of like a virtual partition but follows the Hadoop specification for replicating data 3 times, distributes data across different machines, etc), copy our source data to HDFS, do our computation, copy out data from the HDFS storage back to our local file system (ie from HDFS-space back to user-space) and then killing/shutdown/cleanup the hadoop environment using the handy scripts on Comet.

First we logon to comet and start an interactive job. I’m not familiar SLURM, but here was a good resource.

This basically requests a number of nodes for an amount of time. The -p option specifies a partition with ‘compute’ being a general purpose partition. There are specialty partitions for gpus, etc.I think partitions basically mean a group of computers/processors.

Once a SLURM allocation has been created, you can dump the SLURM environment variables to a file. We do this because we need to source them in, once we log

Now find the first node of our allocated pool using squeue and SSH into that machine.

This tells me that comet-10-65 to comet-10-70 are the nodes that are allocated to me. Our first node is then comet-10-65.

Now we need to source all the SLURM environment variables that we just dumped so they are  available on our master node (this first node in the pool).

Then cd to the directory with my example code

The interactive commands look like this

The file part-00000 contains the results which is basically anagrams of all words in the SINGLE.TXT input file.

Of course, there’s a script to do all this since you don’t often want to run interactively. To kill a squeue job, use scancel <jobID>.

The hadoop config files are in the user rootdir cometcluster directory.

SLURM seems similar to SGE. The basic commands


aligning text in vim

Lately, i’ve been pulling out rnaseq reads and trying to build contings by hand in a quick and dirty manner by

A nifty trick is to use vim to line up your strings on MYSTRING. I did this using the tabularize.vim plugin.

This basically says, line up all lines on TACCAT. Everything before the token is right aligned, then the token, then everything after the token is left aligned. Voila!

Viewing source code for S3/S4 objects

It’s not always easy to see the source code for functions in R/BioC. For S4 method dispatch, you need to figure out the function signature, then call selectMethod() or getMethod()

getMethod = selectMethod


For example:


For S3 dispatch, you can call


to get a list of dispatch methods and then type in the name of the full function name to see the underlying code.

Useful post: