Setting up AWS CLI

AWS has a command line interface which may be easier for programmers to use.

To set up on a linux machine, use pip and configure

pip install awscli
aws configure

The Access and Secret Access keys are created in the AWS console in the IAM tool. You need to create a user and give him/hser an access key. They are only viewable once, so you need to download and save that credential. They are analagous to a username/password but more like an API access key. If you lose it or can’t find it, the AWS admin will need to delete the old key and provide a new one.

For region name, I use us-west-2 (Oregon). Docs are here.

The configure script will create a JSON file. If you have more than one profile you can

aws configure --profile user2

AWS CLI will look first at command line options, then environment variables (AWS_ACCESS_KEY_ID, etc), and then ~/.aws//config

Amazon presentation on AWS for genomics

getting lists of rRNA, miRNA in R

Recently i had to remove all small RNAs from some cufflinks data. There are lots of way to do this, but this was relatively painless (aside from figuring it out).

The HUGO website has subcategories of genes annotated and we can grab the data from there. It’s available as both JSON or TEXT. I found it here:

There was a bit of fiddling to figure out the structure of the returned JSON file, hence the smallRNA[[2]][[2]].

We can do the same for pseudogenes using the same site.

Then we can filter against our genes.fpkm_tracking file from cuffdiff output

Enable X on AWS EC2 instance

Spin up an EC2 instance. For this demo, I used an t2.micro instance with the base amazon Linux AMI.

First connect to our machine. You need to find your EC2 instance public dns. It will be something like: The user is always ec2-user.

ssh -i /path/to/pem_file

Once on the EC2 machine, I needed to install some packages:

sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64

Check that X11 forwarding is enabled for ssh. This was a good reference:

Check that the DISPLAY variable is enabled on the EC2 instance

export DISPLAY=:0.0
# or
export DISPLAY=:10.0

Login again using ssh with X11 forwarding

ssh -X -i /path/to/pem_file

To test:


And you should be good to go. I could use R remotely from my linux desktop with plots showing up.

Accessing biomart from R

A good use for Ensembl’s biomart is the ability to pull a list of gene names linked to ensembl transcript ID’s, as when using Sleuth.

We will need Bioconductor and the biomaRt library if you don’t have it already

The usual bioC way of doing things like this is to create an object that encapsulates all your desired parameters. So create a biomart object

Basically you are connecting to the Ensembl backend, accessing the human data table (defaults to the current version) at the specified host. One note is that many tutorials mention using “” as the host, but with things like URL redirection and Ensembl’s server location may affect your success rate. If you are in the US, there is a “” presumably to make our lives a little easier and it works well (so far).

Then we specify what attributes you want. For some reason ‘external_gene_name’ is the name for gene symbol.

If you are looking for a specific build of the genome such as the hg19/Grch37 human build there are stable archives of the Ensembl site at Archives are also available with the format (date) but they have a limited lifespan.

How to view function definitions in R (S3/S4)

Typically, when you type the name of a function in R, it will show you the source code of that function. For example, to see the definition of the ls() function, simply type ls into the R IDE.

Viewing S3 methods
Some functions work by dispatch. An example is print(). When you call print()with some parameter, R will check to see what the type of the parameter (object) is and then call the correct print function which might be something like print.table() or print.dates(). That will be obvious because when you type “print”, R will show you something like UseMethod("print") which is the dispatcher.

Use methods("print") to see all methods that are print.* and you can view the function definition by typing the name of the function. For example, print.table will display the source code. Another way to get the same info is getS3method(function_name, class_name).

If the print method has an asterisk after its name, that means it is hidden. To see hidden function definitions, try something like this getS3method("print", "aov") or getAnywhere("print.aov").

Viewing S4 methods

For S4 methods, it’s a bit more work.
Using a method from cummeRbund, type showMethods(csHeatmap) to figure out the parent object. The object=”CuffFeatureSet” is the clue to the object we need to call for getMethod()

getMethod("csHeatmap", "CuffFeatureSet")

Note that csHeatmap gets its method from CuffGeneSet which inherited it from CuffFeatureSet so call the parent class to get the function definition.

More here

ipython notebooks

ipython notebooks are a useful way to run/save a session of python code/analysis. I can see it as a useful way to share code and promote reproducible research. In several respects, its similar to knitr in R except that it runs in your browser.

I’ve been using ipython notebook on our cluster, which turns out to be fairly easy. You really need a few things:

Server Side

  • IP address of a machine on the cluster
  • port (of your choosing) on that machine

Typically, I log on into our system and qlogin or ssh onto a node where I know the IP address and then run

This will start up ipython notebook on that node.

Client Side (Windows)

Then from my laptop, I tunnel into the cluster node using Putty for SSH. I ssh into the cluster and then use a SSH tunnel for port forwarding. So under SSH Tunnels

Source port: 8888 (for example)

Destination: (for example)

Now I open my browser and point it to:


and i will have a ipython notebook session! One thing to note is that whatever directory on the server that you start up ipython notebook is the directory where your notebooks will be saved (by default). I actually have a jupyter (the new name for ipython notebooks I guess) directory in my home folder where I keep my ipython notebooks but you can imagine putting them on git and doing pull requests, etc.


D3 for visualization

I’ve seen very few scientific journal articles make use of interactive visualizations the way, for example, the New York Times, does. Datasets are large and complex and giving scientists a way to browse the data can add a lot to understanding the data.

I spent some time learning D3.js as a way to do this. D3 is a javascript based data visualization library commonly used for making online charts, plots, chloropleths, etc. Among its strengths are its ability to allow users to interact with the data and animate datasets.

As an example, I looked at our dataset of DMD boys who had been exome sequenced. in our Genetic Modifier study. Using R/Bioconductor, I compute the average depth of coverage per exons for each of the 79 exons of the muscle form of the DMD gene. The goal was to create an interactive heatmap style rendering of the data. Rows are samples, columns are exons and each cell represents the amount of average sequencing coverage per exon. Most interesting (and easy to see) are the exonic deletions where coverage is 0 because that region is missing.

The visualization is here mutation explorer.

I’ll try to explain the code in a later post.

Bash shortcuts

Useful but no well known shortcuts to editing previous commands in bash

!! expands to the last command and all arguments
!* all arguments of the last command, but not the command itself
!$ last argument of the last command
!foo last command beginning with “foo”
!?bar ast command containing “bar”
^foo^bar last command with the first occurrence of “foo” replaced with “bar”
!:gs/foo/bar last command with all occurrences of “foo” replaced with “bar
<any_above>:p print the command but does not execute

Other useful thins

Brace Expansion

cp file.txt{,.bak}

this basically says cp file.txt file.txt.bak

mkdir -p {src,test}/com/eriwen/{data,view}

This will expand every combination so we get  src/com/eriwen/data, src/com/eriwen/view, test/com/eriwen/data, and so forth. Being able to create a template directory structure with one line is a big time saver!


Good reference:

first scratch at spark

This follows from the previous post where i tried out Hadoop. This uses spark on the same comet cluster.

To run interactively, the best thing to do first off is add this to my .bashrc

Then in the folder with the spark code (get it here), run this

Basically, it sleeps for 4 hours because we need to reserve multiple nodes that we will use for Spark. This reserves the machines for  our use, but we will direct ssh in to do our work.

First, source the SLURM env variables file (auto created)

Then identify the namenode. You can ether

  1. examine ~/mycluster.conf/hdfs-site.xml and look for the namenode,
  2. do squeue -u $USER and identify the first machine which will be your namenode.
  3. the env variable $SLURMD_NODENAME will be set to the namenode

SSH into the namenode and then

This will start a terminal that looks like python IDLE and you can type interactively.

The first example:

Most of it is straightforward. The most confusing part is that the data is created then distributed across HDFS but that is invisible to us. The data is split across machines with the parallelize() function. Spark calls this an RDD – resillient distributed dataset. The data cannot be displayed directly (ie by typign ‘data’) because it exists with part on one machine, part on another machine. So to get it back, you have to tell spark to go get the data from the various machines then assemble it so that it may be displayed. That’s what collect() does. Glom(), as far as i understand, creates one object/entity per machine, which can then be joined.

Spark can read from local files, HDFS and probably other sources. The main advantage of Spark over Hadoop is that all the file manipulations are invisible. The code you would use on your personal computer to analyze one data set is almost the same as you would use for Spark. It’s just that Spark handles the sharding of data for you, managing, copying, etc. Talking to a colleague, there’s no reason to use Hadoop unless you have legacy code/data.

In the Hadoop/Spark parlance, be wary whenever you generate a ‘shuffle’ command. Those are the ones that require communication between nodes, such as when generating final output since this generates a lot of network traffic which is the whole thing you’re trying to avoid.

Cacheing is also recommended immediately after cleaning a data set to your liking. This makes the read-back super fast instead of hitting all the disks on all the machines and waiting for response.

Lots of Machine learning libraries and SQL like syntax, so this looks like fun stuff. All with the convenience (I guess) of python.

IPython notebooks can also be viewable on Comet with the program running locally on your computer and communicating back to the cluster.