Rstudio-server over ssh

Rstudio is a great tool but suppose you need to access your work computer to do some work from home? Use Rstudio-sever which  allows you to basically connect to a remote machine and do your development there.

I can’t find a definitive answer whether the free version of Rstudio-server encrypts the login information. Their website is not clear.

But we can always tunnel using SSH and things should be just fine.

In putty, setup an ssh tunnel and do the following:

  1. Source port: 8787
  2. Destination: localhost:8787

Once you connect to the remote host using putty, you can use “http://localhost:8787” in your browser to access your Rstudio-server session. The plus is that the connection to the remote host is going over an SSH tunnel and therefore encrypted.

RNAseq aligners – The Next Generation

There was a time when Tophat/Cufflinks was the only game in town. That has changed. By a lot. Some of the newest aligners include

  • STAR. Using a suffix tree and the idea of compatible reads make this a very fast aligner. Whereas cufflinks requires 8+ hours, STAR will require <2. The downside: the index requires a lot of RAM. I think I’ve heard 32 Gb of RAM works but I run it on our 256GB machine.
  • GMAP/GSNAP – actually used since the EST days. I haven’t used it yet but according to bake-off comparisons like this, it is a relatively fast aligner.
  • Tophat/Cufflinks. Probably the most cited of all aligners. Straightfoward to use but seems to be underaligning when compared to newer software.

Some of the new tools in town are specialized for certain tasks, particularly aspects of splicing

  • Leafcutter is designed to study variation in intron splicing
  • MapSplice2 – splice junction discovery
  • SpliceTrap – differential splice usage
  • DEXseq – differential exon usage

New wave differential gene expression can now be done in minutes using these tools! First they use “transcripts per million” instead of the confusing FPKM/RPKM that no one seems to understand.

  • Kallisto (Pachter lab). From the good folks who brought us Tophat/Cufflinks and Express. I like the interactive UI for playing with your data through Sleuth.
  • Salmon/Sailfish. Runs extremely fast and makes use of the idea of compatible alignments which is more like STAR
  • RSEM – haven’t look at it yet

Workflows. There are a few workflows that seem popular for analyis:

  • Tophat/Cufflinks. You do the splice aware alignment, I’ll generate a ton of output for you
  • subread/edgeR/limma(voom)/Glimmr. align/count, normalize, DEG analysis and visualize. All thanks to bioconductor
  • your favorite aligner + DESeq2/edgeR/ebSeq

Of course, there’s not a lot of consensus on what works best or is “right” so there’s still room to add to this list.

 

nodejs for genomics

Some interesting sites to peruse later…

  • http://www.bionode.io/
  • http://biojs.io/
  • http://devpost.com/software/genesis-computational-genomic-sequence-analysis-z47x9t
  • http://thejackalofjavascript.com/dna-analysis-node-js/
  • https://www.npmjs.com/package/ntseq

Amazon CLI for S3

Some useful commands

EC2


aws ec2 describe-instances
aws ec2 describe-instances --region us-east-1
aws ec2 describe-instances --region us-east-1 --output table

Any aws cli command can use the following flags
*–profile – name of a profile to use, or “default” to use the default profile.
*–region – AWS region to call.
*–output – output format.
*–endpoint-url – The endpoint to make the call against. The endpoint can be the address of a proxy or an endpoint URL for the in-use AWS region. Specifying an endpoint is not required for normal use as the AWS CLI determines which endpoint to call based on the in-use region.

S3


# make a bucket
aws s3 mb s3://bucket-name
# remove a bucket
aws s3 rb s3://bucket-name
# by default will not remove non-empty buckets, use force
aws s3 rb s3://bucket-name --force
# list buckets
aws s3 ls
# list contents of a bucket
aws s3 ls s3://bucket-name

Since the bioconductor image is hosted on the east cost us-east-1, we should have our S3 bucket in the same availability zone/region.


aws s3 mb s3://bioarchive.1 --region us-east-1

Using the bioconductor AMI

The current BioC release is 3.3 for which there is an Amazon AMI (ami-64d43409). I’m not sure why, but the AMI is in the us-east1 region which isn’t convenient for me, but we do not have permission to copy it. So all services have to be in the same availability zone (AZ) in us-east(Virginia). If there’s time, I’ll have to roll my own to have something closer to home.

Not sure if my analysis is more memory or compute heavy. Going with memory optimized. r3.2xlarge $0.66/hour.

  • r3.2xlarge $0.66/hour
  • EBS $0.05/hour

Security group: enable ssh and https (22 and 80)

Connect to instance
ssh -i /path/to/key.pem ec2-user@ip-from-aws-console

Attach an EBS (console)

      go to EC2, select Volumes, create volume 250gb of gp2

        make sure its in the same availability zone as EC2 instance

          select volumne, select attach (instance must by stopped or not running)

If the EBS is new(never used), you can see the device size and mount point

lsblk

# if this returns data, there is not filesystem and we need to create one
sudo file -s /dev/xvda1
# format
sudo mkfs -t ext4 device_name
# create mount point
sudo mkdir /data
mount /dev/xvdf /data

Details where here

To upload data

sftp -i /path/to/key.pem ubuntu@ip-from-aws-console
cd /my/dirir
put, etc

Doing it better With the need to upload data first, there are two options I can think of. First, upload to S3 then copy over to machine instance or EBS volume. The other possibility to save $$$ would be to use a low cost instance with EBS, upload data to it, then spin up a larger instance and reattach the same volume.

Using RStudio
Point browser to the public IP for the AMI instance. Login with user/pass ubuntu/bio.

Permissions problems
within RStudio, I had problems loading dplyr, etc despite the fact they were present in the /home/ubuntu/R-libs directory.
I tried:
in /etc/rstudio/rsession.conf: set r-libs-user=~/R-libs
And in /home/ubuntu/R-libs
sudo chown -R ubuntu.ubuntu *
sudo chmod -R 755 *

Not sure which one worked, but RStudio now seems to work okay.

Setting up AWS CLI

AWS has a command line interface which may be easier for programmers to use.

To set up on a linux machine, use pip and configure

pip install awscli
aws configure

The Access and Secret Access keys are created in the AWS console in the IAM tool. You need to create a user and give him/hser an access key. They are only viewable once, so you need to download and save that credential. They are analagous to a username/password but more like an API access key. If you lose it or can’t find it, the AWS admin will need to delete the old key and provide a new one.

For region name, I use us-west-2 (Oregon). Docs are here.

The configure script will create a JSON file. If you have more than one profile you can


aws configure --profile user2

AWS CLI will look first at command line options, then environment variables (AWS_ACCESS_KEY_ID, etc), and then ~/.aws//config

Amazon presentation on AWS for genomics

getting lists of rRNA, miRNA in R

Recently i had to remove all small RNAs from some cufflinks data. There are lots of way to do this, but this was relatively painless (aside from figuring it out).

The HUGO website has subcategories of genes annotated and we can grab the data from there. It’s available as both JSON or TEXT. I found it here: http://www.genenames.org/cgi-bin/statistics

There was a bit of fiddling to figure out the structure of the returned JSON file, hence the smallRNA[[2]][[2]].

We can do the same for pseudogenes using the same site.

Then we can filter against our genes.fpkm_tracking file from cuffdiff output

Enable X on AWS EC2 instance

Spin up an EC2 instance. For this demo, I used an t2.micro instance with the base amazon Linux AMI.

First connect to our machine. You need to find your EC2 instance public dns. It will be something like:
ec2-1-2-3-4-us-west-1.compute.amazonaws.com. The user is always ec2-user.


ssh -i /path/to/pem_file ec2-1-2-3-4-us-west-1.compute.amazonaws.com

Once on the EC2 machine, I needed to install some packages:


sudo yum install xorg-x11-xauth.x86_64 xorg-x11-server-utils.x86_64

Check that X11 forwarding is enabled for ssh. This was a good reference:

http://www.zorranlabs.com/blog/?p=4

Check that the DISPLAY variable is enabled on the EC2 instance

export DISPLAY=:0.0
# or
export DISPLAY=:10.0

Login again using ssh with X11 forwarding

ssh -X -i /path/to/pem_file ec2-1-2-3-4-us-west-1.compute.amazonaws.com

To test:


xeyes
xclock

And you should be good to go. I could use R remotely from my linux desktop with plots showing up.