Assembling reads with velvet

Velvet is a hash-based deBrujin assembler. I occasionally use it to build contigs from short read RNAseq data, although it can be used for long reads, metagenomics, etc.

There are two parts to the program:
1. velveth – prepare the sequences
2. velvetg – build the graph and output the contig.fa and some summary statistics

# prepare sequences
velveth contig/ 21 -raw dmd_kmer_matches.txt
# build the graph
velvetg contig/ -min_contig_lgth 100

There are several parameters you can tune.
– kmer length
– coverage cutoff (cov_cutoff)
– minumum contig length (min_contig_lgth)

It can take as input paired end or single end, FASTQ, FASTA and other formats. It is not restricted to illumina output.

velvetg will produce a fasta file of contigs and summary stats.

GMAP for long read RNA

GMAP is an old (circa 2006) software for long read alignment. Its use case is for mapping RNA reads back to a genome. It has found new life in the world of long read RNAseq such as from Pacbio reads.

Perhaps because of its age and architecture, it has some quirks and dependencies that seem odd to the modern bioinformatician.

Download & untar

Get the code from It is versioned by date.

Extract the files using

tar zxvf gmap-2018-05-30.tar.gz

change into the source directory and edit the file.
sets some system wide parameters the most important will be the path to your
GMAP database, which is where the index, one you create it, will be placed.

Then compile from the command line

make install

It will help to read the notes README and INSTALL for particular issues with different architectures, operating systems, etc. Because of the many optimizations, some deep diving may be required.

Building an index

There might be some prebuilt indexes, but I couldn’t find them. To build your own you need to specify fasta file(s). One per chromosome or one for the whole genome or contigs or whatever.

# in this case, i called it hg19
gmap_build -d nameOfGenome /path/to/genome_fasta_files

The index files will be placed in whatever you set in or on the gmap_build command line options.

Align a cdna

To align, there are a few options.

  1. map
  2. map and align (show alignment)
  3. align only
  4. batch mode

Since I had several fastqs in one file, I needed to convert FASTQ to FASTA.

Using seqtk
seqtk seq -a input.fastq > output.fasta

And align

# this aligns
gmap -d hg19 myfile.fasta
# this aligns and maps (shows alignment)
gmap -d hg19 -A myfile.fasta

Examining output

The output has genome sequence on the top row and the cDNA on the bottom with the cDNA in its forward direction.
Numbers below introns indicate its length in nucleotides.
GMAP uses these symbols:

| match
> canonical intron
- gap
= non-canonical intron

Figure 4 in the Wu et al paper is helpful in understanding the output. [Wu & Watanabe. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics May 2005]

Serving BAM files from S3

Its handy to have a webserver where you can put up BAM files for sharing with collaborators. But if you have tons of files or available webserver, you can use Amazon S3 to serve up your files. While S3 is pretty simple, setting it up for serving BAM (and index) files has a few tricky aspects to it.

Setting up an S3 bucket

You can do this pretty simply from the AWS dashboard. Pick a region (close to you), create a folder or two. You can specify an encryption type so that data in S3 is encrypted (I used server side encryption with Amazon S3-Managed Keys).

Remember S3 bucket names are unique! It cannot be the same as any other bucket that has ever existed. Also, it’s a good idea to make the bucket private so that the BAM files are not open to the public.

Copy files to S3

I’ll assume you have AWS CLI for copying files. To list your buckets

aws s3 ls
# list a specific archive
aws s3 ls s3://mybucket

We want to copy our files into S3. We need both the BAM file and the BAM index file.

aws s3 cp mybamfile.bam s3://mybucket/bamfiles
aws s3 cp mybamfile.bam.bai s3://mybucket/bamfiles
# alternatively
aws s3 sync myfolder s3://mybucket/bamfiles

More details about using the CLI are here.

S3 does not support authentication or any such mechanism. So if your bucket is private, you will need to use some methods to expose your data. There are a few options.

AWS CLI to presign a URL

Using the awscli, you can generate a signed URL. Basically this generates a URL that will work for anyone who has the link. The URL will expire (default 3600 seconds). We actually need to make two signed URLs. One for the BAM and one for the BAM index file.

# this will make one signed URL for the BAM file
aws s3 presign s3://mybucket/path/to/mybam.bam
# this makes a signed URL for the BAM index file
aws s3 presign s3://mybucket/path/to/mybam.bam.bai
# to specify an expiry time
aws s3 presign s3://mybucket/path/to/mybam.bam --expires-in

View in IGV

Then open up IGV, use File, Load from URI and input the signed URL for both the BAM and BAM index. Note that my standalone binary distribution version of IGV did not have two places to input the BAM and Bam index signed URL (there was only one input box). Instead, I had to goto the Broad website and load from Java Web Start which started IGV after downloading from the Broad. This version of IGV (2.4.10 03/20/2018) did have two input boxes for the BAM and BAM index.

Generate presigned URL using Java Eclipse

There is a code example here. A few steps need to be completed:

  1. download the AWS Java SDK
  2. add the SDK jar files to your build path
  3. You must specify these things:
    clientRegion (such as us-west-2)
    bucketName (mybucket)
    objectKey (name of the object stored in S3). Note that if you have a subfolder, that goes into the objectKey name (examples/bamfiles/mybam.bam). This is because S3 doesn’t really have folders, so the object name actually includes the whole path.
  4. Change HttpMethod.GET to HttpsMethod.GET
  5. Build and it will output the signedURL



Rstudio-server over ssh

Rstudio is a great tool but suppose you need to access your work computer to do some work from home? Use Rstudio-sever which  allows you to basically connect to a remote machine and do your development there.

I can’t find a definitive answer whether the free version of Rstudio-server encrypts the login information. Their website is not clear.

But we can always tunnel using SSH and things should be just fine.

In putty, setup an ssh tunnel and do the following:

  1. Source port: 8787
  2. Destination: localhost:8787

Once you connect to the remote host using putty, you can use “http://localhost:8787” in your browser to access your Rstudio-server session. The plus is that the connection to the remote host is going over an SSH tunnel and therefore encrypted.

RNAseq aligners – The Next Generation

There was a time when Tophat/Cufflinks was the only game in town. That has changed. By a lot. Some of the newest aligners include

  • STAR. Using a suffix tree and the idea of compatible reads make this a very fast aligner. Whereas cufflinks requires 8+ hours, STAR will require <2. The downside: the index requires a lot of RAM. I think I’ve heard 32 Gb of RAM works but I run it on our 256GB machine.
  • GMAP/GSNAP – actually used since the EST days. I haven’t used it yet but according to bake-off comparisons like this, it is a relatively fast aligner.
  • Tophat/Cufflinks. Probably the most cited of all aligners. Straightfoward to use but seems to be underaligning when compared to newer software.

Some of the new tools in town are specialized for certain tasks, particularly aspects of splicing

  • Leafcutter is designed to study variation in intron splicing
  • MapSplice2 – splice junction discovery
  • SpliceTrap – differential splice usage
  • DEXseq – differential exon usage

New wave differential gene expression can now be done in minutes using these tools! First they use “transcripts per million” instead of the confusing FPKM/RPKM that no one seems to understand.

  • Kallisto (Pachter lab). From the good folks who brought us Tophat/Cufflinks and Express. I like the interactive UI for playing with your data through Sleuth.
  • Salmon/Sailfish. Runs extremely fast and makes use of the idea of compatible alignments which is more like STAR
  • RSEM – haven’t look at it yet

Workflows. There are a few workflows that seem popular for analyis:

  • Tophat/Cufflinks. You do the splice aware alignment, I’ll generate a ton of output for you
  • subread/edgeR/limma(voom)/Glimmr. align/count, normalize, DEG analysis and visualize. All thanks to bioconductor
  • your favorite aligner + DESeq2/edgeR/ebSeq

Of course, there’s not a lot of consensus on what works best or is “right” so there’s still room to add to this list.


nodejs for genomics

Some interesting sites to peruse later…


Amazon CLI for S3

Some useful commands


aws ec2 describe-instances
aws ec2 describe-instances --region us-east-1
aws ec2 describe-instances --region us-east-1 --output table

Any aws cli command can use the following flags
*–profile – name of a profile to use, or “default” to use the default profile.
*–region – AWS region to call.
*–output – output format.
*–endpoint-url – The endpoint to make the call against. The endpoint can be the address of a proxy or an endpoint URL for the in-use AWS region. Specifying an endpoint is not required for normal use as the AWS CLI determines which endpoint to call based on the in-use region.


# make a bucket
aws s3 mb s3://bucket-name
# remove a bucket
aws s3 rb s3://bucket-name
# by default will not remove non-empty buckets, use force
aws s3 rb s3://bucket-name --force
# list buckets
aws s3 ls
# list contents of a bucket
aws s3 ls s3://bucket-name

Since the bioconductor image is hosted on the east cost us-east-1, we should have our S3 bucket in the same availability zone/region.

aws s3 mb s3://bioarchive.1 --region us-east-1

Using the bioconductor AMI

The current BioC release is 3.3 for which there is an Amazon AMI (ami-64d43409). I’m not sure why, but the AMI is in the us-east1 region which isn’t convenient for me, but we do not have permission to copy it. So all services have to be in the same availability zone (AZ) in us-east(Virginia). If there’s time, I’ll have to roll my own to have something closer to home.

Not sure if my analysis is more memory or compute heavy. Going with memory optimized. r3.2xlarge $0.66/hour.

  • r3.2xlarge $0.66/hour
  • EBS $0.05/hour

Security group: enable ssh and https (22 and 80)

Connect to instance
ssh -i /path/to/key.pem ec2-user@ip-from-aws-console

Attach an EBS (console)

      go to EC2, select Volumes, create volume 250gb of gp2

        make sure its in the same availability zone as EC2 instance

          select volumne, select attach (instance must by stopped or not running)

If the EBS is new(never used), you can see the device size and mount point


# if this returns data, there is not filesystem and we need to create one
sudo file -s /dev/xvda1
# format
sudo mkfs -t ext4 device_name
# create mount point
sudo mkdir /data
mount /dev/xvdf /data

Details where here

To upload data

sftp -i /path/to/key.pem ubuntu@ip-from-aws-console
cd /my/dirir
put, etc

Doing it better With the need to upload data first, there are two options I can think of. First, upload to S3 then copy over to machine instance or EBS volume. The other possibility to save $$$ would be to use a low cost instance with EBS, upload data to it, then spin up a larger instance and reattach the same volume.

Using RStudio
Point browser to the public IP for the AMI instance. Login with user/pass ubuntu/bio.

Permissions problems
within RStudio, I had problems loading dplyr, etc despite the fact they were present in the /home/ubuntu/R-libs directory.
I tried:
in /etc/rstudio/rsession.conf: set r-libs-user=~/R-libs
And in /home/ubuntu/R-libs
sudo chown -R ubuntu.ubuntu *
sudo chmod -R 755 *

Not sure which one worked, but RStudio now seems to work okay.