With its long single molecule reads, Pacbio IsoSeq offers the tantalizing possibility of reading entire full-length RNAs. The library prep is not for the faint of heart and a recent provider asked me for 5ug of RNA to start – not an easy task! Moreover, a thorough interrogation of a transcriptome would require at least 4 SMRT cells, which are the “chips” that the Sequel instrument currently uses with 1 million wells. The newer version is 8M wells at was released recently.
The raw data came back as unaligned BAM files and (BAM.pbi which presumably is pacbio index) and a bunch of other files. The subreads.bam is the file of interest since that is the “sequenced” data with adapters removed. For a useful glossary of Pacbio terms see here or a bit about version3 file formats. The scraps.bam file is basically whatever remains after removing adapter and barcode sequences.
I used the IsoSeq3 workflow which basically does this:
- Build a circular consensus sequence (CCS) from your subreads.bam
- Classify the reads. This means finding and removing 5′ and 3′ primers from each read. Limma is the tool that does this. It produces a file named unpolished.flnc.bam.
- Cluster the FLNC (full length non-chimeric) reads, which basically groups the reads into clusters of identical/similar reads (your isoforms).
- Polish the final reads. Not exactly sure what this does.
This is all nicely summed up here and here. Once you have your polished reads, you can then align them to a reference genome (eg minimap2), see what genes/isoforms are expressed, and visualize in IGV.
This is the software provided by Pacbio. It’s an end-to-end tool that handles sample setup, run design, data management and analysis. As such its complicated and a pain to install. Base requirements were 32 CPUs and 64GB of RAM. I got it to install on my desktop which had 8 CPUs and 24Gb of RAM but it was not fun using this. The viewer only works with Google Chrome.
I couldn’t get SMRTLink to start using the services-start command, so I ended up doing this:
ulimit -n 8019
env PATH=/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_22.214.171.124975/admin/bin/../../bundles/smrttools/smrtcmds/bin:/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_126.96.36.199975/admin/bin/../../bundles/smrtlink-analysisservices-gui/current/private/thirdparty/java/jre/jre_8u192b12-hotspot/bin:/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_188.8.131.52975/admin/bin/../../bundles/smrtinub/current/private/thirdparty/python/python_2.7.9/binwrap:/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_184.108.40.206975/admin/bin/../../bundles/smrtlink-analysisservices-gui/current/private/thirdparty/postgresql/postgresql_9.6.1/binwrap:/usr/bin:/bin SMRT_PIPELINE_BUNDLE_DIR=/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_220.127.116.11975/admin/bin/../../bundles/smrttools/current/private/pacbio/pbpipeline-resources CATALINA_PID=/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_18.104.22.168975/admin/bin/../../rwdir/run/smrtlink_gui.pid JAVA_HOME=/home/rwang/projects/pacbio-barthelemy/apps/smrtlink/install/smrtlink-release_22.214.171.124975/admin/bin/../../bundles/smrtlink-analysisservices-gui/current/private/thirdparty/java/jre/jre_8u192b12-hotspot JAVA_OPTS=-Djava.library.path= ./../../bundles/smrtlink-analysisservices-gui/current/private/pacbio/smrtlink-analysisservices-gui/bin/start
Then point chrome to http://servername:9090 and you should see
Import your data
Specify SMRT Link Server, and pick Sequel Data (XML).
You need to find the XML file that describes everything. In my case it was m54098_190719_203808.sts.xml. If everything goes well, your data will be imported and some QC data will be available to see.
- Overview|Status|Number of Records: number of “reads”
- Loading Report|Statistics| Productivity: Number of wells (ZMWs), number of empty wells (0), wells with 1 molecule (1), wells with 2 molecules (2). Obviously a high number of Productivity 1 wells means your chip was well utilized.
- Raw Data Report | Polymerase Read Length: how many bases “read”
- Raw Data Report | Subread Length: how many bases “read” that we actually care about (RNA without adapters, 1x pass)
- Raw Data Report | Insert Length vs Read Length: In my case, all insert reads were about 1000-2000 bases despite longer and longer polymerase reads.
You can only analyze data that has been imported into SMRTLink.
- Give the analysis a name
- Use “Analysis” not “Auto Analysis”
- Select Sequel Data
- Pick a data set
- Click Next
For IsoSeq there are three options
- Iso-Seq: CCS, Classify, Cluster. Generate full-length transcript isoforms.
- Iso-Seq Classify only: classify reads into full-length or non-full length reads
- Iso-Seq with Mapping. Runs the first option, then uses GMAP to align the outputted transcripts to a reference genome. You need to provide a reference genome.
On my modest computer (8 core, 24Gb), Iso-Seq took about 12 hours per sample (1M ZMW).
Aligning with Minimap2
From the SMRTLink Iso-Seq output, I retrieved the hq_transcripts.fastq file.
minimap2 -t 30 -ax splice -uf --secondary=no -C5 -O6,24 -B4 \
hg38.fasta hq_isoforms.fasta \
> hq_isoforms.fasta.sam \
samtools -bS hq_isoforms.fasta.sam > hq_isoforms.bam
samtool sort hq_isoforms.bam > hq_isoforms.sorted.bam
samtools index hq_isoforms.sorted.bam
From here, you can view the results in IGV, figure out which transcripts were detected, etc. At the moment, quantifying pacbio Iso-Seq is not recommended.