Plotting dense data

Handling high density data

In genomics, it’s often the case that the amount of data will overwhelm your screen or printer. What is one to do when trying to image a million data points?

Here are a few simple recommendations I can make in R, followed by some more sophisticated strategies.

Change your printing character

The default R printing character is an empty circle that looks like ‘o’. You can specify the printing character (pch) by specifying a number, which indicates what symbol to use, but you can also just specify a character directly:

In this case, we’ve used a very small character, a period, as our plot symbol. This allows us to cram lots of points together, making it easier to view a pattern if there are lots of data.

Downsample your data

Do you really need to plot all 1 million points to see a pattern? Often, you just need enough points to see the pattern. In this case you can sample from your

smoothScatter

This is a great function. Basically it computes the density of a set of points, giving you an indication of how many points are in a given region. Think of it as averaging (in 2 dimensions) a regular x-y plot.

I learned about this here: http://www.biostars.org/p/47288/

Using alpha values

A little known feature is the ability to print colors in RGB with an alpha value. Alpha values are a measure of transparency, so you can set each point to print at 10% of its normal darkness. When multiple points land near each other, that region will get darker. Originally, this seemed to only work with PDF output and you needed to specify version 1.4. It seems newer versions of R will let you do this with the regular graphics.

bigvis package

I haven’t had a chance to check this out, Hadley Wickham has a new package for visualzing large data sets.

Revolutions bigvis

Advice for learning R

There are no easy ways to learn R except by
  1. Typing examples in yourself! It makes a difference (muscle memory, syntax)
  2. Persisting!
  3. Iterating!
There is a steep learning curve for learning R, but with a few commands it can become very useful very quickly. Also, keep trying things out – you will not break your data!
The best thing for the determined beginner is to start with a small project and work on it until you can make it do what you want. Do it everyday until it works, then start a new project!

Y.A.B.B.

I forget a lot of things, so this is basically a reminder to myself of useful R/Bioconductor and bioinformatics code and ideas. But maybe this will be useful to someone else along the line. I deal mostly with genomic data analysis and my tools of choice are R/Bioconductor, Perl and C.