# Handling high density data

In genomics, it’s often the case that the amount of data will overwhelm your screen or printer. What is one to do when trying to image a million data points?

Here are a few simple recommendations I can make in R, followed by some more sophisticated strategies.

**Change your printing character**

The default R printing character is an empty circle that looks like ‘o’. You can specify the printing character (pch) by specifying a number, which indicates what symbol to use, but you can also just specify a character directly:

1 2 3 |
y = rnorm(100) x = rnorm(100) plot(x, y, pch=".") |

In this case, we’ve used a very small character, a period, as our plot symbol. This allows us to cram lots of points together, making it easier to view a pattern if there are lots of data.

**Downsample your data**

Do you really need to plot all 1 million points to see a pattern? Often, you just need enough points to see the pattern. In this case you can sample from your

1 2 3 4 5 6 |
# make a table of x and y xy = cbind(x,y) # randomly pick 25 row numbers someindices = sample(1:100, 25) # plot plot(xy[someindices, 1], xy[someindices, 2]) |

**smoothScatter**

This is a great function. Basically it computes the density of a set of points, giving you an indication of how many points are in a given region. Think of it as averaging (in 2 dimensions) a regular x-y plot.

1 2 3 4 5 6 |
x1 = rnorm(100000) y1 = rnorm(100000) x2 = rnorm(100000, mean=2, sd=1.5) y2 = rnorm(100000, mean=3, sd=1.5) xy = rbind(cbind(x1, y1), cbind(x2, y2)) smoothScatter(xy) |

I learned about this here: http://www.biostars.org/p/47288/

**Using alpha values**

A little known feature is the ability to print colors in RGB with an alpha value. Alpha values are a measure of transparency, so you can set each point to print at 10% of its normal darkness. When multiple points land near each other, that region will get darker. Originally, this seemed to only work with PDF output and you needed to specify version 1.4. It seems newer versions of R will let you do this with the regular graphics.

1 2 3 4 5 |
pdf(file="output.pfd", version=1.4) # colors are given as amount of Red, Green, Blue and Alpha with range [0,1]. # you can specify maxColorValue=255 if you want to use [0,255] plot(x,y, col=rgb(0,0.5,0,0.1)) dev.off() |

**bigvis package**

I haven’t had a chance to check this out, Hadley Wickham has a new package for visualzing large data sets.