Power analysis for rare variant studies

I recently came across SEQPower, a tool which “provides statistical power analysis and sample size estimation for sequence-based association studies”.

For my interests, one of the more interesting applications is trying to figure out the power from a study that samples from the extreme values of a quantitative trait distribution. Here I’m looking at rare variants (MAF < 0.01). The packaging of SEQPower could use a little more explanation, but basically you can download one of two formats: a heavy duty version and a lite version. The heavy duty version gives you the tools to generate random samples. The lite version will run using pre-generated random samples. These are the files that look like

  • AfricanAmericanEVS6500.sfs
  • EuropeanAmericanEVS6500.sfs
  • EuropeanGazaveModel.sfs

I think the .sfs designates site specific frequencies or something like that. We could probably figure that out by looking through the authors other works. Incidentally, this project has Susan Leal’s name on it, who gets credit for an early foray into these collapsing methods (ie CMC).

There are several illustrations in the documentation once you figure them out. But taking the top/bottom 10% this is what I used

~/bin/spower ELNR data/EuropeanAmericanEVS6500.sfs
--sample_size 1000 -p1 0.5
--meanshift_rare_detrimental 0.5
--QT_threshold 0.1 0.9
--method "CollapseQt --alternative 2"
-r 5000 -j 24 -l 1 -o extreme10

This is explained as follows:

  • run the extreme phenotyping option (ELNR)
  • use the EuropeanAmericanEVS6500 file in the data/ subdir
  • sample 1000 indiv from the 2 extremes (equiv to 500 from top 10%, 500 from bottom 10%)
  • meanshift_rare_detrimental how much does the rare variant move the phenotypic mean from the baseline. Units are interms of standard deviations, so we’re asking for at 0.5 SD shift here. Is that big?
  • p1 means assume we can sample until we get 500 ppl per tail. 50% means 50% from each tail. . There’s a different way to sample from tails (ie. given 1000, take top and bottom 10% but that’s different than this).
  • generate 5000 replicates. Default is 1.
  • -j 24 means to use 24 CPUs. I am using a heavy duty machine for this example
  • -l 1 means to limit to the first observation. In the data file, there might be more than one observation per variant, so tells us to use the first
  • -o extreme10 will create output files named extreme10.csv, extreme10.loci.csv, …

There are some defaults to this, for instance alpha 0.05, rare alleles are MAF<0.01, but that can be specified if different. The paper actually has a table of all the parameters or you can get them by
spower ELNR -h


I forget a lot of things, so this is basically a reminder to myself of useful R/Bioconductor and bioinformatics code and ideas. But maybe this will be useful to someone else along the line. I deal mostly with genomic data analysis and my tools of choice are R/Bioconductor, Perl and C.