CNVtools

Written by Chris Barnes and Vincent Plagnol and David Clayton

Download page

Introduction

CNVtools is an R package for performing robust case control and quantitative trait association analyses of Copy Number Variants. The methods are described in the paper:

A robust statistical method for case-control association testing with Copy Number Variation.
Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D, Hurles ME.
Nature Genetics, 2008 Oct;40(10):1245-52 pubmed

The package implements a robust association framework by unifying genotyping and association testing into a single model. This is done by incorporating a disease model, which is either a logistic regression disease model for a dichotomous disease variable or a standard regression for a quantitative trait, into the mixture model for the signal. Association is assessed via a likelihood ratio test. The procedure is assay/platform independent and can be applied whenever there is a univariate diploid copy number eg SNP genotyping assays (R coordinate), Array-CGH or quantitative PCR.

Often an assay will have multiple probes within a CNV. CNVtools also includes methods to extract the CNV signal in an optimal way using principal component analysis and a linear discriminant function.

Installation

Simply download the tarball at this address . Then on a linux/unix machine run:

> R CMD INSTALL CNVtools_1.34.6.tar.gz

(note that the website may have a more recent version). As usual with R package installation, you may have to specify your own path for libraries. For more details see standard procedures at installing an R package from source.

Common questions/problems

It won't install
CNVtools has been checked with various versions of linux (32 and 64 bit) and Mac with R version 2.6.0. If you are having problems with installation we recommend at least version 2.6.0 of R. If you are still having problems email the mailing list.
Can CNVtools output the location of CNVs in my data?
No. CNVtools is for performing tests of association in known CNV regions. Use a CNV finding algorithm to locate regions in your samples that show copy number variation. Then use CNVtools to perform association within those regions.
My data looks like ...

If, when you histogram the signal within the CNV region that you want to test, your data resembles the distribution on the left then you will not be able to use CNVtools to perform your association study. CNVtools requires a minimum data quality in order to fit a mixture model. In practical terms this means that you must be able to see clear clustering of individuals into copy number classes like in the right plot. If this is not the case then it becomes impossible to distinguish a true association signal from plate/batch effects between cases and controls.
Which summarized signal from Illumina/Affy/Agilent.... shall I use?
Any signal that is a proxy for the the number of copies. In Array-CGH data you might use the log2 ratio. In SNP data you might use R=X+Y or R=log(X+Y) or R=log(1+X+Y) or Log R Ratio. Any transformation that gives good separation of the copy number clusters is suitable (without distorting the distributions too much). In qPCR data you would use the raw copy number signal.
Error in contrasts<-(*tmp*, value = "contr.treatment") contrasts can be applied only to factors with 2 or more levels
CNVtools was designed to handle batch effects but if you think there are none present or batches do not apply to your type of data then the fitting can handle this. Just assign batches a single label and remove "batches" from the model specification. Removing 'batches' from the model specification removes batches as a covariate from the model. For example in the A112 example :
batches <- rep(1,nrow(A112))
no.batch.model.means = "~ strata(cn)"
no.batch.model.var="~ strata(cn)"
fit.pca <- CNVtest.binary(signal = pca.signal,sample=sample,batch=batches,
ncomp=3,n.H0=1,n.H1=0,
model.means = no.batch.model.means, model.var= no.batch.model.means)
This will solve the error that you see.
I get a worse signal using PCA and/or LDF than using the mean
PCA and LDF are really effective at extracting signal from poor quality data and data where the CNV breakpoints are over specified. It may be that you do not need these transformations and a simple mean summary of the data is more appropriate.
What can I do about status=P?
This indicates that the posterior probability for a particular cluster has become non zero far away from the cluster centre. Various scenarios can cause the problem and hopefully this will be less of a problem when mixtures of t distributions are implemented.
Here are some suggestions :
- Is the number of components you are fitting reasonable?
  Look at the histogram/density plot and make sure you are fitting a reasonable of components. The output of CNVtest.select.model can sometimes under or over estimate the number of clusters.
- Try constraining/relaxing models.
  Experiment with the models you are fitting. For example consider the following models :
  model.means = "~ strata(cn)"
  model.means = "~ cn"
  model.var = "~ 1"
  The first model fits a model where the component means are free, but the second model constrains the means to be proportional to the copy number essentially making them equally spaced. If the components in your data look to have roughly the same variance you can try the third model for the variance. The number of parameters in the model and, in particular, the number of variances that you are estimating greatly affects the stability of the fit.
- Look at the output of cnv.plot
  You should be able to identify the offending component by the posterior probability for that component falling to zero and then increasing again. Is this increase due to an outlier? If so try removing it.