CNVtools is an R package for performing robust case control and quantitative trait association analyses of Copy Number Variants.
The methods are described in the paper:
A robust statistical method for case-control association testing with Copy Number Variation.
Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D, Hurles ME.
Nature Genetics, 2008 Oct;40(10):1245-52 pubmed
The package implements a robust association framework by unifying genotyping and association testing into a single model.
This is done by incorporating a disease model, which is either a logistic regression disease model for a dichotomous disease variable or a standard regression for a quantitative trait, into the mixture model for the signal. Association is assessed via a likelihood ratio test.
The procedure is assay/platform independent and can be applied whenever there is a univariate diploid copy number eg SNP genotyping assays (R coordinate), Array-CGH or quantitative PCR.
Often an assay will have multiple probes within a CNV. CNVtools also includes methods to extract the CNV signal in an optimal way
using principal component analysis and a linear discriminant function.
Simply download the tarball at this address
Then on a linux/unix machine run:
> R CMD INSTALL CNVtools_1.34.6.tar.gz
(note that the website may have a more recent version).
As usual with R package installation, you may have to specify your own path for libraries.
For more details see standard procedures at installing an R package from source
It won't install
CNVtools has been checked with various versions of linux (32 and 64 bit) and Mac with R version 2.6.0. If you are having problems with installation we recommend at least version 2.6.0 of R. If you are still having problems email the mailing list.
Can CNVtools output the location of CNVs in my data?
No. CNVtools is for performing tests of association in known CNV regions. Use a CNV finding algorithm to locate regions in your samples that show copy number variation. Then use CNVtools to perform association within those regions.
My data looks like ...
If, when you histogram the signal within the CNV region that you want to test, your data resembles the distribution on the left then you will not be able to use CNVtools to perform your association study. CNVtools requires a minimum data quality in order to fit a mixture model. In practical terms this means that you must be able to see clear clustering of individuals into copy number classes like in the right plot. If this is not the case then it becomes impossible to distinguish a true association signal from plate/batch effects between cases and controls.
Which summarized signal from Illumina/Affy/Agilent.... shall I use?
Any signal that is a proxy for the the number of copies. In Array-CGH data you might use the log2 ratio. In SNP data you might use R=X+Y or R=log(X+Y) or R=log(1+X+Y) or Log R Ratio. Any transformation that gives good separation of the copy number clusters is suitable (without distorting the distributions too much). In qPCR data you would use the raw copy number signal.
Error in contrasts<-(*tmp*, value = "contr.treatment") contrasts can be applied only to factors with 2 or more levels
CNVtools was designed to handle batch effects but if you think there are none present or batches do not apply to your type of data
then the fitting can handle this. Just assign batches a single label and remove "batches" from the model specification. Removing 'batches' from the model specification removes batches as a covariate from the model.
For example in the A112 example :
batches <- rep(1,nrow(A112))
This will solve the error that you see.
no.batch.model.means = "~ strata(cn)"
fit.pca <- CNVtest.binary(signal = pca.signal,sample=sample,batch=batches,
model.means = no.batch.model.means, model.var= no.batch.model.means)
I get a worse signal using PCA and/or LDF than using the mean
PCA and LDF are really effective at extracting signal from poor quality data and data where the CNV breakpoints are over specified. It may be that you do not need these transformations and a simple mean summary of the data is more appropriate.
What can I do about status=P?
This indicates that the posterior probability for a particular cluster has become non zero far away from the cluster centre. Various scenarios can cause the problem and hopefully this will be less of a problem when mixtures of t distributions are implemented.
Here are some suggestions :