Normtools

Written by Chris Barnes, updates by Sam Robson

Introduction

Collection of C++ normalization and preprocessing tools. Most of the tools have been designed with CNV analyses in mind however some may be useful in more general applications. The emphasis has been placed on flexibility and the programs have been designed to mimic unix tool behaviour taking input from stdin and printing to stdout which allows efficient piping of the data.

Installation

1) To check out the package use anonymous CVS access:

cvs -z3 -d:pserver:anonymous@cnv-tools.cvs.sourceforge.net:/cvsroot/cnv-tools co -P Normtools

2) Change into the Normtools directory and type make

cd Normtools
make

This should build the applications univariate_quantile_norm, medianIQR, make_matrix, population_medianIQR.

3) Run the tests to make sure everything is working

make test

You should see "Test completed successfully". Look in the script test/test.sh for examples on data formats and how to run the programs.

univariate_quantile_norm

Performs quantile normalization on a univariate signal. Quantile normalization is described in this paper:

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.
Bolstad BM, Irizarry RA, Astrand M, Speed TP.
Bioinformatics. 2003 Jan 22;19(2):185-93 pubmed

Input

Quantile normalisation is essentially a two step process. First the target distribution is generated then each sample is normalized to the target.
In generation mode (-gen see below) the input via stdin is a list of full directory locations of data files containing data distributions for estimation of target distribution
. The data files are assumed to contain tab delimited columns of data. In correction mode the input via stdin is the data file containing the data distribution to normalize to target distribution.

Output

In generation mode the target distribution will be written to -targetfile [file].
In correction mode the corrected distribution (along with any columns specified by -ext_columns[n1 n2 n3]) will be output to stdout.

Options

-gen
Boolean. If specified generate target distribution (default false).
-targetfile [file]
Full path to target file (must exist if gen == false).
-column [n]
Column from input file to process.
-impute
Boolean. Impute missing data? (default false)
-ext_columns[n1 n2 n3..]
Extract and output additional column numbers before normalised variable (default false). Correction stage only.
-header
Boolean. Do input files contain headers? (default false)
-v
Boolean. Should progress be printed? (default false)

medianIQR

Performs median and/or interquartile normalisation within samples.

Options

-med
Boolean. If specified, use median for normalisation (true by default).
-iqr
Boolean. If specified, use inter quartile range for normalisation (false by default).
-column [n]
Column from input file to process.
-ext_columns [n1 n2 n3..]
Extract and output additional column numbers before normalised variable.
-header
Boolean. Do input files contain headers? (default false)
-log
Boolean. Specify this flag if your data is logged. Non-logged is assumed by default.

make_matrix

Combines data from individuals to make matrices of intensities.

Options

-annotfile [file]
The full path name for the probe annotation file.
-chromosome [c]
The chromomsome to process.
-cid [n]
The column number in the probe annotation file containing the Probe ID.
-cchr [n]
The column number in the probe annotation file containing the chromosome number.
-cpos [n]
The column number in the probe annotation file containing the probe starting coordinates.
-did [n]
The column in the data file containing the Probe ID number.
-title
Boolean. If specified, the title row is output in the resulting file.
-ext_columns [n1 n2 n3..]
Columns in the data file to extract. If more than one specified these will be repeated.
For example if the sample files contained three columns : id A B, specifying -ext_columns 2 3 will produce a matrix with this structure : s1_A s1_B s2_A s2_B...

-column_titles

population_medianIQR

Applies a cross population median and/or interquartile normalization on matrices of intensities.

Options

-nannot
Number of annotation columns to ignore at start of input data file.
-ndata [n]
Number of repeated data columns.
Eg in a matrix containing s1_A s1_B s2_A s2_B... the number of repeated data columns is 2.
-med
Boolean. If specified, will perform median normalisation (true by default).
-iqr
Boolean. If specified, will perform inter quartile range normalisation (false by default).
-column [n]
Column number within ndata to normalise.
Eg in a matrix containing s1_A s1_B s2_A s2_B... specify 1 to normalize A columns.
-save
Boolean. If specified, outputs both the input data and the normalised data. Normalised data headings are appended _n. False by default.