Experimental Design

In order to rerun my analysis from the exome pipeline I decided to try different experiments in order to evaluate which one fits better my analysis. So far I came up with this ideas:

Experiment 1

Use Fastx toolkit to clean the reads, align with BWA, postprocess with GATK and evaluate the genotype of the individual

Experiment 2

Use Trimreads to clean the reads, align with BWA, postprocess with GATK and evaluate the genotype of the individual

Experiment 3

Do no trimming, align with BWA, postprocess with GATK and evaluate the genotype of the individual

Experiment 4

Use SOAP to align and call SNPs

I’m planning to use one individual from 1000 genomes to compare the genotype.

Fastq – Guessing between 4 formats

There is a huge mess in the fastq format description and sometimes you might get lost trying to find which is the encoding of your file. Because of that i found a script written in here https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py that you can use to find out which is the correct version of your files with the command:

awk ‘NR % 4 == 0’ your.fastq | python %prog [options]

So far this support the following encodings:

‘Sanger’: (33, 73),
‘Solexa’: (59, 104),
‘Illumina-1.3’: (64, 104),
‘Illumina-1.5’: (67, 104)
My file outputed Solexa  Illumina-1.3    (66, 104)
There is also this script:
./SolexaQA.pl reads1.fastq


DATASETS from :references for SIFT, PolyPhen, annovar

OMIM variants extracted by Omicia and provided as a track (OMICIA_auto) on the next release of UCSC tables (http://genome-preview.ucsc.edu/…)

COSMIC rev54 (now 55 since a couple of days) DL as a text table I had to convert to BED with some perl magic (ftp://ftp.sanger.ac.uk/pub/CGP/cosmic)

dbSNP was not an easy catch and I am still struggling to get the full information from their difficult batch download system (only feasible through ensembl BIOMART so far: [tip: hg18 BIOMART is at:http://may2009.archive.ensembl.org/biomart/martview/]). For dbSNP, I searched for records with phenotype (thanks to another colleague) which is the only available annotation to pick disease variants but in fact includes many association results which are far from being causative .

Cancer Datasets


Breast Cancer Datasets




Copy-number variations (CNVs)—a form of structural variation—are alterations of the DNA of a genome that results in the cell having an abnormal number of copies of one or more sections of the DNA. CNVs correspond to relatively large regions of the genome that have been deleted (fewer than the normal number) or duplicated (more than the normal number) on certain chromosomes. For example, the chromosome that normally has sections in order as A-B-C-D might instead have sections A-B-C-C-D (a duplication of “C”) or A-B-D (a deletion of “C”). Source: http://en.wikipedia.org/wiki/Copy-number_variation

Synonymous and Non-Synonymous SNPs

To be a synonymous or a non-synonymous SNP, the SNP must fall inside a protein-coding region of the DNA (otherwise it is a noncoding SNP). A synonymous SNP is a coding SNP that does not change the protein sequence. A non-synonymous SNPT is one that changes the protein sequence. So what you have to check is if the SNP changes a codon to a different codon for the same amino acid, in which case it is a synonymous SNP, or if it changes the codon to one that codes for a different amino acid, in which case it is a non-synonymous SNP.






Cluster Commands and Tricks 06/09/11

Every node has 8 cores 16GB of RAM

Useful Commands for the cluster:

List Queues and times of execution

scontrol show partitions

List Jobs


Cancel Job (jobid)

scancel -v 179

List nodes status


Submit a process

srun -p long|medium|shot| -w veredas17 -o output_file COMMAND
Example of a command:
srun -p medium -o piccard java -jar ../../bin/picard-tools-1.52/SamFormatConverter.jar INPUT=../alignment/aln.sam OUTPUT=exome.bam VALIDATION_STRINGENCY=LENIENT TMP_DIR=/tmp