Experimental Design

In order to rerun my analysis from the exome pipeline I decided to try different experiments in order to evaluate which one fits better my analysis. So far I came up with this ideas:

Experiment 1

Use Fastx toolkit to clean the reads, align with BWA, postprocess with GATK and evaluate the genotype of the individual

Experiment 2

Use Trimreads to clean the reads, align with BWA, postprocess with GATK and evaluate the genotype of the individual

Experiment 3

Do no trimming, align with BWA, postprocess with GATK and evaluate the genotype of the individual

Experiment 4

Use SOAP to align and call SNPs

I’m planning to use one individual from 1000 genomes to compare the genotype.

Fastq – Guessing between 4 formats

There is a huge mess in the fastq format description and sometimes you might get lost trying to find which is the encoding of your file. Because of that i found a script written in here https://github.com/brentp/bio-playground/blob/master/reads-utils/guess-encoding.py that you can use to find out which is the correct version of your files with the command:

awk ‘NR % 4 == 0’ your.fastq | python %prog [options]

So far this support the following encodings:

‘Sanger’: (33, 73),
‘Solexa’: (59, 104),
‘Illumina-1.3’: (64, 104),
‘Illumina-1.5’: (67, 104)
My file outputed Solexa  Illumina-1.3    (66, 104)
There is also this script:
./SolexaQA.pl reads1.fastq