Broad releases FASTG reference format that contains variation

The FASTG Format Specification Working Group is pleased to announce version 1.0 of the FASTG specification

FASTG is a format for faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty. Currently genome assemblies are represented linearly, as sequences of bases, recorded in FASTA files. Since chromosomes are in fact linear or circular, this makes sense, so long as one has complete knowledge of the genome. However, many genomes contain polymorphisms that cannot be represented in a simple linear sequence, and almost all assemblies contain errors and omissions, which can result in incorrect biological inferences. The FASTG format aims to address this problem using a flexible graph-based approach to encode any variability in the sequence, along with metadata to score and annotate the source of those variations. Assembly graphs in FASTG can be easily translated into linear FASTA sequences to support current analysis tools for reading mapping, annotation, visualization, etc, but our hope is to develop a next generation of assembly and genome analysis algorithms that can work with the graph structure directly. For the complete specification and additional information on FASTG, please visit:

http://fastg.sourceforge.net

http://fastg.sourceforge.net/FASTG_Spec_v1.00.pdf

If you are interested to discuss this further, please subscribe to the assemblathon-file-format mailing list:

http://assemblathon.org/pages/mailing-list

The immediate plans are to enlist help to develop a reference library and command line suite for parsing, transforming, and querying assemblies in FASTG format, similar to the widely used SAM/SAMTools suite.

source: http://www.biostars.org/p/59370/

 

Raony Guimaraes