Sequence Formats

fasta file formatThere are a lot of them. FASTA comes to mind. GenBank is another. Clustal, EMBL, GCG and the list goes on. I’d say FASTA is one of the most commonly used or accepted, but I could be wrong. Still, many databases and software programs have their own format that they accept and generate. Some of these programs and databases will accept several formats or generate files in several formats. It can get a bit confusing. So, you’ve got a sequence file in PAUP but you need it in FASTA? Don’t even know what format it is? Or what they look like or the information that they contain?

Here are some links that could help I have gathered over time and lately as I was working with a phylip file:
Oxford’s CGRB’s examples of sequence formats.

EMBOSS’s explanation of sequence formats.

EBI’s help section on sequence formats.

Here are two programs that will convert one format to another:

Readseq (home URL and downloadable code here)


Hopefully that will get you started in making sense of sequence formats. Have another other help pages or conversion programs to suggest?

3 thoughts on “Sequence Formats

  1. gsgs

    my favourite format is one
    header line with several fields,
    separated by comma, followed by the
    nucleotides, aligned for the
    corresponding group of sequences.
    One long line for the nucleotides.
    Conversion program to convert to /from FASTA.

    I’m only interested in influenza
    and my favourite database contains
    aligned records with one header line
    and 8 nucleotide lines for the segments (some of the 8 lines may be empty) in one big file of all sequences.

    then commandline utilities to extract,merge sub-bases from this
    other utilities for analysis,

