What’s the Answer? (align hundreds of genomes)

BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of thecommunity and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

Today’s featured issue: What do you do with hundreds of genomes?

Question: Help with multiple whole genome alignment. Aligning over 400 whole genomes

ClustalW is extremely limited when in comes to multiple whole genome sequencing. I have recently just looked at mugsy which claims to be able to align a little over 30 whole genomes.

Is there a software that can align 400 whole genomes? This would be over a Gb of data.

Any help would be enormously appreciated.


It wasn’t clear at first, but it turned out this was a set of bacterial genomes. However, more and more researchers are going to want to align, analyze, and visualize enormous sets of the newly sequenced genomes of all sorts with different strategies. The number of genomes that are coming out every week continues to astound me. Just yesterday I was looking at that paper on the 10,000 birds and it boggled my mind–but not all of those genomes are fully available now, and that could affect the ultimate conclusions at this point. But of course, there’s a lot of issues to consider about how to do analyses of this sort and there is debate. It’s debate we need to have now though–the healthy science kind of debate. Those genomes are coming.

Anyway–check out the answers for this question, and if you have other strategies to suggest be sure to add your voice to the discussion.

2 thoughts on “What’s the Answer? (align hundreds of genomes)

  1. gsgs

    I often have ~10000 flu sequences to align
    most of these sequences align without gaps, so I wrote a simple,
    fast program for that O(n^2). Those few that don’t align well can be done
    with other programs, I also use MAFFT.
    Recently I thought alignment should be fast, there should be a program
    that looks up 16-subsequences and aligns it to the best match so far.
    This is O(n), I haven’t yet implemented it, but it should work.
    Similar to “blast” ?

  2. gsgs

    well, I used 12-subsequences, not 16-subsequences. Assume ~1000 consensus nucleotide-sequences
    of lengths ~2000,
    ~2M 12-subsequences in total into a table of 16M pointers going to 16M lists of indices and positions,
    most (~90%) of which are empty. The others contain the numbers of the matching
    consensus sequences. That part is already implemented (256 consensus only so far),
    and works well. (100sec for the 207000 flu-A sequences, 330MB in genbank)
    It gives a quick (O(n)) characterization for each test-sequence from the big unaligned file,
    to what of the 256 consensus sequences it is most similar.
    Now, align the 256-consensus list only, also store the positions of the subsequences in
    that 256-alignment, compute the best positioning and align each test-sequence just as its
    assigned best-matching consensus-sequence !?!
    Someone should have done it already …

Comments are closed.