I'll Play! Word Search

So, Carl Zimmer asks the question (from his reader):

What actually is the longest word (in any language) encoded by the reference human genome? If I had the time and computer power I’d have a look…
Guesstimate – it’ll be somewhere in the 4-5 letter range, depending on letter frequency in the target language.

Well, 6-letter words are somewhat easy to find actually in the database of all known proteins. When I was doing my graduate research, I had to do it the old-fashioned way… I read gels. To pass the time when analyzing, I’d see if I could find words in the translated amino acid sequences. I found a few 6 and 7 letter words… and if my memory doesn’t fail me, an 8 letter word. But I don’t remember what it was! Doing a quick BLAST on two words I found SEARCH (in Plesiocystis pacifica, a bacteria) and CHANGE (in Danio rerio, zebrafish). I’m sure there are many others.

Wanna play? (who’s done this before?). The question above is of course the human genome, but we could do subcategories… all genomes, human only.. :). I suspect you could parse a simple dictionary into a FASTA format and blast against the genome :).

update: I’ve really got to get back to work. I looked for my full name and I’ve found every part “Warren” “Calvin” “Lathe” (and III is everywhere) and NO, the fact that the first is in Neisseria gonorrhoeae and the latter is in Salmonella enterica is of NO significance whatsoever. Hey “Calvin” is in platypus, so… oh nevermind.

3 thoughts on “I'll Play! Word Search

  1. Mary

    Have you heard about Mark Boguski’s contribution to Michael Crichton’s The Lost World? First, you need to know the background–how Mark ran the sequence he found in Jurassic Park with BLAST to see what the sequence was. You can find that story here (pdf): http://www.markboguski.net/publications_PDFs/BioTechniques%201992.pdf

    PDF link comes from this page: http://www.markboguski.net/publications.htm Boguski, M. S. (1992). “A molecular biologist visits Jurassic Park.” Biotechniques 12(5): 668-669.

    Then you take the new sequence from The Lost World that Mark provided. You can get it here: http://www.inf.fu-berlin.de/lehre/WS05/aldabi/aufgabe5_12.html Be sure to take the second one–the one from Lost World.

    You run it with blastx at NCBI. Run against the nr (non-redundant) database. Find the top hit. Read the gaps.

    P.S. Mark was at the NIH at the time.

