I'll Play! Word Search

1 February, 2008 (13:13) | General Science | By: Trey

So, Carl Zimmer asks the question (from his reader):

What actually is the longest word (in any language) encoded by the reference human genome? If I had the time and computer power I’d have a look…
Guesstimate – it’ll be somewhere in the 4-5 letter range, depending on letter frequency in the target language.

Well, 6-letter words are somewhat easy to find actually in the database of all known proteins. When I was doing my graduate research, I had to do it the old-fashioned way… I read gels. To pass the time when analyzing, I’d see if I could find words in the translated amino acid sequences. I found a few 6 and 7 letter words… and if my memory doesn’t fail me, an 8 letter word. But I don’t remember what it was! Doing a quick BLAST on two words I found SEARCH (in Plesiocystis pacifica, a bacteria) and CHANGE (in Danio rerio, zebrafish). I’m sure there are many others.

Wanna play? (who’s done this before?). The question above is of course the human genome, but we could do subcategories… all genomes, human only.. :) . I suspect you could parse a simple dictionary into a FASTA format and blast against the genome :) .

update: I’ve really got to get back to work. I looked for my full name and I’ve found every part “Warren” “Calvin” “Lathe” (and III is everywhere) and NO, the fact that the first is in Neisseria gonorrhoeae and the latter is in Salmonella enterica is of NO significance whatsoever. Hey “Calvin” is in platypus, so… oh nevermind.

Comments

Comment from Mary
Time February 1, 2008 at 9:06 PM

Have you heard about Mark Boguski’s contribution to Michael Crichton’s The Lost World? First, you need to know the background–how Mark ran the sequence he found in Jurassic Park with BLAST to see what the sequence was. You can find that story here (pdf): http://www.markboguski.net/publications_PDFs/BioTechniques%201992.pdf

PDF link comes from this page: http://www.markboguski.net/publications.htm Boguski, M. S. (1992). “A molecular biologist visits Jurassic Park.” Biotechniques 12(5): 668-669.

Then you take the new sequence from The Lost World that Mark provided. You can get it here: http://www.inf.fu-berlin.de/lehre/WS05/aldabi/aufgabe5_12.html Be sure to take the second one–the one from Lost World.

You run it with blastx at NCBI. Run against the nr (non-redundant) database. Find the top hit. Read the gaps.

P.S. Mark was at the NIH at the time.

Pingback from Fun with word searching genomes | The OpenHelix Blog
Time April 16, 2008 at 3:12 PM

[...] for non infectious diseases” and “Protein word search,” apparently do to this earlier post about searching AA sequences for real words. So we thought we’d run with it :) . Using this [...]

Pingback from Friday Fun: Crossword Puzzle | The OpenHelix Blog
Time May 30, 2008 at 12:50 PM

[...] Term Mapping EnhancedAdvanced Search Related Posts Fun with word searching genomesI’ll Play! Word SearchExtinct Genomes in PLOS OneMore Friday Fun – OpenHelix Sudoku, Puzzle 1Tip of the Week: Fun in the [...]