So, Carl Zimmer asks the question (from his reader):
What actually is the longest word (in any language) encoded by the reference human genome? If I had the time and computer power I’d have a look…
Guesstimate – it’ll be somewhere in the 4-5 letter range, depending on letter frequency in the target language.
Well, 6-letter words are somewhat easy to find actually in the database of all known proteins. When I was doing my graduate research, I had to do it the old-fashioned way… I read gels. To pass the time when analyzing, I’d see if I could find words in the translated amino acid sequences. I found a few 6 and 7 letter words… and if my memory doesn’t fail me, an 8 letter word. But I don’t remember what it was! Doing a quick BLAST on two words I found SEARCH (in Plesiocystis pacifica, a bacteria) and CHANGE (in Danio rerio, zebrafish). I’m sure there are many others.
Wanna play? (who’s done this before?). The question above is of course the human genome, but we could do subcategories… all genomes, human only.. :). I suspect you could parse a simple dictionary into a FASTA format and blast against the genome :).
update: I’ve really got to get back to work. I looked for my full name and I’ve found every part “Warren” “Calvin” “Lathe” (and III is everywhere) and NO, the fact that the first is in Neisseria gonorrhoeae and the latter is in Salmonella enterica is of NO significance whatsoever. Hey “Calvin” is in platypus, so… oh nevermind.