“We BLATted the Internet!”

Best sentence I’ve seen today. Heh.

blat

I’d be interested in the answer to Laura’s question too!

Here’s more detail from the “announcement” mailing list:

All the DNA on the internet now at your fingertips!

Hello everyone!

We’re pleased to announce the release of the Web Sequences track on the UCSC Genome Browser. This track, produced in collaboration with Microsoft Research, contains the results of a 30-day scan for DNA sequences from over 40 billion different webpages. The sequences were then mapped with Blat to the human genome (hg19) and numerous other species including mouse (mm9), rat (rn4), and zebrafish (danRer7). The data were extracted from a variety of sources including patents, online textbooks, help forums, and any other webpages that contain DNA sequence. In essence, this track displays the Blat alignments of nearly every DNA sequence on the internet! The Web Sequences track description page contains more details on how the track was generated.

We would like to acknowledge Max Haeussler and Matt Speir from the UCSC Genome Browser staff and Bob Davidson from Microsoft Research for their hard work in creating this track.


Matthew Speir
UCSC Genome Bioinformatics Group

If you are looking for the track, it’s in the Phenotype and Literature section in human:
web_seqs_noteI took a quick look and it’s definitely a mixed bag–patents and homework sites, and journals and such. But I think it will be interesting to see what turns up.

Edit: some other finds–lots of non-English pages, so I can’t tell what they are. I have seen Japanese, Chinese, and Korean so far. Saw a link to Fark.com (heh). Slideshare. Some pages are borked and don’t load. Some require logins (medscape). Could be a good source of PDFs that you can’t get elsewhere (*cough*).

3 thoughts on ““We BLATted the Internet!”

  1. Neil

    My quick-and-dirty analysis indicates that GAPDH is the most frequent gene symbol in the UCSC database table. Now to write the blog post…

    1. Mary Post author

      Oh–that makes sense. I think that was the winner when I was testing the Publications track for Max way back too. Because it’s so often a control in various experiments….

      I found the Fark link in the BRCA1 region if anyone is interested. It actually made sense–it was a complaint about patents.

  2. Max

    Yes, GAPH is the winner in both tracks, as it’s the generic RT-PCR control gene.

    As for the pages that don’t load anymore: Yes, that’s too bad, the crawl was done a few months ago and some webpages have disappeared since then. If this is a track that is popular, I hope we can do more frequent updates.

Comments are closed.