I’d be interested in the answer to Laura’s question too!
Here’s more detail from the “announcement” mailing list:
All the DNA on the internet now at your fingertips!
We’re pleased to announce the release of the Web Sequences track on the UCSC Genome Browser. This track, produced in collaboration with Microsoft Research, contains the results of a 30-day scan for DNA sequences from over 40 billion different webpages. The sequences were then mapped with Blat to the human genome (hg19) and numerous other species including mouse (mm9), rat (rn4), and zebrafish (danRer7). The data were extracted from a variety of sources including patents, online textbooks, help forums, and any other webpages that contain DNA sequence. In essence, this track displays the Blat alignments of nearly every DNA sequence on the internet! The Web Sequences track description page contains more details on how the track was generated.
We would like to acknowledge Max Haeussler and Matt Speir from the UCSC Genome Browser staff and Bob Davidson from Microsoft Research for their hard work in creating this track.
UCSC Genome Bioinformatics Group
If you are looking for the track, it’s in the Phenotype and Literature section in human:
I took a quick look and it’s definitely a mixed bag–patents and homework sites, and journals and such. But I think it will be interesting to see what turns up.
Edit: some other finds–lots of non-English pages, so I can’t tell what they are. I have seen Japanese, Chinese, and Korean so far. Saw a link to Fark.com (heh). Slideshare. Some pages are borked and don’t load. Some require logins (medscape). Could be a good source of PDFs that you can’t get elsewhere (*cough*).