Tag Archives: databases

Ensembl Updating

ensembl updateEnsembl has been teasing about an update for a while now, version #51. It’s not out  yet :). But today on the Ensembl blog they do have a bit of a preview. It has a new design, some new stuff under the hood to make it run better, a new configuration panel and some of the tracks will behave differently (more options). They give a screenshot of the new interface, but it’s not particularly large, so you get the gist of it at least. No word other than “the web team are working hard to tidy up the loose ends of the release!,” so we assume that will be soon. Looking forward to it.

Treasure Hunts

Thought I’d recommend a little fun treasure hunt using GenBank. It’s a fun (if you are a biology and database geek like me) project that will hone some skills using GenBank, introduce you to a nice tool called ‘Blink” and maybe find some interesting anomalies. It’s all outlined (in several blog posts) by Sandra at her blog “Discovering Biology in a Digital World” (great blog btw) in a post entitled “A general method and good student project for finding interesting anomalies in GenBank.” I’m having fun with it, I’ll tell you (and her) if I find something.

Speaking of finding things, here are some interesting things I’ve come across randomly lately.

I came across this interesting book the other day. Isn’t ‘databases’ or ‘genomics’ per se (ok, so not at all really), but it’s a look at what geological footprint humans and human civilization will have on the Earth some 100 million years from now. What some alien traveler might find from this “Anthropocene” period we are in. Read more about it here and listen to the interview.

PLoS Computational Biology has a paper introducing a new database you might want to check out: mouseNet . As stated at the database site, it is a “functional network for laboratory mouse based on integration of diverse genetic and genomic data…to… predict novel functional assignments and network components.”

Database "openness"

We train on publicly available databases and resources. For our purposes on deciding when to develop training, the definition is relatively straightforward: Can the academic researcher access the data without cost or license restriction? If the answer is yes, our next step is to determine if we can develop training materials based on the resource without cost or license restriction and to ask the providers specifically for permission to do so. We ask permission for several reasons: let the developer know what we are doing, verify the restrictions or lack there of, build good relationships, etc.

That first decision, “is it publicly available?”, would seem a relatively clearcut criteria, but we have found that it isn’t always. There are several problems. Often, the ‘terms of use’ or copyright documentation is difficult to find on the web site or non-existent. Even when it available, the terms, language and restrictions can vary quite a bit across databases, countries and even within a resource at times. Determining what “publicly available” is and which resource fits that definition can be less than simple, to say the least.

There is an attempt to offer a definition of “open” using the Creative Commons license. Continue reading

Another Wiki, WikiPathways

ResearchBlogging.orgPLoS Biology reports today on WikiPathway. The paper entitled “WikiPathways: Pathway editing for the people,” announces a new wiki for the ‘public curation’ of pathway data. The authors argue that

 The exponential growth of diverse types of biological data presents the research community with an unprecedented challenge to keep the flood of biological data as accessible, up-to-date, and integrated as possible.

I agree with this. We’ve seen it here and mentioned it many times, the growth of data is exponential and difficult to keep track of. The proposed solution for pathway data, as there has been for other data types and curation that I’ve written about lately, is a wiki: WikiPathways to be exact. The authors have high hopes for this wiki, as they state:

WikiPathways will be a powerful resource for the research community and a vital forum for pathway curation, And we are hopeful that it will serve as an example for how the continuing flood of biological data can be managed and utilized by the community to irrigate future hypotheses and discoveries

I’ve already made known my “skeptical optimism” for wikis for biological data known in a previous post, reading this later paper, that would still apply here. But right now I’m not going to write beyond that, I’m just going to point you to this paper and wiki. Later (this week, next at the latest) I’ll be critiquing this paper more fully and more generally look at this trend currently to use wikis for community curation and documentation of biological data and databases.

Pico, A.R., Kelder, T., van Iersel, M.P., Hanspers, K., Conklin, B.R., Evelo, C. (2008). WikiPathways: Pathway Editing for the People. PLoS Biology, 6(7), e184. DOI: 10.1371/journal.pbio.0060184

Sequence Formats

fasta file formatThere are a lot of them. FASTA comes to mind. GenBank is another. Clustal, EMBL, GCG and the list goes on. I’d say FASTA is one of the most commonly used or accepted, but I could be wrong. Still, many databases and software programs have their own format that they accept and generate. Some of these programs and databases will accept several formats or generate files in several formats. It can get a bit confusing. So, you’ve got a sequence file in PAUP but you need it in FASTA? Don’t even know what format it is? Or what they look like or the information that they contain?

Here are some links that could help I have gathered over time and lately as I was working with a phylip file:
Oxford’s CGRB’s examples of sequence formats.

EMBOSS’s explanation of sequence formats.

EBI’s help section on sequence formats.

Here are two programs that will convert one format to another:

Readseq (home URL and downloadable code here)


Hopefully that will get you started in making sense of sequence formats. Have another other help pages or conversion programs to suggest?

Quick intro to Viral Bioinformatics Resource Center

I just spent a bit over an hour getting some one-on-one time with Chris Upton at Viral Bioinformatics Resource Center. He was showing me the tools and resources they have (we used the new screensharing iChat feature of Leopard OS 10.5, that alone was worth the cost of upgrading to 10.5 this week, or even dumping your PC and getting a Mac ;-)… but I digress…) and they look quite useful. You can analyze a large number of viral genomes in their database (or upload your own, or a bacterial genome for that matter) in many different ways. Their webpage navigation and look is going to change soon (and I’ll inform you when it does), but the software and tools will remain the same (they are mainly Java programs), so if you want, you can go check them out. I suggest starting with VOCS which is a sort of an advanced search/filter/browser for viral genomes and from which you can access several other of the tools. We’ll be looking at these tools more in depth (I have a couple tips planned for the near future, tutorial), but thought I’d point it out to you now. Quite nice.

Future of genome sequencing

We’ve written before about the feel of ‘a genome a day’ around here. RPM at Evolgen points to a paper that suggests his prediction (from last year) that “de novo sequencing of whole eukaryotic genomes may be a thing of the past.” Perhaps he is correct, though we do have quite a large number of de novo sequencing projects for eukaryotic genomes in the pipeline for the moment. He suggests that, as this paper has done, sequencing projects will “use 454 to sequence cDNA libraries.” Though there is loss of data in not sequencing the non-transcriptome part of the genome, as the abstract in the paper he points to says:

We conclude that 454 sequencing, when performed to provide sufficient coverage depth, allows de novo transcriptome assembly and a fast, cost-effective, and reliable method for development of functional genomic tools for nonmodel species. This development narrows the gap between approaches based on model organisms with rich genetic resources vs. species that are most tractable for ecological and evolutionary studies.

There is a lot of interesting discussion in the comments to his post.

Eh, enter your own damn data….

tair_submission.jpgI was looking over the Eurekalert announcements and came across one that I have been percolating about now for some time. It is an effort I fully support and encourage. But I worry about a few aspects of it. The alert is entitled: Controlling a sea of information. The Arabidopsis Information Resource (TAIR) has partnered with the journal Plant Physiology to ensure data from Plant Physiology papers will get into the TAIR database. The longer story is available from the alert and from the associated Editorial. The short story is: there aren’t enough curators to keep up with all the data coming out. This prevents a lot of information from getting into the databases. The TAIR and PlantPhysiol folks have teamed up to create a way for the authors themselves to get this information into TAIR with a simple form.

Continue reading

Allen Brain Atlas, part duex

The Allen Institute for Brain Sceince is a great institution that was founded just under 5 years ago with a 100 million seed money from billionaire Paul Allen (of Microsoft fame). The purpose is,

… dedicated to performing innovative basic research on the brain and distributing its discoveries to researchers around the world. Through its efforts, the Institute aims to advance a new understanding of brain diseases and disorders.

The result of this research is disseminated through some excellent tools at the Allen Brain Atlas. This research and tool focuses on the mouse brain and determining which genes are expressed in different parts of the brain.

Well, it was recently announced that not only are they planning to extend this map to the mouse spinal chord and another atlas of brain development from fetus to adult mouse, they have launched a project to do a similar atlas of the human brain. This project is expected to take four years.

btw, the “brain explorer” tool is just cool. My expertise isn’t mouse or brain science, but I like roaming around the brain as much as the next guy :).

We’ll keep you up-to-date on the progress :).

Open Access Evolution

Dr. Eisen at UC Davis has started a new blog theme on his “Tree of Life” blog called “Open Evolution” (open access publications, open source programs, etc) and has started with open access journals. He has listed a few open access journals (and there’s a good discussion in the comments about the difference between ‘open access’ and ‘free online access’ journals) and is asking if anyone knows of any others. He hasn’t asked for it yet, but I’ve got some ideas for open source/access phylogeny analysis programs and/or databases. I’ll post a few of those in the coming week or so, but for now here is a link to a list of such programs (some on this list I’m not sure are open source, I’ll cull these later too).