Tag Archives: EBI

Video Tip of the Week: MapMi, automated mapping of microRNA loci

Today’s video tip of the week is on MapMi. This tool is found at EBI and was developed by the Enright lab. The purpose of this tool is a computational system for mapping of miRNAs within and across species. As the abstract of their recent paper says:

 Currently miRBase is their primary repository, providing annotations of primary sequences, precursors and probable genomic loci… However, miRBase focuses on those species for which miRNAs have been directly confirmed. Secondly, specific miRNAs or their loci are sometimes not annotated even in well-covered species. We sought to address this problem by developing a computational system for automated mapping of miRNAs within and across species. Given the sequence of a known miRNA in one species it is relatively straightforward to determine likely loci of that miRNA in other species. Our primary goal is not the discovery of novel miRNAs but the mapping of validated miRNAs in one species to their most likely orthologues in other species.

In today’s tip I walk you through MapMi to search for miRNAs.


Quick link to MapMi: http://www.ebi.ac.uk/enright-srv/MapMi

Related Links:

miRBase (uses miRBase ID’s, our tutorial on miRBase here)
Ensembl  (uses Ensembl genomes, our tutorial on Ensembl here)
MapMi uses the next two algorithms for predictions:


Guerra-Assuncao, J., & Enright, A. (2010). MapMi: automated mapping of microRNA loci BMC Bioinformatics, 11 (1) DOI: 10.1186/1471-2105-11-133

Video Tip of the Week: The New Database of Genomic Variants – DGV2 (edited)

In today’s tip I will briefly introduce you to the beta version of the updated DGV resource. The Database of Genomic Variants, or DGV, was created in 2004 at a time early in the understanding of human structural variation, or SV, which is defined by DGV as genomic variation larger than 50bp. DGV has historically provided public access to SV data in humans who are non-diseased. In the past it both accepted direct data submissions on SV and also provided high quality curation and analysis of the data such that it was appropriate for use in biomedical studies.

We’ve had an introductory tutorial on using DGV for years, and we’ve posted on changes at DGV in the past, so we were quite interested to read in their recent newsletter that there is a newly updated beta version of the DGV resource. The increase in SV data being generated by many large-scale sequencing projects as well as individual labs, has made it difficult for the DGV to continue to collect SV data, to provide a stable and comprehensive data archive AND to manually curate it at the level they have in the past. Therefore the DGV team is now partnering with DGVa at EBI and dbVar at NCBI. DGVa and dbVar will accept SV data submissions, and will function as public data archives (PDA) and, according to the publication sited below, DGVa and dbVar will:

 “...provide stable and traceable identifiers and allow for a single point of access for data collections, facilitating download and meta-analysis across studies.

DGV will no longer accept data submissions, but will instead use accessioned SV data from the archives and focus on providing the scientific community and public at-large with a subset of the data. Again quoting from the paper referenced below:

The main role of DGV going forward will be to curate and visualize selected studies to facilitate interpretation of SV data, including implementing the highest-level quality standards required by the clinical and diagnostic communities.

The original DGV resource is still available while comments are collected on the updated beta site. For more information on the updated DGV I suggest you check out this documentation from the DGV team: From their FAQ – “What is the data model used for DGV2?” and from a link in their top navigation area – “DGV Beta User Tutorial“. Be sure to check out the new displays & data that’s available, and most importantly to send your comments & suggestions to the group so that they can design a resource best suited for your needs.

Quick Links:

Original Database of Genomic Variants: http://projects.tcag.ca/variation/

New beta version of the Updated DGV: http://dgvbeta.tcag.ca/dgv/app/home

Introductory OpenHelix on Original DGV: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=88

DGV Beta User Tutorial from DGV: http://dgvbeta.tcag.ca/dgv/docs/20111019-DGV_Beta_User_Tutorial.pdf

Church, D., Lappalainen, I., Sneddon, T., Hinton, J., Maguire, M., Lopez, J., Garner, J., Paschall, J., DiCuccio, M., Yaschenko, E., Scherer, S., Feuk, L., & Flicek, P. (2010). Public data archives for genomic structural variation Nature Genetics, 42 (10), 813-814 DOI: 10.1038/ng1010-813
(Free access from PubMed Central here)

Edit, March 5, 2012 – I wanted to add a clarification that we recieved through our contact link. I am pasting it in full, with permission from Margie:

“Hi Jennifer
We at TCAG think you did a great job on your video blog of the New Database of Genomic Variants.
I wanted to make a correction to one of your statements: “The increase in SV data (…) at the level they have in the past.”
We, the DGV team, have built a system that CAN handle the new volumes and types of SV data now being published, and we are able to curate all of these data. The reason we partnered with DGVa and dbVar was primarily to provide stable, “universal” accessions for SV data. We also work with DGVa and dbVar to define standard terminology, data types, and data exchange formats.
I just wanted to make sure it was clear that we are fully capable to handle the SV data being published now. Our reason for partnership was to foster standardized data and open data sharing across systems.
Thanks again for your blog post!
Margie Manker”

(re)Funding Databases II

ResearchBlogging.orgSo, I wrote about defunding resources and briefly mentioned a paper in Database about funding (or ‘re’funding) databases and resources. I’d like to discuss this a bit further. The paper, by Chandras et. al, discusses how databases and, to use their term, Biological Resource Centers (BRCs) are to maintain financial viability.

Let me state first, I completely agree with their premise, that databases and resources have become imperative. The earlier model of “publication of experimental results and sharing of the reated research materials” needs to be extended. As they state:

It is however no longer adequate to share data through traditional modes of publication, and, particularly with high throughput (‘-omics) technologies, sharing of datasets requires submission to public databases as has long been the case with nucleic acid and protein sequence data.

The authors state, factually, that the financial model for most biological databases (we are talking the thousands that exist), has often been a 3-5 year development funding, that once runs out, the infrastructure needs to be supported by another source. In fact, this has lead to the defunding of databases such as TAIR and VBRC (and many others), excellent resources with irreplaceable data and tools, that then must struggle to find funding to maintain the considerable costs of funding infrastructure and continued development.

The demands of scientific research, open, shared data, require a funding model that maintains the publicly available nature of these databases. And thus the problem as they state:

If, for financial reasons, BRCs are unable to perform their tasks under conditions that meet the requirements of sceintfic research and the deamnds of industry, scientists will either see valuable information lost or being transferred into strictly commercial environment with at east two consequences: (i) blockade of access to this information and/or high costs and (ii) loss of data and potentioal for technology transfer for the foreseeable future. In either case the effect on both the scientific and broader community will be detrimental.

Again, I agree.

They discuss several possible solutions to maintaining the viability of publicly available databases including a private-public dual tier system where for-profits paid an annual fee and academic researchers have free access. They mention Uniprot, which underwent a funding crisis over a decade ago, as an example. Uniprot (then Swissprot) went back to complete public funding in 2002. There are still several other databases that are attempting to fund themselves by such a model. BioBase is one where several databases have been folded. TransFac is one. There is a free, reduced functionality, version that is available to academics through gene-regulation.com and the fuller version for a subscription at BioBase. This former version allows some data to be shared, as one could see at VISTA or UCSC. I am not privy to the financials of BioBase and other similar models, and I assume that will work for some, but I agree with the authors that many useful databases and resources would be hard-pressed to be maintained this way.

Other possibilities include fully  including databases under a single public institution funding mechanism. The many databases of NCBI and EBI fit this model. In fact, there is even a recent case of a resource being folded into this model at NCBI. Again, this works for some, but not all useful resources.

Most will have to find variable methods for funding their databases. Considering the importance of doing so, it is imperative that viable models are found. The authors reject, out of hand, advertising. As they mention, most advertisers will not be drawn to website advertising without a visibility of at least 10,000 visitors per month. There might be some truth to this (and I need to read the reference they cite that use to back that up).

But the next model they suggest seems to me to have the same drawback. In this model, the database or resource would have a ‘partnership of core competencies.’ An example they cite is MMdb (not to be confused with MMDB). This virtual mutant mouse repository provides direct trial links to Invitrogen from it’s gene information to the product page. They mention that though 6 companies were approached, only one responded. It would seem that this model has the same issues as directly selling advertising.

They also mention that, at least for their research community of mouse functional genomics, “Institutional Funding” seems the best solution for long-term viability and open access. Unfortunately, until institutions like NIH and EMBL are willing or able to fund these databases, I’m not sure that’s thats a solution.

As they mention in the paper, the rate of growth of the amounts and types of data that is being generated is exponential. I am not sure that government or institutional funding can financially keep up with housing the infrastructure needed to maintain and further develop these databases so that all the data generated can remain publicly and freely accessible.

Information is should be free, but unfortunately it is not without cost. It will be interesting to see how funding of databases and resources evolves in this fast growing genomics world (and imperative we figure out solutions).

PS: On a personal note, the authors use their resource, EMMA (European Mouse Mutant Archive), as an example in the paper. I like the name since it’s the name of my daughter, but it just goes to prove that names come in waves. We named our daughter thinking few would name their daughter the same. When even databases name the same name, you know that’s not the case.

Chandras, C., Weaver, T., Zouberakis, M., Smedley, D., Schughart, K., Rosenthal, N., Hancock, J., Kollias, G., Schofield, P., & Aidinis, V. (2009). Models for financial sustainability of biological databases and resources Database, 2009 DOI: 10.1093/database/bap017

Teaching and annotating at the same time

plos teaching paperA recent paper (couple weeks ago) in PLoS Biology from Hingamp et al. had me intrigued. Entitled Metagenome Annotation Using a Distributed Grid of Undergraduate Students, the lecturers put together a system to teach bioinformatics to undergraduates that uses new unannotated sequences from metagenome projects. As stated in the announcement,

This method asks students to randomly pick and analyze unknown metagenomic DNA fragments from a real research sequence stockpile. The student’s mission, using Internet tools only, is to figure out from which organism the DNA comes from, and what biological function it might have. As well as gaining confidence and proficiency in bioinformatics, students experience the authentic research process of weighing the arguments, establishing prediction reliability, building hypotheses, and maintaining rigorous disourse.

The lecturers have put together  a teaching-annotation procedure in a publicly accessible “annotation environment” they call “Annotathon.” This web interface walks the student through the annotation process in a procedure as you see in the figure here. Since you can join and use this interface, I thought I’d give it a test drive.

Continue reading

New and Updated Online Tutorials for ASTD, Entrez Protein and MMDB

Comprehensive tutorials on the ASTD, Entrez Protein, and MMDB databases enable researchers to quickly and effectively use these invaluable variation resources.

Seattle, WA September 24, 2008 — OpenHelix today announced the availability of new tutorial suites on the Alternative Splicing and Transcript Diversity (ASTD) database, Entrez Protein and the Molecular Modeling Database (MMDB). ASTD is an European Bioinformatics Institute (EBI) resource for alternative splice events and transcripts for the human, mouse, and rat systems. Entrez protein is a comprehensive database of protein information brought to you by the National Center for Biotechnology Information (NCBI). MMDB is another NCBI resource which contains an extensive collection of three-dimensional protein structures with detailed annotation that can be used to learn about the structure and function of many proteins. Together these three tutorials give the researcher an excellent set of resources to carry their research from transcript to 3d protein structure.

The tutorial suites, available for single purchase or through a low-priced yearly subscription to all OpenHelix tutorials, contain a narrated, self-run, online tutorial, slides with full script, handouts and exercises. With the tutorials, researchers can quickly learn to effectively and efficiently use these resources. These tutorials will teach users:


  • to perform Quick and Advanced searches
  • to navigate gene and transcript report pages
  • to predict intron/exon boundaries and likely regulatory protein binding site
  • to search manually curated data regarding alternate splicing

Entrez Protein

  • to perform basic and advanced searches utilizing the many available tools and options
  • to understand the protein records and exploit the many internal and external links you are provided with
  • to explore some of the resources provided by the NCBI network of databases, such as “My NCBI”


  • to search MMDB using both basic and advanced query techniques
  • to understand the detailed results you obtain
  • to visualize and manipulate structures using NCBI’s Cn3D structural viewer
  • to locate and view structurally aligned homologs

To find out more about these and other tutorial suites visit the OpenHelix Tutorial Catalog and OpenHelix or visit the OpenHelix Blog for up-to-date information on genomics.

About OpenHelix
OpenHelix, LLC, provides the genomics knowledge you need when you need it. OpenHelix currently provides online self-run tutorials and on-site training for institutions and companies on the most powerful and popular free, web based, publicly accessible bioinformatics resources. In addition, OpenHelix is contracted by resource providers to provide comprehensive, long-term training and outreach programs.

Tip of the Week: Gene Expression Data by Condition at ArrayExpress

AE Atlas TipIn today’s tip, I want to show you how to use a great looking beta tool that I just found at EBI’s ArrayExpress Gene Expression repository (AE). The tool’s name is the ArrayExpress Atlas. You may have retrieved expression data from the ArrayExpress Warehouse, which is a carefully curated collection of expression data. The Warehouse is a wonderful resource, and a great way to obtain expression data sets, but the information retrieved is organized by gene name and sample values. The ArrayExpress Atlas appears to be the next generation of the Warehouse and it provides gene expression data as a table, with genes corresponding to rows and experimental conditions corresponding to columns. The tool is easy to use, provides easy to interpret results, and looks like its capabilities are growing fast. Check out this tip, check out the Atlas blog spot, check out the tool, and send any feedback for improving the tool to AE.