A very timely post, related to some of the things I was grousing about in my long software piece.
— Titus Brown (@ctitusbrown) October 7, 2014
This is Table 1 that accompanies the full blog post: Bioinformatics tools extracted from a typical mammalian genome project. See the main post for the details and explanation. The table is too long to keep in the post, but I wanted it to be web-searchable. A copy also resides at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867
In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.
In the field of bioinformatics, there is a lot of discussion about data and code availability, and reproducibility or replication of research using the resources described in previous work. To explore the scope of the problem, I used the recent publication of the well-documented gibbon genome sequence project as a launching point to assess the tools, repositories, data sources, and other bioinformatics-related items that had been in use in a current project. Details of the named bioinformatics items were extracted from the publication, and location and information about the tools was then explored.
Only a small fraction of the bioinformatics items from the project were denoted in the main body of the paper (~16%). Most of them were found in the supplementary materials. As we’ve noted in the past, neither the data nor the necessary tools are published in the traditional paper structure any more. Among the over 100 bioinformatics items described in the work, availability and usability varies greatly. Some reside on faculty or student web sites, some on project sites, some in code repositories. Some are published in the traditional literature, some are student thesis publications, some are not ever published and only a web site or software documentation manual serves to provide required details. This means that information about how to use the tools is very uneven, and support is often non-existent. Access to different software versions poses an additional challenge, either for open source tools or commercial products.
New publication and storage strategies, new technological tools, and broad community awareness and support are beginning to change these things for the better, and will certainly help going forward. Strategies for consistently referencing tools, versions, and information about them would be extremely beneficial. The bioinformatics community may also want to consider the need to manage some of the historical, foundational pieces that are important for this field, some of which may need to be rescued from their current status in order to remain available to the community in the future.
From the Nature website, I obtained a copy of the recently published paper: Gibbon genome and the fast karyotype evolution of small apes (Carbone et al, 2014). From the text of the paper and the supplements, I manually extracted all the references to named database tools, data source sites, file types, programs, utilities, or other computational moving parts that I could identify. There maybe be some missed by this process, for example, names that I didn’t recognize or didn’t connect with some existing tool (or some image generated from a tool, perhaps). Some references were to “in house Perl scripts” or other “custom” scenarios were not generally included unless they had been made available. Pieces deemed as being done “in a manner similar to that already described” in some other reference were present, and I did not go upstream to prior papers to extract those details. Software associated with laboratory equipment, such as sequencers (located at various institutions) or PCR machines were not included. So this likely represents an under-count of the software items in use. I also contacted the research team for a couple of additional things, and quickly received help and guidance. Using typical internet search engines or internal searches at publisher or resource sites, I tried to match the items to sources of software or citations for the items.
What I put in the bucket included specific names of items or objects that would be likely to be necessary and/or unfamiliar to students or researchers outside of the bioinformatics community. Some are related, but different. For example, you need to understand what “Gene Ontology” is as a whole, but you also need to know what “GOslim” is, a conceptual difference and a separate object in my designation system here. Some are sub-components of other tools, but important aspects to understand (GOTERM_BP_FAT at DAVID or randomBed from BEDTools) and are individual named items in the report, as these might be obscure to non-practitioners. Other bioinformatics professionals might disagree with their assignment to this collection. We may discuss removal or inclusion of these in discussions about them in future iterations of the list.
After creating a master list of references to bioinformatics objects or items, the list was checked and culled for duplicates or untraceable aspects. References to “in house Perl scripts” or other “custom” scripts were usually eliminated, unless special reference to a code repository was provided. This resulted in 133 items remaining.
How are they referenced? Where in the work?
Both the main publication (14 PDF pages) and the first Supplementary Information file (133 PDF pages) provided the names of bioinformatics objects in use for this project. All of the items referenced in the main paper were also referenced in the supplement. The number of named objects in the main paper was 21 of the 133 listed components (~16%). This is consistent with other similar types of consortium or “big data” papers that I’ve explored before: the bulk of the necessary information about software tools, data sources, methods, parameters, and features have been in the extensive supplemental materials.
The items are referenced in various ways. Sometimes they are named in the body of the main text, or the methods. Sometimes they are included as notes. Sometimes tools are mentioned only in figure legends, or only in references. In this case, some details were found in the “Author information” section.
As noted above, most were found in the supplemental information. And in this example, this could be in the text or in tables. This is quite typical of these large project papers, in our experience. Anyone attempting to text-mine publications for this type of information should be aware of this variety of locations for this information.
Which bioinformatics objects are involved in this paper?
Describing bioinformatics tools, resources, databases, files, etc, has always been challenging. These are analogous to the “reagents” that I would have put in my benchwork biology papers years ago. They may matter to the outcome, such as enzyme vendors, mouse strain versions, or antibody species details. They constitute things you would need to reproduce or extend the work, or to appropriately understand the context. But in the case of bioinformatics, this can mean file formats such as the FASTQ or axt format from UCSC Genome Browser. They can mean repository resources like the SRA. They can be various different versioned downloaded data sets from ENSEMBL (version 67, 69, 70, or 73 here, but which were counted only once as ENSEMBL). It might be references to Reactome in a table.
With this broad definition in mind, Table 1 provides the list of named bioinformatics objects extracted from this project. The name or nickname or designation, the site at which it can be found (if available), and a publication or some citation is included when possible. Finally, a column designates whether it was found in the main paper as well.
What is not indicated is that some are references multiple times in different contexts and usages, with might cause people to not realize how frequently these are used. For example, ironically, RepeatMasker was referenced so many times I began to stop marking it up at one point.
Table 1. Software tools, objects, formats, files, and resources extracted from a typical mammalian genome sequencing project. See the web version supplement to this blog post: http://blog.openhelix.eu/?p=20002, or access at FigShare: http://dx.doi.org/10.6084/m9.figshare.1194867
What can we learn about the source or use of these items?
Searches for the information about the source code, data sets, file types, repositories, and associated descriptive information about the items yields a variety of access. Some objects are associated with traditional scientific publications and have valid and current links to software or data (but are also sometimes incorrectly cited). These may be paywalled in certain publications, or are described in unavailable meeting papers. Some do not have associated publications at all, or are described as submitted or in preparation. Some tools remain unpublished in the literature, long after they’ve gone into wide use, and their documentation or manual is cited instead. Some reside on faculty research pages, some are student dissertations. Some tools are found on project-specific pages. Some exist on code repositories—sometimes deprecated ones that may disappear. A number of them have moved from their initial publications, without forwarding addresses. Some are allusions to procedures other publications. Some of them are like time travel right back to the 1990s, with pages that appear to be original for the time. Some may be at risk of disappearing completely the next time an update at a university web site changes site access.
Other tools include commercial packages that may have unknown details, versions, or questionable sustainability and future access.
When details of data processing or software implementations are provided, the amount can vary. Sometimes parameters are included, others not.
Missing tool I wanted to have
One of my favorite data representations in the project results was Figure 2 in the main paper, Oxford grids of the species comparisons organized in a phylogenetic tree structure. This conveyed an enormous amount of information in a small area very effectively. I had hoped that this was an existing tool somewhere, but upon writing to the team I found it’s an R script by one of the authors, with a subsequent tree arrangement in the graphics program “Illustrator” by another collaborator. I really liked this, though, and hope it becomes available more broadly.
The most fun citation I came across was the page for PHYLIP, and the FAQ and credits were remarkable. Despite the fact that there is no traditional publication available to me, a lengthy “credits” page offers some interesting insights about the project. The “No thanks to” portion was actually a fascinating look at the tribulations of getting funding to support software development and maintenance. The part about “outreach” was particularly amusing to us:
“Does all this “outreach” stuff mean I have to devote time to giving workshops to mystified culinary arts students? These grants are for development of advanced methods, and briefing “the public or non-university educators” about those methods would seem to be a waste of time — though I do spend some effort on fighting creationists and Intelligent Design advocates, but I don’t bring up these methods in doing so.”
Even the idea of “outreach” and support for use of the tools is certainly unclear to the tool providers, apparently. Training? Yeah, not in any formal way.
The gibbon genome sequencing project provided an important and well-documented example of a typical project in this arena. In my experience, this was a more detailed collection and description than many other projects I’ve explored, and some tools that were new and interesting to me were provided. Clearly an enormous number and range of bioinformatics items, tools, repositories, and concepts are required for the scope of a genome sequencing project. Tracing the provenance of them, though, is uneven and challenging, and this is not unique to this project—it’s a problem among the field. Current access to bioinformatics objects is also uneven, and future access may be even more of a hurdle as aging project pages may disappear or become unusable. This project has provided an interesting snapshot of the state of play, and good overview of the scope of awareness, skills, resources, and knowledge that researchers, support staff, or students would need to accomplish projects of similar scope.
It used to be simpler. We used to use the small number of tools on the VAX, uphill, in the snow, both ways, of course. When I was a grad student, one day in the back of the lab in the early 1990s, my colleague Trey and I were poking around at something we’d just heard about—the World Wide Web. We had one of those little funny Macs with the teeny screens, and we found people were making texty web pages with banal fonts and odd colors, and talking about their research.
Although we had both been using a variety of installed programs or command lines for sequence reading and alignment, manipulation, plasmid maps, literature searching and storage, image processing, phylogenies, and so on—we knew that this web thing was going to break the topic wide open.
Not long after, I was spending more and more time in the back room of the lab, pulling out sequences from this NCBI place (see a mid-1990s interface here), and looking for novel splice variants. I found them. Just by typing—no radioactivity and gels required by me! How cool was that? We relied on Pedro’s List to locate more useful tools (archive of Pedro’s Molecular Biology Search and Analysis Tools.).
Both of us then went off into postdocs and jobs that were heavily into biological software and/or database development. We’ve had a front seat to the changes over this period, and it’s been really amazing to watch. And it’s been great for us—we developed our interests into a company that helps people use these tools more effectively, and it has been really rewarding.
At OpenHelix, we are always trying to keep an eye on what tools people are using. We regularly trawl through the long, long, long supplementary materials from the “big data” sorts of projects, using a gill net to extract the software tools that are in use in the community. What databases and sites are people relying on? What are the foundational things everyone needs? What are the cutting-edge things to keep a lookout for? What file formats or terms would people need to connect with a resource?
But as I began to do it, I thought: maybe I should use this as a launching point to discuss some of the issues of software tools and data in genomics. If you were new to the field and had to figure out how a project like this goes, or what knowledge, skills, and tools you’d need, can you establish some idea of where to aim? So I used this paper to sort of analyze the state of play: what bioinformatics sites/tools/formats/objects/items are included in a work of this scope? Can you locate them? Where are the barriers or hazards? Could you learn to use them and replicate the work, or drive forward from here?
It was illuminating to me to actually assemble it all in one place. It took quite a bit of time to track the tools down and locate information about them. But it seemed to be a snapshot worth taking. And I hope it highlights some of the needs in the field, before some of the key pieces become lost to the vagaries of time and technology. And also I hope the awareness encourages good behavior in the future. Things seem to be getting better—community pressure to publish data sets and code in supported repositories has increased. We could use some standardized citation strategies for the tools, sources, and parameters. The US NIH getting serious about managing “big data” and ensuring that it can be used properly has been met with great enthusiasm. But there are still some hills left to climb before we’re on top of this.
Carbone L., R. Alan Harris, Sante Gnerre, Krishna R. Veeramah, Belen Lorente-Galdos, John Huddleston, Thomas J. Meyer, Javier Herrero, Christian Roos, Bronwen Aken & Fabio Anaclerio & al. (2014). Gibbon genome and the fast karyotype evolution of small apes, Nature, 513 (7517) 195-201. DOI: http://dx.doi.org/10.1038/nature13679
FigShare version of this post: http://dx.doi.org/10.6084/m9.figshare.1194879
Drama surrounding the $1000 genome erupts every so often, and earlier this year when the HiSeq X Ten setup was unveiled there was a lot of chatter–and questions: Is the $1,000 genome for real? And some push-back on the cost analysis: That “$1000 genome” is going to cost you $72M. A piece that offers nice framework for the field of play is here: Welcome to the $1,000 genome: Mick Watson on Illumina and next-gen sequencing. Aside from the media flurry, though, what matters is the data. And not many people have had access to the data yet.
Via Gholson Lyon, I heard about access to some:
— Gholson Lyon (@GholsonLyon) July 30, 2014
The Garvan Institute of Medical Research, DNAnexus and AllSeq have teamed up to offer the genomics community open access to the first publicly available test data sets generated using Illumina’s HiSeq X Ten, an extremely powerful sequencing platform. Our goal is to provide sample data that will allow you to gain a deeper understanding of what this technological advancement means for your work today and in the future.
My focus won’t be this data itself–but if you are interested in many of the technical aspects of this system and their process, have a listen to this informative presentation by Warren Kaplan from Garvan:
The sample data is derived from a cell line, the GM12878 cells. These cells are from the Coriell Repository here: Catalog ID: GM12878. Conveniently, this is one of the Tier 1 cell lines from the ENCODE project too, so there is other public data out there on this cell line–which I have explored in the past and knew some things about.
There are 2 different data sets of the sequence in the download files, and one of them is available in the browser to view. I’m sure the Genoscenti will be all over the downloadable files. But because I’m always interested new visualizations, I wanted to explore the genome browser they made available. Although I had heard of Biodalliance before, we hadn’t highlighted it as a tip, so I thought that would be interesting to explore. Biodalliance is a flexible, embeddable, extensible system that’s worth a look on it’s own, besides delivering this test data. And if you come by at a later date and the X Ten data is no longer available, go over to their site for nice sample data sets. Their “getting started” page has a nice intro to the features.
In the video, I’ll just take a quick test drive around some of the visualization features with the X-Ten GM12878 data. I’ll look at a couple of sample regions, one with the SOD1 gene just to illustrate the search and the tracks. And I’ll look at a region that I knew from the previous ENCODE CNV data had a homozygous deletion to see how that looked in this data set. (If you want to look for deletions later, search for the genes OR2T10 or UGT2B17).
Note: the data is time-sensitive–apparently it’s only available until September 30 2014. So get it while it’s hot, or browse around now.
Test data site: http://allseq.com/x-ten-test-data
Biodalliance browser software details: http://www.biodalliance.org/
Down T.A. & T. J. P. Hubbard (2011). Dalliance: interactive genome viewing on the web, Bioinformatics, 27 (6) 889-890. DOI: http://dx.doi.org/10.1093/bioinformatics/btr020
Check Hayden E. (2014). Is the $1,000 genome for real?, Nature, DOI: http://dx.doi.org/10.1038/nature.2014.14530
Dunham I., Shelley F. Aldred, Patrick J. Collins, Carrie A. Davis, Francis Doyle, Charles B. Epstein, Seth Frietze, Jennifer Harrow, Rajinder Kaul & Jainab Khatun & (2012). An integrated encyclopedia of DNA elements in the human genome, Nature, 489 (7414) 57-74. DOI: http://dx.doi.org/10.1038/nature11247
Garvan NA12878 HiSeqX datasets by The Garvan Institute of Medical Research, DNAnexus and AllSeq is licensed under a Creative Commons Attribution 4.0 International License
There’s a workshop running today and tomorrow, called:
July 28-29, 2014
It’s live streaming here: http://www.genome.gov/GenomeTVLive/
I’m sure the recordings will be available later, though, if you come across this at a later date.
Edit after session were done: I really enjoyed this. Having all these wicked smaht folks discussing ways to get to the future was really useful. I’ll post an additional note when I see the videos are up.
— OpenHelix Staff (@OpenHelix) July 29, 2014
Yeah, I know you know. There’s a lot of genomics and proteomics data coming out every day–some of it in the traditional publication route, but some of it isn’t–and it’s only getting harder and harder to wrangle the useful information to access the signal from the noise. I can remember when merely looking through the (er, paper-based) table of contents of Cell and Nature would get me up to speed for a week. But increasingly, the data I need isn’t even coming through the papers.
Like everyone else, I have a variety of strategies to keep notified of different things I need to see. I use the MyNCBI stored searches to keep me posted on things that come from via the NCBI system. I signed up for the OMIM new “MIM-Match” service as well. But there’s still a lot of room for new ways to collect and filter new data and information. Today’s tip focuses on a service to do that: Nowomics. This is a freely available tool to help you keep track of important new data. Here’s a quick video overview of how to see what’s going on with Nowomics.
The goal of Nowomics is to offer you an actively updated feed of relevant information on genes or topics of interest, using text mining and ontology term harvesting from a range of sources. What makes them different from MyNCBI or OMIM is the range and types of data sources they use. The user sets up some genes or Gene Ontology terms to “follow”, and the software regularly checks for changes in the source sites. You can go in an look at your feed, you can filter it for different types of data, and you can see what’s new (“latest”) or what’s being hotly chattered about (“popular”) using Altmetric strategies. For example, here’s a paper that people seemed to find worth talking about, based on the tweets and the Mendeley occurrences.
This tool is in early stages of development–if there are features you’d like to see or other sources you’d think are useful, the Nowomics team is eager for feedback. You can find a link to contact them over at their site, or locate them on Facebook and Twitter. You can also learn more from their blog. You can also learn more about the philosophy and foundations of Nowomics from their slide presentation below.
Example gene feed: http://nowomics.com/gene/human/BRCA2
Acland A., T. Barrett, J. Beck, D. A. Benson, C. Bollin, E. Bolton, S. H. Bryant, K. Canese, D. M. Church & K. Clark & (2014). Database resources of the National Center for Biotechnology Information, Nucleic Acids Research, 42 (D1) D7-D17. DOI: http://dx.doi.org/10.1093/nar/gkt1146
Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD), July 22 2014. World Wide Web URL: http://omim.org/
Pathway databases, like Reactome, are uniquely suited for interpreting the results of high-throughput functional genomics data sets such as microarray-based expression profiles, protein interaction sets, and chromatin IP. In response to user feedback and new feature requests, we have released a new Reactome Pathway Browser with an integrated suite of tools for pathway analysis. Using these improved features, you can map protein lists to Reactome pathways, perform pathway overrepresentation analysis for a set of genes, colourize pathway diagrams with gene expression data, and compare model organism and human pathways. To support third-party tool integration, the Reactome Pathway Analysis Portal is also available via RESTful web services. Further details about the new pathway analysis tool can be found in our User Guide.…[see more details and contact info at the mailing list page]
Mapping gene and protein lists to pathways is a frequently-requested feature in pretty much every workshop we give–so have a look and see if it would help you to manage lists and do some discovery on them.
Quick link: http://www.reactome.org/PathwayBrowser/
This week’s video tip shows you a new way to look at the multiWig track data at the UCSC Genome Browser. A new option has recently been released (see 06 May 2014), a “stacked” view, and it’s a handy way to look at the data with a new strategy. But I’ll admit it took me a little while of working with it to understand the details. So in this tip I hope you’ll see what the new visualization offers.
I won’t go into the background on the many types of annotation tracks available–if you need to be introduced to the idea of the basic track views, start out with our introduction tutorial that touches on the different types of graphical representations. Custom tracks are touched on in the advanced tutorial. For guidance specifically how to create the different track types, see the UCSC documentation. The type of track I’m illustrating in the video today, a MultiWig track, has its own section over there too. Basically, if you are completely new to this, the “wiggle” style is a way to show a histogram display across a region. MultiWig lets you overlay several of these histograms in one space. In the example I’ll show here, the results of looking at 7 different cell lines are shown for some histone mark signals (Layered H3K27Ac track).
When I saw the announcement, I thought this was a good way to show all of the data simultaneously. When we do basic workshops, we don’t always have time to go into the details of this view, although we do explore it in the ENCODE material, because the track I’m using is one of the ENCODE data sets. I’ll use the same track in the same region as the announcement, which is shown here:
But when I first looked at this, I wasn’t sure if the peak–focus on the pink peak that represents the NHLF cell line–was meant to cover the whole area underneath or not. What I was trying to figure out is essentially this (a graphical representation of my thought process follows):
By trying out the various styles I was pretty sure I had the idea of what was really being shown, but I confirmed that with one of the track developers. The value is only the pink band segment, not the whole area below it. And Matthew also noted to me that they are sorting the tracks in reverse alphabetical order (so NHLF is the highest in the stack). That was an aspect I hadn’t realized yet. They are not sorting based on the values at that spot. This makes sense, of course, but it wasn’t obvious to me at first.
I like this option very much–but I figured if I had to do some noodling on what it actually meant others might have the same questions.
In the video I’ll show you how this segment looks with the different “Overlay method” settings on that track page. I’ll be looking at the SOD1 area, like the announcement example. I tweaked a couple of the other settings from the defaults so it would be easier to see on the video (see arrowheads for my changes). But I hope this conveys the options you have now to look at this type of track data effectively.
So here is the video with the SOD1 5′ region in the center, using the 4 different choices of overlay method, illustrating the histone mark data in the 7 cell lines. I’m not going into the details of the data here, but I’ll point you to a reference associated with this work for more on how it’s done–see the Bernstein lab paper below. I wanted to just demonstrate this new type of viewing options that will be available on wiggle tracks. Some tracks will have too much data for one type or another, or will be clearer with one or another style. But now you have an additional way to consider it.
UCSC Genome Browser: genome.ucsc.edu
UCSC Intro tutorial: http://openhelix.com/ucscintro
UCSC Advanced tutorial: http://openhelix.com/ucscadv
These tutorials are freely available because UCSC sponsors us to do training and outreach on the UCSC Genome Browser.
Kent W.J., Zweig A.S., Barber G., Hinrichs A.S. & Karolchik D. (2010). BigWig and BigBed: enabling browsing of large distributed datasets., Bioinformatics (Oxford, England), PMID: 20639541
Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update., Nucleic acids research, PMID: 24270787
Ram O., Goren A., Amit I., Shoresh N., Yosef N., Ernst J., Kellis M., Gymrek M., Issner R. & Coyne M. & al. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells., Cell, PMID: 22196736
The ENCODE Project Consortium, Bernstein B.E., Birney E., Dunham I., Green E.D., Gunter C. & Snyder M. et al. (2012). An integrated encyclopedia of DNA elements in the human genome., Nature, 489 PMID: 22955616
Also see the Nature special issue on ENCODE data, especially the chromatin accessibility and histone modification subset (section 02): http://www.nature.com/encode/
Just got the news via the mailing list, I haven’t had a chance to kick the tires yet:
We are pleased to announce the release of BioMart version 0.9.
The latest version of BioMart includes support for data analysis and visualisation tools. The first of the BioMart tools has already been implemented and is accessible from www.biomart.org. This tool enables enrichment analysis of genes in all Ensembl species and a broad range of gene identifiers for each species are also available. Furthermore, the tool supports cross-species analysis using Ensembl homology data. Finally, the enrichment tool facilitates analysis of BED files containing genomic features such as Copy Number Variations (CNVs) or Differentially Methylated Regions (DMRs).
The latest BioMart release comes with the new version of the REST and SOAP APIs. These APIs are available for testing at central.biomart.org. Third party developers who are currently using REST or SOAP version 0.7 are encouraged to start testing and transitioning to 0.9. The two servers providing access to BioMart data through REST and SOAP (version 0.7 and version 0.9) will be running in parallel to provide support for easy transition. The Enrichment tool is also accessible programmatically through 0.9 REST/SOAP interface.
Finally, the BioMart website has been completely redesigned to cater for a better user experience. The re-organised layout, incorporation of new functionality, such as the “quick tool access” and the use of subtle animation makes for clearer navigation and greater site interactivity.
Your feedback is welcome and appreciated.
On behalf of the BioMart developers
Check it out: www.biomart.org