Tag Archives: big data

Genomics England is responsible for the 100,000 Genomes Project

Video Tip of the Week: 100,000 Genomes Project

Genomics England is responsible for the 100,000 Genomes Project

Genomics England is responsible for the 100,000 Genomes Project

Software tools are certainly our focus for most of our tips of the week. But a key aspect of using the software and data repositories is that they rely on quality data. So sometimes we’ll highlight specific projects that will provide data to researchers, and this tip is one of those cases. Researchers should be aware of the data, understand the project goals, and that they may benefit from access to this information in the future. So this week, we’ll highlight the 100,000 Genomes Project as our video tip of the week.

Genomics England is the organization that is in charge of this project. From their “about” page, you can learn more about the organization and structure. But I’ll post their specific goals here:

  • to bring benefit to patients
  • to create an ethical and transparent programme based on consent 
  • to enable new scientific discovery and medical insights
  • to kickstart the development of a UK genomics industry

Oh, how I yearn for a national health system that has this kind of opportunity and security for patients. I hope they can keep it. </aside over>

I saw their outreach video the other day, and I thought it was well done. So to get an overview of their efforts, that video is this week’s Tip of the Week:

Another reason I wanted to highlight this project is that there is a deadline for researchers to apply for access to this project that is coming up soon:

Deadline for researchers:

So if you are in a position to help with this project, consider this call to action and sign up. It seems to be very valuable in getting personalized medicine to patients in specific ways, while also benefiting the genomics-to-clinic transition that all of us will need. You can learn more about the need to input on the data here in a second video. They have emphasized the aspect for training people in this topic, which is music to my ears as well. Young researchers who can get involved with this work should benefit for many years as the data comes along and we can follow patients over time.

Quick links:

Genomics England: http://www.genomicsengland.co.uk/

Twitter: https://twitter.com/genomicsengland


Siva, N. (2015). UK gears up to decode 100 000 genomes from NHS patients The Lancet, 385 (9963), 103-104 DOI: 10.1016/S0140-6736(14)62453-3

Caulfield M., Davies J., Dennys M., et al “The 100,000 Genomes Project Protocol” London: Genomics England (2014) Available at: genomicsengland.co.uk/?wpdmdl=5168 (Accessed: Oct 8 2015)

Video Tip of Week: Bioproject, it’s where to start finding data (hint, not the papers so much anymore))

A few months ago, Jennifer did a nice tip on on NCBI’s Genome Resources and the changes there. There she briefly mentioned Genome Project resource moving to a new home, BioProject, just about a year ago. Today, I’d like to give you a quick overview of BioProject. It was described in this year’s issue of NAR’s database issue: “BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.From the abstract:

As the volume and complexity of data sets archived at NCBI grow rapidly, so does the need to gather and organize the associated metadata. Although metadata has been collected for some archival databases, previously, there was no centralized approach at NCBI for collecting this information and using it across databases. The BioProject database was recently established to facilitate organization and classification of project data submitted to NCBI, EBI and DDBJ databases.

This is just one step in the process the biological science community will have to do to get a handle of the data deluge. If scientists are to get a handle of the projects and data that is spewing at breakneck speeds, a key is knowing what data is being generated, organizing the projects.

As Mary (and we here at OpenHelix) keep not-so-gently reminding you, the data isn’t in the papers any more. Huge projects like 1000 Genomes, ENCODE and others and reduced sequencing costs produce enough data that finding it is difficult.

BioProject grew out of a need to better organize these large projects’ datasets and metadata and replaces NCBI’s Genome Project resource. These projects produce data which is then deposited in several repositories. BioProject “provides an organizational framework to access metadata about research projects and the data from those projects which is deposited, or planned for deposition, into archival databases.”

Quick Links:

BioProject Help
BioSample (descriptions of biological source materials used in experimental assays)
ENCODE (sponsored tutorial)
1000 Genomes 


Barrett, T., Clark, K., Gevorgyan, R., Gorelenkov, V., Gribov, E., Karsch-Mizrachi, I., Kimelman, M., Pruitt, K., Resenchuk, S., Tatusova, T., Yaschenko, E., & Ostell, J. (2011). BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata Nucleic Acids Research, 40 (D1) DOI: 10.1093/nar/gkr1163

Video tip of the week: OpenHelix App on SciVerse to Extend Research

We’ve all seen the discussions – on twitter, in journals, lots of places – on how to collect, store, find and use all the data that is and will be generated. Here at OpenHelix we believe that there is a gold mine of bioscience data that is being vastly underutilized, and our goal is to help make that data more accessible to researchers, clinicians, librarians, students and anyone else who is interested in science.

We go at our goal in a variety of ways, including: this blog with its weekly tips, answers and other posts; with our online tutorial materials on over 100 different biological databases and resources; and with our live trainings, many of which are sponsored by resource providers such as the UCSC Genome Browser group.

In today’s tip I will introduce you to another one of our efforts to “extend research” by showing you a glimpse of an OpenHelix app that we designed for the SciVerse platform, which Elsevier has described as an “ecosystem providing workflow solutions to improve scientist productivity and help them in their research process”. This app scans a ScienceDirect journal article for any database names or URLs that we train on, and then displays a list of such resources in the window of the app. A researcher can use this list to go from a research article to our training on how to use the resource, and to the resource itself. We believe this type of integration will help extend research by making it easier to find, access and use data associated with a paper. If you have access to articles through ScienceDirect, and you try out our app, please comment here & let us know what you think, or suggest future enhancements. Also you could consider reviewing it for the app gallery. Thanks!

Quick links:

SciVerse Hub http://www.hub.sciverse.com

SciVerse Application Gallery http://www.applications.sciverse.com

OpenHelix SciVerse App Description http://bit.ly/xtGcco

Reference shown in Tip (subscription required): Mortensen, H., & Euling, S. (2011). Integrating mechanistic and polymorphism data to characterize human genetic susceptibility for environmental chemical risk assessment in the 21st century Toxicology and Applied Pharmacology DOI: 10.1016/j.taap.2011.01.015

OpenHelix Reference (free from PMC here): Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI: 10.1093/bib/bbq026

SciVerse Reference (subscription required): Bengtson, J. (2011). ScienceDirect Through SciVerse: A New Way To Approach Elsevier Medical Reference Services Quarterly, 30 (1), 42-49 DOI: 10.1080/02763869.2011.541346

Real bioinformaticians write code, real scientists…

Just over a week ago, Neil Saunders wrote a post I agreed with: Real bioinformaticians write code. The post was in response to a tweet conversation that started:

Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.

He’s right, and that’s very true for bioinformaticists to whom he’s talking. My concern is for the rest of biological researchers. He states in the post:

In other words: know the data sources, know the right tools and you can always sculpt a solution for your own situation.

This is very true and I whole heartedly agree. So many solutions exist already in thousands of databases and analysis tools. It’s what we do here at OpenHelix, help experimental biologists, genomics researchers and bioinformaticists find the right data sources and tools and then go and “sculpt a solution for their situation.”

In the last part of my comment,

BioMart, UCSC Genome Browser, Galaxy, etc, etc are excellent tools and data sources and could probably answer about 80% of most posed questions :). But my caveat would be that knowing the data sources and right tools can be a bit of a daunting task.

And it is, despite the somewhat dismissive response :). We’ve all seen the graphs, exponentially rising amounts of data over time. It’s an issue as the Chronicle of Higher Education article title states:

Dumped on by Data: Scientists Say a Deluge is Drowning Research

The journal Science also had an entire 10 article section on the issue. It’s not a problem that will go away.

Along with that deluge of data, has come a deluge of databases and data analysis tools (created for the most part by bioinformaticists!), many of which _alone_ are quite daunting to find the right data and tool within. There are thousands such databases and tools. I’ve lost count.

Neil Saunders is correct. The solution is out there, find the right tools and data, sculpt a solution. He responds to my comment with “Learning what you need to know in bioinformatics can certainly be daunting. But then, science isn’t for for the easily daunted :-).” In other words, “if you are daunted, you aren’t a scientist?”

We give workshops to researchers around the world from Singapore to the US to Morocco and at institutions as varied as Harvard, Stanford, University of Missouri, Mt. Sinai, Stowers and Hudson-Alpha. The researchers we’ve given workshops and answered questions from were also varied, developmental biologists, evolutionary, medical researchers, bioinformaticists, researchers quite well versed in genomics and those not.

The overriding theme is finding and knowing the data and the tools is not only daunting, but sometimes not possible. Not because they don’t exist, but because finding and knowing them is a drain of personal and lab resources considering the shear growing field of things to find and know. I refer you to the Chronicle article… drowning in data..

They are real scientists not easily daunted, but daunted just the same, by what’s in front of them. And yes, many of those specific questions to specific research needs can be answered by existing tools. We come across many questions on Biostar that a well-crafted database search or analysis step will answer beautifully, without the need for reinventing the wheel with more code (and the answers are often code).

I suspect that most of those scientists out there who call themselves ‘bioinformaticists” should have a grasp of the tools and databases available to them (but I can tell you, even the brightest of them don’t sometimes). So, the advice and final words of the linked blog post above…

In other words: know the data sources, know the right tools and you can always sculpt a solution for your own situation…. real bioinformaticists write code

Yes, real bioinformaticists write code, but this advice is insufficient to the other 90% of real scientists who don’t. Perhaps Biostar is not the solution (I suspect a lot of those questions being asked he points out are those by non-bioinformaticists who only have a basic, if any, knowledge of coding nor access to those who do). Perhaps it, or something like it, can be.

Big data specialists…yeah, but…

There is a great discussion on Big Data today that I found on the twittosphere.  Hat tip to Paul Blaser on the tweet that got my attention.  I have posted a comment over there, but I decided as I was writing it that I wanted to bring it over here as well.  (I also added some links here that I couldn’t add over there since without preview I hate to not be able to test them.)

Deepak has a post up on the blog business|bytes|genes|molecules called The Biological Data Scientist.  It speaks to big data projects, and the need to have specialists in biological data to handle it.

I suspect that we do actually agree on much of the concept.  But like a lot of things, I think more downstream about the implementation of the topic on the ground.  And my thoughts on that are below, which I posted as a comment over there.


Hmmm…I certainly agree with large chunks of this. But I don’t agree that this should be the domain of some kind of data scientist.  Or–more specifically–it does need to have their hands to some point.  But I think it still needs to be accessible to the handful-of-genes bench biologists.

The idea of the multi-functional team is terrific, when it is possible.  But we see a lot of people who are not getting that kind of support from their local “bioinformatics” club–for a couple of reasons: if you have some big-data folks on site, they have their own project to worry about. They are not eager to hand-hold others on the way in to the data.  It’s not their job. It’s not what they are supported to do, and it doesn’t help them with their next grant.

If you have some kind of dedicated bioinformatics core support, the quality of the support varies widely: the kinds of things they do, the skills they have, the interest in actual support.

We have seen some great examples.  For example, it seems to me the team at CHOP in Philly provides this kind of support: in house tools to support the researchers, bringing in the right tools to add more support, training everyone up to some level so they are at least aware of what the tools can do. (Samples of CHOP tools, team, and training.)

On the other hand, we’ve been to some major institutions–many with “big data” projects, who are getting next to zero interaction with anyone who could help them.  You’d be stunned if I told you who these people are.

Then there are those who don’t even have a shot at this.  People trying to keep up, and write new grants with hot new data, that are in some mid-western campus that really just doesn’t even have someone to ask.  I talked to one woman once that needed a really simple thing out of the UCSC Genome Browser.  It took me roughly 5 minutes to build the right query, pull the data out of the table browser, and hand it to her. I thought she was going to kiss me.  She told me she had expected that to take her 6 months of benchwork.

I would hate to see this strategy create a tier of biologists who are nearly locked out the data.  Because it is also still imminently clear that we can throw a lot of big data at project, but the crucial details require the “small people” to look closely at them.  And many of them feel excluded from the club already.

Heads Up, More Data, Epigenome

As this Nature editorial says, the as the human genome (and a few hundred others) were completed, the amount of data had become daunting (we know that well here at OpenHelix, we deal with it everyday and daily make that more accessible to scientists through training :). But also, importantly, even with all the data, it’s been found that we need more. As the editorial states:

By 2004, large-scale genome projects were already indicating that genome sequences, within and across species, were too similar to be able to explain the diversity of life. It was instead clear that epigenetics — those changes to gene expression caused by chemical modification of DNA and its associated proteins — could explain much about how these similar genetic codes are expressed uniquely in different cells, in different environmental conditions and at different times.

Thus is born the Human Epigenome Consortium (Nature paper, subscription required, here). You can find some of the data from the pilot projec at the Sanger Institute site.

The beginning stages, but I believe it will prove to be quite a treasure trove of data (as if we don’t have a huge unmined dataset now). It was this last comment in the editorial:

.., given that epigenetic coding will be orders of magnitude more complex than genetic coding, its requirement for data crunching may be similar…

Get ready for a lot more resources and tools of greater complexity :).

Data and how to handle it – biocuration and beyond

female_computer_idea.jpgI was enjoying a wonderfully wet, gray autumn day – you know the kind – just perfect for curling up and reading a good book with a hot cup of tea. I figured I’d just indulge in a little break from writing & revising drafts of tutorials and publications. I was going to allow myself one Nature article – “The future of biocuration“, which I’ve been meaning to read since it came. The article was written by several biocurators and describes the exponential growth in the amount of available biological data and proposes three urgent actions:

1. collaboration among authors, journals and curators to expedite the exchange of data between databases and journal publications

2. development of a recognition structure that encourages community curation

3. establishment of scientific curation as an accepted professional career

computer_information.jpgThe article makes a lot of good points, and I highly recommend you read it if you are interested in the future of databases at all. But as I began reading, I couldn’t stop. The special feature of this whole issue of Nature is ‘Big Data: Science in the petabyte era’. I really think Nature did a great job of finding and presenting many many points of view on the subject of big data – some that I’ve been thinking about as I register for upcoming meetings – and some I’ve never considered, but can now see how they make so much sense…

Continue reading