Tag Archives: bioinformatics

TIL: There’s a chief data scientist for the US. DJ Patil.

I know there’s lots of hype and drama over “big data”, some of which is over-the-top. But there are real needs and real opportunity in all sorts of data we are generating as well. So we now have a chief data scientist in the US. I found the news on the NIH Data Science blog, where they have more links and include this video where DJ Patil explains more about this role and the reasons.

Highlights of the video in case you can’t listen right now:

~6min he calls out “bioinformatics” as an area of emphasis

~10min he specifically talks of working with Phil Bourne and NIH about bringing data science and bioinformatics together.

The White House release about Patil references the Precision Medicine efforts. 1.29.15_precision_medicine

Precision medicine. Medical and genomic data provides an incredible opportunity to transition from a “one-size-fits-all” approach to health care towards a truly personalized system, one that takes into account individual differences in people’s genes, environments, and lifestyles in order to optimally prevent and treat disease. We will work through collaborative public and private efforts carried out under the President’s new Precision Medicine Initiative to catalyze a new era of responsible and secure data-based health care.

He asks for your help. They are building out teams. He wants everyone to check out the site and see if they can contribute.

US Data Service: http://whitehouse.gov/USDS

Follow @dpatil on twitter: https://twitter.com/dpatil

Hat tip to Beth Russell at the NIH Data Science blog called Input | Output: https://nihdatascience.wordpress.com/2015/02/24/dj-patil-is-the-new-chief-data-scientist-of-the-united-states/

Cambridge Healthtech Institute Announces the Acquisition of OpenHelix

Cambridge Healthtech Institute (CHI) announced the purchase of Washington–based OpenHelix, the provider of online and onsite training on some of the most popular and powerful open-access bioinformatics resources on the web.

“Knowing how to use the latest bioinformatics tools is critical to genomics research, which will only grow in importance,” said Phillips Kuhl, President of Cambridge Healthtech Institute “With an over ten year track record of developing and presenting training on open access bioinformatics databases and programs, OpenHelix is an instrumental service to researchers and a key addition to CHI’s family of conference and training products.”

OpenHelix will join the Cambridge Healthtech Institute as a division of Bio-IT World, a leading source of news and opinion on technology and strategic innovation in the life sciences, including drug discovery and development. “OpenHelix brings Bio-IT World an extensive and solid audience in the academic research community, as well as the opportunity to extend to our existing audience a valuable training product line,” said Lisa Scimemi, Publisher of Bio-IT World, “training that many of our readers need for themselves or their staff or students but may not be aware of.”

“We are proud of the success we have had in the past, with some of the top universities and medical schools subscribing to OpenHelix,” said Scott Lathe, CEO of OpenHelix “Working with Bio-IT World will bring us the infrastructure, resources, and market reach we need to further grow our tutorials, subscriptions, and product offerings.”

As part of the acquisition, Scott Lathe, CEO and co-founder of OpenHelix will become General Manager of the OpenHelix unit and Mary Mangan, President and co-founder of OpenHelix will become Director, Product and Content of the OpenHelix unit.

About Bio-IT World (www.Bio-ITWorld.com)
Bio-IT World provides outstanding coverage of cutting-edge trends and technologies that impact the management and analysis of life sciences data, including next-generation sequencing, drug discovery, predictive and systems biology, informatics tools, clinical trials, and personalized medicine. Through a variety of sources including, Bio-ITWorld.com, Weekly Update Newsletter and the Bio-IT World News Bulletins, Bio-IT World is a leading source of news and opinion on technology and strategic innovation in the life sciences, including drug discovery and development.

About Cambridge Healttech Institute (www.chicorporate.com)
Cambridge Healthtech Institute (CHI), founded in 1992, is the industry leader in providing superior-quality scientific information to eminent researchers and business experts from top pharmaceutical, biotech, and academic organizations. Delivering an assortment of resources such as events, reports, publications and eNewsletters, CHI’s portfolio of products include Cambridge Healthtech Institute Conferences, Barnett Educational Services, Insight Pharma Reports, Cambridge Marketing Consultants, Cambridge Meeting Planners, Knowledge Foundation and Cambridge Healthtech Media Group, which includes Bio-IT World and Clinical Informatics News.

About OpenHelix (www.openhelix.com)
OpenHelix, a Washington State company, was founded in 2003 to provide training on what was then a fledgling but quickly growing market of open access web based bioinformatics resources. OpenHelix has provided training and outreach services for many providers of resources, such as the UCSC Genome Browser, OMIM, and the Protein Data Bank (RSCB PDB). OpenHelix received a $1.2 million grant in 2007 to create a search engine for bioinformatics resources and to expand its tutorials suites. In 2009, it launched the subscription service to over 100 tutorial suites.

Video Tip of the Week: MetaboAnalyst 2.0

In looking through the 2012 Web Server Issue of NAR, Nucleic Acids Research journal, I couldn’t help notice resource names that revealed a bit about the developers’ sense of humor, such as “TaxMan” and “XXmotif“.  There were others on the list (“MAGNET“, “GENIES” and “VIGOR“, for example) whose names made me cringe imagining someone trying to find them with the average search engine. [Our family’s favorite such resource is iHOP, or Information Hyperlinked Over Proteins - I gotta think that the developers aimed at that name in honor of the other IHOP :) and breakfasts everywhere.]

I scrolled through many such names until I found a resource to feature in today’s tip. I wanted something dealing with a current topic – they all pretty much fit that criteria – and one that I was interested in, but that was outside my “normal area of expertise”. I decided on “MetaboAnalyst 2.0“, which is the resource that I will feature in today’s tip. It is described in the article “MetaboAnalyst 2.0—a comprehensive server for metabolomic data analysis” as follows:

“MetaboAnalyst is a web-based suite for high-throughput metabolomic data analysis. It was originally released in 2009… MetaboAnalyst 2.0 now includes a variety of new modules for data processing, data QC and data normalization. It also has new tools to assist in data interpretation, new functions to support multi-group data analysis, as well as new capabilities in correlation analysis, time-series analysis and two-factor analysis. We have also updated and upgraded the graphical output to support the generation of high resolution, publication quality images.”

As I often do, I began “exploring” MetaboAnalyst 2.0 by reading their NAR article. It is well written and describes how the goal of the interface is to be user friendly and intuitive, so I headed over to MetaboAnalyst 2.0 “kick some tires”, so to speak. I found that the interface is quite easy & intuitive to use. And to really help users understand the resource before launching into uploading their own data, the developers provide a wide range of example data sets that users can play with, as well as step-by-step guides (pdf, PowerPoint, & two articles that require journal subscriptions, no videos yet). In my video I use one of their datasets & show a quick example of some analysis steps. Of course there isn’t time to fully cover MetaboAnalyst 2.0, but hopefully I show you enough to tempt you to try it out on your own.

*Please note that the developers suggest that you download results immediately because all user data is treated as private and confidential by MetaboAnalyst 2.0 will remain on the server for only 72 hours before automatically deleted.

Quick Link:

MetaboAnalyst 2.0 – http://www.metaboanalyst.ca/

Jianguo Xia, Rupasri Manda, Igor V. Sinelnikov, David Broadhurst, & David S. Wishart (2012). MetaboAnalyst 2.0—a comprehensive server for metabolomic data analysis Nucleic Acids Research, 40 (W1) DOI: 10.1093/nar/gks374

Jianguo Xia, Nick Psychogios, Nelson Young, & David S. Wishart (2009). MetaboAnalyst: a web server for metabolomic data analysis and interpretation Nucleic Acids Research Volume 37, Issue suppl 2 Pp. W652-W660. , 37 DOI: 10.1093/nar/gkp356

What’s the answer? (how to stay current)

This week is a bit different than the usual “What’s the Answer?” post where we highlight a question from a forum that our readers might be interested in. However–in this post, one of the answers includes BioStar–so it sort of comes back around!

Stephen Turner (aka @genetics_blog) wrote up a blog post recently in response to people asking him how he stays current in bioinformatics/genomics. I know a lot of people retweeted that post at the time, and his very kind inclusion of OpenHelix led to some good traffic to this blog and new twitter followers. And just last night I saw this as well, confirming my impression of this post:

There are blogs, forums, automated searches, Twitter, and literature–of course. A lot of people might know about some of these, but it’s nice to see someone assemble a collection. It is also interesting to see how similar it is to my strategy.

So it seemed like this might be a fun item to highlight as a good source of answers on a number of things, and a good way to find useful sites and folks to be aware of in this field. Check it out.

How to Stay Current in Bioinformatics/Genomics

So much data and information. You gotta have some strategies. And you have to have more than the literature.

Video Tip of the Week: Variation Data from Ensembl

Trey introduced me to this “decent collection of video tutorials ” from Ensembl, but he and Mary are currently in Morocco teaching a 3-day bioinformatics workshop & then attending the conference (yes, I am envious!). I am therefore creating this week’s tip based on the tutorials that Trey pointed me to. In today’s tip I am going to parallel a tutorial available from Ensembl on SNP information in order to both: 1) show you haw you can access variation information from Ensembl and 2) compare doing these steps using Ensembl 64 (here in this video) and using Ensembl 54 (archived) (in the Ensembl video).

Bioscience resources often are continuously being developed and improved & it can be difficult to keep videos and documentation up-to-date. That’s why here at OpenHelix we work continuously to keeping our materials up-to-date, with weekly tips on new features and updated tutorials as updated sites become stable.

The Ensembl video (SNPs and other Variations – 1 of 2) is quite nice & provides more detail about the actual Ensembl data than I can in my short movie, but it was done a few years ago on an older version of Ensembl. Since then the resource has been updated, and gone through several new versions of the data. I’m going to follow the same steps that are done in part one of the Ensembl SNP tutorial so that you can see examples of what’s changed & what is pretty much the same. I’d suggest you watch both videos back-to-back to get a good idea of what’s changed, and what types of variation information are available from Ensembl. From that basis I’m sure you’ll be able to watch Ensembl’s second SNP video & apply it to using the current version of Ensembl without much trouble. For more details you can refer to the most recent Ensembl paper in the NAR database  issue, which describes not just variation information but Ensembl as a whole.

Quick links:

Ensembl Browser: http://www.ensembl.org/index.html

Legacy Ensembl Browser (release 54): http://may2009.archive.ensembl.org/index.html

Ensembl tutorial, part 1 of 2: http://useast.ensembl.org/Help/Movie?id=208

Ensembl tutorial, part 1 of 2: http://useast.ensembl.org/Help/Movie?id=211

OpenHelix Ensembl tutorial materials: http://www.openhelix.eu/cgi/tutorialInfo.cgi?id=95

Ensembl Tutorial List: http://useast.ensembl.org/common/Help/Movie?db=core

Flicek, P., Aken, B., Ballester, B., Beal, K., Bragin, E., Brent, S., Chen, Y., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Fernandez-Banet, J., Gordon, L., Graf, S., Haider, S., Hammond, M., Howe, K., Jenkinson, A., Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F., Koscielny, G., Kulesha, E., Lawson, D., Longden, I., Massingham, T., McLaren, W., Megy, K., Overduin, B., Pritchard, B., Rios, D., Ruffier, M., Schuster, M., Slater, G., Smedley, D., Spudich, G., Tang, Y., Trevanion, S., Vilella, A., Vogel, J., White, S., Wilder, S., Zadissa, A., Birney, E., Cunningham, F., Dunham, I., Durbin, R., Fernandez-Suarez, X., Herrero, J., Hubbard, T., Parker, A., Proctor, G., Smith, J., & Searle, S. (2009). Ensembl’s 10th year Nucleic Acids Research, 38 (Database) DOI: 10.1093/nar/gkp972

Friday SNPets

Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…

Sage Bioinformatics Advice, But…

Bioinformatics analysis is a powerful technique applicable to a wide variety of fields, and the subject of many a blog post here at OpenHelix. I’ve had two particular bioinformatics articles on my desk for a couple of months now, waiting for me to be able to articulate my thoughts on them. They both offer great information about their particular area of interest – predicting either SNV impacts or protein identities – and sage bioinformatics advice.

The first article “Using bioinformatics to predict the functional impact of SNVs” is a great review of bioinformatics techniques for picking out functionally important single nucleotide variants (SNVs, also sometimes variously referred to as SNPs or Small, Simple or Single Nucleotide Polymorphisms) from the millions of candidate variants being identified everyday. In the introduction the authors do a great job of explaining the many ways in which SNVs can have an impact, as well as how these basic philosophies of impact can be used for bioinformatics analyses. The paper then goes on to describe both classic and bioinformatics techniques for predicting the impact of such variations. It is a phenomenal read for the list of resources alone, with many valuable and important algorithms and resources mentioned.  We’ve got tutorials (ENCODE, OMIM, the UCSC Genome Browser, UniProtKB, Blosum and PAM, HGMDJASPAR, Principal Components Analysis, relative entropy, SIFT score, TRANSFAC, ) and blog posts (the Catalog of Published Genome-Wide Association Studies) describing many of the same resources. In fact this paper inspired at least one of our weekly posted tips (Tip of the Week: SKIPPY predicting variants w/ splicing affects). The paper then goes on to a “BUYER BEWARE” section that offers some sage advice – know the weaknesses, assumptions, and of the resources you use for your predictions.

The second article is an open access article from BioTechniques entitled “Mistaken identities in proteomics“. It offers a romp through the history of mass spectrometry (MS) technology and rising standards for documenting techniques used for protein identification in journals. The article also concludes with sage bioinformatics advice, including this quote:

Proteomic researchers should be able to answer key questions, according to Giddings. “What are you actually getting out of a search engine?” she says. “When can you believe it? When do you need to validate?”

Both papers suggest that researchers who wish to use bioinformatics resources in their research should investigate the theoretical underpinnings and assumptions of each tool before deciding on a tool to use, and then should go at every analysis with a level of disbelief in the tool results. That just sounds like common sense, and makes good theoretical advice.

HOWEVER, the level of investigation that is required to truly know each tool and algorithm is prohibitively huge. As for me, my “practical” suggestion for researchers is a bit of a “filtering shortcut”. Before diving into all the publications on all possible tools, just spend a few minutes with some documentation – the resource’s FAQ, or an intro tutorial – we’ve got a few we can offer you :) – to get an idea of what the tool is about & what you might be able to get from it. Once you’ve got a general idea of how to approach the resource  begin “banging” on it lightly. An initial kick the tires test of an algorithm, database, or other resource can be as easy as keeping a “test set” on hand at all times & running it through any new tool you want to use. Make sure that the set includes a partial list of some very well known proteins/pathways/SNPs/etc. (whatever you work on & will be interested in analyzing) and that it has some of your fields ‘flukes’. Think about what you expect to get back from your set. Then run your tester set through any new tool you are considering using in your research, and look at your results – are they what you know they should be? Can they handle the flukes, or do they break? As an example, when I approach a new protein interaction resource, I’ll use a partial parts list for some aspect of the yeast cell cycle, and include one or two of the hyphenated gene names. If the tool is good, I get a completed list with no bogging on the “weird” names. If it bogs, I know the resource may not be 100% worked out for yeast & may have issues with other species as well. If the full list of interactors comes back with a bunch of space-junk proteins I begin investigating what data is included in the resource and if settings can be tweaked to get better answers. Then, if things still look promising with the tool, I am likely to dig deep into the literature, etc. for the tool – just to be sure – because the authors of these articles are absolutely right, chasing false leads is expensive, frustrating & time consuming. It is amazing how many lemons & jalopies you can weed out with a 5 minute bioinformatics tire kick! :)

I also don’t think the responsibility should solely be on the back of each end user – the resource developer does have some responsibility for making their tool rigorous and for accurately representing its capabilities in publications and documentation. Calls for open source code can help improve some bioinformatics tools, so can education & outreach – but that discussion will have to wait for another day…


Explore Open Access Bioinformatics Tools with the Free “World Tour of Genomics Resources” Tutorial Suite

Online tutorial gives researchers and scientists a place to learn about the many biology resources available to them.

Quote  startThese links assist scientists by guiding them to relevant technical tutorials on resources which may be unfamiliar to them. Thanks to this partnership with OpenHelix, BioMed Central journals are able to make their scientific content more useful and access.Quote end

Bellevue, WA (PRWEB) April 6, 2011

The science community now has a valuable launching point to explore and find the many bioinformatics and genomics resources available to them through the “World Tour of Genomics Resources” tutorial suite by OpenHelix.

The free tutorial suite includes a sampling of resources organized by categories such as algorithms and analysis tools, expression resources, genome browsers (both eukaryotic and prokaryotic/microbial), literature and text mining resources, and resources focused on nucleotides, proteins, pathways, disease and variation.

In each category, the tutorial explores not only the most popular resources, but also some lesser known ones that fill unique scientific needs or are especially helpful to researchers.

The tour also shows easy ways of accomplishing the difficult task of finding and learning about other resources with the free OpenHelix search tool, tutorial suites, and other tools.

“With the ever expanding data sets and resources of the genomics era” said Warren (Trey) Lathe, Chief Science Officer at OpenHelix, “this tutorial suite fills the critical need of giving scientists an overview of resources and showing them ways to find them and learn how to use them.”

The online narrated tutorial, which runs in just about any browser, can be viewed from beginning to end or navigated using chapters and forward and backward sliders.

Included in the tutorial suite are animated PowerPoint slides used as a basis for the tutorial, suggested script for the slides, slide handouts, and aa list of the resources and tutorial landing pages mentioned in the tutorial. This saves a tremendous amount time and effort for teachers and professors to give this tour to others.

A companion piece to this free tutorial, exploring ways to find and learn about online biology computational tools, is the paper “OpenHelix: bioinformatics education outside of a different box” published in a special issue of Briefings in Bioinformatics entitled “Special Issue: Education in Bioinformatics“. This paper describes a wide range of repositories where researchers can access informal educational sources of learning on publicly available bioinformatics resources. These include a wide variety of formats and strategies including lists of resources, journals that regularly feature tool descriptions, and eLearning resources sources such as the MIT OpenCourseWare effort.

Tip of the Week: World Tour of Genomics Resources

Most weeks our tip is a five-minute movie that quickly introduces you to a new resource, or a cool new function at an established resource. Occasionally we feature one of our full resource tutorial that is being made freely available through resource sponsorship of our training suite. In this week’s tip we provide access to one of our tutorials that is especially near and dear to our heart. It is a World Tour of Genomics Resources in which we explore a variety of publicly-available biomedical, bioinformatics and bioscience databases and other resources.

This tutorial is quite different from our usual ones. Generally we focus on a specific software resource and describe step-by-step how to use its functions such as how to do basic and advanced searches, how to understand and modify displays, where to find specific types of data such as FASTA sequences, etc. and even provide tips on ‘hidden features’ that power users even find useful and informative.  This type of software training is absolutely critical.

But many people need an even earlier step: just the *awareness* that resources are available that might serve their needs. This tutorial fills that niche. We present a sampling of resources, all free to use, from each of 9 categories including: Analysis & Algorithms, Expression, Genome Browsers (for Eukaryotes and for Prokaryotes and Viruses), Genome VariationLiterature, Nucleotides, Pathways and Proteins. After the World Tour, which is the majority of the tutorial, we then describe how to use OpenHelix’s free search and learn portal to find bioscience resources most appropriate for your research needs. From this the tour transitions into a brief discussion of the format of our training materials and how to use them, and then ends with information about other learning resources that we provide.

This tutorial has been wildly popular whenever we’ve done it as a live seminar. At the NIH they actually had to lock the doors because we’d hit the capacity of the room, and people were turned away. In fact, it has been so popular that we decided to produce it as a full tutorial suite and release it as one of our free trainings so that anyone and everyone could learn about the breadth of great public software options available for free use.

In addition to this free tutorial, we also have published a paper entitled “OpenHelix: bioinformatics education outside of a different box” in a special issue of Briefings in Bioinformatics entitled “Special Issue: Education in Bioinformatics“. This paper describes a plethora of sources where researchers can access informal educational sources of learning on publicly available bioinformatics resources. The sources of information include a wide variety of formats including lists of resources, journals that regularly feature tool descriptions, and eLearning resources sources such as the MIT OpenCourseWare effort. If you know of other such resources that aren’t covered in our tour or paper, comment & let us know about them – we love to learn as much as we love to teach! :)

Quick link to World Tour of Genomics Resources tutorial here.

  • Williams, J., Mangan, M., Perreault-Micale, C., Lathe, S., Sirohi, N., & Lathe, W. (2010). OpenHelix: bioinformatics education outside of a different box Briefings in Bioinformatics, 11 (6), 598-609 DOI: 10.1093/bib/bbq026

Real bioinformaticians write code, real scientists…

Just over a week ago, Neil Saunders wrote a post I agreed with: Real bioinformaticians write code. The post was in response to a tweet conversation that started:

Many #biostar questions begin “I am looking for a resource..”. The answer is often that you need to code a solution using the data you have.

He’s right, and that’s very true for bioinformaticists to whom he’s talking. My concern is for the rest of biological researchers. He states in the post:

In other words: know the data sources, know the right tools and you can always sculpt a solution for your own situation.

This is very true and I whole heartedly agree. So many solutions exist already in thousands of databases and analysis tools. It’s what we do here at OpenHelix, help experimental biologists, genomics researchers and bioinformaticists find the right data sources and tools and then go and “sculpt a solution for their situation.”

In the last part of my comment,

BioMart, UCSC Genome Browser, Galaxy, etc, etc are excellent tools and data sources and could probably answer about 80% of most posed questions :). But my caveat would be that knowing the data sources and right tools can be a bit of a daunting task.

And it is, despite the somewhat dismissive response :). We’ve all seen the graphs, exponentially rising amounts of data over time. It’s an issue as the Chronicle of Higher Education article title states:

Dumped on by Data: Scientists Say a Deluge is Drowning Research

The journal Science also had an entire 10 article section on the issue. It’s not a problem that will go away.

Along with that deluge of data, has come a deluge of databases and data analysis tools (created for the most part by bioinformaticists!), many of which _alone_ are quite daunting to find the right data and tool within. There are thousands such databases and tools. I’ve lost count.

Neil Saunders is correct. The solution is out there, find the right tools and data, sculpt a solution. He responds to my comment with “Learning what you need to know in bioinformatics can certainly be daunting. But then, science isn’t for for the easily daunted :-).” In other words, “if you are daunted, you aren’t a scientist?”

We give workshops to researchers around the world from Singapore to the US to Morocco and at institutions as varied as Harvard, Stanford, University of Missouri, Mt. Sinai, Stowers and Hudson-Alpha. The researchers we’ve given workshops and answered questions from were also varied, developmental biologists, evolutionary, medical researchers, bioinformaticists, researchers quite well versed in genomics and those not.

The overriding theme is finding and knowing the data and the tools is not only daunting, but sometimes not possible. Not because they don’t exist, but because finding and knowing them is a drain of personal and lab resources considering the shear growing field of things to find and know. I refer you to the Chronicle article… drowning in data..

They are real scientists not easily daunted, but daunted just the same, by what’s in front of them. And yes, many of those specific questions to specific research needs can be answered by existing tools. We come across many questions on Biostar that a well-crafted database search or analysis step will answer beautifully, without the need for reinventing the wheel with more code (and the answers are often code).

I suspect that most of those scientists out there who call themselves ‘bioinformaticists” should have a grasp of the tools and databases available to them (but I can tell you, even the brightest of them don’t sometimes). So, the advice and final words of the linked blog post above…

In other words: know the data sources, know the right tools and you can always sculpt a solution for your own situation…. real bioinformaticists write code

Yes, real bioinformaticists write code, but this advice is insufficient to the other 90% of real scientists who don’t. Perhaps Biostar is not the solution (I suspect a lot of those questions being asked he points out are those by non-bioinformaticists who only have a basic, if any, knowledge of coding nor access to those who do). Perhaps it, or something like it, can be.