Category Archives: Genomics Research

A History of Bioinformatics (told from the Year 2039)

A week or so back I was watching the chatter around the #ISMB / #BOSC2014 meeting, and saw a number of amusing and intriguing comments about Titus Brown’s keynote talk.

You can see a lot of chatter about it in the Storify. I was delighted to soon see this follow up tweet:

I didn’t have time to watch it right away, but when I did, I really enjoyed it. It’s worth your time if you have some interest about the directions of this field. It’s not easy to pull off a talk like you are 25 years into the future. It’s also rife with danger–as later people might use pieces of it against you. Lincoln Stein wrote an amusing follow-up to to a prediction talk he gave in 2003, entitled: Bioinformatics: Gone in 2012 (follow up piece linked below).  Or it could just end up so embarrassingly off-target that you’ll look like some of the folks that Titus highlights in the talk, whose predictions about future technologies were pretty…um…well, you’ll see. But it’s a clever way to think about the future that we want, and how the path could look to get us there.

SPOILERS: Here are some of my favorite tidbits, mostly for my own notes:

  • Bioinformatics sweatshops [I fear this too]
  • California has disappeared [egads, but...]
  • MicrosoftElsevier [snicker]
  • Universities have collapsed [hmm, not convinced on this]
  • Pioneering appointment of Phil Bourne: “NIH finally realized that training was important” [~20min; oh, please let this come true]
  • the problems of “Glam Data” [contrast to "glam journals" today]
  • in the future, because of better education, 80% of the US will accept evolution [from your lips to...wait...]
  • ~33min, interesting look at the actual outcomes of techno-progress and how they diverged from predictions; via Heinlein’s “Where To?” with 4 curves of predicted human progress (linked below). [Heh, I'm in this argument a lot, this could be handy--piece + chart linked below]
  • “I have no idea what I’m doing, but I’m trying new things.” [~38min, about forging unchartered directions in a young field]
  • At the end, ~56min: “Let the crazy people do the crazy things. See what happens.” [Testify.]

Boy, the pressure is on Phil Bourne to solve everything. This is a recurring theme at every genomics and bioinformatics event I see lately…I wish him luck sorting this out. Good news from this talk is that he seems to have done it.

And the slides are here, with Talk notes for the Bioinformatics Open Source Conference (2014) at Titus’ blog.

References:

Stein L.D. (2008). Bioinformatics: alive and kicking, Genome Biology, 9 (12) 114. DOI: http://dx.doi.org/10.1186/gb-2008-9-12-114

Heinlein R. (1952). Where to?, Galaxy Magazine, February 13-22. ["Your personal telephone will be small enough to carry in your handbag." Well, he nailed that one.]

{sorry,  had to republish to get it in to the ResearchBlogging queue. RB was down yesterday.}

Video Tip of the Week: New UCSC “stacked” wiggle track view

This week’s video tip shows you a new way to look at the multiWig track data at the UCSC Genome Browser. A new option has recently been released (see 06 May 2014), a “stacked” view, and it’s a handy way to look at the data with a new strategy. But I’ll admit it took me a little while of working with it to understand the details. So in this tip I hope you’ll see what the new visualization offers.

I won’t go into the background on the many types of annotation tracks available–if you need to be introduced to the idea of the basic track views, start out with our introduction tutorial that touches on the different types of graphical representations. Custom tracks are touched on in the advanced tutorial. For guidance specifically how to create the different track types, see the UCSC documentation. The type of track I’m illustrating in the video today, a MultiWig track, has its own section over there too. Basically, if you are completely new to this, the “wiggle” style is a way to show a histogram display across a region. MultiWig lets you overlay several of these histograms in one space. In the example I’ll show here, the results of looking at 7 different cell lines are shown for some histone mark signals (Layered H3K27Ac track).

Annotation track cell lines

Annotation track cell lines

When I saw the announcement, I thought this was a good way to show all of the data simultaneously. When we do basic workshops, we don’t always have time to go into the details of this view, although we do explore it in the ENCODE material, because the track I’m using is one of the ENCODE data sets. I’ll use the same track in the same region as the announcement, which is shown here:

stack announcementBut when I first looked at this, I wasn’t sure if the peak–focus on the pink peak that represents the NHLF cell line–was meant to cover the whole area underneath or not. What I was trying to figure out is essentially this (a graphical representation of my thought process follows):

stackedMultiWig_screenshot_v2

By trying out the various styles I was pretty sure I had the idea of what was really being shown, but I confirmed that with one of the track developers. The value is only the pink band segment, not the whole area below it. And Matthew also noted to me that they are sorting the tracks in reverse alphabetical order (so NHLF is the highest in the stack). That was an aspect I hadn’t realized yet. They are not sorting based on the values at that spot. This makes sense, of course, but it wasn’t obvious to me at first.

I like this option very much–but I figured if I had to do some noodling on what it actually meant others might have the same questions.

In the video I’ll show you how this segment looks with the different “Overlay method” settings on that track page. I’ll be looking at the SOD1 area, like the announcement example.  I tweaked a couple of the other settings from the defaults so it would be easier to see on the video (see arrowheads for my changes). But I hope this conveys the options you have now to look at this type of track data effectively.

Track settings for videoSo here is the video with the SOD1 5′ region in the center, using the 4 different choices of overlay method, illustrating the histone mark data in the 7 cell lines. I’m not going into the details of the data here, but I’ll point you to a reference associated with this work for more on how it’s done–see the Bernstein lab paper below.  I wanted to just demonstrate this new type of viewing options that will be available on wiggle tracks. Some tracks will have too much data for one type or another, or will be clearer with one or another style. But now you have an additional way to consider it.

Quick links:

UCSC Genome Browser: genome.ucsc.edu

UCSC Intro tutorial: http://openhelix.com/ucscintro

UCSC Advanced tutorial: http://openhelix.com/ucscadv

These tutorials are freely available because UCSC sponsors us to do training and outreach on the UCSC Genome Browser.

References:

Kent W.J., Zweig A.S., Barber G., Hinrichs A.S. & Karolchik D. (2010). BigWig and BigBed: enabling browsing of large distributed datasets., Bioinformatics (Oxford, England), PMID:

Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L. & Haeussler M. & (2013). The UCSC Genome Browser database: 2014 update., Nucleic acids research, PMID:

Ram O., Goren A., Amit I., Shoresh N., Yosef N., Ernst J., Kellis M., Gymrek M., Issner R. & Coyne M. & al. Combinatorial patterning of chromatin regulators uncovered by genome-wide location analysis in human cells., Cell, PMID:

The ENCODE Project Consortium, Bernstein B.E., Birney E., Dunham I., Green E.D., Gunter C. & Snyder M. et al. (2012). An integrated encyclopedia of DNA elements in the human genome., Nature, 489 PMID:

Also see the Nature special issue on ENCODE data, especially the chromatin accessibility and histone modification subset (section 02): http://www.nature.com/encode/

Participate in an NSF “IDEAS LAB” (generate research agendas and proposals)

Greetings!
The short link: IUSE IDEAS LAB: http://www.nsf.gov/pubs/2014/nsf14033/nsf14033.jsp

IUSE:
NSF’s education directorate has a funding opportunity called “Improving Undergraduate STEM Education” (IUSE).

The IUSE program description [PD 14-7513] http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504976 outlines a broad funding opportunity to support projects that address immediate challenges and opportunities facing undergraduate science, technology, engineering, and math (STEM) education,
To generate research agendas and proposals for this, NSF is holding an… 

Ideas Lab:
Ideas labs are meetings that bring together researchers, educators and others in an “intensive, interactive and free-thinking environment, where participants immerse themselves in a collaborative dialog in order to construct bold and innovative approaches and develop research projects.” MOre often than not, these “Ideas Labs” produce new collaborations and research projects proposals that often go on to be funded. The Ideas Lab is patterned after the Ideas Factory process.
“to make new connections, which are frequently cross disciplinary, and also generate novel research projects coupled with real-time peer review.”
This NSF Ideas lab has several purposes, but the one most pertinent to this community is finding new ways, and develop research proposals, to infuse computational thinking, literacy and competency into the core curriculum for undergraduate education.
Individuals apply to the Ideas lab, it’s a 2 page proposal and is DUE FEBRUARY 4 (Next Tuesday). Funding is provided for the trip. These ideas labs are excellent ways to meet and discuss genomics, biology and education, build new collaborations and to develop new research proposals.
The letter and more information (read the link):
A Dear Colleague Letter on the topic of ³Preparing Applications to
Participate in Phase I Ideas Labs on Undergraduate STEM Education² [NSF
14-033] has been posted on the NSF web site.
If you have any questions, you can ask here or by email (wlathe AT openhelix.com ). I am _not_ a project officer at NSF and don’t have all the answers, but I can direct you to the places you might find answers.
PLEASE feel free to disseminate!

The Thanksgiving Genomes

Happy Thanksgiving to those who celebrate. For those of you who don’t,  have a nice Thursday.

Light posting this week due to the holiday, but this might be fun for you to keep in your back pocket for dinner discussion–genome information of the traditional foods.

The Genome of Your Thanksgiving Supper

The genetic sequences of the turkey, apple, potato, and other traditional Thanksgiving ingredients are providing bountiful lessons for scientists.

Public service announcement: CAFA2 for protein functional annotations

Just got this email on the Biocurators mailing list, wanted to spread the word:

Announcing CAFA 2: The Second Critical Assessment of protein Function Annotations

Friends and Colleagues,

We are pleased to announce the Second Critical Assessment of protein Function Annotation (CAFA) challenge. The goal of the challenge is to predict functional annotations of genes/proteins. In CAFA, the organizers provide a set of about 100,000 protein sequences, of which most are completely unannotated and some are partially annotated with respect to their function. The participants are asked to predict functional annotation of these proteins before January 15, 2014. At that time, all predictions will be stored and we will wait for 6-12 months until new annotations are available in the biomedical literature and/or major databases. The initial evaluation will be provided in July 2014, during the ISMB conference (Boston, MA). Anyone in the world is welcome to participate.

In brief:

Web site: biofunctionprediction.org

Prediction submission deadline: January 15, 2014

Initial evaluation: July 12, 2014 in Boston

All targets can be downloaded from http://biofunctionprediction.org/node/12. The web site also contains training data; however, the participants are *not* required to use it and even if they do, they can use any additional data of their choice, including the literature. The CAFA challenge is different from many other similar challenges because not even the organizers know which of the original target sequences will be functionally annotated after the submission deadline.

The CAFA 1 experiment is described in the following paper:

P. Radivojac et al. A large-scale evaluation of computational protein function prediction. Nature Methods (2013) 10(3): 221-227.

A brief introduction to the problem for computer scientists is provided at:
http://biofunctionprediction.org/sites/default/files/IntroductionCAFA_pedja.pdf

The mission of the Automated Function Prediction Special Interest Group (AFP-SIG) is to bring together computational biologists who are dealing with the important problem of gene and gene product function prediction, to share ideas and create collaborations. We also aim to facilitate interactions with experimental biologists and biocurators.

We hope that AFP-SIG serves an important role in stimulating research in annotation of biological macromolecules, but also related fields.

New in CAFA 2:

In CAFA 2, we would like to evaluate the performance of protein function prediction tools/methods and also expand the CAFA challenge to include prediction of human phenotypes associated with genes and gene products. As the last time, CAFA will be a part of the Automated Function Prediction Special Interest Group (AFP-SIG) meeting that will be held alongside the ISMB conference. AFP-SIG will be held as a two-day meeting in July 2014 in Boston.

About the CAFA experiment:

The problem: There are far too many proteins for which the sequence is known, but the function is not. The gap between what we know and what we do not know is growing. A major challenge in the field of bioinformatics is to predict the function of a protein from its sequence (and all other data one can find). At the same time, how can we judge how well these function prediction algorithms are performing and whether we are making progress over time?

The solution: The Critical Assessment of protein Function Annotation algorithms (CAFA) is an experiment designed to provide a large-scale assessment of computational methods dedicated to predicting protein function. We will evaluate methods in predicting the Gene Ontology (GO) terms in the categories of Molecular Function, Biological Process, and Cellular Component. In addition, predictors may use the Human Phenotype Ontology (HPO) for the human dataset. A set of protein sequences is provided by the organizers, and participants are expected to submit their predictions by the submission deadline, January 15, 2014. The predictions will be evaluated during the Automated Function Prediction (AFP) meeting, which has been approved as a Special Interest Group (SIG) meeting, at the ISMB 2014 conference (Boston, USA).

History: The first CAFA experiment was conducted in 2010-2011. Twenty-three groups submitted fifty-four algorithms for assessment. The results and most methods were published in Nature Methods and in a special supplement in BMC Bioinformatics. CAFA 1 has brought together a large group of computational predictors and, for the first time, provided us with a clear picture of the state of this important field. As with other critical assessment experiments, the aim of CAFA is improve protein function prediction by continuously challenging groups to develop more accurate methods.

How to participate in CAFA 2?

1. Go to http://biofunctionprediction.org

2. Download target proteins, already available

3. Submit predictions on or before January 15, 2014

4. Join us at the AFP-SIG, July 11-12, 2014 in Boston for the eighth protein function prediction meeting, to hear the CAFA 2 results, to present your work, and to learn about the latest research in computational protein function prediction

More details at: http://biofunctionprediction.org

Confirmed keynote speakers:

Fiona Brinkman, Simon-Fraser University, Canada

Mark Gerstein, Yale University, USA

We look forward to hearing from you!

The CAFA organizing Team: Predrag Radivojac, Michal Linial, Sean Mooney and Iddo Friedberg
Contact: CAFA.2014@gmail.com

UCSC’s new Variant Annotation Integrator

In case you aren’t on the UCSC announcement mailing list, and you don’t go to the site via their homepage with the posted news–you should know about this new tool at the UCSC Genome Browser. It will take variations that you are exploring and make a prediction about whether the variant is associated with a function, and potentially if it is damaging to a protein. It’s under active development, so try it out. And if there are features you could use, suggest them. See the VAI page for more.

Here are the details via their email, but sign up for the “announce” mailing list to get this news like this in your inbox if you like too:

[Link to the original at the mailing list site]

Hello all,

In order to assist researchers in annotating and prioritizing thousands
of variant calls from sequencing projects, we have developed the Variant
Annotation Integrator (VAI). Given a set of variants uploaded as a
custom track (in either pgSnp or VCF format), the VAI will return the
predicted functional effect (e.g., synonymous, missense, frameshift,
intronic) for each variant. The VAI can optionally add several other
types of relevant information, including: the dbSNP identifier if the
variant is found in dbSNP, protein damage scores for missense variants
from the Database of Non-synonymous Functional Predictions (dbNSFP), and
conservation scores computed from multi-species alignments. The VAI also
offers filters to help narrow down results to the most interesting variants.

Future releases of the VAI will include more input/upload options,
output formats, and annotation options, and a way to add information
from any track in the Genome Browser, including custom tracks.

There are two ways to navigate to the VAI: (1) From the “Tools” menu,
follow the “Variant Annotation Integrator” link. (2) After uploading a
custom track, hit the “go to variant annotation integrator” button. The
user’s guide is at the bottom of the page, under “Using the Variant
Annotation Integrator.”

As always, we welcome questions and feedback on our public mailing list:
genome@soe.ucsc.edu.


Brooke Rhead
UCSC Genome Bioinformatics Group

 

“Most viewed” item in figshare is….software training?

So if you go to visit figshare today, and you click the “Browse” link at the top, and then you select to sort by “most viewed” from the menu, what do you get?

figshare_mostviewed

Yes, for reasons I cannot explain, work that we’ve created or uploaded appears right at the top–the GenoCAD training we are developing, and a copy of the UCSC Genome Browser intro slides. Honestly–how we are beating “World Beer Consumption and Scientific Productivity” completely stumps me. I am rather pleased to see that the herring transcriptome is ranking so high too though.

I was joking on twitter the other night, though, that a #1 viewed rank and $3 will get me a cup of coffee at Dunkin’ Donuts. I’d love to see if this has any value in a grant situation, but I have no idea if it would. But it does make me wonder how and why this has happened. Is it really reflecting interest, or a need? Or is there some other way to interpret this?

Software training on genomics tools is a curious thing. A lot of people tell us how much they need this, and they appreciate the training which saves them lots of time in their work. We know we improve their awareness of what’s available, and their efficiency. At the last workshop we did at WashU, a woman in the back of the room emitted a huge sigh during Trey’s advanced UCSC section. Trey was worried that he’s confused her, but instead she said that in fact what he had just shown her saved her a ton of work. She was actually just incredibly relieved to learn what we could show her. And we see this a lot. But we have no way to measure that really.

But other times we find–say in grant situations–that software training isn’t scoring very high in the priority list. Yeah, it’s not novel and innovated enough I suppose. The people who need the training have no mechanism to push upwards really and express the need or quantify it. It’s kind of individual–you need what you need, when you need it. But it’s not an organized demand that we can point to in any way. Yet just a couple of weeks ago I attended a Software Carpentry training with 120 women who wanted better knowledge of software tools. Demand is there. I wish it was better recognized how important and useful it is.

I’m gonna go get a cup of coffee. And then make some more training. Go figure.

Citations:

GenoCAD Tutorials. Mary Mangan, Mandy Wilson, Laura Adam, Jean Peccoud. figshare.
http://dx.doi.org/10.6084/m9.figshare.153827 Retrieved 16:33, Jul 08, 2013 (GMT).

World beer consumption & scientific productivity.. Christopher Lortie. figshare.
http://dx.doi.org/10.6084/m9.figshare.664162 Retrieved 16:34, Jul 08, 2013 (GMT).
Introduction to the UCSC Genome Browser. Mary Mangan. figshare.
http://dx.doi.org/10.6084/m9.figshare.96258 Retrieved 16:42, Jul 08, 2013 (GMT).

Who’s your daddy?

A new article in Slate describes a case of non-paternity unearthed as a result of a 23andme scan.

Who’s Your Daddy?

The perils of personal genomics.

By

I expect a bit of chatter from the genoscenti. I’ll collect responses below if I see them. I agree that the actual studies of non-paternity show values that are all over the map. But I suspect that there are going to be a lot of people affected by this who didn’t see it coming. And many of those stories will be quiet and private, and won’t be widely known. Some will turn into Jerry Springer, perhaps.

But I know of cases where this has already had serious impact, like the woman who was thrown out of her tribe as a result of her DNA test. This is a very heated topic in some circles: Tribal Enrollment and Genetic Testing Resources.

Interesting times.

All I could think of was this: