Category Archives: What’s the Answer?

What’s the Answer? (electronic lab notebooks)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted question actually started on Twitter, and led me back to Biostar. I saw this question come across:

And I was interested in several of the answers. But one of the great things was the answer from Pierre–links to Biostar–with several different discussions of this.

This is a resource with history and depth! And although those answers were some time ago, they offer useful thoughts about the features to consider when making a choice. So that kind of institutional memory can be really helpful.

But I was also interested in the other answers–including DokuWiki, “universal open-source Electronic Laboratory Notebook” (referenced below), Labguru, and other people’s less formal solutions and suggestions.

Reference:

Voegele C., N. Robinot, J. McKay, P. Damiecki & L. Alteyrac (2013). A universal open-source Electronic Laboratory Notebook, Bioinformatics, 29 (13) 1710-1712. DOI: http://dx.doi.org/10.1093/bioinformatics/btt253

What’s the Answer? (free + useful protein tools)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

One of the things we still don’t really have a handle on is the “lists of tools” problem. I think this leads to some really unfortunate duplication of efforts. A lot of folks have attempted to create lists of tools for certain purposes, but they are hard to maintain, the focus of the lists vary. Sometimes useful tools are found in unusual or informal places, sometimes hard to categorize, and the support…well…yeah. So I keep tabs on various lists that I find, because sometimes there are some gems in there which are new to me. And to have active practitioners describing what’s useful to them is particularly helpful.

This week’s highlighted post is from someone focusing on protein tools, who is collecting a list of them.

Tool: A growing collection of “Free and useful protein-science tools”

I thought that it might be useful to put together a list of the tools that I am currently using with a short description and usage example.

I will add to it in future, and I am also looking forward to contributions: Please feel free to add your favorite tools if you like:

https://github.com/rasbt/protein-science/blob/master/scripts-and-tools/more_protein-science_tools.md

se.raschka

Check out the current list, and suggest others if you have some.

What’s The Answer? (data sharing with Bittorrent)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted Biostar item is a new feature–and they are looking for your input and testing if it is a feature you might use.

Forum: Data sharing via Bittorrent is coming to Biostar

Hello Everyone,

We are adding bittorrent data sharing to Biostars.  Help us identify bugs and issues by creating a few torrents and adding them to posts on the test site. Also feel free to comment and provide suggestions and feedback. The description of how it works is at:

http://test.biostars.org/info/data/

An example post with data can be seen at:

http://test.biostars.org/p/101/

A few details on how it works:

  1. Torrents can get attached to posts, answers or comments
  2. A post may have multiple torrents attached.
  3. Biostars will attempt to connect the IP number of the Bittorrent peer connection to the IP number of the Biostar user account. This allows you to see who the person that shares the data is.
  4. Anonymous users cannot create torrents but they may share existing datasets.
  5. Data may be shared without making it visible on Biostar (although this should not be considered a secure way to share data)

(note: the test site will not log you into your old account since the emails are protected so don’t report that as an issue)

Istvan Albert

Although it seems to be well received, people have issues with some institutions that don’t allow Bittorrent access due to some past bad behaviors…so people have raised that issue. So if you want to try it out, or have concerns, let ‘em know over there.

What’s the Answer? (non-PhD bioinformatics job skills)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted post was popular, and offers some chatter on the state of the field with regard to employment opportunities. And this is the kind of question that it’s hard to get answers out of the literature for.

Question: I’m applying for a non-PhD bioinformatics position in your lab. What do you look for?

I’ve been lurking here for years and I’d like to cover a topic that isn’t covered that much.

Bioinformatics is a tough field to not have a PhD. Nonetheless, research positions do exist where only a bachelors is required and research experience is also stated as between 0-2 years. I’d like to give a hypothetical situation that describes a good percent of such applicants to these positions. The motivation here is to survey what are ultimately core requirements for these positions and what is maybe considered “bells and whistles”.

I’m fresh out of college and I have a BS and/or Masters in Bioinformatics along with ~two years research in a lab. I’m applying to your lab, what are you looking for? And what requirement(s) can you excuse or not weight that heavily?

Edit. Sort of a related question, is requiring knowing hadoop and also the biochemistry/biophysics behind RNA-seq at the same time an outrageous expectation for a non-Phd?

scical

Everyone has been following the drama (and the graphs) about how many PhDs vs how many academic jobs there are. Certainly not everyone needs to have a PhD, and this seems a valid and useful question. It got some thoughtful answers from potential employers too. Check out the discussion.

What’s the Answer? (mutation nomenclature)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

When I touched on the variation tools at NCBI for this week’s tip, I didn’t go into detail on how the specific variations are designated. But I happened to be looking through the Biostar questions for this week’s highlight, and noticed that someone was not familiar with how the ClinVar mutations are denoted. So I thought maybe others would find that useful information as well.

Question: ClinVar Mutation representations and Descriptions

I was looking into ClinVar data for getting mutation lists. There were mutations which were in the form GENE:c.*** representing they are CDS mutations and GENE:p.*** representing the amino acid changes.

What are those in the following forms represent?

  1. m.***
  2. GENE:n.***
  3. GENE:g.***
  4. nsv***

Example:

TBC1D24:c.1143-6C>T – CDS mutation

NP_002760.1:p.Cys139Ser –  Protein mutation

m.1606G>A ??

U43746.1:n.2241A>G ??

NC_000023.11:g.53254331_53296102dup41772 ??

nsv513787 ??

vigprasud

Have a look at the answers at Biostar. Zhaorong’s answer is correct. This nomenclature is certainly a bit cryptic if you aren’t familiar with the Human Genome Variation Society (HGVS) system. It’s worth looking into the background and framework for this if this is data you are likely to be working with. The history of this strategy goes back quite a ways as you can see from their publication list. But below I’ll add a reference that I think helps to understand the structure if you are new to it.

For even more help in understanding why getting nomenclature right is so crucial–check out the paper below that came out recently, on naming just the TP53 variations . This is a gene that has clinical relevance–and if you are aiming treatments at mutated TP53 you have to be sure you are getting the right one. It’s not just a trivial nuisance to understand how to define mutations–it can matter at the clinic and this will only become increasingly important as we get sequence from more tumors and other clinical situations. And I think this paper makes the point about the complexity and the needs for standardization.

References:
Laros J.F.J., Johan T den Dunnen & Peter E M Taschner (2011). A formalized description of the standard human variant nomenclature in Extended Backus-Naur Form, BMC Bioinformatics, 12 (Suppl 4) S5. DOI: http://dx.doi.org/10.1186/1471-2105-12-s4-s5

Soussi T. & Peter E.M. Taschner (2014). Recommendations for Analyzing and Reporting TP53 Gene Variants in the High-Throughput Sequencing Era , Human Mutation, 35 (6) 766-778. DOI: http://dx.doi.org/10.1002/humu.22561

What’s the Answer? (clinical cancer genomics)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted question sort of pairs with my post from yesterday about the cancer genomics challenge. Despite all the chatter about what will be possible with tumor sequencing, one of the Biostars community members wants to know what we really know, and what we’ve really seen so far on this topic for treating specific individuals. It’s a nice summary type of question that brings together a bunch of the knowledge from folks who are on the front lines and thinking/reading/working on this issue. Biostars works well in this type of chatter, kind of like an international lab meeting.

Forum: Publications for individualized medicine in cancer by whole genome, exome or transcriptome sequencing

What papers are you aware of that attempt the following: (1) Whole genome, exome and/or transcriptome sequencing of (2) live patient tumor samples in an attempt to (3) guide clinical decision making for cancer. The omic events could provide diagnostic, prognostic or treatment response predictions. This approach is widely referred to as personalized medicine, individualized medicine, precision medicine, or precision oncology. There are many reviews describing this idea and many examples that make use of targeted panels (one to hundreds of molecular events). I’m looking for proof-of-principle papers, describing the paradigm where researchers (or tumor genome boards) attempt to use omic NGS data to alter or inform clinical care. These could be N-of-1 case reports or overviews describing experiences with small to large cohorts.

Here is a prototypical example in which an oral adenocarcinoma was sequenced by whole genome and transcriptome sequencing and analysis done to suggest a particular target/pathway for therapy that might not otherwise by considered in this disease type at the time. http://genomebiology.com/content/11/8/R82

[then the post is updated with many examples of the types of papers]

Obi Griffith

In a fast-moving area, with so much literature and/or data getting published in places that might be outside of the PubMed reach at this point, it’s a nice way to collect useful input. Go have a look at the ensuing discussion.

What’s The Answer? (explosion of careers)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted post was popular, and offers some amusing chatter on the state of the field with regard to employment opportunities.

Forum: “An Explosion Of Bioinformatics Careers” – from Science Careers

I’m not sure if that was already posted, but I had found it interesting, especially with the growing amount of questions “should I get to bioinformatics”. Last paragraph of this piece from Science:

Data scientists can expect the field to change and evolve in novel ways in the near future. But the bottom line is that “companies are growing their bioinformatics,” says Kaleck. “There are 100% more job opportunities opening up in bioinformatics than ever before,” much of which is driven by an increase in venture capital investment.

Given that big data “is the hottest field on the planet,” says Agrafiotis, those who have the requisite skills and expertise often have their pick of opportunities. “I have to fight Google, Amazon, LinkedIn, and hedge funds to hire the top people. They are valuable in any industry.”

In particular, the future of big data in big pharma and biotech sectors is bright and exciting. “Bring your expertise to health care,” says Telthorst, “and you’ll know you’re going to make a difference, at the patient level and at the societal level.”

–Pawel Szczesny

Istvan’s comment and description of our current “caveman era” is something I wouldn’t have thought of, but I can see where he’s going with that. Heh.

Referenced article:

Levine A.G. (2014). An Explosion Of Bioinformatics Careers, Science, DOI: http://dx.doi.org/10.1126/science.opms.r1400143

What’s the Answer? (aligning isoforms)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlight from Biostar is a tool that was new to me, and seems to have useful differentiation from other tools in this arena. I put it in my drafts folder a while ago and forgot to get back to it at the time. Anyway–check out PALO for isoform matching in multiple sequence alignments.

Forum: Palo: The Importance (And Impact) Of Aligning Matching Isoforms In Multiple Sequence Alignments

Protein ALignment Optimiser (PALO) is an algorithm for the selection of the best combination of protein isoforms among orthologous genes in the construction of a multiple alignment. You can easily upload your files from ENSEMBL and this tool will tell you which is the most suitable combination for you to align.

Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode different protein isoforms. As analyses tend to be gene centered, one single-protein isoform per gene is selected for the alignment, with the de facto approach being to use the longest protein isoform per gene (Longest), presumably to avoid including partial sequences and to maximize sequence information. Here, we show that this approach is problematic because it increases the number of indels in the alignments due to the inclusion of nonhomologous regions, such as those derived from species-specific exons, increasing the number of misaligned positions. With the aim of ameliorating this problem, we have developed a novel heuristic, Protein ALignment Optimizer (PALO), which, for each gene family, selects the combination of protein isoforms that are most similar in length.

Take a look to the Tutorial section. You can either use this online version (section Run) or download the raw code (python-github) and run it in your local machine.

Biojl

Quick link to PALO: http://evolutionarygenomics.imim.es/palo

And their paper has more details as well.

Villanueva-Cañas J.L., Laurie S. & Albà M.M. (2013). Improving genome-wide scans of positive selection by using protein isoforms of similar length., Genome biology and evolution, 5 (2) 457-467. DOI:

What’s The Answer? (23andMe to other formats)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted question was from someone with personal genomics data in their hands, but doesn’t know what to do next.

Question: How to I convert 23andMe Raw Genome to GenBank or FASTA?

I used 23andMe to download my raw genome. I have it in a .txt file but you can’t use the format for real bio programs. i want to make my own library for further analysis. Does anyone know how i can convert .TXT to FASTA, GenBank, or any other usable file type?

someashole

Although Biostar usually hosts questions from folks who are a bit more advanced in their grasp of file formats, this question struck me as interesting for a couple of reasons. The needs of folks who are not practitioners, but who find themselves with data in their hands, will only increase going forward. And although the companies will offer some tools, there’s a niche for some lighter-weight public tools. I discussed this before on the issue of genome browsers, which are currently too much heavy lifting for intro-level users. I know there are some open data communities forming around this data too, but so far what I’ve seen has been more sophisticated early adopter types.

But I imagine it would be difficult to get funding for such intro-level tools. They probably wouldn’t score well on “innovation” and some of the other traditional grant criteria because–well, because that’s not what the system does.

Maybe it would make some good class projects for some coders who are learning to build tools, and to work with this type of data. Make some gentle 23andme to X-format converters. A browser that’s not too hard to load your data up and look around without too many tracks. These folks are going to need more hand-holding. They don’t know what formats they need, or what is available for them to do.

But have a look at the answers, and if you have other guidance for this newbie, drop some comments over there.

What’s The Answer? (resource support via Biostar)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

In my last Biostars highlight, I was noting that the suggestion box item I offered was that I thought it would be neat if tool providers had a channel over at Biostar. That way they could talk about their tool, announce the updates, and offer some support for users with this existing infrastructure. No need to run a mailing list, or set up their own forum, etc. So I was delighted to see a development team do exactly that.

This!

News: GATB will provide user support via Biostar

The Genome Assembly & Analysis Tool Box (GATB) provides an easy way to develop efficient and fast NGS tools. GATB is a C++ library of high level functions that leverage state-of-the-art data structures for handling huge NGS datasets.

To developers of NGS applications: GATB allows you to re-use components (fasta/fastq reading, k-mer counting, de Bruijn graph construction and traversal) from Minia/DSK. It’s well-documented.

We will answer Biostar questions regarding:

  • Library usage
  • Bugs
  • Feedback
  • Support for any of the tools

Post a question: https://www.biostars.org/p/new/post/?tag_val=GATB

See all posts: https://www.biostars.org/t/GATB/

– GATB authors

Love it. Yay to the GATB team for this. And they got a good response in up votes for the effort. I hope it plays out well.