What’s the Answer? (tidy data format)

Biostars is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the Biostars_logo community and find it very useful. Often questions and answers arise at Biostars that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at Biostars.

This week’s highlighted post at Biostar is about “tidy data”. Ah, quite the concept. The day when data becomes tidy will be one to celebrate. Anyway, I think it’s a worthwhile discussion to have, and I’m looking forward to the comments as this develops. If you have thoughts, please bring them over there too.

Usually I highlight most of the question here, but this time there are pieces that are too large–examples of format issues–so I’ll just give you the bullet and send you over to Biostar to read the whole thing.

Forum: Principles of Tidy Data (Hadley Wickham) and the VCF format

Hadley Wickham, the author of ggplot and many other popular R packages, has recently published a very good paper regarding the principles of tidy data. This article introduces a new library called tidyr, and also proposes a standard for formatting and organizing data before data analysis.

I personally think that the principles proposed in the article are very good, and that they help a lot in data analysis. Some of these are already adopted by many ggplot2/plyr users, as you need a data frame in a long format in order to produce most of the plots.

My question is whether it would make sense to apply these principles to bioinformatics. In particular, if we look at the VCF format, it fails at least two of the rules mentioned in the paper:

- “3.1. Column headers are values, not variable names”  (because individuals are encoded on distinct columns)

- “3.2. Multiple variables stored in one column” (because each genotype column contains the status of one or more alleles, plus its coverage etc…

For example, let’s take the example from the 4.0 specs of VCF:

[examples here]

[More discussion of the issues within samples, so go read over there]

What do you think? Will we all convert to tidy VCF in the far future?

–Giovanni M Dall’Olio

So, tidy VCF. What do you think? Some people are already musing about it. Discuss over there.

Reference:
Wickham H.W. (2014). Tidy Data, Journal of Statistical Software, 59 (10). http://www.jstatsoft.org/v59/i10