So last week I treated myself to my first vacation in a long time. It was my birthday, and I wanted to disconnect a bit and recharge. Mostly it worked, although the hundreds of emails I’m facing this morning are a bit daunting. But just before I left I got an email from a colleague who asked me a really great question:
….I would love to know where you would start when you get back a personal genome sequence….
And I couldn’t shake this out of my head. I was sitting on a bridge outside Windsor Castle thinking about it as the sun set on my first day. (On subsequent days I found that the far superior ciders in the UK were able to push this question out of my head for some periods of time. And also pie.)
I’ve spent some significant time thinking about the onslaught of personal genomics, of course. It’s all been very theoretical, because I would have refused to even begin the process of obtaining my personal genome sequence until the GINA legislation fully kicked in. But now that barrier is down. I’m still not ready to get mine done for a variety of reasons (cost, quality, informative value). But it’s still worth thinking about what I would do with it if it was handed to me–in specific terms, with concrete actions. So here’s what I decided I would do. Your mileage may vary. And I’d love to hear what others might do with theirs. Follow the link for the specific actions I’d take.
This assumes a fully-sequenced genome. I don’t want a SNP check. I want all the nucleotides.
Step 1: Assessment and QC
Assessment: The first thing I would do would be to open the files and figure out the formats. I have no idea what the sequence might be like. Do you get the read data? Do you get summary FASTA files? Do you get processsed/annotated stuff? What’s in there? So I’d assess the files. Here’s what I’d want: I would want access to the raw output data (I’m assuming short-read or somewhat longer-read data. Technology may change, though). We’ve seen in the past that there can be issues with the raw data, and if there was anything I was going to make any sort of a health-related decision on, I would want to go all the way back to the read data and check the quality. It may also be that software improvements come along and you’d want to re-build the summary data. But mostly I’d like to work with a summary/consensus file in FASTA format for my own analysis. I’d also like more processed output from the provider, with annotations. But I would use those as a starting point and verify it all myself probably anyway. Although as I say this now, I think I would have asked about the output files prior to committing to a service provider for this. And I would have requested all of the files in my contract.
QC: My next step would be to take chunks of the data and look at sequence alignments of my sequence vs the reference sequence. For example, I would take maybe a chunk of chromosome 21 and look closely at it. Not just the known genes, but all the other pieces as well. According to the GRCh37 view of Chr21 reference sequence I just called up on the UCSC Genome Browser there are over 48 million bases. I’d take chunks that had known genes, and regions with no genes, maybe avoid the centromere and the very ends, but I’d use that to look at the data nearly letter-by-letter probably with BLAT. If there were variations every 10 bases I’d be very concerned. If my variations appeared to correspond with known SNPs, I’d be more confident in the quality. I’d pull up some regions with known repeats and have a hard look at how those had been handled. (I’d also do a second check using BLAST at NCBI on the official reference sequence to make sure the conclusions were about the same. Can’t help myself, I love to QC things….). I’d also probably spot-check some other chromosomes and other regions. I should also say that I mean not only single-nucleotide variations (SNPs) but also copy-number variations (CNVs). I’d be extra-curious if CNV-sized chunks were observed in my own genome and where they were, and how they were handled.
But let’s assume the quality is respectable. Then I’d start to look at more targeted pieces. I would look at well-known genes. Not necessarily disease genes (although those are certainly in the pool). I would look at how my collection of variation compares to highly-studied genes. I might also look across the deeply examined ENCODE data too, since that’s been re-examined now as well. Probably I’d look at the genomes of some of the other individuals who have been sequenced for comparison.
I’d probably have to stop at this point and get back to my real work. I also have mixed feelings about how much I want to know about my own “disease” variations right now. So I’d need some more thought about how to process that psychologically before I went to look into the regions that are outside of the flashlight areas.
Step 2. Get someone to build me a genome browser
Yeah, I might be able to follow the GBrowse instructions and build my own. Maybe build my own UCSC Browser since I know that one so intimately. It might be possible to get away with just creating your own DAS or custom track to load into an existing GBrowse or UCSC Browser which I could definitely accomplish on my own. But it seems to me to be most useful over the long term I’d want my own coordinates, and my own database that I could curate as I go along and and new data comes out on variations, etc. Would want to add my own personal notes to some pieces. So I’d lean to full browser, with a curation pipeline, with my own genome as the reference sequence/coordinates. And it strikes me that it would be most time- and cost-effective to have someone do it for me. I’d pay for that as a service.
Step 3: Look closer, and ongoing monitoring
Ok, here’s where I hesitate a bit. I’m honestly not quite sure what I want to know from my genome at this point. I suspect I’d be unable to resist looking at longevity genes–we’ve seen some good long lives and healthy seniors in my family, and I’d want to see if I can hope for that. I’d look at the tone-deaf genes as this is a long-standing problem in my family and probably good for a laugh. I’d check on that curly hair gene to be sure mine was the same variation they’ve seen. But here’s the hard part: do I want to know about the Alzheimer’s genes? Do I want to know about the cancer ones? I’m not sure that I do. Eventually I probably wouldn’t be able to resist this either. I’d look at the NHGRI catalog of GWAS studies. I’d check SNPedia.
And what about the other variations that aren’t in known genes? I’d scan the regions. I’d probably run GRAIL. I’d probably set up a MyNCBI saved search to run regularly for papers that come out on either variations, regions, or diseases I’m concerned about.
But then what? Do I make changes based on what I’m seeing? Do I alter my diet? Do I check my genome before getting prescriptions filled? Do I discuss any findings with my siblings? Do I drive myself insane with genome minutia? Do I start hanging out with the people doing recreational genomics? Quite frankly, I don’t know.
So that’s what I did on my vacation–planned the analysis of my genome. It was actually a rather fun exercise.
What have I missed? What would you do?