Tag Archives: CNV

Guest Post: CHOP’s new tool, CNV Workshop – Xiaowu Gai

This next post in our continuing semi-regular Guest Post series is from Xiaowu Gai, the Bioinformatics Core Director at CHOP . If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.

Thanks to Mary for running a Tip of the Week – “CHOP CNV database” a couple of months back. CHOP CNV database is a high-resolution genome-wide survey of copy number variations of a large number (2,026) of apparently healthy individuals. It is publicly accessible and has been widely used by a large number of research groups world-wide. I am now pleased to announce the public release of our software system behind it: CNV Workshop. CNV Workshop is a suite of software tools that we have developed over the last a few years. It provides a comprehensive workflow for analyzing, managing, and visualizing genome copy number variation (CNV) data.

It can be used for almost any CNV research or clinical project by offering the following capabilities for both individual samples and cohort studies:

CNV identification
Implements a modified circular binary segmentation algorithm that reduces false positives
Fully configurable parameters for sensitivity/specificity management
Individual locus-specific annotations such as position, type of variation, call metrics, and overlap with CNVs of other data sets, including the Database of Genomic Variants.
Functional gene annotations such as genes affected and known disease associations
Accepts user-provided annotations
GBrowse-enabled visuals for querying, browsing, interpreting, and reporting CNVs
Export of results into Excel, XML, CSV, and BED files
Direct links to public resources such as the UCSC Genome Browser, NCBI Entrez, Entrez Gene, and FABLE
Project and Account Management
Authentication and permission scheme that is especially useful for clinical diagnostic settings
Analysis result sharing within and between projects
Simple Web-based administrative interface
Remote access and administration enabled

CNV Workshop currently accepts genotyping array data from Illumina’s 550k, 610- and 660-Quad, and Omni arrays, along with Affymetrix’s 5.0 and 6.0 arrays, and can be easily configured to accept data from other platforms. The package comes preloaded with publicly available reference data from more than 2,000 healthy control subjects (the CHOP CNV Database). CNV Workshop also allows the user to upload already processed CNV calls for annotation and presentation.

The software package is freely available at http://sourceforge.net/projects/cnv/. It is also described in more detailed in our recent paper on BMC Bioinformatics.

-Xiaowu Gai

Corn: 85% not corn, and missing big pieces

popcornSo I’m all excited about the genome festival that I’m seeing, related to the publication of the new sequence version of corn. You can access the main paper in Science, and there’s a very neat diagram in figure 1 that is like looking across time at the sequence data and into the corn nebula.  But the thing that cracked me up was this line from the abstract:

Nearly 85% of the genome is composed of hundreds of families of transposable elements, dispersed nonuniformly across the genome.

That means 85% of corn isn’t corn!!  And what business do those elements have messing with the genomes??  I am told all the time that messing with plant genomes is wrong and unnatural.  Heh.

For full coverage of the big news today I’ll point you to James and the Giant Corn (appropriately enough) who seems to be the CNN (Corn News Network) of 24-hour coverage of many aspects of the work.

I spent my morning looking over the PLoS Maize Special Collection papers, including the intriguing appetizer:  10 Reasons to be Tantalized by the B73 Maize Genome.  But I spent longer looking at the CNVs and PAVs paper.  I’ve been thinking about CNVs a lot  lately, and was interested to see this covered in a non-mammalian species.

Figure 1 is a nice example of how to use VISTA for effective displays in comparative genomics.  (If you haven’t used VISTA before you might check out our sponsored free tutorial on that–we are currently working with the VISTA team to update that with their new features too.)

There’s a really striking segment of chromosome 6 that appears to be present in one of the strains they examine and absent in the other (illustrated in figure 4).  And it looks like it has genes that are expressed and active in the B73 strain.  The ongoing investigation of that is pretty intriguing as well.

The structural variations are not evenly distributed across the genomes.  Some places have large occurrences, and some are untouched.  It’s clear that just in these two strains there’s a lot more structural diversity than in other species that have been examined:

In the human, rat, dog, mouse, macaque and chimpanzee genomes the average number of CNVs between two individuals is between 15 and 75 [43]–[48]. A high resolution study of eight human genomes [49] revealed only several hundred insertions and deletions, including CNV and PAV sequences, in the comparison of any two human genomes. In contrast, even after very stringent filtering we identified >3,700 CNV or PAV sequences that represent at least 2,000 events between these two maize genomes.

Emphasis mine.  Plants are so much more flexible, apparently….

This is going to lead to some neat clues on heterosis (or hybrid vigor) as the research proceeds with these new tools.  What a great time to be a plant scientist.  There are some very exciting projects coming along with the tools of genomics.

What I couldn’t locate was any reference to a CNV database (like DGV or CHOP CNV) where you can examine the whole set.  I’ll dig through the supplement data to see if I can find out more on that.  But I wanted get this post out to celebrate the very nice work and collection of papers on this project. Congrats to the teams involved!


Springer, N., Ying, K., Fu, Y., Ji, T., Yeh, C., Jia, Y., Wu, W., Richmond, T., Kitzman, J., Rosenbaum, H., Iniguez, A., Barbazuk, W., Jeddeloh, J., Nettleton, D., & Schnable, P. (2009). Maize Inbreds Exhibit High Levels of Copy Number Variation (CNV) and Presence/Absence Variation (PAV) in Genome Content PLoS Genetics, 5 (11) DOI: 10.1371/journal.pgen.1000734

Schnable, P., Ware, D., Fulton, R., Stein, J., Wei, F., Pasternak, S., Liang, C., Zhang, J., Fulton, L., Graves, T., Minx, P., Reily, A., Courtney, L., Kruchowski, S., Tomlinson, C., Strong, C., Delehaunty, K., Fronick, C., Courtney, B., Rock, S., Belter, E., Du, F., Kim, K., Abbott, R., Cotton, M., Levy, A., Marchetto, P., Ochoa, K., Jackson, S., Gillam, B., Chen, W., Yan, L., Higginbotham, J., Cardenas, M., Waligorski, J., Applebaum, E., Phelps, L., Falcone, J., Kanchi, K., Thane, T., Scimone, A., Thane, N., Henke, J., Wang, T., Ruppert, J., Shah, N., Rotter, K., Hodges, J., Ingenthron, E., Cordes, M., Kohlberg, S., Sgro, J., Delgado, B., Mead, K., Chinwalla, A., Leonard, S., Crouse, K., Collura, K., Kudrna, D., Currie, J., He, R., Angelova, A., Rajasekar, S., Mueller, T., Lomeli, R., Scara, G., Ko, A., Delaney, K., Wissotski, M., Lopez, G., Campos, D., Braidotti, M., Ashley, E., Golser, W., Kim, H., Lee, S., Lin, J., Dujmic, Z., Kim, W., Talag, J., Zuccolo, A., Fan, C., Sebastian, A., Kramer, M., Spiegel, L., Nascimento, L., Zutavern, T., Miller, B., Ambroise, C., Muller, S., Spooner, W., Narechania, A., Ren, L., Wei, S., Kumari, S., Faga, B., Levy, M., McMahan, L., Van Buren, P., Vaughn, M., Ying, K., Yeh, C., Emrich, S., Jia, Y., Kalyanaraman, A., Hsia, A., Barbazuk, W., Baucom, R., Brutnell, T., Carpita, N., Chaparro, C., Chia, J., Deragon, J., Estill, J., Fu, Y., Jeddeloh, J., Han, Y., Lee, H., Li, P., Lisch, D., Liu, S., Liu, Z., Nagel, D., McCann, M., SanMiguel, P., Myers, A., Nettleton, D., Nguyen, J., Penning, B., Ponnala, L., Schneider, K., Schwartz, D., Sharma, A., Soderlund, C., Springer, N., Sun, Q., Wang, H., Waterman, M., Westerman, R., Wolfgruber, T., Yang, L., Yu, Y., Zhang, L., Zhou, S., Zhu, Q., Bennetzen, J., Dawe, R., Jiang, J., Jiang, N., Presting, G., Wessler, S., Aluru, S., Martienssen, R., Clifton, S., McCombie, W., Wing, R., & Wilson, R. (2009). The B73 Maize Genome: Complexity, Diversity, and Dynamics Science, 326 (5956), 1112-1115 DOI: 10.1126/science.1178534

Tip of the Week: Fable, text mining for literature on human genes

fable_thumb A couple of weeks ago we brought you a tip of the week about the CHOP CNV Database. The same people who bring you that database also do FABLE (Fast Automated Biomedical Literature Extraction), a literature mining tool. The tool uses an advanced algorithm to find Human genes that are directly related to the keywords search on and then find literature on those genes. The tool has some great features and is a great way to quickly find  the literature of a gene of interest. Today’s tip will give you a quick intro to the tool.

Tip of the Week: CHOP CNV database

chop_cnv_tipOne of the hottest searches we see all the time is for more information on CNVs, or copy number variations.  These intriguing structural variants in our genomes explain a lot of the reason that SNP hunting for complex diseases like schizophrenia and autism weren’t able to elucidate the problems as most people expected.  These spectrum sorts of conditions were just not going to turn out as straightforward as the sickle-cell variation or the cystic fibrosis stories.

Resources to catalog and look at CNVs have developed.  We have had a tutorial on DGV, the Database of Genomic Variants for some time (subscription required for tutorial).  Just the other day I was looking around at the NCBI tool called dbVar, which has a nice diagrammatic overview of the kinds of structural variations CNVs represent (but I’m not sure I understand how to use the database yet–I’ll keep you posted :) ). Now there is also CHOP CNV.

Today I’ll be introducing you to the CHOP CNV resource.  I heard about it at ASHG a couple of weeks ago, and decided to look into it.  I had remembered hearing about the tool at one of the trainings we did at CHOP, but I wasn’t sure it was publicly available.  Now I’m sure it is!

The publication associated with the CHOP CNV resource provides an overview of the  strategy. The authors highlight the reason they developed this one–to use a uniform technology (Illumina chips to start, and then subsequent validation with other techniques) and to have a large sample set.  They examine the genomes of over 2000 healthy individuals.  The point of looking at healthy folks is that they form the reference set essentially: you can now take the samples from affected patients and subtract the things that healthy folks appear to share.  This helps to narrow down your search for CNVs that might cause disease conditions.  They offer various statistics on the types and sizes of the structural variants observed in the healthy population.  It reminded me of another talk I heard at ASHG called “The first map of dispensable regions in the human genome” by Terry Vrijenhoek et al–which was a cool talk that began with a Facebook chat that had us all giggling–but the serious message was there’s a lot of missing genome healthy people appear to tolerate just fine….

The paper goes on to describe the creation of their web interface.  Although I couldn’t find it mentioned in the paper, I asked one of the authors and my suspicion that it was based on GBrowse was confirmed–I thought the tracks and controls appeared “GBrowsy” to me.  It shows the variations on the graphical display.  The deletions are red, the duplications are blue.  There is also a table that contains the data which you can color code to indicate uniqueness with green.  And the table provides a column that summarizes the genes in that region (if there are some), and links to the UCSC Genome Browser in that region so you can choose to go there and examine the other genomic features in that region.  When you have that loaded at UCSC, the data becomes a custom track that you can then examine with all the UCSC tools, including detailed queries with the table browser.  It’s a nice example of a big data set from a publication getting displayed at UCSC for further query options.

Another nice feature of the tabular display is that it also links to FABLE.  FABLE is a literature mining tool (Fast Automated Biomedical Literature Extraction) that will be searched for papers relating to the genes you find in that region–so you can quickly assess what’s known about a given gene in a CNV region.

They also include a compelling “application” as a way to illustrate how you can use the CHOP CNV resource to make discoveries.  There was a clinical sample of a patient with a number of congenital anomalies.  The CNV detection of the genomic sample indicated that 32 of the 35 variations this patient had existed in the healthy controls–which means that targeting the remaining 3 for further study provides a much more helpful focus on the likely issues.  There were a couple of other examples of utility as well.

When I asked the CHOP CNV team some questions about their Figure 1 in the paper (it showed what appeared to be lab group names with data sets), I was told that new versions will be coming that will offer some new features–including an option to upload your own samples to compare them to their data set.

If you are interested in structural variations in the genome you should check out the CHOP CNV database.  You might find some helpful information for your project!  I almost forgot to note–you can download all the data as well, and use it with other data you may have or for other analysis tools.

Direct to the site: http://cnv.chop.edu/

Shaikh, T., Gai, X., Perin, J., Glessner, J., Xie, H., Murphy, K., O’Hara, R., Casalunovo, T., Conlin, L., D’Arcy, M., Frackelton, E., Geiger, E., Haldeman-Englert, C., Imielinski, M., Kim, C., Medne, L., Annaiah, K., Bradfield, J., Dabaghyan, E., Eckert, A., Onyiah, C., Ostapenko, S., Otieno, F., Santa, E., Shaner, J., Skraban, R., Smith, R., Elia, J., Goldmuntz, E., Spinner, N., Zackai, E., Chiavacci, R., Grundmeier, R., Rappaport, E., Grant, S., White, P., & Hakonarson, H. (2009). High-resolution mapping and analysis of copy number variations in the human genome: A data resource for clinical and research applications Genome Research, 19 (9), 1682-1690 DOI: 10.1101/gr.083501.108

UCSC Genome Brower on TV

Sometimes I complain about my Tivo.  It thinks I like to cook and that I speak Portuguese.  Neither of these things are correct (not that there’s anything wrong with that).  But it must also know that I like science because it does find some nuggets that are suitable.  And when I got back from a recent road trip of trainings on the UCSC Genome Browser it cracked me up to see the browser on my TV!

The segment is about finding genes related to autism.  As a popular press sort of story it doesn’t quite get all the science right.  There’s some phrasing of the description of autism that I think is incorrect–or misleading as it was described.  But they talk about the importance of collections of DNA from affected families, they interview Mark Daly and Rudy Tanzi, and they show some software that identfies CNVs (copy number variations) and Rudy shows genes on the UCSC Genome Browser.

I like to see working scientists doing this kind of outreach.  I think it is something we need more of.  Click the image to go to the site and watch the piece (about 15 minutes long).


Specific episode and segment link:  http://www.pbs.org/wgbh/nova/sciencenow/0402/04.html

DGV releases a pre-publication data set

I got my newsletter for May from the Database of Genomic Variants, or DGV.  They announce the availability of a large data set of variants from HapMap individuals.  There are more than 8000 variations available in this set.

It’s not peer-reviewed at this point, so keep that in mind.  But if you are eager for new CNVs (copy number variations), you may want to have a look.

This data are released in DGV pre-publication, and we will therefore not incorporate these regions with the rest of the data in DGV (which has all gone through peer-review).
At this stage, the data will be made available through DGV in two ways. The entire data set will be available as a text file for download on the DGV download page, and it will be shown as a separate track in the DGV browser under the heading “Provisional data release from the Genome Structural Variation Consortium”, in a track with the name “NG42M_CNV (CNVE)”.

The data is subject to the “Fort Lauderdale” non-scoop rules:  you can use the data, but the data’s owners reserve the right to publish on global aspects of the data set first. You can see more on the details of use here: http://projects.tcag.ca/variation/ng42m_cnv.php

You can access DGV here: http://projects.tcag.ca/variation/

The newsletter with the details links from the bottom of the homepage.  Here’s a link to that (warning, PDF): http://projects.tcag.ca/variation/DGV_Newsletter.pdf

How do you represent genomes?

cnv_1Not just the genome, but genomeS. As Jan at Saaien Tist has mentioned, human (and other species) genomes are quiet variable. Though the linear representation of genome browsers makes perfect sense (like the UCSC Genome Browser, Ensembl, GBrowse and MapViewer among others) for much annotated data of the genome, structural variations are not so well visualized in a linear representation. And, as we are find the human and other specie genomes are quite variable, we might need to come up with another way to visualize these genomic data beyond the ‘reference genome’ linear model. Jan suggests deBruijn graphs,
pictured here. I find some difficulty in ‘visualizing’ how these are going to work for the _other_ annotations in the data. Though this representation looks like it might work great for CNV and the like, it seems to make viewing other types of data (expression, SNP, etc) more complicated. I’m looking forward to see how this develops.

Or perhaps we’ll be looking at genomes like this (ok, maybe not, but it’s geeky cool).

Summary of webinar “CNVs vs. SNPs: Understanding Human Structural Variation in Disease”

NHGRI CNV image Do you still believe that monozygotic, or identical, twins online canadian pharmacy are really genetically identical?

Or that we are all 99.9% genetically similar to each other? Well I certainly did, and boy was I wrong!

It turns out that CNVs (Copy Number Variations) are causing the “facts” some of us learned in Molecular Biology 101 to be rewritten. If you, like me, thought that what you learned years ago was still true, then there is a great webinar you may want to watch. It is brought to you by Science/AAAS, and it features three prominent experts in genetic variability, Drs. Charles Lee, Lars Feuk and Alexandra Blakemore.

The moderator is Dr. Sean Sanders, who is the Commercial Editor of Science. Even those of you that are up to speed on the current research can find many interesting facts and learn about the new techniques used to study CNVs, or just genetic variability in general. It turns out that CNVs are much more prevalent than was previously thought. You hear so much about SNPs that it seems like they are the source of genetic variability that we should be most concerned about, but CNVs are catching up real fast. This new field is rapidly advancing because of major technology breakthroughs.

All of the panelists present a short talk highlighting the prevalence, importance and experimental limitations of studying CNVs and their role in normal human variability, as well as in disease. They present some of their own data and discuss the future direction of this young field. This is followed by a very interesting question and answer session where they allowed listeners to email their questions. It may even turn out that CNVs are the reason that your personality, IQ, height and weight differ from your colleagues, friends and family. So not only is this an exciting new field, but it is certainly one we can all relate to! Continue reading

Tip of the Week: Human Genome Structural Variation Viewer

Looking at the NHGRI News feed recently, I noticed this story (below) about a new genomic data collection that intrigued me. I found out about a new resource that I wanted to share as this week’s Tip of the Week. So this ~4 minute movie discusses my path to the Human Genome Structural Variation resource and a quick look at some of the data. But the paper was so influential on my thinking about the genome that I wanted to cover that in more detail in text form as well. So for a quick hit, watch the movie. For more detail, check out the text and links below.  Quick trip to the database: http://hgsv.washington.edu

Researchers Produce First Sequence Map of Large-Scale Structural Variation in Human Genome


….Other recently created maps, such as the HapMap, have catalogued the patterns of small-scale variations in the genome that involve single DNA letters, or bases. However, the scientific community has been eagerly awaiting the creation of additional types of maps in light of findings that larger scale differences account for a great deal of the common genetic variation among individuals and between populations, and may account for a significant fraction of disease. While previous work has identified structural variation in the human genome, a sequence-based map provides much finer resolution and location information….

I spend a lot of time thinking about the official or “reference” human genome sequence. This sequence–the one that was released to all that fanfare a few years back–is a composite of several people. Rather like a “generic” genome.

Continue reading