Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
RT @bbglab: IGV 2.3 (Integrative Genomics Viewer) can export data to Gitools heat-maps. https://t.co/UAOCIkhZa6 http://t.co/XEcTvx9OIq
RT @NESCent: Sign up for a course in genetic & genomic software! Apply by June10 for the GMOD Summer School, July19-23 in DurhamNC http://t.co/YE6PEPAFCe
RT @MikeTaylor: Very positive steps by Nature Group towards ensuring that more of what they publish is good science. http://t.co/sEVgks7vf7
RT @Chris_Evelo: Amen RT @ldtimmerman: @wilbanks One co-worker at that bioinformatics co once said, ‘Dude, biology is complicated.’ Yeah. #sagecon
RT @jennomics: You know that feeling when the last big publication from your PhD work is finally out? Yep, feeling that right now. http://t.co/aKaX1UMjCQ
BioStar is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
This week’s question was different than many of the questions over at BioStar. But it led me to a neat tool that was new to me for text mining and curation.
I want to annotate all the enzyme names and their kinetic data from given text. Is there any enzyme dictionary that I can use it directly for annotation? How should I annotate kinetic expressions from the text?
See the answers to learn more about the goals of this, and the one about Annotator from NCBO. I can see the utility of that and I’m musing on some projects with that now. But there may be other answers to help out the poster of the question–so if you have other ideas on the enzyme dictionary features do bring them over to the answers.
I’ve talked a lot about how much I am interested in seeing new visualization strategies for working with the volumes of data was have today–which are certainly not going to stop flowing in. But a more basic level of this is even just locating and navigating to find the data sets you might want to visualize.
TCGA–The Cancer Genome Atlas–collects large numbers of data sets on various cancers. They collect different types of data: GWAS, expression, protein, and more. But it can be a challenge to keep up the the huge amounts of data that are coming in. They have a portal where you can query the underlying data sets, with many features that you might be interested in. But another group has developed another strategy to access data sets–their roadmap offers a quicker and easier way to assess, and then access, what’s available, as well as providing a more general strategy for organizing access to the files.
I’ll let the team explain with their own video:
But be sure to check out their paper where they explain their strategy in more detail. They provide links to the queries they generate to you can explore that too. And you can consider this method for other types of data sets you might want to navigate as well.
There may be other ways you want to interact with TCGA data, and you can still access their portal for other types of queries. But this offers another way to quickly locate subsets of data sets that you might be interested in exploring with other tools.
Hat tip to Bell Eapen for the notice:
A self-updating road map of The Cancer Genome Atlas. feedly.com/k/XYsYT0 Good read
Robbins, D., Gruneberg, A., Deus, H., Tanik, M., & Almeida, J. (2013). A self-updating road map of The Cancer Genome Atlas Bioinformatics DOI: 10.1093/bioinformatics/btt141
The Cancer Genome Atlas (TCGA) Research Network (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways Nature, 455 (7216), 1061-1068 DOI: 10.1038/nature07385
In our workshops around the world on the UCSC Genome Browser, we talk at the very beginning about the framework for the organization of the data in the graphical representation. We describe that the reference genome–the official released genome–for a species provides the genome coordinates, or positions, that allows the rest of the data to be placed in the viewer at the correct spots. We don’t spend time on how a reference genome comes into existence–we just accept that it’s there for the purposes of the tutorials and focus on how to work with it using the site and the software.
If you go over and look at the UCSC Genome Browser Gateway page, you’ll see the current default assembly (or version) of the reference genome has several nicknames. One of them is GRCh37. And you can see how that’s a change from the previous nicknames. The Genome Reference Consortium (GRC) for the human genome was assembled (get it?? har har) after the end of the Human Genome Project. Some people may only realize the change from the menu at the UCSC Genome Browser:
We don’t spend a lot of time on the process of getting the reference genome. But if you’ve ever wondered who is responsible for creating the human reference genome–and a bit about how that’s done, and some of the complexities, you should read this interview with Deanna Church in Bio-IT World:
You should also check out that paper they link about the variations on human chromosome 17–it’s a fascinating case study of the challenges of creating a reference from a section that has evolutionary and medical consequences. It might make you think differently about what the reference genome really means. It means it’s the official one that we all agree to use to provide the map coordinates–it doesn’t necessarily mean the one that everyone is walking around with. And some of the other places that have complicated structural features could have real medical implications are mentioned in the piece.
As we see more personal genomes come along, that will affect our understanding of genome structure in other important ways too. The article touches on that as well.
Also–you can get a heads-up on when the next assembly is expected. So eventually you’ll see that the UCSC team will offer another menu choice, and the new coordinates will drive what you see in their viewer. It doesn’t happen right away; it takes some time to recreate the mappings. And some annotation tracks take longer to come along from their providers as other groups also have to re-map to the new assembly. But that interview will help you to understand why new assemblies are still coming along, and how that happens.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
RT @HankGreelyLSJU: This week’s JAMA is on Genomic Medicine. Lots of stuff, much of interest (to me). http://t.co/vXrs6X9Z2I McGruie, McCullough, & Evans esp.
RT @tharris: Best hashtag: #workingwithfrancis; would add @timhortons MT “@bffo: New #job WebDev Team Content Editor http://t.co/WalrwCfbZA #oicr”
RT @sangerinstitute: We’re running an online survey that is trying to capture public attitudes on the use of #genetic information. http://t.co/6CT4p58xzD
Yes.
Am I allowed to be offended when a job vacancy mail (for a professorship in Bioinformatics) starts with “Dear sirs”?
BioStar is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
This week’s highlighted question is my favorite kind–the one that reveals a resource that is new to me. I was familiar with miRBase and have turned to that for miRNA information.
miRWalk was new to me. I was familiar with miRBase and have turned to that for miRNA information. But I didn’t know there was one that specifically examined the CDS located targets. So check out the answers–and if you know of any other resources do add them.
This week’s video Tip of the Week offers you a quick tour of GISAID’s resources and their EpiFlu™ database. This is the database you might be hearing about in the news—the one to which researchers submit the new H7N9 influenza sequence data that they are collecting. Originally this initiative was seeded as the “Global Initiative on Sharing Avian Influenza Data” but it has evolved to become the “Global Initiative on Sharing All Influenza Data” to describe a broadened mission to collect any flu data. Researchers around the world can quickly share their discoveries of any types of influenza viruses and make the sequence details as well as epidemiological and clinical data available to other researchers, who then explore and analyze that information. Researchers from various disciplines, including veterinary, virology, bioinformatics, epidemiology, immunology, and clinical analysis access this information. Currently the US CDC and others are using the data to explore the development of new vaccines, antiviral drugs, and diagnostic kits and to learn about the characteristics of the virus isolates which could affect public health policy making. Even before they had access to physical virus samples to test, they could begin assessments using the EpiFlu data. In this tip I’ll show you how researchers submit to and access this global resource. My goal is to show researchers who might want to use the information some of the details, but other people might be curious to have a look under-the-hood too.
For today’s tip I focus on how researchers would use the EpiFlu database, with a quick tour of some features. Recently I signed up for access to the site, which was quickly approved. And then I asked the GISAID team for permission to make the video, which they also quickly granted, and I have asked them to review the movie to make sure I didn’t go out of bounds of the Data Access Agreement. As a registered user, I’m not allowed to show the general public the sequence data itself, but I will show you how researchers would obtain the details they need to take their analyses further. And I did get permission to open one record to illustrate a key point the record. (Information in that record has been published in the New England Journal of Medicine by the submission team. Reference below.) Later, you can sign up for access to see the details yourself. The data is publicly accessible, as long as you identify yourself and agree to the terms. The terms of use are not designed to be a barrier to access and research—they are in place to give us the freedom and responsibility to use the data appropriately. There have been objections to this sharing model, but going into detail on the history and development of GISAID is not the subject of this post.
The site details
You can learn more about the issues that were a catalyst to the development of GISAID from an editorial published in Nature, cited below. GISAID continues to evolve, and you can learn more about the state of the current initiative and its scientific advisors by visiting their website. Since 2010, the German government, represented by the Federal Ministry of Food, Agriculture and Consumer Protection is the official host of the site, and the Federal Research Institute for Animal Health is responsible for the quality of the data in GISAID.
When you are logged into the GISAID site, you’ll have access to a range of features. Related news items are posted. You can see the list of all the other registered users, and you can easily contact them from within the system for questions and collaborations. Most importantly, though, you have access to the relational database component called EpiFlu. This is where researchers can submit new sequences that they isolate. There are many fields for storing crucial metadata. The entry form offers different fields depending on the type of isolate and host. EpiFlu is where other researchers can query for the types of strains, hosts, or submission details they are interested in. These sequences and metadata can be downloaded for use with other tools. There are also some analysis tools provided in the EpiFlu interface. Sequences can be submitted for BLAST analysis or used to generate a multiple-sequence alignment with an installation of Jalview.
In speaking with folks at GISAID last week about their philosophy I learned about upcoming new software they are working on. The GISAID group plans an EpiFlu 2.0 which they are building from scratch. That version will have additional features that enhance the connectivity with other resources and for enhancing collaborations, and with better scalability. As we continue to see the deluge of sequence data coming in from all kinds of sources in the future, this will really be necessary. I don’t know what the target date is for the next version, but I’ll be keeping an eye out for that as a future tip.
For non-researchers
If you are a member of the public curious about information sources on the flu, please read this excellent guidance on sources of flu information by Maryn McKenna: The New Bird Flu, and How to Read the News About It. Not all news is the same. Let’s be careful out there.
An example of sequence being submitted to GISAID:
The Hangzhou Center for Disease Control and Prevention, has submitted a new case of the novel H7N9 strain to GISAIDtiny.cc/ojj2uw
BTW: I also asked how they pronounce their name—it’s like “jees-aid”, if you want to know. I had only seen it written in text form and hadn’t heard it from any sources so I wanted to be sure.
References for the video and post:
Editorial (2006). Boosting access to disease data Nature, 442 (7106), 957-957 DOI: 10.1038/442957a
Butler, D. (2013). Urgent search for flu source Nature, 496 (7444), 145-146 DOI: 10.1038/496145a
Gao, R., Cao, B., Hu, Y., Feng, Z., Wang, D., Hu, W., Chen, J., Jie, Z., Qiu, H., Xu, K., Xu, X., Lu, H., Zhu, W., Gao, Z., Xiang, N., Shen, Y., He, Z., Gu, Y., Zhang, Z., Yang, Y., Zhao, X., Zhou, L., Li, X., Zou, S., Zhang, Y., Li, X., Yang, L., Guo, J., Dong, J., Li, Q., Dong, L., Zhu, Y., Bai, T., Wang, S., Hao, P., Yang, W., Zhang, Y., Han, J., Yu, H., Li, D., Gao, G., Wu, G., Wang, Y., Yuan, Z., & Shu, Y. (2013). Human Infection with a Novel Avian-Origin Influenza A (H7N9) Virus New England Journal of Medicine DOI: 10.1056/NEJMoa1304459
++++++++++++++++++++++++++++++++++
Disclosure: OpenHelix has no financial or scientific relationship with the GISAID Foundation or EpiFlu. I merely approached them as I do many other resources to ask for permission to do a movie. I offered them the opportunity to review my materials because of the sensitivity of this issue and the desire to NOT cause any kind of international public health incident. The goal of this is to show people the insides of a database or resource that they may not be familiar with.
It’s really crucial for scientists of all stripes to have some computational skills in their toolbelts. In genomics, the deluge of data that needs to be sorted, sifted, and analyzed is not going to stop–so it’s even more urgent that everyone gets some comfort and capability to work with the data, do some scripting to solve some issues you might have, and assess the tools you might want to be using.
For various reasons, women in science may not have pursued much coding. But there are new efforts to bring women researchers up to speed on some of these skills by the Software Carpentry team. There’s a workshop coming up in Boston in June that is a great opportunity if you’ve been thinking about tackling some of the basics.
Welcome to our Friday feature link collection: SNPpets. During the week we come across a lot of links and reads that we think are interesting, but don’t make it to a blog post. Here they are for your enjoyment…
RT @gauravjain49: Ewan Birney`s current stand on #ENCODE findings http://t.co/vzEW0GSM1F BBC interview with Ewan Birney @ENCODE_NIH @EncodeDCC @FakeEncode
And check out the interesting discussion at the tweet link (@takluyver). RT @takluyver: The paper on my species name matching software has been published! http://t.co/IRsLrw4nAe (and it’s open access)
Scientists–they are watching you: RT @carlzimmer: Here’s a free program for statistical power. Journalists, plug in values & hold feet to the fire! http://t.co/XRSbm6BYek via @Keith_Laws
RT @mendelspod: Antony Evans of the Glowing Plants Project talks about ethical/policy hurdles in today's show. http://t.co/Ey142V4O7e#GMOMonday, 05.20.13 02:56
Recent Comments