What’s Your Problem? Open Thread

wyp_q_mark2_thumbnail1Welcome to the “What’s Your Problem?” (WYP) open thread. The purpose of this entry is to allow the community to ask questions on the use of genomics resources. Think of us as a virtual help desk. If you have a question about how to access a certain kind of data, or how to use a database, or what kind of resources there are for your particular research problem, just ask in the comments. OpenHelix staff will keep watch on the comment threads and answer those questions to the best of our knowledge. Additionally, we encourage readers to answer questions in the comments too. If you know the answer to another reader’s question, please chime in! The “WYP” thread will be posted every Thursday and remain at the top of the blog for 24 hours. Questions or problems asked on Thursday will be answered on Thursday to the best of our ability. You can leave questions on other days of the week, but the answer might not come that day.

We’d also like to invite resource providers to let us know if they have something new to talk about, or something they want to mention to the bioinformatics community. We’ve had some people email us because they weren’t sure if they should post something, and we want to say that’s fine.

So What’s Your Problem? And What’s Your Solution? :)

You can keep up with this thread by remembering to check back, by subscribing to the RSS comments feed to this WYP post or by subscribing to be notified by email of new comments to the post (use checkbox at end of comment form, you can unsubscribe later). If you want to be notified of future WYP posts (every Thursday), you can subscribe to the WYP feed.

10 thoughts on “What’s Your Problem? Open Thread

  1. Bob

    I was looking for info on two things. One, I would like to know how/if I can enter a queary sequence into BLAST and my email address and have them email me when new homologous proteins are found in the newly sequenced genomes? Also, I am looking for a tool/program that can find silent restriction sites if given a DNA sequence. I have one that finds them but only when a single base is changed in my DNA sequence, not if two or three silent mutations need to be made to make the restriction site.

    Any help will be appreciated.

    1. Jennifer Post author

      Hi Bob,

      Thanks for stopping by with your questions! I’ll give them my best ‘quick shot’ now, but will also continue to think & ask Mary & Trey to chime in. Any readers with ideas are also more than welcome to share their ideas too!

      My first thought for your BLAST question is saving your BLAST search to your My NCBI account & requesting that it be rerun on some regular basis – you can do that with searches for a LOT of NCBI’s databases. I’m not seeing a way to rerun BLAST searches through My NCBI, but when I went over to BLAST to check into this I noticed a couple of usage tips that I think also might be of interest to you. Check out the ‘How to save custom search pages.’ and ‘How to Search Custom Databases in Web-Blast Using Entrez Queries.’ sections on this page: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastTips#0 . If you are signed into My NCBI, and save a BLAST search, under the ‘Saved Strategies’ tab there is an option to ‘download the search’. It appears the resulting text file gives you the syntax of your saved search. I’d guess with a bit of coding you could use the provided syntax to create an automatic query of your dreams. This level of coding is out of my league, but you might be able to get help from your local bioinformatics support team, or from NCBI itself, if you need.

      Another thought is that a lot of resources have incorporate BLAST utilities. It might be easier to go to your favorite resource & run your BLAST search from there, rather than the NCBI BLAST site itself. For example, I can imagine you could program the UCSC Table Browser to do a BLAT search that meets your needs, and perhaps the Galaxy workflow resource would be useful in such analyses as well. We’ve got full tutorials on all the resources that I’ve mention (NCBI BLAST, overview of NCBI, UCSC intro & advanced & Galaxy) and then some, which might help you out with some of the usage details.

      On your restriction site question: again a UCSC Table Browser query comes to mind because I know they have restriction enzyme information over there. So do several other genome browsers, including Ensembl, GBrowse, IMG, etc. But I wonder if some sort of motif finder would be a creative way to specifically find silent sites. Many RE tools that I’ve used are based on exact sequences, but motif tools, such as those provided by the MEME suite of tools (http://meme.sdsc.edu/meme/) often can handle motifs that are slightly ‘off exact’. The Database of Transcriptional Start Sites (DBTSS, http://dbtss.hgc.jp/)also has some motif finding tools that might allow you to do what you want. WE are just updating our tutorial on that now, so expect to see it soon in our catalog.

      Good luck, and please let us know what solutions you find that work for you!

  2. Fred Mills

    I have a specific question re UCSC custom tracks. I’d like to make tracks that are in color, but which also show changes in amplitude, or intensity, like the ENCODE data for histone mods, CTCF, etc. How to?

    Many thanks,

  3. Mary

    @Bob on restriction enzymes: Yeah, UCSC has a restriction enzyme track. You can go to Human 2006 for example, look at the Mapping and Sequencing tracks, and there’s one called Restr Enzymes. You can see the details of that here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=165395019&c=chrX&g=cutters And you can look at their comments on the displays for the ambiguous bases in the description. Does that help? If not, tell me more about the program you have used and I can see if I can think of something else more like that.

  4. Mary

    @Fred: The basic foundations of custom tracks can be found here–you need to have some of the aspects of this to do any track. http://genome.ucsc.edu/goldenPath/help/customTrack.html

    But the basic track may not be right for your data. If you want the type that displays like the histograms you want one of the WIG or wiggle track styles, or maybe the BAM/SAM for some of your data.

    Another genome-wide display style that is suitable for some data is the Genome Graphs: http://genome.ucsc.edu/goldenPath/help/hgGenomeHelp.html

    You might also check out the community custom tracks to see if anyone did your same type of data, and you could see how they structured it. http://genome.ucsc.edu/goldenPath/customTracks/custTracks.html

    Check those out for options. If you have some additional questions after reading a bit, search the mailing list to see of there have been useful discussion on that: http://genome.ucsc.edu/contacts.html And the folks on that UCSC mailing list are hugely helpful if you have additional questions about how to structure the tracks if it’s not available from those docs.

  5. Trey

    @fred, to point you directly to what I think you are talking about in the custom track documentation,

    you can change the color of your individual items using a track attribute: itemRgb=On

    for example: track name=mytrack description=”my track” itemRgb=On

    Then color the individual items as you need with the RGB color system (e.g. 220,0,220).

    This works for the BED format, but might be different if you need a different format like those suggested by mary.

  6. gsgs

    I want one full genome, all the 3G of nucleotides, from the 1000 genomes project.Downloaded in a file.

    Then 999 files (or how many are available)
    for the other genomes, all aligned to the first,
    but not all the nucleotides but only the
    positions where they differ from the first one.
    In computer-readable format.

    Or maybe only one chromosome is sufficient, e.g. Y

    Only SNPs are sufficient, positions with no
    other differences in the neighborhood.

  7. Mary

    Well, you can go over to the 1000 genomes site to the data area to access the information: http://www.1000genomes.org/page.php?page=data

    If you go to the browser you should be able to visualize all the comparative data that is released: http://browser.1000genomes.org/index.html But be sure to note their warning on the SNP ids over on the upper right:
    Ensembl-based browser provides early access to 1000genomes data

    In order to facilitate immediate analysis of the 1000genomes data by the whole scientific community, this browser (based on Ensembl) integrates the SNP calls from the March 2010 release. All of this data has been submitted to dbSNP, and once rsid’s have been allocated, will be absorbed into the UCSC and Ensembl browsers according to their respective release cycles. Until that point any SNP id’s on this site are temporary and will NOT be maintained .

  8. gsgs

    what I have so far:

    when I click on
    and I choose CHRY, I get this subdirectory:

    FTP-Verzeichnis /genomes/H_sapiens/ARCHIVE/BUILD.36.3/CHR_Y/ auf http://ftp.ncbi.nih.gov

    Eine Ebene höher

    03/14/2008 12:00 153,864 hs_alt_chrY_Celera.asn.gz
    03/14/2008 12:00 2,874,402 hs_alt_chrY_Celera.fa.gz
    03/14/2008 12:00 4,074,665 hs_alt_chrY_Celera.gbk.gz
    03/14/2008 12:00 48,082 hs_alt_chrY_Celera.gbs.gz
    03/17/2008 12:00 3,057,950 hs_alt_chrY_Celera.mfa.gz
    03/14/2008 12:00 156,296 hs_alt_chrY_HuRef.asn.gz
    03/14/2008 12:00 5,491,065 hs_alt_chrY_HuRef.fa.gz
    03/14/2008 12:00 7,785,563 hs_alt_chrY_HuRef.gbk.gz
    03/14/2008 12:00 100,498 hs_alt_chrY_HuRef.gbs.gz
    03/17/2008 12:00 5,842,179 hs_alt_chrY_HuRef.mfa.gz
    03/04/2008 12:00 299,047 hs_ref_chrY.asn.gz
    03/04/2008 12:00 7,534,471 hs_ref_chrY.fa.gz
    03/04/2008 12:00 10,703,507 hs_ref_chrY.gbk.gz
    03/04/2008 12:00 176,727 hs_ref_chrY.gbs.gz
    03/05/2008 12:00 8,008,160 hs_ref_chrY.mfa.gz

    but apparantly none of these files are aligned, they are only ~26MB,
    but hapmap-files refer to positions >57M

    I also got hg18 from UCSC, 60MB, but the positions don’t match (?)

    genbank has info on the builds ,

    so build 36 , CHRY has length
    57772954 , NC_000024.8
    but the link has only the info, not the nucleotides.

    they have build 36 there and build 37, but not build 36.3

    I’m trying to match the positions of the files in

    I don’t understand the meaning of column 2 in those files “allele”
    it doesn’t match hg18 at those positions nor does it match the letter-pairs

    the many letter-pairs at the end of the line are presumably different
    people in that group samples on 2 different machines ?

Comments are closed.