Tip of the Week: Getting flanking sequence

getflank_thumbIn an earlier What’s Your Problem thread, a researcher had hundreds of SNP locations where they were trying to easily obtain the flanking sequence of those hundreds of SNPs without having to go to each location in the UCSC Genome Browser and eyeballing. There are probably a few ways to do this, but I found that Galaxy was a good place to start. So, the tip this week is taking two SNP locations on the human genome and obtaining the flanking sequence from those locations and returning a file that could be saved either as a spreadsheet, text or even made back into a UCSC Genome Browser custom track that can then be uploaded, viewed and searched at UCSC. The process for individual researchers will be a bit different depending on the data and how the excel/worksheet/file is configured, but hopefully you’ll get the idea. The steps are thus:
1. Upload your file (tab delineated text)
2. Convert file to the ‘interval’ format
3. Cut out any columns of data from original file to save for later use.
4. Get flanking chromosomal locations (then merge upstream and downstream records into one record)
5. Get flanking sequence
6. Paste data columns from step 3 to the data columns (chromosomal location and sequence) from step 5.

Voila, now you have a tab-delineated text file that can be opened in Excel, made into a custom track (in Galaxy), etc.

Any suggestions on other methods for doing this?

(OpenHelix does training on Galaxy and UCSC Genome Browser).

4 thoughts on “Tip of the Week: Getting flanking sequence

  1. Lon Phan


    dbSNP have various report formats (FASTA, XML, and ASN.1) with flanking sequences that can be retrieved though web searches or as FTP downloads. The easiest to use is probably the FASTA format that can be selected from the “Display” option on EntrezSNP web search (http://www.ncbi.nlm.nih.gov/sites/snp/limits) or retrieved programmatically by Eutils API(http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils).

    Various dbSNP report formats can also be retrieved in bulk using dbSNP Batch Query (http://www.ncbi.nlm.nih.gov/SNP/batchquery.html) using variant ID or download from the FTP site (ftp://ftp.ncbi.nih.gov/snp/organisms/) by organism and chromosome.

    Please feel free to contact dbSNP (snp-admin@ncbi.nlm.nih.gov) if you have any questions or suggestions.



  2. Trey Post author

    Thank you Lon. It is indeed a great way to get batch data and flanking regions. I’ve done it, and it works perfectly well. Thank you!

    This particular question was the researcher wanted a specific number of bps upstream and downstream and wanted it in a tab-delineated form (for spreadsheet) with the other data he had. Not sure if that’s possible with dbSNP, but I should try.

    Anyway, that is a great way to get batch flanking region data. I might make it my next tip actually. I could do a series on getting sequence data of various forms, flanking regions of SNPs from dbSNP could be a good first one.


  3. Ria

    This was a very helpful tip, thanks! I’m in the middle of doing this myself, for over 7 million potential SNP locations. I’ve actually written some perl scripts to help with this, using Bio::perl. If one has access to server-enterprise grade and cluster computational resources, that’s likely to be the fastest option.

