BioStar is a site for asking, answering and discussing bioinformatics questions. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday* we will be highlighting one of those questions and answers here in this thread. You can ask questions in this thread, or you can always join in at BioStar.

BioStar Question of the Week:

Dump upstream sequence. I am looking for transcription factor motifs. I have a list of refseq IDs of genes that I am interested in. How would i export a multi-fasta of all sequences from TSS to 1000bp before TSS?…

Highlighted Answer:

Answer from Ian: I definitely endorse the use of Galaxy due to its flexibility in handling genome coordinate based data. If you would like to retrieve the coordinates of a particular RefSeq transcript (NM_xxxxxx) from RefSeq data you can also extract it from the UCSC table browser.

  • http://genome.ucsc.edu/
  • select ‘Table Browser’ from the left-hand side panel
  • select mammal/human/hg18 from the top row of options
  • group: ‘genes and gene prediction tracks’; track: ‘RefSeq genes’
  • get output

You can load the resulting file into Galaxy and retrieve the lines of information you want by comparing your RefSeq IDs to the second column of the table browser data.

Just remember that txStart = TSS if the gene is on the + strand. txEnd = TSS if the gene is on the – strand.

I just want to point out we go through this in both the tutorial and exercises in Galaxy (sponsored/free tutorial). Galaxy is excellent for this.

Check out the other answers, or provide one if you have insights into the problem.

