BioStar is a site for asking, answering and discussing bioinformatics questions and issues. We are members of the community and find it very useful. Often questions and answers arise at BioStar that are germane to our readers (end users of genomics resources). Every Thursday we will be highlighting one of those items or discussions here in this thread. You can ask questions in this thread, or you can always join in at BioStar.
This week’s question is interesting to me because I’m increasingly aware of the challenges of using tools that focus on humans or mammals on other types of genomes. There are some trickier genomes out there, and some of them are important food crops. But there are also just other types of genome organizations that could be challenging to assess with human-centric tools and assumptions.
I’m currently working on a project involving SNP calling of resistance genes (R-genes) in 96 potato cultivars. We are interested in identifying SNPs (and INDELs) in the NB domain. The NB domains were enriched using PCR and Illumina paired-end read libraries were created for each cultivar. After quality checking (adapter trimming and read quality trimming) the reads were mapped against the potato DM v4.03 genome using NextGenMap. SNP calling is (to be) performed on the known NB-region coordinates.
And now is it time for my question, how should we do the SNP calling? Potato is a tetraploid organism, so theoretically using samtools mpileup should not cover all SNPs/alleles, because (correct me if I’m wrong) samtools is designed for diploid organisms. After a Google search I found three SNP callers (QualitySNPng, freebayes and UNEAK) which should be able to call SNPs in polyploid organisms. My question to the community is if anyone has experience in polyploid (tetraploid) SNP calling and if there is a recommended SNP caller (or if they all behave similarly), or that we maybe should only call SNPs which are called by multiple SNP callers.
Furthermore, we are uncertain about which parameters to use in SNP calling and filtering. The major problem we face is that we are uncertain when we can actually call a SNP; what allele fraction should we use? Or should we call a SNP if it has at least x (high quality) reads supporting it? And what quality score (QUAL as given in the VCF output format, or any other quality measure) is sufficiently high enough to call a SNP with high confidence?
So far we haven’t been able to find any satisfying answers to these questions and are therefore uncertain how to proceed. Thanks in advance for anyone taking the time to read this and to anyone who is willing to help us with our problems.
The discussion has begun with one answer, but if you have other thoughts or know of tools that suit the bill for this, please bring your information over.