Removing contamination from genomic sequences based on vector reference libraries
Citation
Bağcı, C., and Allmer, J. (2012, April 19-22). Removing contamination from genomic sequences based on vector reference libraries. Paper presented at the 7th International Symposium on Health Informatics and Bioinformatics, HIBIT 2012. doi:10.1109/HIBIT.2012.6209053Abstract
DNA is often sequenced after being cloned into a vector since this provides the possibility for using standard primers and removes the need to develop custom primers. In this way a certain amount of vector is sequenced along with the sequence of interest. Unfortunately, occasionally these contaminating vector sequences find their way into public databases as part of submitted sequences. It has been pointed out that SeqClean, a program used to remove vector contamination from sequences, does not take into account that vectors are circular structures. A workaround has been presented before, but we were able to simplify the process and, additionally, we provide an implementation. We further applied our method to a test set of EST sequences and also analyzed the amount of contamination found in the EST sequences available on NCBI. © 2012 IEEE.