Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

Doğan, Tunca; Karaçalı, Bilge

Please use this identifier to cite or link to this item: https://hdl.handle.net/11147/5277

Title:	Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
Authors:	Doğan, Tunca Karaçalı, Bilge
Keywords:	Sequence analysis Proteins Genome analysis Genetic database Receiver operating characteristic
Publisher:	Public Library of Science
Source:	Doğan, T., and Karaçalı, B. (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PLoS One, 8(9). doi:10.1371/journal.pone.0075458
Abstract:	Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
URI:	http://doi.org/10.1371/journal.pone.0075458 http://hdl.handle.net/11147/5277
ISSN:	1932-6203
Appears in Collections:	Electrical - Electronic Engineering / Elektrik - Elektronik Mühendisliği PubMed İndeksli Yayınlar Koleksiyonu / PubMed Indexed Publications Collection Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection

Files in This Item:

File	Description	Size	Format
5277.PDF	Makale	2.53 MB	Adobe PDF	View/Open

Show full item record

CORE Recommender

SCOPUS^TM
Citations

9

checked on May 16, 2025

WEB OF SCIENCE^TM
Citations

7

checked on May 17, 2025

Page view(s)

812

checked on May 12, 2025

Download(s)

240

checked on May 12, 2025

Google Scholar^TM

Check

Files in This Item:

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Page view(s)

Download(s)

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM