Decreasing the number of false positives in sequence classification


Autoria(s): Machado-Lima, Ariane ; Kashiwabara, André ; Durham, Alan 
Contribuinte(s)

UNIVERSIDADE DE SÃO PAULO

Data(s)

26/08/2013

26/08/2013

01/12/2010

Resumo

Abstract Background A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. Results For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. Conclusions Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.

We would like to thank Hernando A. del Portillo who proposed the initial biological problem that motivated this study, Sean R. Eddy who helped AML suggesting possible null models for log-odds scoring analysis and for important advice in the final form of the paper, Eric Nawrocki for important insights about Infernal, Alex Coventry for helpful discussion and modification in cmsearch's source code to report negative values, and BIOINFO-Vision Laboratory (University of São Paulo) for computing facilities. During this work, AML was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP, 2007/01549-5), AYK was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and AMD was partially supported by CNPq.

We would like to thank Hernando A. del Portillo who proposed the initial biological problem that motivated this study, Sean R. Eddy who helped AML suggesting possible null models for logodds scoring analysis and for important advice in the final form of the paper, Eric Nawrocki for important insights about Infernal, Alex Coventry for helpful discussion and modification in cmsearchs source code to report negative values, and BIOINFOVision Laboratory (University of São Paulo) for computing facilities. During this work, AML was supported by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and Fundação de Amparo a Pesquisa do Estado de São Paulo (FAPESP, 2007/015495), AYK was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and AMD was partially supported by CNPq.

This article has been published as part of BMC Genomics Volume 11 Supplement 5, 2010: Proceedings of the 5th International Conference of the Brazilian Association for Bioinformatics and Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/14712164/11?issue=S5.

This article has been published as part of BMC Genomics Volume 11 Supplement 5, 2010: Proceedings of the 5th International Conference of the Brazilian Association for Bioinformatics and Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S5.

Identificador

BMC Genomics. 2010 Dec 22;11(Suppl 5):S10

1471-2164

http://www.producao.usp.br/handle/BDPI/32778

http://dx.doi.org/10.1186/1471-2164-11-S5-S10

10.1186/1471-2164-11-S5-S10

http://www.biomedcentral.com/1471-2164/11/S5/S10

Idioma(s)

eng

Relação

BMC Genomics

Direitos

openAccess

Durham et al; licensee BioMed Central Ltd. - This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Tipo

article

original article