2 resultados para Conceptual mistakes in text books
em Massachusetts Institute of Technology
Resumo:
There are numerous text documents available in electronic form. More and more are becoming available every day. Such documents represent a massive amount of information that is easily accessible. Seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classification. The accuracy and our understanding of such systems greatly influences their usefulness. In this paper, we seek 1) to advance the understanding of commonly used text classification techniques, and 2) through that understanding, improve the tools that are available for text classification. We begin by clarifying the assumptions made in the derivation of Naive Bayes, noting basic properties and proposing ways for its extension and improvement. Next, we investigate the quality of Naive Bayes parameter estimates and their impact on classification. Our analysis leads to a theorem which gives an explanation for the improvements that can be found in multiclass classification with Naive Bayes using Error-Correcting Output Codes. We use experimental evidence on two commonly-used data sets to exhibit an application of the theorem. Finally, we show fundamental flaws in a commonly-used feature selection algorithm and develop a statistics-based framework for text feature selection. Greater understanding of Naive Bayes and the properties of text allows us to make better use of it in text classification.
Resumo:
Many online services access a large number of autonomous data sources and at the same time need to meet different user requirements. It is essential for these services to achieve semantic interoperability among these information exchange entities. In the presence of an increasing number of proprietary business processes, heterogeneous data standards, and diverse user requirements, it is critical that the services are implemented using adaptable, extensible, and scalable technology. The COntext INterchange (COIN) approach, inspired by similar goals of the Semantic Web, provides a robust solution. In this paper, we describe how COIN can be used to implement dynamic online services where semantic differences are reconciled on the fly. We show that COIN is flexible and scalable by comparing it with several conventional approaches. With a given ontology, the number of conversions in COIN is quadratic to the semantic aspect that has the largest number of distinctions. These semantic aspects are modeled as modifiers in a conceptual ontology; in most cases the number of conversions is linear with the number of modifiers, which is significantly smaller than traditional hard-wiring middleware approach where the number of conversion programs is quadratic to the number of sources and data receivers. In the example scenario in the paper, the COIN approach needs only 5 conversions to be defined while traditional approaches require 20,000 to 100 million. COIN achieves this scalability by automatically composing all the comprehensive conversions from a small number of declaratively defined sub-conversions.