8 resultados para Associative Classifiers
em Helda - Digital Repository of University of Helsinki
Resumo:
The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Zellig S. Harris (1954,1968). We study his assumption from two aspects: Firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts), and secondly, similar usages (contexts) should lead to similar meanings (word senses). If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labeling contexts in an unsupervised or weakly supervised manner (Publication 1, 2 and 3). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context are the only means we have to separate word senses. If we start with words in similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or multilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Publication 4). In the multilingual material, we ?nd translations by supervised learning of transliterations (Publication 5). In both the monolingual and multilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets. In this introduction to the publications of the thesis, we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it is modeled. We also consider how to define, collect and represent contexts. We discuss how to evaluate the trained context classi?ers and discovered word sense classifications, and ?nally we present the word sense discovery and disambiguation methods of the publications. This work supports Harris' hypothesis by implementing three new methods modeled on his hypothesis. The methods have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes. Keywords: Word senses, Context, Evaluation, Word sense disambiguation, Word sense discovery.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
In visual object detection and recognition, classifiers have two interesting characteristics: accuracy and speed. Accuracy depends on the complexity of the image features and classifier decision surfaces. Speed depends on the hardware and the computational effort required to use the features and decision surfaces. When attempts to increase accuracy lead to increases in complexity and effort, it is necessary to ask how much are we willing to pay for increased accuracy. For example, if increased computational effort implies quickly diminishing returns in accuracy, then those designing inexpensive surveillance applications cannot aim for maximum accuracy at any cost. It becomes necessary to find trade-offs between accuracy and effort. We study efficient classification of images depicting real-world objects and scenes. Classification is efficient when a classifier can be controlled so that the desired trade-off between accuracy and effort (speed) is achieved and unnecessary computations are avoided on a per input basis. A framework is proposed for understanding and modeling efficient classification of images. Classification is modeled as a tree-like process. In designing the framework, it is important to recognize what is essential and to avoid structures that are narrow in applicability. Earlier frameworks are lacking in this regard. The overall contribution is two-fold. First, the framework is presented, subjected to experiments, and shown to be satisfactory. Second, certain unconventional approaches are experimented with. This allows the separation of the essential from the conventional. To determine if the framework is satisfactory, three categories of questions are identified: trade-off optimization, classifier tree organization, and rules for delegation and confidence modeling. Questions and problems related to each category are addressed and empirical results are presented. For example, related to trade-off optimization, we address the problem of computational bottlenecks that limit the range of trade-offs. We also ask if accuracy versus effort trade-offs can be controlled after training. For another example, regarding classifier tree organization, we first consider the task of organizing a tree in a problem-specific manner. We then ask if problem-specific organization is necessary.
Resumo:
In recent reports, adolescents and young adults (AYA) with acute lymphoblastic leukemia (ALL) have had a better outcome with pediatric treatment than with adult protocols. ALL can be classified into biologic subgroups according to immunophenotype and cytogenetics, with different clinical characteristics and outcome. The proportions of the subgroups are different in children and adults. ALL subtypes in AYA patients are less well characterized. In this study, the treatment and outcome of ALL in AYA patients aged 10-25 years in Finland on pediatric and adult protocols was retrospectively analyzed. In total, 245 patients were included. The proportions of biologic subgroups in different age groups were determined. Patients with initially normal or failed karyotype were examined with oligonucleotide microarray-based comparative genomic hybridization (aCGH). Also deletions and instability of chromosome 9p were screened in ALL patients. In addition, patients with other hematologic malignancies were screened for 9p instability. aCGH data were also used to determine a gene set that classifies AYA patients at diagnosis according to their risk of relapse. Receiver operating characteristic analysis was used to assess the value of the set of genes as prognostic classifiers. The 5-year event-free survival of AYA patients treated with pediatric or adult protocols was 67% and 60% (p=0.30), respectively. White blood cell count larger than 100x109/l was associated with poor prognosis. Patients treated with pediatric protocols and assigned to an intermediate-risk group fared significantly better than those of the pediatric high-risk or adult treatment groups. Deletions of 9p were detected in 46% of AYA ALL patients. The chromosomal region 9p21.3 was always affected, and the CDKN2A gene was always deleted. In about 15% of AYA patients, the 9p21.3 deletion was smaller than 200 kb in size, and therefore, probably undetectable with conventional methods. Deletion of 9p was the most common aberration of AYA ALL patients with initially normal karyotype. Instability of 9p, defined as multiple separate areas of copy number loss or homozygous loss within a larger heterozygous area in 9p, was detected in 19% (n=27) of ALL patients. This abnormality was restricted to ALL; none of the patients with other hematologic malignancies had the aberration. The prognostic model identification procedure resulted in a model of four genes: BAK1, CDKN2B, GSTM1, and MT1F. The copy number profile combinations of these genes differentiated between AYA ALL patients at diagnosis depending on their risk of relapse. Deletions of CDKN2B and BAK1 in combination with amplification of GSTM1 and MT1F were associated with a higher probability of relapse. Unlike all previous studies, we found that the outcome of AYA patients with ALL treated using pediatric or adult therapeutic protocols was comparable. The success of adult ALL therapy emphasizes the benefit of referral of patients to academic centers and adherence to research protocols. 9p deletions and instability are common features of ALL and may act together with oncogene-activating translocations in leukemogenesis. New and more sensitive methods of molecular cytogenetics can reveal previously cryptic genetic aberrations with an important role in leukemic development and prognosis and that may be potential targets of therapy. aCGH also provides a viable approach for model design aiming at evaluation of risk of relapse in ALL.
Resumo:
We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆfCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆfCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources.