981 resultados para Unsupervised document classification
Resumo:
This paper presents an approach to development of intelligent search system and automatic document classification and cataloging tools for CASE-system based on metadata. The described method uses advantages of ontology approach and traditional approach based on keywords. The method has powerful intelligent means and it can be integrated with existing document search systems.
Resumo:
The aim of this thesis is to present a new approach to document classification using verb-object pairs. We explore one possible strategy that uses the presence of relevant verb-object pairs in documents as features and a Naive Bayes classifier as a classifier on which the model is trained. Then, we assess the results from the case study which uses a software based on the strategy and make conclusions.
Resumo:
This thesis introduces a novel conceptual framework to support the creation of knowledge representations based on enriched Semantic Vectors, using the classical vector space model approach extended with ontological support. One of the primary research challenges addressed here relates to the process of formalization and representation of document contents, where most existing approaches are limited and only take into account the explicit, word-based information in the document. This research explores how traditional knowledge representations can be enriched through incorporation of implicit information derived from the complex relationships (semantic associations) modelled by domain ontologies with the addition of information presented in documents. The relevant achievements pursued by this thesis are the following: (i) conceptualization of a model that enables the semantic enrichment of knowledge sources supported by domain experts; (ii) development of a method for extending the traditional vector space, using domain ontologies; (iii) development of a method to support ontology learning, based on the discovery of new ontological relations expressed in non-structured information sources; (iv) development of a process to evaluate the semantic enrichment; (v) implementation of a proof-of-concept, named SENSE (Semantic Enrichment kNowledge SourcEs), which enables to validate the ideas established under the scope of this thesis; (vi) publication of several scientific articles and the support to 4 master dissertations carried out by the department of Electrical and Computer Engineering from FCT/UNL. It is worth mentioning that the work developed under the semantic referential covered by this thesis has reused relevant achievements within the scope of research European projects, in order to address approaches which are considered scientifically sound and coherent and avoid “reinventing the wheel”.
Resumo:
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.
Resumo:
This paper presents a comparative study of three closely related Bayesian models for unsupervised document level sentiment classification, namely, the latent sentiment model (LSM), the joint sentiment-topic (JST) model, and the Reverse-JST model. Extensive experiments have been conducted on two corpora, the movie review dataset and the multi-domain sentiment dataset. It has been found that while all the three models achieve either better or comparable performance on these two corpora when compared to the existing unsupervised sentiment classification approaches, both JST and Reverse-JST are able to extract sentiment-oriented topics. In addition, Reverse-JST always performs worse than JST suggesting that the JST model is more appropriate for joint sentiment topic detection.
Resumo:
2000 Mathematics Subject Classification: 62H30
Resumo:
Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.
Resumo:
Mestrado em Engenharia Informática
Resumo:
Mestrado em Engenharia Informática - Área de Especialização em Arquiteturas, Sistemas e Redes
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
In this article, we propose a framework, namely, Prediction-Learning-Distillation (PLD) for interactive document classification and distilling misclassified documents. Whenever a user points out misclassified documents, the PLD learns from the mistakes and identifies the same mistakes from all other classified documents. The PLD then enforces this learning for future classifications. If the classifier fails to accept relevant documents or reject irrelevant documents on certain categories, then PLD will assign those documents as new positive/negative training instances. The classifier can then strengthen its weakness by learning from these new training instances. Our experiments’ results have demonstrated that the proposed algorithm can learn from user-identified misclassified documents, and then distil the rest successfully.
Resumo:
This research analyzed the spatial relationship between a mega-scale fracture network and the occurrence of vegetation in an arid region. High-resolution aerial photographs of Arches National Park, Utah were used for digital image processing. Four sets of large-scale joints were digitized from the rectified color photograph in order to characterize the geospatial properties of the fracture network with the aid of a Geographic Information System. An unsupervised landcover classification was carried out to identify the spatial distribution of vegetation on the fractured outcrop. Results of this study confirm that the WNW-ESE alignment of vegetation is dominantly controlled by the spatial distribution of the systematic joint set, which in turn parallels the regional fold axis. This research provides insight into the spatial heterogeneity inherent to fracture networks, as well as the effects of jointing on the distribution of surface vegetation in desert environments.
Resumo:
The Venice Lagoon is a complex, heterogeneous and highly dynamic system, subject to anthropogenic and natural pressures that deeply affect the functioning of this ecosystem. Thanks to the development of acoustic technologies, it is possible to obtain maps with a high resolution that describe the characteristics of the seabed. With this aim, a high resolution Multibeam Echosounder (MBES) bathymetry and backscatter survey was carried out in 2021 within the project Research Programme Venezia 2021. Ground-truthing samples were collected in 24 sampling sites to characterize the seafloor and validate the maps produced with the MBES acoustic data. Ground-truthing included the collection of sediment samples for particle size analysis and video footage of the seabed to describe the biological component. The backscatter data was analysed using the unsupervised Jenks classification. We created a map of the habitats integrating morphological, granulometric and biological data in a GIS environment. The results obtained in this study were compared to those collected in 2015 as part of the National Flagship Project RITMARE. Through the comparison of the repeated morpho-bathymetric surveys over time we highlighted the changes of the seafloor geomorphology, sediment, and habitat distribution. We observed different type of habitats and the presence of areas characterized by erosive processes and others in which deposition occurred. These effects led to changes in the benthic communities and in the type of sediment. The combination of the MBES surveys, the ground truth data and the GIS methodology, permitted to construct high-resolution maps of the seafloor and proved to be effective implement for monitoring an extremely dynamic area. This work can contribute not only to broaden the knowledge of transitional environments, but also to their monitor and protection.
Resumo:
Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática