50 resultados para automatic content extraction
Resumo:
Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática
Resumo:
Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática
Resumo:
The extraction of relevant terms from texts is an extensively researched task in Text- Mining. Relevant terms have been applied in areas such as Information Retrieval or document clustering and classification. However, relevance has a rather fuzzy nature since the classification of some terms as relevant or not relevant is not consensual. For instance, while words such as "president" and "republic" are generally considered relevant by human evaluators, and words like "the" and "or" are not, terms such as "read" and "finish" gather no consensus about their semantic and informativeness. Concepts, on the other hand, have a less fuzzy nature. Therefore, instead of deciding on the relevance of a term during the extraction phase, as most extractors do, I propose to first extract, from texts, what I have called generic concepts (all concepts) and postpone the decision about relevance for downstream applications, accordingly to their needs. For instance, a keyword extractor may assume that the most relevant keywords are the most frequent concepts on the documents. Moreover, most statistical extractors are incapable of extracting single-word and multi-word expressions using the same methodology. These factors led to the development of the ConceptExtractor, a statistical and language-independent methodology which is explained in Part I of this thesis. In Part II, I will show that the automatic extraction of concepts has great applicability. For instance, for the extraction of keywords from documents, using the Tf-Idf metric only on concepts yields better results than using Tf-Idf without concepts, specially for multi-words. In addition, since concepts can be semantically related to other concepts, this allows us to build implicit document descriptors. These applications led to published work. Finally, I will present some work that, although not published yet, is briefly discussed in this document.
Resumo:
Currently the world swiftly adapts to visual communication. Online services like YouTube and Vine show that video is no longer the domain of broadcast television only. Video is used for different purposes like entertainment, information, education or communication. The rapid growth of today’s video archives with sparsely available editorial data creates a big problem of its retrieval. The humans see a video like a complex interplay of cognitive concepts. As a result there is a need to build a bridge between numeric values and semantic concepts. This establishes a connection that will facilitate videos’ retrieval by humans. The critical aspect of this bridge is video annotation. The process could be done manually or automatically. Manual annotation is very tedious, subjective and expensive. Therefore automatic annotation is being actively studied. In this thesis we focus on the multimedia content automatic annotation. Namely the use of analysis techniques for information retrieval allowing to automatically extract metadata from video in a videomail system. Furthermore the identification of text, people, actions, spaces, objects, including animals and plants. Hence it will be possible to align multimedia content with the text presented in the email message and the creation of applications for semantic video database indexing and retrieving.
Resumo:
The automatic acquisition of lexical associations from corpora is a crucial issue for Natural Language Processing. A lexical association is a recurrent combination of words that co-occur together more often than expected by chance in a given domain. In fact, lexical associations define linguistic phenomena such as idiomes, collocations or compound words. Due to the fact that the sense of a lexical association is not compositionnal, their identification is fundamental for the realization of analysis and synthesis that take into account all the subtilities of the language. In this report, we introduce a new statistically-based architecture that extracts from naturally occurring texts contiguous and non contiguous. For that purpose, three new concepts have been defined : the positional N-gram models, the Mutual Expectation and the GenLocalMaxs algorithm. Thus, the initial text is fisrtly transformed in a set of positionnal N-grams i.e ordered vectors of simple lexical units. Then, an association measure, the Mutual Expectation, evaluates the degree of cohesion of each positional N-grams based on the identification of local maximum values of Mutual Expectation. Great efforts have also been carried out to evaluate our metodology. For that purpose, we have proposed the normalisation of five well-known association measures and shown that both the Mutual Expectation and the GenLocalMaxs algorithm evidence significant improvements comparing to existent metodologies.
Resumo:
Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática
Resumo:
Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science
Resumo:
Dissertation to obtain a Master Degree in Biotechnology
Resumo:
Based in internet growth, through semantic web, together with communication speed improvement and fast development of storage device sizes, data and information volume rises considerably every day. Because of this, in the last few years there has been a growing interest in structures for formal representation with suitable characteristics, such as the possibility to organize data and information, as well as the reuse of its contents aimed for the generation of new knowledge. Controlled Vocabulary, specifically Ontologies, present themselves in the lead as one of such structures of representation with high potential. Not only allow for data representation, as well as the reuse of such data for knowledge extraction, coupled with its subsequent storage through not so complex formalisms. However, for the purpose of assuring that ontology knowledge is always up to date, they need maintenance. Ontology Learning is an area which studies the details of update and maintenance of ontologies. It is worth noting that relevant literature already presents first results on automatic maintenance of ontologies, but still in a very early stage. Human-based processes are still the current way to update and maintain an ontology, which turns this into a cumbersome task. The generation of new knowledge aimed for ontology growth can be done based in Data Mining techniques, which is an area that studies techniques for data processing, pattern discovery and knowledge extraction in IT systems. This work aims at proposing a novel semi-automatic method for knowledge extraction from unstructured data sources, using Data Mining techniques, namely through pattern discovery, focused in improving the precision of concept and its semantic relations present in an ontology. In order to verify the applicability of the proposed method, a proof of concept was developed, presenting its results, which were applied in building and construction sector.
Resumo:
This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations.
Resumo:
Dissertation presented at the Faculty of Science and Technology of the New University of Lisbon in fulfillment of the requirements for the Masters degree in Electrical Engineering and Computers
Resumo:
Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia Informática
Resumo:
European Master in Multimedia and Audiovisual Administration
Resumo:
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia Informática
Resumo:
Dissertation for a Masters Degree in Computer and Electronic Engineering