904 resultados para INFORMATION EXTRACTION FROM DOCUMENTS


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The extraction of relevant terms from texts is an extensively researched task in Text- Mining. Relevant terms have been applied in areas such as Information Retrieval or document clustering and classification. However, relevance has a rather fuzzy nature since the classification of some terms as relevant or not relevant is not consensual. For instance, while words such as "president" and "republic" are generally considered relevant by human evaluators, and words like "the" and "or" are not, terms such as "read" and "finish" gather no consensus about their semantic and informativeness. Concepts, on the other hand, have a less fuzzy nature. Therefore, instead of deciding on the relevance of a term during the extraction phase, as most extractors do, I propose to first extract, from texts, what I have called generic concepts (all concepts) and postpone the decision about relevance for downstream applications, accordingly to their needs. For instance, a keyword extractor may assume that the most relevant keywords are the most frequent concepts on the documents. Moreover, most statistical extractors are incapable of extracting single-word and multi-word expressions using the same methodology. These factors led to the development of the ConceptExtractor, a statistical and language-independent methodology which is explained in Part I of this thesis. In Part II, I will show that the automatic extraction of concepts has great applicability. For instance, for the extraction of keywords from documents, using the Tf-Idf metric only on concepts yields better results than using Tf-Idf without concepts, specially for multi-words. In addition, since concepts can be semantically related to other concepts, this allows us to build implicit document descriptors. These applications led to published work. Finally, I will present some work that, although not published yet, is briefly discussed in this document.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Except the article forming the main content most HTML documents on the WWW contain additional contents such as navigation menus, design elements or commercial banners. In the context of several applications it is necessary to draw the distinction between main and additional content automatically. Content extraction and template detection are the two approaches to solve this task. This thesis gives an extensive overview of existing algorithms from both areas. It contributes an objective way to measure and evaluate the performance of content extraction algorithms under different aspects. These evaluation measures allow to draw the first objective comparison of existing extraction solutions. The newly introduced content code blurring algorithm overcomes several drawbacks of previous approaches and proves to be the best content extraction algorithm at the moment. An analysis of methods to cluster web documents according to their underlying templates is the third major contribution of this thesis. In combination with a localised crawling process this clustering analysis can be used to automatically create sets of training documents for template detection algorithms. As the whole process can be automated it allows to perform template detection on a single document, thereby combining the advantages of single and multi document algorithms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Computer modelling has shown that electrical characteristics of individual pixels may be extracted from within multiple-frequency electrical impedance tomography (MFEIT) images formed using a reference data set obtained from a purely resistive, homogeneous medium. In some applications it is desirable to extract the electrical characteristics of individual pixels from images where a purely resistive, homogeneous reference data set is not available. One such application of the technique of MFEIT is to allow the acquisition of in vivo images using reference data sets obtained from a non-homogeneous medium with a reactive component. However, the reactive component of the reference data set introduces difficulties with the extraction of the true electrical characteristics from the image pixels. This study was a preliminary investigation of a technique to extract electrical parameters from multifrequency images when the reference data set has a reactive component. Unlike the situation in which a homogenous, resistive data set is available, it is not possible to obtain the impedance and phase information directly from the image pixel values of the MFEIT images data set, as the phase of the reactive reference is not known. The method reported here to extract the electrical characteristics (the Cole-Cole plot) initially assumes that this phase angle is zero. With this assumption, an impedance spectrum can be directly extracted from the image set. To obtain the true Cole-Cole plot a correction must be applied to account for the inherent rotation of the extracted impedance spectrum about the origin, which is a result of the assumption. This work shows that the angle of rotation associated with the reactive component of the reference data set may be determined using a priori knowledge of the distribution of frequencies of the Cole-Cole plot. Using this angle of rotation, the true Cole-Cole plot can be obtained from the impedance spectrum extracted from the MFEIT image data set. The method was investigated using simulated data, both with and without noise, and also for image data obtained in vitro. The in vitro studies involved 32 logarithmically spaced frequencies from 4 kHz up to 1 MHz and demonstrated that differences between the true characteristics and those of the impedance spectrum were reduced significantly after application of the correction technique. The differences between the extracted parameters and the true values prior to correction were in the range from 16% to 70%. Following application of the correction technique the differences were reduced to less than 5%. The parameters obtained from the Cole-Cole plot may be useful as a characterization of the nature and health of the imaged tissues.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Aquesta memòria vol mostrar que la tecnologia XML és la millor alternativa per a afrontar el repte tecnològic existent en els sistemes d'extracció d'informació de les aplicacions de nova generació. Aquests sistemes, d'una banda, han de garantir la seva independència respecte dels esquemes de les bases de dades dels quals s'alimenten i, de l'altra, han de ser capaços de mostrar la informació en múltiples formats.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

S’insérant dans les domaines de la Lecture et de l’Analyse de Textes Assistées par Ordinateur (LATAO), de la Gestion Électronique des Documents (GÉD), de la visualisation de l’information et, en partie, de l’anthropologie, cette recherche exploratoire propose l’expérimentation d’une méthodologie descriptive en fouille de textes afin de cartographier thématiquement un corpus de textes anthropologiques. Plus précisément, nous souhaitons éprouver la méthode de classification hiérarchique ascendante (CHA) pour extraire et analyser les thèmes issus de résumés de mémoires et de thèses octroyés de 1985 à 2009 (1240 résumés), par les départements d’anthropologie de l’Université de Montréal et de l’Université Laval, ainsi que le département d’histoire de l’Université Laval (pour les résumés archéologiques et ethnologiques). En première partie de mémoire, nous présentons notre cadre théorique, c'est-à-dire que nous expliquons ce qu’est la fouille de textes, ses origines, ses applications, les étapes méthodologiques puis, nous complétons avec une revue des principales publications. La deuxième partie est consacrée au cadre méthodologique et ainsi, nous abordons les différentes étapes par lesquelles ce projet fut conduit; la collecte des données, le filtrage linguistique, la classification automatique, pour en nommer que quelques-unes. Finalement, en dernière partie, nous présentons les résultats de notre recherche, en nous attardant plus particulièrement sur deux expérimentations. Nous abordons également la navigation thématique et les approches conceptuelles en thématisation, par exemple, en anthropologie, la dichotomie culture ̸ biologie. Nous terminons avec les limites de ce projet et les pistes d’intérêts pour de futures recherches.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A new method of poly-beta-hydroxybutyrate (PHB) extraction from recombinant E. coli is proposed, using homogenization and centrifugation coupled with sodium hypochlorite treatment. The size of PHB granules and cell debris in homogenates was characterised as a function of the number of homogenization passes. Simulation was used to develop the PHB and cell debris fractionation system, enabling numerical examination of the effects of repeated homogenization and centrifuge-feedrate variation. The simulation provided a good prediction of experimental performance. Sodium hypochlorite treatment was necessary to optimise PHB fractionation. A PHB recovery of 80% at a purity of 96.5% was obtained with the final optimised process. Protein and DNA contained in the resultant product were negligible. The developed process holds promise for significantly reducing the recovery cost associated with PHB manufacture.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Proteins are biochemical entities consisting of one or more blocks typically folded in a 3D pattern. Each block (a polypeptide) is a single linear sequence of amino acids that are biochemically bonded together. The amino acid sequence in a protein is defined by the sequence of a gene or several genes encoded in the DNA-based genetic code. This genetic code typically uses twenty amino acids, but in certain organisms the genetic code can also include two other amino acids. After linking the amino acids during protein synthesis, each amino acid becomes a residue in a protein, which is then chemically modified, ultimately changing and defining the protein function. In this study, the authors analyze the amino acid sequence using alignment-free methods, aiming to identify structural patterns in sets of proteins and in the proteome, without any other previous assumptions. The paper starts by analyzing amino acid sequence data by means of histograms using fixed length amino acid words (tuples). After creating the initial relative frequency histograms, they are transformed and processed in order to generate quantitative results for information extraction and graphical visualization. Selected samples from two reference datasets are used, and results reveal that the proposed method is able to generate relevant outputs in accordance with current scientific knowledge in domains like protein sequence/proteome analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Esta dissertação apresenta uma proposta de sistema capaz de preencher a lacuna entre documentos legislativos em formato PDF e documentos legislativos em formato aberto. O objetivo principal é mapear o conhecimento presente nesses documentos de maneira a representar essa coleção como informação interligada. O sistema é composto por vários componentes responsáveis pela execução de três fases propostas: extração de dados, organização de conhecimento, acesso à informação. A primeira fase propõe uma abordagem à extração de estrutura, texto e entidades de documentos PDF de maneira a obter a informação desejada, de acordo com a parametrização do utilizador. Esta abordagem usa dois métodos de extração diferentes, de acordo com as duas fases de processamento de documentos – análise de documento e compreensão de documento. O critério utilizado para agrupar objetos de texto é a fonte usada nos objetos de texto de acordo com a sua definição no código de fonte (Content Stream) do PDF. A abordagem está dividida em três partes: análise de documento, compreensão de documento e conjunção. A primeira parte da abordagem trata da extração de segmentos de texto, adotando uma abordagem geométrica. O resultado é uma lista de linhas do texto do documento; a segunda parte trata de agrupar os objetos de texto de acordo com o critério estipulado, produzindo um documento XML com o resultado dessa extração; a terceira e última fase junta os resultados das duas fases anteriores e aplica regras estruturais e lógicas no sentido de obter o documento XML final. A segunda fase propõe uma ontologia no domínio legal capaz de organizar a informação extraída pelo processo de extração da primeira fase. Também é responsável pelo processo de indexação do texto dos documentos. A ontologia proposta apresenta três características: pequena, interoperável e partilhável. A primeira característica está relacionada com o facto da ontologia não estar focada na descrição pormenorizada dos conceitos presentes, propondo uma descrição mais abstrata das entidades presentes; a segunda característica é incorporada devido à necessidade de interoperabilidade com outras ontologias do domínio legal, mas também com as ontologias padrão que são utilizadas geralmente; a terceira característica é definida no sentido de permitir que o conhecimento traduzido, segundo a ontologia proposta, seja independente de vários fatores, tais como o país, a língua ou a jurisdição. A terceira fase corresponde a uma resposta à questão do acesso e reutilização do conhecimento por utilizadores externos ao sistema através do desenvolvimento dum Web Service. Este componente permite o acesso à informação através da disponibilização de um grupo de recursos disponíveis a atores externos que desejem aceder à informação. O Web Service desenvolvido utiliza a arquitetura REST. Uma aplicação móvel Android também foi desenvolvida de maneira a providenciar visualizações dos pedidos de informação. O resultado final é então o desenvolvimento de um sistema capaz de transformar coleções de documentos em formato PDF para coleções em formato aberto de maneira a permitir o acesso e reutilização por outros utilizadores. Este sistema responde diretamente às questões da comunidade de dados abertos e de Governos, que possuem muitas coleções deste tipo, para as quais não existe a capacidade de raciocinar sobre a informação contida, e transformá-la em dados que os cidadãos e os profissionais possam visualizar e utilizar.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Asymptomatic Plasmodium infection is a new challenge for public health in the American region. The polymerase chain reaction (PCR) is the best method for diagnosing subpatent parasitemias. In endemic areas, blood collection is hampered by geographical distances and deficient transport and storage conditions of the samples. Because DNA extraction from blood collected on filter paper is an efficient method for molecular studies in high parasitemic individuals, we investigated whether the technique could be an alternative for Plasmodium diagnosis among asymptomatic and pauciparasitemic subjects. In this report we compared three different methods (Chelex®-saponin, methanol and TRIS-EDTA) of DNA extraction from blood collected on filter paper from asymptomatic Plasmodium-infected individuals. Polymerase chain reaction assays for detection of Plasmodium species showed the best results when the Chelex®-saponin method was used. Even though the sensitivity of detection was approximately 66% and 31% for P. falciparum and P. vivax, respectively, this method did not show the effectiveness in DNA extraction required for molecular diagnosis of Plasmodium. The development of better methods for extracting DNA from blood collected on filter paper is important for the diagnosis of subpatent malarial infections in remote areas and would contribute to establishing the epidemiology of this form of infection.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Lecture Notes in Computer Science, 9309

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The purpose of this paper is to reflect on the possibilities and challenges of Community Development Banks (CDBs) as an innovative method of socioeconomic management of microcredit for poor populations. To this end, we will discuss the case of Banco Palmas in Conjunto Palmeiras in the city of Fortaleza, in the northeastern state of Ceará, as an empirical case study. The analyses presented here are based on information obtained from Banco Palmas between late 2011 and early 2012. In addition, previous studies by other researchers on the bank and other studies on CDBs were important. The primary data collected at Banco Palmas came from documents made available by the bank, such as reports and mappings. The analyses describe some of the characteristics of the granting of microcredit and allow one to situate it in the universe of microfinance and solidarity finance. They also show the significant growth of local consumption, mostly through the use of the Palmas social currency. The Banco Palmas experience, aside from influencing national public policies of solidarity finance, initiated a CDBs network that encourages the replication of these experiences throughout the country.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The present dissertation examined reading development during elementary school years by means of eye movement tracking. Three different but related issues in this field were assessed. First of all, the development of parafoveal processing skills in reading was investigated. Second, it was assessed whether and to what extent sublexical units such as syllables and morphemes are used in processing Finnish words and whether the use of these sublexical units changes as a function of reading proficiency. Finally, the developmental trend in the speed of visual information extraction during reading was examined. With regard to parafoveal processing skills, it was shown that 2nd graders extract letter identity information approx. 5 characters to the right of fixation, 4th graders approx. 7 characters to the right of fixation, and 6th graders and adults approx. 9 characters to the right of fixation. Furthermore, it was shown that all age groups extract more parafoveal information within compound words than across adjectivenoun pairs of similar length. In compounds, parafoveal word information can be extracted in parallel with foveal word information, if the compound in question is of high frequency. With regard to the use of sublexical units in Finnish word processing, it was shown that less proficient 2nd graders use both syllables and morphemes in the course of lexical access. More proficient 2nd graders as well as older readers seem to process words more holistically. Finally, it was shown that 60 ms is enough for 4th graders and adults to extract visual information from both 4-letter and 8-letter words, whereas 2nd graders clearly needed more than 60 ms to extract all information from 8- letter words for processing to proceed smoothly. The present dissertation demonstrates that Finnish 2nd graders develop their reading skills rapidly and are already at an adult level in some aspects of reading. This is not to say that there are no differences between less proficient (e.g., 2nd graders) and more proficient readers (e.g., adults) but in some respects it seems that the visual system used in extracting information from the text is matured by the 2nd grade. Furthermore, the present dissertation demonstrates that the allocation of attention in reading depends much on textual properties such as word frequency and whether words are spatially unified (as in compounds) or not. This flexibility of the attentional system naturally needs to be captured in word processing models. Finally, individual differences within age groups are quite substantial but it seems that by the end of the 2nd grade practically all Finnish children have reached a reasonable level of reading proficiency.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The objective of the present study was to test three different procedures for DNA extraction of Melipona quadrifasciata based on existing methods for DNA extraction of Apis, plants and fungi. These methods differ in the concentrations of specific substances in the extraction buffer. The results demonstrate that the method used for Apis is not adequate for DNA extraction from M. quadrifasciata. On the other hand, with minor modifications this method and the methods for plants and fungi were adequate for DNA extraction of this stingless bee, both for adults and larvae