925 resultados para text mining
Resumo:
Curves are a common feature of road infrastructure; however crashes on road curves are associated with increased risk of injury and fatality to vehicle occupants. Countermeasures require the identification of contributing factors. However, current approaches to identifying contributors use traditional statistical methods and have not used self-reported narrative claim to identify factors related to the driver, vehicle and environment in a systemic way. Text mining of 3434 road-curve crash claim records filed between 1 January 2003 and 31 December 2005 at a major insurer in Queensland, Australia, was undertaken to identify risk levels and contributing factors. Rough set analysis was used on insurance claim narratives to identify significant contributing factors to crashes and their associated severity. New contributing factors unique to curve crashes were identified (e.g., tree, phone, over-steer) in addition to those previously identified via traditional statistical analysis of Police and licensing authority records. Text mining is a novel methodology to improve knowledge related to risk and contributing factors to road-curve crash severity. Future road-curve crash countermeasures should more fully consider the interrelationships between environment, the road, the driver and the vehicle, and education campaigns in particular could highlight the increased risk of crash on road-curves.
Resumo:
This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology. A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.
Resumo:
A central tenet in the theory of reliability modelling is the quantification of the probability of asset failure. In general, reliability depends on asset age and the maintenance policy applied. Usually, failure and maintenance times are the primary inputs to reliability models. However, for many organisations, different aspects of these data are often recorded in different databases (e.g. work order notifications, event logs, condition monitoring data, and process control data). These recorded data cannot be interpreted individually, since they typically do not have all the information necessary to ascertain failure and preventive maintenance times. This paper presents a methodology for the extraction of failure and preventive maintenance times using commonly-available, real-world data sources. A text-mining approach is employed to extract keywords indicative of the source of the maintenance event. Using these keywords, a Naïve Bayes classifier is then applied to attribute each machine stoppage to one of two classes: failure or preventive. The accuracy of the algorithm is assessed and the classified failure time data are then presented. The applicability of the methodology is demonstrated on a maintenance data set from an Australian electricity company.
Resumo:
The rapid increase in the number of text documents available on the Internet has created pressure to use effective cleaning techniques. Cleaning techniques are needed for converting these documents to structured documents. Text cleaning techniques are one of the key mechanisms in typical text mining application frameworks. In this paper, we explore the role of text cleaning in the 20 newsgroups dataset, and report on experimental results.
Resumo:
The DADAISM project brings together researchers from the diverse fields of archaeology, human computer interaction, image processing, image search and retrieval, and text mining to create a rich interactive system to address the problems of researchers finding images relevant to their research. In the age of digital photography, thousands of images are taken of archaeological artefacts. These images could help archaeologists enormously in their tasks of classification and identification if they could be related to one another effectively. They would yield many new insights on a range of archaeological problems. However, these images are currently greatly underutilized for two key reasons. Firstly, the current paradigm for interaction with image collections is basic keyword search or, at best, simple faceted search. Secondly, even if these interactions are possible, the metadata related to the majority of images of archaeological artefacts is scarce in information relating to the content of the image and the nature of the artefact, and is time intensive to enter manually. DADAISM will transform the way in which archaeologists interact with online image collections. It will deploy user-centred design methodologies to create an interactive system that goes well beyond current systems for working with images, and will support archaeologists’ tasks of finding, organising, relating and labelling images as well as other relevant sources of information such as grey literature documents.
Resumo:
Esta dissertação apresenta a estruturação de um sistema para indexação e visualização de depoimentos de história oral em vídeo. A partir do levantamento de um referencial teórico referente à indexação, o sistema resultou em um protótipo funcional de alta fidelidade. O conteúdo para a realização deste foi obtido pela indexação de 12 depoimentos coletados pela equipe do Museu da Pessoa durante o projeto Memórias da Vila Madalena, em São Paulo (ago/2012). Acervos de História Oral como o Museu da Pessoa, o Museu da Imagem e do Som ou o Centro de Pesquisa e Documentação de História Contemporânea do Brasil / CPDOC da Fundação Getúlio Vargas, reúnem milhares de horas de depoimentos em áudio e vídeo. De uma forma geral, esses depoimentos são longas entrevistas individuais, onde diversos assuntos são abordados; o que dificulta sua análise, síntese e consequentemente, sua recuperação. A transcrição dos depoimentos permite a realização de buscas textuais para acessar assuntos específicos nas longas entrevistas. Por isso, podemos dizer que as transcrições são a principal fonte de consulta dos pesquisadores de história oral, deixando a fonte primária (o vídeo) para um eventual segundo momento da pesquisa. A presente proposta visa ampliar a recuperação das fontes primárias a partir da indexação de segmentos de vídeo, criando pontos de acesso imediato para trechos relevantes das entrevistas. Nessa abordagem, os indexadores (termos, tags ou anotações) não são associados ao vídeo completo, mas a pontos de entrada e saída (timecodes) que definem trechos específicos no vídeo. As tags combinadas com os timecodes criam novos desafios e possibilidades para indexação e navegação através de arquivos de vídeo. O sistema aqui estruturado integra conceitos e técnicas de áreas aparentemente desconectadas: metodologias de indexação, construção de taxonomias, folksonomias, visualização de dados e design de interação são integrados em um processo unificado que vai desde a coleta e indexação dos depoimentos até sua visualização e interação.
Resumo:
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
Resumo:
BACKGROUND: In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required. DEVELOPMENT AND TESTING OF THE ONTOLOGY: Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar. RESULTS AND SIGNIFICANCE: Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.
Resumo:
Dascalu, M., Stavarache, L.L., Dessus, P., Trausan-Matu, S., McNamara, D.S., & Bianco, M. (2015). ReaderBench: An Integrated Cohesion-Centered Framework. In G. Conole, T. Klobucar, C. Rensing, J. Konert & É. Lavoué (Eds.), 10th European Conf. on Technology Enhanced Learning (pp. 505–508). Toledo, Spain: Springer.
Resumo:
Purpose – The purpose of this paper is to present an analysis of media representation of business ethics within 62 international newspapers to explore the longitudinal and contextual evolution of business ethics and associated terminology. Levels of coverage and contextual analysis of the content of the articles are used as surrogate measures of the penetration of business ethics concepts into society. Design/methodology/approach – This paper uses a text mining application based on two samples of data: analysis of 62 national newspapers in 21 countries from 1990 to 2008; analysis of the content of two samples of articles containing the term business ethics (comprised of 100 newspaper articles spread over an 18-year period from a sample of US and UK newspapers). Findings – The paper demonstrates increased coverage of sustainability topics within the media over the last 18 years associated with events such as the Rio Summit. Whilst some peaks are associated with business ethics scandals, the overall coverage remains steady. There is little apparent use in the media of concepts such as corporate citizenship. The academic community and company ethical codes appear to adopt a wider definition of business ethics more akin to that associated with sustainability, in comparison with the focus taken by the media, especially in the USA. Coverage demonstrates clear regional bias and contextual analysis of the articles in the UK and USA also shows interesting parallels and divergences in the media representation of business ethics. Originality/value – A promising avenue to explore how the evolution of sustainability issues including business ethics can be tracked within a societal context.
Resumo:
This paper is concerned with the language of policy documents in the field of health care, and how ‘readings’ of such documents might be validated in the context of a narrative analysis. The substantive focus is on a comparative study of UK health policy documents (N=20) as produced by the various assemblies, governments and executives of England, Scotland, Wales and Northern Ireland during the period 2000-2009. Following an identification of some key characteristics of narrative structure the authors indicate how text-mining strategies allied with features of semantic and network analysis can be used to unravel the basic elements of policy stories and to facilitate the presentation of data in such a way that readers can verify the strengths (and weaknesses) of any given analysis – with regard to claims concerning, say, the presence, absence, or relative importance of key ideas and concepts. Readers can also ‘see’ how the different components of any one story might fit together, and to get a sense of what has been excluded from the narrative as well as what has been included, and thereby assess the reliability and validity of interpretations that have been placed upon the data.
Resumo:
Abstract. With this paper we discuss the differences between sustainability-related media agendas across different countries and regions. Utilising a sample of 115 leading national newspapers covering forty-one countries, we show that typically no homogeneous global trends exist with regard to sustainability-related media agendas. Instead, significant differences exist regarding the national-level prioritisations of sustainability-related issues in the countries under review. To some extent, these observed differences can be attributed to different levels of socioeconomic development as measured by Human Development Index scores and gross domestic product per capita. Here, generic differences can be identified between newspapers from the Global North and South, with a range of issues such as climate change emerging as typically Northern issues, whereas issues such as corruption and poverty show significantly higher levels of coverage across newspapers from the Global South. We conclude with a discussion of the results in the context of global environmental governance.
Resumo:
One of the major challenges in systems biology is to understand the complex responses of a biological system to external perturbations or internal signalling depending on its biological conditions. Genome-wide transcriptomic profiling of cellular systems under various chemical perturbations allows the manifestation of certain features of the chemicals through their transcriptomic expression profiles. The insights obtained may help to establish the connections between human diseases, associated genes and therapeutic drugs. The main objective of this study was to systematically analyse cellular gene expression data under various drug treatments to elucidate drug-feature specific transcriptomic signatures. We first extracted drug-related information (drug features) from the collected textual description of DrugBank entries using text-mining techniques. A novel statistical method employing orthogonal least square learning was proposed to obtain drug-feature-specific signatures by integrating gene expression with DrugBank data. To obtain robust signatures from noisy input datasets, a stringent ensemble approach was applied with the combination of three techniques: resampling, leave-one-out cross validation, and aggregation. The validation experiments showed that the proposed method has the capacity of extracting biologically meaningful drug-feature-specific gene expression signatures. It was also shown that most of signature genes are connected with common hub genes by regulatory network analysis. The common hub genes were further shown to be related to general drug metabolism by Gene Ontology analysis. Each set of genes has relatively few interactions with other sets, indicating the modular nature of each signature and its drug-feature-specificity. Based on Gene Ontology analysis, we also found that each set of drug feature (DF)-specific genes were indeed enriched in biological processes related to the drug feature. The results of these experiments demonstrated the pot- ntial of the method for predicting certain features of new drugs using their transcriptomic profiles, providing a useful methodological framework and a valuable resource for drug development and characterization.
Resumo:
Tese de doutoramento, Ciências Biotecnológicas (Biotecnologia Alimentar), Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2014
Automatic classification of scientific records using the German Subject Heading Authority File (SWD)
Resumo:
The following paper deals with an automatic text classification method which does not require training documents. For this method the German Subject Heading Authority File (SWD), provided by the linked data service of the German National Library is used. Recently the SWD was enriched with notations of the Dewey Decimal Classification (DDC). In consequence it became possible to utilize the subject headings as textual representations for the notations of the DDC. Basically, we we derive the classification of a text from the classification of the words in the text given by the thesaurus. The method was tested by classifying 3826 OAI-Records from 7 different repositories. Mean reciprocal rank and recall were chosen as evaluation measure. Direct comparison to a machine learning method has shown that this method is definitely competitive. Thus we can conclude that the enriched version of the SWD provides high quality information with a broad coverage for classification of German scientific articles.