163 resultados para Guido, Tomás
Resumo:
Objective This paper presents an automatic active learning-based system for the extraction of medical concepts from clinical free-text reports. Specifically, (1) the contribution of active learning in reducing the annotation effort, and (2) the robustness of incremental active learning framework across different selection criteria and datasets is determined. Materials and methods The comparative performance of an active learning framework and a fully supervised approach were investigated to study how active learning reduces the annotation effort while achieving the same effectiveness as a supervised approach. Conditional Random Fields as the supervised method, and least confidence and information density as two selection criteria for active learning framework were used. The effect of incremental learning vs. standard learning on the robustness of the models within the active learning framework with different selection criteria was also investigated. Two clinical datasets were used for evaluation: the i2b2/VA 2010 NLP challenge and the ShARe/CLEF 2013 eHealth Evaluation Lab. Results The annotation effort saved by active learning to achieve the same effectiveness as supervised learning is up to 77%, 57%, and 46% of the total number of sequences, tokens, and concepts, respectively. Compared to the Random sampling baseline, the saving is at least doubled. Discussion Incremental active learning guarantees robustness across all selection criteria and datasets. The reduction of annotation effort is always above random sampling and longest sequence baselines. Conclusion Incremental active learning is a promising approach for building effective and robust medical concept extraction models, while significantly reducing the burden of manual annotation.
Resumo:
This paper presents a new active learning query strategy for information extraction, called Domain Knowledge Informativeness (DKI). Active learning is often used to reduce the amount of annotation effort required to obtain training data for machine learning algorithms. A key component of an active learning approach is the query strategy, which is used to iteratively select samples for annotation. Knowledge resources have been used in information extraction as a means to derive additional features for sample representation. DKI is, however, the first query strategy that exploits such resources to inform sample selection. To evaluate the merits of DKI, in particular with respect to the reduction in annotation effort that the new query strategy allows to achieve, we conduct a comprehensive empirical comparison of active learning query strategies for information extraction within the clinical domain. The clinical domain was chosen for this work because of the availability of extensive structured knowledge resources which have often been exploited for feature generation. In addition, the clinical domain offers a compelling use case for active learning because of the necessary high costs and hurdles associated with obtaining annotations in this domain. Our experimental findings demonstrated that 1) amongst existing query strategies, the ones based on the classification model’s confidence are a better choice for clinical data as they perform equally well with a much lighter computational load, and 2) significant reductions in annotation effort are achievable by exploiting knowledge resources within active learning query strategies, with up to 14% less tokens and concepts to manually annotate than with state-of-the-art query strategies.
Resumo:
PBDE concentrations are higher in children compared to adults with exposure suggested to include dust ingestion. Besides the home environment, children spend a great deal of time in school classrooms which may be a source of exposure. As part of the “Ultrafine Particles from Traffic Emissions and Children's Health (UPTECH)” project, dust samples (n=28) were obtained in 2011/12 from 10 Brisbane, Australia metropolitan schools and analysed using GC and LC–MS for polybrominated diphenyl ethers (PBDEs) -17, -28, -47, -49, -66, -85, -99, -100, -154, -183, and -209. Σ11PBDEs ranged from 11–2163 ng/g dust; with a mean and median of 600 and 469 ng/g dust, respectively. BDE-209 (range n.d. −2034 ng/g dust; mean (median) 402 (217) ng/g dust) was the dominant congener in most classrooms. Frequencies of detection were 96%, 96%, 39% and 93% for BDE-47, -99, -100 and -209, respectively. No seasonal variations were apparent and from each of the two schools where XRF measurements were carried out, only two classroom items had detectable bromine. PBDE intake for 8–11 year olds can be estimated at 0.094 ng/day BDE-47; 0.187 ng/day BDE-99 and 0.522 ng/day BDE-209 as a result of ingestion of classroom dust, based on mean PBDE concentrations. The 97.5% percentile intake is estimated to be 0.62, 1.03 and 2.14 ng/day for BDEs-47, -99 and -209, respectively. These PBDE concentrations in dust from classrooms, which are higher than in Australian homes, may explain some of the higher body burden of PBDEs in children compared to adults when taking into consideration age-dependant behaviours which increase dust ingestion.
Resumo:
In [8] the authors developed a logical system based on the definition of a new non-classical connective ⊗ capturing the notion of reparative obligation. The system proved to be appropriate for handling well-known contrary-to-duty paradoxes but no model-theoretic semantics was presented. In this paper we fill the gap and define a suitable possible-world semantics for the system for which we can prove soundness and completeness. The semantics is a preference-based non-normal one extending and generalizing semantics for classical modal logics.
Resumo:
Previous qualitative research has highlighted that temporality plays an important role in relevance for clinical records search. In this study, an investigation is undertaken to determine the effect that the timespan of events within a patient record has on relevance in a retrieval scenario. In addition, based on the standard practise of document length normalisation, a document timespan normalisation model that specifically accounts for timespans is proposed. Initial analysis revealed that in general relevant patient records tended to cover a longer timespan of events than non-relevant patient records. However, an empirical evaluation using the TREC Medical Records track supports the opposite view that shorter documents (in terms of timespan) are better for retrieval. These findings highlight that the role of temporality in relevance is complex and how to effectively deal with temporality within a retrieval scenario remains an open question.
Resumo:
Non-monotonic reasoning typically deals with three kinds of knowledge. Facts are meant to describe immutable statements of the environment. Rules define relationships among elements. Lastly, an ordering among the rules, in the form of a superiority relation, establishes the relative strength of rules. To revise a non-monotonic theory, we can change either one of these three elements. We prove that the problem of revising a non-monotonic theory by only changing the superiority relation is a NP-complete problem.
Resumo:
Concept mapping involves determining relevant concepts from a free-text input, where concepts are defined in an external reference ontology. This is an important process that underpins many applications for clinical information reporting, derivation of phenotypic descriptions, and a number of state-of-the-art medical information retrieval methods. Concept mapping can be cast into an information retrieval (IR) problem: free-text mentions are treated as queries and concepts from a reference ontology as the documents to be indexed and retrieved. This paper presents an empirical investigation applying general-purpose IR techniques for concept mapping in the medical domain. A dataset used for evaluating medical information extraction is adapted to measure the effectiveness of the considered IR approaches. Standard IR approaches used here are contrasted with the effectiveness of two established benchmark methods specifically developed for medical concept mapping. The empirical findings show that the IR approaches are comparable with one benchmark method but well below the best benchmark.
Resumo:
Background Bien Hoa and Da Nang airbases were bulk storages for Agent Orange during the Vietnam War and currently are the two most severe dioxin hot spots. Objectives This study assesses the health risk of exposure to dioxin through foods for local residents living in seven wards surrounding these airbases. Methods This study follows the Australian Environmental Health Risk Assessment Framework to assess the health risk of exposure to dioxin in foods. Forty-six pooled samples of commonly consumed local foods were collected and analyzed for dioxin/furans. A food frequency and Knowledge–Attitude–Practice survey was also undertaken at 1000 local households, various stakeholders were involved and related publications were reviewed. Results Total dioxin/furan concentrations in samples of local “high-risk” foods (e.g. free range chicken meat and eggs, ducks, freshwater fish, snail and beef) ranged from 3.8 pg TEQ/g to 95 pg TEQ/g, while in “low-risk” foods (e.g. caged chicken meat and eggs, seafoods, pork, leafy vegetables, fruits, and rice) concentrations ranged from 0.03 pg TEQ/g to 6.1 pg TEQ/g. Estimated daily intake of dioxin if people who did not consume local high risk foods ranged from 3.2 pg TEQ/kg bw/day to 6.2 pg TEQ/kg bw/day (Bien Hoa) and from 1.2 pg TEQ/kg bw/day to 4.3 pg TEQ/kg bw/day (Da Nang). Consumption of local high risk foods resulted in extremely high dioxin daily intakes (60.4–102.8 pg TEQ/kg bw/day in Bien Hoa; 27.0–148.0 pg TEQ/kg bw/day in Da Nang). Conclusions Consumption of local “high-risk” foods increases dioxin daily intakes far above the WHO recommended TDI (1–4 pg TEQ/kg bw/day). Practicing appropriate preventive measures is necessary to significantly reduce exposure and health risk.
Resumo:
Bien Hoa Airbase was one of the bulk storage and supply facilities for defoliants during the Vietnam War. Environmental and biological samples taken around the airbase have elevated levels of dioxin. In 2007, a pre-intervention knowledge, attitude and practice (KAP) survey of local residents living in Trung Dung and Tan Phong wards was undertaken regarding appropriate strategies to reduce dioxin exposure. A risk reduction programme was implemented in 2008 and post-intervention KAP surveys were undertaken in 2009 and 2013 to evaluate the longer term impacts. Quantitative assessment was undertaken via a KAP survey in 2013 among 600 local residents randomly selected from the two intervention wards and one control ward (Buu Long). Eight in-depth interviews and two focus group discussions were also undertaken for qualitative assessment. Most programme activities had ceased and dioxin risk communication activities had not been integrated into local routine health education programmes; however, main results generally remained and were better than that in Buu Long. In total, 48.2% of households undertook measures to prevent exposure, higher than those in pre- and post-intervention surveys (25.8% and 39.7%) and the control ward (7.7%). Migration and the sensitive nature of dioxin issues were the main challenges for the programme's sustainability
Resumo:
Permissions are special case of deontic effects and play important role compliance. Essentially they are used to determine the obligations or prohibitions to contrary. A formal language e.g., temporal logic, event-calculus et., not able to represent permissions is doomed to be unable to represent most of the real-life legal norms. In this paper we address this issue and extend deontic-event-calculus (DEC) with new predicates for modelling permissions enabling it to elegantly capture the intuition of real-life cases of permissions.
Resumo:
This study investigates the use of unsupervised features derived from word embedding approaches and novel sequence representation approaches for improving clinical information extraction systems. Our results corroborate previous findings that indicate that the use of word embeddings significantly improve the effectiveness of concept extraction models; however, we further determine the influence that the corpora used to generate such features have. We also demonstrate the promise of sequence-based unsupervised features for further improving concept extraction.
Resumo:
Recent advances in neural language models have contributed new methods for learning distributed vector representations of words (also called word embeddings). Two such methods are the continuous bag-of-words model and the skipgram model. These methods have been shown to produce embeddings that capture higher order relationships between words that are highly effective in natural language processing tasks involving the use of word similarity and word analogy. Despite these promising results, there has been little analysis of the use of these word embeddings for retrieval. Motivated by these observations, in this paper, we set out to determine how these word embeddings can be used within a retrieval model and what the benefit might be. To this aim, we use neural word embeddings within the well known translation language model for information retrieval. This language model captures implicit semantic relations between the words in queries and those in relevant documents, thus producing more accurate estimations of document relevance. The word embeddings used to estimate neural language models produce translations that differ from previous translation language model approaches; differences that deliver improvements in retrieval effectiveness. The models are robust to choices made in building word embeddings and, even more so, our results show that embeddings do not even need to be produced from the same corpus being used for retrieval.
Resumo:
Objective Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates – an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates. Methods Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model. Results The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable. Conclusion The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries to monitor and report on cancer mortality in a timely and accurate manner. In addition, the methods and findings are generally applicable beyond cancer classification and to other sources of medical text besides death certificates.