5 resultados para Imprecise Data
em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast
Resumo:
We present TANC, a TAN classifier (tree-augmented naive) based on imprecise probabilities. TANC models prior near-ignorance via the Extreme Imprecise Dirichlet Model (EDM). A first contribution of this paper is the experimental comparison between EDM and the global Imprecise Dirichlet Model using the naive credal classifier (NCC), with the aim of showing that EDM is a sensible approximation of the global IDM. TANC is able to deal with missing data in a conservative manner by considering all possible completions (without assuming them to be missing-at-random), but avoiding an exponential increase of the computational time. By experiments on real data sets, we show that TANC is more reliable than the Bayesian TAN and that it provides better performance compared to previous TANs based on imprecise probabilities. Yet, TANC is sometimes outperformed by NCC because the learned TAN structures are too complex; this calls for novel algorithms for learning the TAN structures, better suited for an imprecise probability classifier.
Resumo:
In this paper we present TANC, i.e., a tree-augmented naive credal classifier based on imprecise probabilities; it models prior near-ignorance via the Extreme Imprecise Dirichlet Model (EDM) (Cano et al., 2007) and deals conservatively with missing data in the training set, without assuming them to be missing-at-random. The EDM is an approximation of the global Imprecise Dirichlet Model (IDM), which considerably simplifies the computation of upper and lower probabilities; yet, having been only recently introduced, the quality of the provided approximation needs still to be verified. As first contribution, we extensively compare the output of the naive credal classifier (one of the few cases in which the global IDM can be exactly implemented) when learned with the EDM and the global IDM; the output of the classifier appears to be identical in the vast majority of cases, thus supporting the adoption of the EDM in real classification problems. Then, by experiments we show that TANC is more reliable than the precise TAN (learned with uniform prior), and also that it provides better performance compared to a previous (Zaffalon, 2003) TAN model based on imprecise probabilities. TANC treats missing data by considering all possible completions of the training set, but avoiding an exponential increase of the computational times; eventually, we present some preliminary results with missing data.
Resumo:
Background: Long working hours might increase the risk of cardiovascular disease, but prospective evidence is scarce, imprecise, and mostly limited to coronary heart disease. We aimed to assess long working hours as a risk factor for incident coronary heart disease and stroke.
Methods We identified published studies through a systematic review of PubMed and Embase from inception to Aug 20, 2014. We obtained unpublished data for 20 cohort studies from the Individual-Participant-Data Meta-analysis in Working Populations (IPD-Work) Consortium and open-access data archives. We used cumulative random-effects meta-analysis to combine effect estimates from published and unpublished data.
Findings We included 25 studies from 24 cohorts in Europe, the USA, and Australia. The meta-analysis of coronary heart disease comprised data for 603 838 men and women who were free from coronary heart disease at baseline; the meta-analysis of stroke comprised data for 528 908 men and women who were free from stroke at baseline. Follow-up for coronary heart disease was 5·1 million person-years (mean 8·5 years), in which 4768 events were recorded, and for stroke was 3·8 million person-years (mean 7·2 years), in which 1722 events were recorded. In cumulative meta-analysis adjusted for age, sex, and socioeconomic status, compared with standard hours (35-40 h per week), working long hours (≥55 h per week) was associated with an increase in risk of incident coronary heart disease (relative risk [RR] 1·13, 95% CI 1·02-1·26; p=0·02) and incident stroke (1·33, 1·11-1·61; p=0·002). The excess risk of stroke remained unchanged in analyses that addressed reverse causation, multivariable adjustments for other risk factors, and different methods of stroke ascertainment (range of RR estimates 1·30-1·42). We recorded a dose-response association for stroke, with RR estimates of 1·10 (95% CI 0·94-1·28; p=0·24) for 41-48 working hours, 1·27 (1·03-1·56; p=0·03) for 49-54 working hours, and 1·33 (1·11-1·61; p=0·002) for 55 working hours or more per week compared with standard working hours (ptrend<0·0001).
Interpretation Employees who work long hours have a higher risk of stroke than those working standard hours; the association with coronary heart disease is weaker. These findings suggest that more attention should be paid to the management of vascular risk factors in individuals who work long hours.
Resumo:
Master data management (MDM) integrates data from multiple
structured data sources and builds a consolidated 360-
degree view of business entities such as customers and products.
Today’s MDM systems are not prepared to integrate
information from unstructured data sources, such as news
reports, emails, call-center transcripts, and chat logs. However,
those unstructured data sources may contain valuable
information about the same entities known to MDM from
the structured data sources. Integrating information from
unstructured data into MDM is challenging as textual references
to existing MDM entities are often incomplete and
imprecise and the additional entity information extracted
from text should not impact the trustworthiness of MDM
data.
In this paper, we present an architecture for making MDM
text-aware and showcase its implementation as IBM InfoSphere
MDM Extension for Unstructured Text Correlation,
an add-on to IBM InfoSphere Master Data Management
Standard Edition. We highlight how MDM benefits from
additional evidence found in documents when doing entity
resolution and relationship discovery. We experimentally
demonstrate the feasibility of integrating information from
unstructured data sources into MDM.
Resumo:
Perfect information is seldom available to man or machines due to uncertainties inherent in real world problems. Uncertainties in geographic information systems (GIS) stem from either vague/ambiguous or imprecise/inaccurate/incomplete information and it is necessary for GIS to develop tools and techniques to manage these uncertainties. There is a widespread agreement in the GIS community that although GIS has the potential to support a wide range of spatial data analysis problems, this potential is often hindered by the lack of consistency and uniformity. Uncertainties come in many shapes and forms, and processing uncertain spatial data requires a practical taxonomy to aid decision makers in choosing the most suitable data modeling and analysis method. In this paper, we: (1) review important developments in handling uncertainties when working with spatial data and GIS applications; (2) propose a taxonomy of models for dealing with uncertainties in GIS; and (3) identify current challenges and future research directions in spatial data analysis and GIS for managing uncertainties.