165 resultados para predictive accuracy
em Queensland University of Technology - ePrints Archive
Resumo:
This paper addresses the problem of predicting the outcome of an ongoing case of a business process based on event logs. In this setting, the outcome of a case may refer for example to the achievement of a performance objective or the fulfillment of a compliance rule upon completion of the case. Given a log consisting of traces of completed cases, given a trace of an ongoing case, and given two or more possible out- comes (e.g., a positive and a negative outcome), the paper addresses the problem of determining the most likely outcome for the case in question. Previous approaches to this problem are largely based on simple symbolic sequence classification, meaning that they extract features from traces seen as sequences of event labels, and use these features to construct a classifier for runtime prediction. In doing so, these approaches ignore the data payload associated to each event. This paper approaches the problem from a different angle by treating traces as complex symbolic sequences, that is, sequences of events each carrying a data payload. In this context, the paper outlines different feature encodings of complex symbolic sequences and compares their predictive accuracy on real-life business process event logs.
Resumo:
The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
Resumo:
BACKGROUND AND AIMS: Crohn's disease (CD) is an inflammatory bowel disease (IBD) caused by a combination of genetic, clinical, and environmental factors. Identification of CD patients at high risk of requiring surgery may assist clinicians to decide on a top-down or step-up treatment approach. METHODS: We conducted a retrospective case-control analysis of a population-based cohort of 503 CD patients. A regression-based data reduction approach was used to systematically analyse 63 genomic, clinical and environmental factors for association with IBD-related surgery as the primary outcome variable. RESULTS: A multi-factor model was identified that yielded the highest predictive accuracy for need for surgery. The factors included in the model were the NOD2 genotype (OR = 1.607, P = 2.3 × 10(-5)), having ever had perianal disease (OR = 2.847, P = 4 × 10(-6)), being post-diagnosis smokers (OR = 6.312, P = 7.4 × 10(-3)), being an ex-smoker at diagnosis (OR = 2.405, P = 1.1 × 10(-3)) and age (OR = 1.012, P = 4.4 × 10(-3)). Diagnostic testing for this multi-factor model produced an area under the curve of 0.681 (P = 1 × 10(-4)) and an odds ratio of 3.169, (95 % CI P = 1 × 10(-4)) which was higher than any factor considered independently. CONCLUSIONS: The results of this study require validation in other populations but represent a step forward in the development of more accurate prognostic tests for clinicians to prescribe the most optimal treatment approach for complicated CD patients.
Resumo:
Background: Patients with Crohn’s disease (CD) often require surgery at some stage of disease course. Prediction of CD outcome is influenced by clinical, environmental, serological, and genetic factors (eg, NOD2). Being able to identify CD patients at high risk of surgical intervention should assist clinicians to decide whether or not to prescribe early aggressive treatment with immunomodulators. Methods: We performed a retrospective analysis of selected clinical (age at diagnosis, perianal disease, active smoking) and genetic (NOD2 genotype) data obtained for a population-based CD cohort from the Canterbury Inflammatory Bowel Disease study. Logistic regression was used to identify predictors of complicated outcome in these CD patients (ie, need for inflammatory bowel disease-related surgery). Results: Perianal disease and the NOD2 genotype were the only independent factors associated with the need for surgery in this patient group (odds ratio=2.84 and 1.60, respectively). By combining the associated NOD2 genotype with perianal disease we generated a single “clinicogenetic” variable. This was strongly associated with increased risk of surgery (odds ratio=3.84, P=0.00, confidence interval, 2.28-6.46) and offered moderate predictive accuracy (positive predictive value=0.62). Approximately 1/3 of surgical outcomes in this population are attributable to the NOD2+PA variable (attributable risk=0.32). Conclusions: Knowledge of perianal disease and NOD2 genotype in patients presenting with CD may offer clinicians some decision-making utility for early diagnosis of complicated CD progression and initiating intensive treatment to avoid surgical intervention. Future studies should investigate combination effects of other genetic, clinical, and environmental factors when attempting to identify predictors of complicated CD outcomes.
Resumo:
Biodiesel, produced from renewable feedstock represents a more sustainable source of energy and will therefore play a significant role in providing the energy requirements for transportation in the near future. Chemically, all biodiesels are fatty acid methyl esters (FAME), produced from raw vegetable oil and animal fat. However, clear differences in chemical structure are apparent from one feedstock to the next in terms of chain length, degree of unsaturation, number of double bonds and double bond configuration-which all determine the fuel properties of biodiesel. In this study, prediction models were developed to estimate kinematic viscosity of biodiesel using an Artificial Neural Network (ANN) modelling technique. While developing the model, 27 parameters based on chemical composition commonly found in biodiesel were used as the input variables and kinematic viscosity of biodiesel was used as output variable. Necessary data to develop and simulate the network were collected from more than 120 published peer reviewed papers. The Neural Networks Toolbox of MatLab R2012a software was used to train, validate and simulate the ANN model on a personal computer. The network architecture and learning algorithm were optimised following a trial and error method to obtain the best prediction of the kinematic viscosity. The predictive performance of the model was determined by calculating the coefficient of determination (R2), root mean squared (RMS) and maximum average error percentage (MAEP) between predicted and experimental results. This study found high predictive accuracy of the ANN in predicting fuel properties of biodiesel and has demonstrated the ability of the ANN model to find a meaningful relationship between biodiesel chemical composition and fuel properties. Therefore the model developed in this study can be a useful tool to accurately predict biodiesel fuel properties instead of undertaking costly and time consuming experimental tests.
Resumo:
Temporary Traffic Control Plans (TCP’s), which provide construction phasing to maintain traffic during construction operations, are integral component of highway construction project design. Using the initial design, designers develop estimated quantities for the required TCP devices that become the basis for bids submitted by highway contractors. However, actual as-built quantities are often significantly different from the engineer’s original estimate. The total cost of TCP phasing on highway construction projects amounts to 6–10% of the total construction cost. Variations between engineer estimated quantities and final quantities contribute to reduced cost control, increased chances of cost related litigations, and bid rankings and selection. Statistical analyses of over 2000 highway construction projects were performed to determine the sources of variation, which later were used as the basis of development for an automated-hybrid prediction model that uses multiple regressions and heuristic rules to provide accurate TCP quantities and costs. The predictive accuracy of the model developed was demonstrated through several case studies.
Resumo:
Background Wearable monitors are increasingly being used to objectively monitor physical activity in research studies within the field of exercise science. Calibration and validation of these devices are vital to obtaining accurate data. This article is aimed primarily at the physical activity measurement specialist, although the end user who is conducting studies with these devices also may benefit from knowing about this topic. Best Practices Initially, wearable physical activity monitors should undergo unit calibration to ensure interinstrument reliability. The next step is to simultaneously collect both raw signal data (e.g., acceleration) from the wearable monitors and rates of energy expenditure, so that algorithms can be developed to convert the direct signals into energy expenditure. This process should use multiple wearable monitors and a large and diverse subject group and should include a wide range of physical activities commonly performed in daily life (from sedentary to vigorous). Future Directions New methods of calibration now use "pattern recognition" approaches to train the algorithms on various activities, and they provide estimates of energy expenditure that are much better than those previously available with the single-regression approach. Once a method of predicting energy expenditure has been established, the next step is to examine its predictive accuracy by cross-validating it in other populations. In this article, we attempt to summarize the best practices for calibration and validation of wearable physical activity monitors. Finally, we conclude with some ideas for future research ideas that will move the field of physical activity measurement forward.
Resumo:
The concentrations of Na, K, Ca, Mg, Ba, Sr, Fe, Al, Mn, Zn, Pb, Cu, Ni, Cr, Co, Se, U and Ti were determined in the osteoderms and/or flesh of estuarine crocodiles (Crocodylus porosus) captured in three adjacent catchments within the Alligator Rivers Region (ARR) of northern Australia. Results from multivariate analysis of variance showed that when all metals were considered simultaneously, catchment effects were significant (P≤0.05). Despite considerable within-catchment variability, linear discriminant analysis (LDA) showed that differences in elemental signatures in the osteoderms and/or flesh of C. porosus amongst the catchments were sufficient to classify individuals accurately to their catchment of occurrence. Using cross-validation, the accuracy of classifying a crocodile to its catchment of occurrence was 76% for osteoderms and 60% for flesh. These data suggest that osteoderms provide better predictive accuracy than flesh for discriminating crocodiles amongst catchments. There was no advantage in combining the osteoderm and flesh results to increase the accuracy of classification (i.e. 67%). Based on the discriminant function coefficients for the osteoderm data, Ca, Co, Mg and U were the most important elements for discriminating amongst the three catchments. For flesh data, Ca, K, Mg, Na, Ni and Pb were the most important metals for discriminating amongst the catchments. Reasons for differences in the elemental signatures of crocodiles between catchments are generally not interpretable, due to limited data on surface water and sediment chemistry of the catchments or chemical composition of dietary items of C. porosus. From a wildlife management perspective, the provenance or source catchment(s) of 'problem' crocodiles captured at settlements or recreational areas along the ARR coastline may be established using catchment-specific elemental signatures. If the incidence of problem crocodiles can be reduced in settled or recreational areas by effective management at their source, then public safety concerns about these predators may be moderated, as well as the cost of their capture and removal. Copyright © 2002 Elsevier Science B.V.
Resumo:
Circuit-breakers (CBs) are subject to electrical stresses with restrikes during capacitor bank operation. Stresses are caused by the overvoltages across CBs, the interrupting currents and the rate of rise of recovery voltage (RRRV). Such electrical stresses also depend on the types of system grounding and the types of dielectric strength curves. The aim of this study is to demonstrate a restrike waveform predictive model for a SF6 CB that considered the types of system grounding: grounded and non-grounded and the computation accuracy comparison on the application of the cold withstand dielectric strength and the hot recovery dielectric strength curve including the POW (point-on-wave) recommendations to make an assessment of increasing the CB remaining life. The simulation of SF6 CB stresses in a typical 400 kV system was undertaken and the results in the applications are presented. The simulated restrike waveforms produced with the identified features using wavelet transform can be used for restrike diagnostic algorithm development with wavelet transform to locate a substation with breaker restrikes. This study found that the hot withstand dielectric strength curve has less magnitude than the cold withstand dielectric strength curve for restrike simulation results. Computation accuracy improved with the hot withstand dielectric strength and POW controlled switching can increase the life for a SF6 CB.
Resumo:
Recent efforts in mission planning for underwater vehicles have utilised predictive models to aid in navigation, optimal path planning and drive opportunistic sampling. Although these models provide information at a unprecedented resolutions and have proven to increase accuracy and effectiveness in multiple campaigns, most are deterministic in nature. Thus, predictions cannot be incorporated into probabilistic planning frameworks, nor do they provide any metric on the variance or confidence of the output variables. In this paper, we provide an initial investigation into determining the confidence of ocean model predictions based on the results of multiple field deployments of two autonomous underwater vehicles. For multiple missions conducted over a two-month period in 2011, we compare actual vehicle executions to simulations of the same missions through the Regional Ocean Modeling System in an ocean region off the coast of southern California. This comparison provides a qualitative analysis of the current velocity predictions for areas within the selected deployment region. Ultimately, we present a spatial heat-map of the correlation between the ocean model predictions and the actual mission executions. Knowing where the model provides unreliable predictions can be incorporated into planners to increase the utility and application of the deterministic estimations.
Resumo:
Purpose This Study evaluated the predictive validity of three previously published ActiGraph energy expenditure (EE) prediction equations developed for children and adolescents. Methods A total of 45 healthy children and adolescents (mean age: 13.7 +/- 2.6 yr) completed four 5-min activity trials (normal walking. brisk walking, easy running, and fast running) in ail indoor exercise facility. During each trial, participants were all ActiGraph accelerometer oil the right hip. EE was monitored breath by breath using the Cosmed K4b(2) portable indirect calorimetry system. Differences and associations between measured and predicted EE were assessed using dependent t-tests and Pearson correlations, respectively. Classification accuracy was assessed using percent agreement, sensitivity, specificity, and area under the receiver operating characteristic (ROC) curve. Results None of the equations accurately predicted mean energy expenditure during each of the four activity trials. Each equation, however, accurately predicted mean EE in at least one activity trial. The Puyau equation accurately predicted EE during slow walking. The Trost equation accurately predicted EE during slow running. The Freedson equation accurately predicted EE during fast running. None of the three equations accurately predicted EE during brisk walking. The equations exhibited fair to excellent classification accuracy with respect to activity intensity. with the Trost equation exhibiting the highest classification accuracy and the Puyau equation exhibiting the lowest. Conclusions These data suggest that the three accelerometer prediction equations do not accurately predict EE on a minute-by-minute basis in children and adolescents during overground walking and running. The equations maybe, however, for estimating participation in moderate and vigorous activity.
Resumo:
Multivariate predictive models are widely used tools for assessment of aquatic ecosystem health and models have been successfully developed for the prediction and assessment of aquatic macroinvertebrates, diatoms, local stream habitat features and fish. We evaluated the ability of a modelling method based on the River InVertebrate Prediction and Classification System (RIVPACS) to accurately predict freshwater fish assemblage composition and assess aquatic ecosystem health in rivers and streams of south-eastern Queensland, Australia. The predictive model was developed, validated and tested in a region of comparatively high environmental variability due to the unpredictable nature of rainfall and river discharge. The model was concluded to provide sufficiently accurate and precise predictions of species composition and was sensitive enough to distinguish test sites impacted by several common types of human disturbance (particularly impacts associated with catchment land use and associated local riparian, in-stream habitat and water quality degradation). The total number of fish species available for prediction was low in comparison to similar applications of multivariate predictive models based on other indicator groups, yet the accuracy and precision of our model was comparable to outcomes from such studies. In addition, our model developed for sites sampled on one occasion and in one season only (winter), was able to accurately predict fish assemblage composition at sites sampled during other seasons and years, provided that they were not subject to unusually extreme environmental conditions (e.g. extended periods of low flow that restricted fish movement or resulted in habitat desiccation and local fish extinctions).
Resumo:
In the Bayesian framework a standard approach to model criticism is to compare some function of the observed data to a reference predictive distribution. The result of the comparison can be summarized in the form of a p-value, and it's well known that computation of some kinds of Bayesian predictive p-values can be challenging. The use of regression adjustment approximate Bayesian computation (ABC) methods is explored for this task. Two problems are considered. The first is the calibration of posterior predictive p-values so that they are uniformly distributed under some reference distribution for the data. Computation is difficult because the calibration process requires repeated approximation of the posterior for different data sets under the reference distribution. The second problem considered is approximation of distributions of prior predictive p-values for the purpose of choosing weakly informative priors in the case where the model checking statistic is expensive to compute. Here the computation is difficult because of the need to repeatedly sample from a prior predictive distribution for different values of a prior hyperparameter. In both these problems we argue that high accuracy in the computations is not required, which makes fast approximations such as regression adjustment ABC very useful. We illustrate our methods with several samples.
Resumo:
OBJECTIVE: To compare, in patients with cancer and in healthy subjects, measured resting energy expenditure (REE) from traditional indirect calorimetry to a new portable device (MedGem) and predicted REE. DESIGN: Cross-sectional clinical validation study. SETTING: Private radiation oncology centre, Brisbane, Australia. SUBJECTS: Cancer patients (n = 18) and healthy subjects (n = 17) aged 37-86 y, with body mass indices ranging from 18 to 42 kg/m(2). INTERVENTIONS: Oxygen consumption (VO(2)) and REE were measured by VMax229 (VM) and MedGem (MG) indirect calorimeters in random order after a 12-h fast and 30-min rest. REE was also calculated from the MG without adjustment for nitrogen excretion (MGN) and estimated from Harris-Benedict prediction equations. Data were analysed using the Bland and Altman approach, based on a clinically acceptable difference between methods of 5%. RESULTS: The mean bias (MGN-VM) was 10% and limits of agreement were -42 to 21% for cancer patients; mean bias -5% with limits of -45 to 35% for healthy subjects. Less than half of the cancer patients (n = 7, 46.7%) and only a third (n = 5, 33.3%) of healthy subjects had measured REE by MGN within clinically acceptable limits of VM. Predicted REE showed a mean bias (HB-VM) of -5% for cancer patients and 4% for healthy subjects, with limits of agreement of -30 to 20% and -27 to 34%, respectively. CONCLUSIONS: Limits of agreement for the MG and Harris Benedict equations compared to traditional indirect calorimetry were similar but wide, indicating poor clinical accuracy for determining the REE of individual cancer patients and healthy subjects.