409 resultados para decision tree
em Queensland University of Technology - ePrints Archive
Resumo:
The Thai written language is one of the languages that does not have word boundaries. In order to discover the meaning of the document, all texts must be separated into syllables, words, sentences, and paragraphs. This paper develops a novel method to segment the Thai text by combining a non-dictionary based technique with a dictionary-based technique. This method first applies the Thai language grammar rules to the text for identifying syllables. The hidden Markov model is then used for merging possible syllables into words. The identified words are verified with a lexical dictionary and a decision tree is employed to discover the words unidentified by the lexical dictionary. Documents used in the litigation process of Thai court proceedings have been used in experiments. The results which are segmented words, obtained by the proposed method outperform the results obtained by other existing methods.
Resumo:
Being able to accurately predict the risk of falling is crucial in patients with Parkinson’s dis- ease (PD). This is due to the unfavorable effect of falls, which can lower the quality of life as well as directly impact on survival. Three methods considered for predicting falls are decision trees (DT), Bayesian networks (BN), and support vector machines (SVM). Data on a 1-year prospective study conducted at IHBI, Australia, for 51 people with PD are used. Data processing are conducted using rpart and e1071 packages in R for DT and SVM, con- secutively; and Bayes Server 5.5 for the BN. The results show that BN and SVM produce consistently higher accuracy over the 12 months evaluation time points (average sensitivity and specificity > 92%) than DT (average sensitivity 88%, average specificity 72%). DT is prone to imbalanced data so needs to adjust for the misclassification cost. However, DT provides a straightforward, interpretable result and thus is appealing for helping to identify important items related to falls and to generate fallers’ profiles.
Resumo:
The economiser is a critical component for efficient operation of coal-fired power stations. It consists of a large system of water-filled tubes which extract heat from the exhaust gases. When it fails, usually due to erosion causing a leak, the entire power station must be shut down to effect repairs. Not only are such repairs highly expensive, but the overall repair costs are significantly affected by fluctuations in electricity market prices, due to revenue lost during the outage. As a result, decisions about when to repair an economiser can alter the repair costs by millions of dollars. Therefore, economiser repair decisions are critical and must be optimised. However, making optimal repair decisions is difficult because economiser leaks are a type of interactive failure. If left unfixed, a leak in a tube can cause additional leaks in adjacent tubes which will need more time to repair. In addition, when choosing repair times, one also needs to consider a number of other uncertain inputs such as future electricity market prices and demands. Although many different decision models and methodologies have been developed, an effective decision-making method specifically for economiser repairs has yet to be defined. In this paper, we describe a Decision Tree based method to meet this need. An industrial case study is presented to demonstrate the application of our method.
Resumo:
This paper presents a fault diagnosis method based on adaptive neuro-fuzzy inference system (ANFIS) in combination with decision trees. Classification and regression tree (CART) which is one of the decision tree methods is used as a feature selection procedure to select pertinent features from data set. The crisp rules obtained from the decision tree are then converted to fuzzy if-then rules that are employed to identify the structure of ANFIS classifier. The hybrid of back-propagation and least squares algorithm are utilized to tune the parameters of the membership functions. In order to evaluate the proposed algorithm, the data sets obtained from vibration signals and current signals of the induction motors are used. The results indicate that the CART–ANFIS model has potential for fault diagnosis of induction motors.
Resumo:
Age-related macular degeneration (AMD) affects the central vision and subsequently may lead to visual loss in people over 60 years of age. There is no permanent cure for AMD, but early detection and successive treatment may improve the visual acuity. AMD is mainly classified into dry and wet type; however, dry AMD is more common in aging population. AMD is characterized by drusen, yellow pigmentation, and neovascularization. These lesions are examined through visual inspection of retinal fundus images by ophthalmologists. It is laborious, time-consuming, and resource-intensive. Hence, in this study, we have proposed an automated AMD detection system using discrete wavelet transform (DWT) and feature ranking strategies. The first four-order statistical moments (mean, variance, skewness, and kurtosis), energy, entropy, and Gini index-based features are extracted from DWT coefficients. We have used five (t test, Kullback–Lieber Divergence (KLD), Chernoff Bound and Bhattacharyya Distance, receiver operating characteristics curve-based, and Wilcoxon) feature ranking strategies to identify optimal feature set. A set of supervised classifiers namely support vector machine (SVM), decision tree, k -nearest neighbor ( k -NN), Naive Bayes, and probabilistic neural network were used to evaluate the highest performance measure using minimum number of features in classifying normal and dry AMD classes. The proposed framework obtained an average accuracy of 93.70 %, sensitivity of 91.11 %, and specificity of 96.30 % using KLD ranking and SVM classifier. We have also formulated an AMD Risk Index using selected features to classify the normal and dry AMD classes using one number. The proposed system can be used to assist the clinicians and also for mass AMD screening programs.
Resumo:
Objectives Demonstrate the application of decision trees – classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs) – to understand structure in missing data. Setting Data taken from employees at three different industry sites in Australia. Participants 7915 observations were included. Materials and Methods The approach was evaluated using an occupational health dataset comprising results of questionnaires, medical tests, and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results CART and BRT models were effective in highlighting a missingness structure in the data, related to the Type of data (medical or environmental), the site in which it was collected, the number of visits and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured compared to structured missingness. Discussion Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusion Researchers are encouraged to use CART and BRT models to explore and understand missing data.
Resumo:
PURPOSE To develop and test decision tree (DT) models to classify physical activity (PA) intensity from accelerometer output and Gross Motor Function Classification System (GMFCS) classification level in ambulatory youth with cerebral palsy (CP); and 2) compare the classification accuracy of the new DT models to that achieved by previously published cut-points for youth with CP. METHODS Youth with CP (GMFCS Levels I - III) (N=51) completed seven activity trials with increasing PA intensity while wearing a portable metabolic system and ActiGraph GT3X accelerometers. DT models were used to identify vertical axis (VA) and vector magnitude (VM) count thresholds corresponding to sedentary (SED) (<1.5 METs), light PA (LPA) (>/=1.5 and <3 METs) and moderate-to-vigorous PA (MVPA) (>/=3 METs). Models were trained and cross-validated using the 'rpart' and 'caret' packages within R. RESULTS For the VA (VA_DT) and VM decision trees (VM_DT), a single threshold differentiated LPA from SED, while the threshold for differentiating MVPA from LPA decreased as the level of impairment increased. The average cross-validation accuracy for the VC_DT was 81.1%, 76.7%, and 82.9% for GMFCS levels I, II, and III, respectively. The corresponding cross-validation accuracy for the VM_DT was 80.5%, 75.6%, and 84.2%, respectively. Within each GMFCS level, the decision tree models achieved better PA intensity recognition than previously published cut-points. The accuracy differential was greatest among GMFCS level III participants, in whom the previously published cut-points misclassified 40% of the MVPA activity trials. CONCLUSION GMFCS-specific cut-points provide more accurate assessments of MVPA levels in youth with CP across the full spectrum of ambulatory ability.
Resumo:
The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
Resumo:
Distraction whilst driving on an approach to a signalized intersection is particularly dangerous, as potential vehicular conflicts and resulting angle collisions tend to be severe. This study examines the decisions of distracted drivers during the onset of amber lights. Driving simulator data were obtained from a sample of 58 drivers under baseline and handheld mobile phone conditions at the University of IOWA - National Advanced Driving Simulator. Explanatory variables include age, gender, cell phone use, distance to stop-line, and speed. An iterative combination of decision tree and logistic regression analyses are employed to identify main effects, non-linearities, and interactions effects. Results show that novice (16-17 years) and younger (18-25 years) drivers’ had heightened amber light running risk while distracted by cell phone, and speed and distance thresholds yielded significant interaction effects. Driver experience captured by age has a multiplicative effect with distraction, making the combined effect of being inexperienced and distracted particularly risky. Solutions are needed to combat the use of mobile phones whilst driving.
Resumo:
Our daily lives become more and more dependent upon smartphones due to their increased capabilities. Smartphones are used in various ways from payment systems to assisting the lives of elderly or disabled people. Security threats for these devices become increasingly dangerous since there is still a lack of proper security tools for protection. Android emerges as an open smartphone platform which allows modification even on operating system level. Therefore, third-party developers have the opportunity to develop kernel-based low-level security tools which is not normal for smartphone platforms. Android quickly gained its popularity among smartphone developers and even beyond since it bases on Java on top of "open" Linux in comparison to former proprietary platforms which have very restrictive SDKs and corresponding APIs. Symbian OS for example, holding the greatest market share among all smartphone OSs, was closing critical APIs to common developers and introduced application certification. This was done since this OS was the main target for smartphone malwares in the past. In fact, more than 290 malwares designed for Symbian OS appeared from July 2004 to July 2008. Android, in turn, promises to be completely open source. Together with the Linux-based smartphone OS OpenMoko, open smartphone platforms may attract malware writers for creating malicious applications endangering the critical smartphone applications and owners� privacy. In this work, we present our current results in analyzing the security of Android smartphones with a focus on its Linux side. Our results are not limited to Android, they are also applicable to Linux-based smartphones such as OpenMoko Neo FreeRunner. Our contribution in this work is three-fold. First, we analyze android framework and the Linux-kernel to check security functionalities. We survey wellaccepted security mechanisms and tools which can increase device security. We provide descriptions on how to adopt these security tools on Android kernel, and provide their overhead analysis in terms of resource usage. As open smartphones are released and may increase their market share similar to Symbian, they may attract attention of malware writers. Therefore, our second contribution focuses on malware detection techniques at the kernel level. We test applicability of existing signature and intrusion detection methods in Android environment. We focus on monitoring events on the kernel; that is, identifying critical kernel, log file, file system and network activity events, and devising efficient mechanisms to monitor them in a resource limited environment. Our third contribution involves initial results of our malware detection mechanism basing on static function call analysis. We identified approximately 105 Executable and Linking Format (ELF) executables installed to the Linux side of Android. We perform a statistical analysis on the function calls used by these applications. The results of the analysis can be compared to newly installed applications for detecting significant differences. Additionally, certain function calls indicate malicious activity. Therefore, we present a simple decision tree for deciding the suspiciousness of the corresponding application. Our results present a first step towards detecting malicious applications on Android-based devices.
Resumo:
Background Falls are one of the most frequently occurring adverse events that impact upon the recovery of older hospital inpatients. Falls can threaten both immediate and longer-term health and independence. There is need to identify cost-effective means for preventing falls in hospitals. Hospital-based falls prevention interventions tested in randomized trials have not yet been subjected to economic evaluation. Methods Incremental cost-effectiveness analysis was undertaken from the health service provider perspective, over the period of hospitalization (time horizon) using the Australian Dollar (A$) at 2008 values. Analyses were based on data from a randomized trial among n = 1,206 acute and rehabilitation inpatients. Decision tree modeling with three-way sensitivity analyses were conducted using burden of disease estimates developed from trial data and previous research. The intervention was a multimedia patient education program provided with trained health professional follow-up shown to reduce falls among cognitively intact hospital patients. Results The short-term cost to a health service of one cognitively intact patient being a faller could be as high as A$14,591 (2008). The education program cost A$526 (2008) to prevent one cognitively intact patient becoming a faller and A$294 (2008) to prevent one fall based on primary trial data. These estimates were unstable due to high variability in the hospital costs accrued by individual patients involved in the trial. There was a 52% probability the complete program was both more effective and less costly (from the health service perspective) than providing usual care alone. Decision tree modeling sensitivity analyses identified that when provided in real life contexts, the program would be both more effective in preventing falls among cognitively intact inpatients and cost saving where the proportion of these patients who would otherwise fall under usual care conditions is at least 4.0%. Conclusions This economic evaluation was designed to assist health care providers decide in what circumstances this intervention should be provided. If the proportion of cognitively intact patients falling on a ward under usual care conditions is 4% or greater, then provision of the complete program in addition to usual care will likely both prevent falls and reduce costs for a health service.
Resumo:
With increasing interest shown by Universities in workplace learning, especially in STEM disciplines, an issue has arisen amongst educators and industry partners regarding authentic assessment tasks for work integrated learning (WIL) subjects. This paper describes the use of a matrix, which is also available as a decision-tree, based on the features of the WIL experience, in order to facilitate the selection of appropriate assessment strategies. The matrix divides the WIL experiences into seven categories, based on such factors as: the extent to which the experience is compulsory, required for membership of a professional body or elective; whether the student is undertaking a project, or embedding in a professional culture; and other key aspects of the WIL experience. One important variable is linked to the fundamental purpose of the assessment. This question revolves around the focus of the assessment: whether on the person (student development); the process (professional conduct/language); or the product (project, assignment, literature review, report, software). The matrix has been trialed at QUT in the Faculty of Science and Technology, and also at the University of Surrey, UK, and has proven to have good applicability in both universities.
Resumo:
Due to the demand for better and deeper analysis in sports, organizations (both professional teams and broadcasters) are looking to use spatiotemporal data in the form of player tracking information to obtain an advantage over their competitors. However, due to the large volume of data, its unstructured nature, and lack of associated team activity labels (e.g. strategic/tactical), effective and efficient strategies to deal with such data have yet to be deployed. A bottleneck restricting such solutions is the lack of a suitable representation (i.e. ordering of players) which is immune to the potentially infinite number of possible permutations of player orderings, in addition to the high dimensionality of temporal signal (e.g. a game of soccer last for 90 mins). Leveraging a recent method which utilizes a "role-representation", as well as a feature reduction strategy that uses a spatiotemporal bilinear basis model to form a compact spatiotemporal representation. Using this representation, we find the most likely formation patterns of a team associated with match events across nearly 14 hours of continuous player and ball tracking data in soccer. Additionally, we show that we can accurately segment a match into distinct game phases and detect highlights. (i.e. shots, corners, free-kicks, etc) completely automatically using a decision-tree formulation.
Resumo:
A cell classification algorithm that uses first, second and third order statistics of pixel intensity distributions over pre-defined regions is implemented and evaluated. A cell image is segmented into 6 regions extending from a boundary layer to an inner circle. First, second and third order statistical features are extracted from histograms of pixel intensities in these regions. Third order statistical features used are one-dimensional bispectral invariants. 108 features were considered as candidates for Adaboost based fusion. The best 10 stage fused classifier was selected for each class and a decision tree constructed for the 6-class problem. The classifier is robust, accurate and fast by design.