19 resultados para random forest regression
em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast
Resumo:
Algorithms for concept drift handling are important for various applications including video analysis and smart grids. In this paper we present decision tree ensemble classication method based on the Random Forest algorithm for concept drift. The weighted majority voting ensemble aggregation rule is employed based on the ideas of Accuracy Weighted Ensemble (AWE) method. Base learner weight in our case is computed for each sample evaluation using base learners accuracy and intrinsic proximity measure of Random Forest. Our algorithm exploits both temporal weighting of samples and ensemble pruning as a forgetting strategy. We present results of empirical comparison of our method with îriginal random forest with incorporated replace-the-looser forgetting andother state-of-the-art concept-drift classiers like AWE2.
Resumo:
In this study, 39 sets of hard turning (HT) experimental trials were performed on a Mori-Seiki SL-25Y (4-axis) computer numerical controlled (CNC) lathe to study the effect of cutting parameters in influencing the machined surface roughness. In all the trials, AISI 4340 steel workpiece (hardened up to 69 HRC) was machined with a commercially available CBN insert (Warren Tooling Limited, UK) under dry conditions. The surface topography of the machined samples was examined by using a white light interferometer and a reconfirmation of measurement was done using a Form Talysurf. The machining outcome was used as an input to develop various regression models to predict the average machined surface roughness on this material. Three regression models - Multiple regression, Random Forest, and Quantile regression were applied to the experimental outcomes. To the best of the authors’ knowledge, this paper is the first to apply Random Forest or Quantile regression techniques to the machining domain. The performance of these models was compared to each other to ascertain how feed, depth of cut, and spindle speed affect surface roughness and finally to obtain a mathematical equation correlating these variables. It was concluded that the random forest regression model is a superior choice over multiple regression models for prediction of surface roughness during machining of AISI 4340 steel (69 HRC).
Resumo:
Although visual surveillance has emerged as an effective technolody for public security, privacy has become an issue of great concern in the transmission and distribution of surveillance videos. For example, personal facial images should not be browsed without permission. To cope with this issue, face image scrambling has emerged as a simple solution for privacyrelated applications. Consequently, online facial biometric verification needs to be carried out in the scrambled domain thus bringing a new challenge to face classification. In this paper, we investigate face verification issues in the scrambled domain and propose a novel scheme to handle this challenge. In our proposed method, to make feature extraction from scrambled face images robust, a biased random subspace sampling scheme is applied to construct fuzzy decision trees from randomly selected features, and fuzzy forest decision using fuzzy memberships is then obtained from combining all fuzzy tree decisions. In our experiment, we first estimated the optimal parameters for the construction of the random forest, and then applied the optimized model to the benchmark tests using three publically available face datasets. The experimental results validated that our proposed scheme can robustly cope with the challenging tests in the scrambled domain, and achieved an improved accuracy over all tests, making our method a promising candidate for the emerging privacy-related facial biometric applications.
Resumo:
The main curative therapy for patients with nonsmall cell lung cancer is surgery. Despite this, the survival rate is only 50%, therefore it is important to more efficiently diagnose and predict prognosis for lung cancer patients. Raman spectroscopy is useful in the diagnosis of malignant and premalignant lesions. The aim of this study is to investigate the ability of Raman microscopy to diagnose lung cancer from surgically resected tissue sections, and predict the prognosis of these patients. Tumor tissue sections from curative resections are mapped by Raman microscopy and the spectra analzsed using multivariate techniques. Spectra from the tumor samples are also compared with their outcome data to define their prognostic significance. Using principal component analysis and random forest classification, Raman microscopy differentiates malignant from normal lung tissue. Principal component analysis of 34 tumor spectra predicts early postoperative cancer recurrence with a sensitivity of 73% and specificity of 74%. Spectral analysis reveals elevated porphyrin levels in the normal samples and more DNA in the tumor samples. Raman microscopy can be a useful technique for the diagnosis and prognosis of lung cancer patients receiving surgery, and for elucidating the biochemical properties of lung tumors. (C) 2010 Society of Photo-Optical Instrumentation Engineers. [DOI: 10.1117/1.3323088]
Resumo:
Background: Evidence suggests that in prokaryotes sequence-dependent transcriptional pauses a?ect the dynamics of transcription and translation, as well as of small genetic circuits. So far, a few pause-prone sequences have been identi?ed from in vitro measurements of transcription elongation kinetics.
Results: Using a stochastic model of gene expression at the nucleotide and codon levels with realistic parameter values, we investigate three di?erent but related questions and present statistical methods for their analysis. First, we show that information from in vivo RNA and protein temporal numbers is su?cient to discriminate between models with and without a pause site in their coding sequence. Second, we demonstrate that it is possible to separate a large variety of models from each other with pauses of various durations and locations in the template by means of a hierarchical clustering and a random forest classi?er. Third, we introduce an approximate likelihood function that allows to estimate the location of a pause site.
Conclusions: This method can aid in detecting unknown pause-prone sequences from temporal measurements of RNA and protein numbers at a genome-wide scale and thus elucidate possible roles that these sequences play in the dynamics of genetic networks and phenotype.
Resumo:
The concentration of organic acids in anaerobic digesters is one of the most critical parameters for monitoring and advanced control of anaerobic digestion processes. Thus, a reliable online-measurement system is absolutely necessary. A novel approach to obtaining these measurements indirectly and online using UV/vis spectroscopic probes, in conjunction with powerful pattern recognition methods, is presented in this paper. An UV/vis spectroscopic probe from S::CAN is used in combination with a custom-built dilution system to monitor the absorption of fully fermented sludge at a spectrum from 200 to 750 nm. Advanced pattern recognition methods are then used to map the non-linear relationship between measured absorption spectra to laboratory measurements of organic acid concentrations. Linear discriminant analysis, generalized discriminant analysis (GerDA), support vector machines (SVM), relevance vector machines, random forest and neural networks are investigated for this purpose and their performance compared. To validate the approach, online measurements have been taken at a full-scale 1.3-MW industrial biogas plant. Results show that whereas some of the methods considered do not yield satisfactory results, accurate prediction of organic acid concentration ranges can be obtained with both GerDA and SVM-based classifiers, with classification rates in excess of 87% achieved on test data.
Resumo:
Background: Ineffective risk stratification can delay diagnosis of serious disease in patients with hematuria. We applied a systems biology approach to analyze clinical, demographic and biomarker measurements (n = 29) collected from 157 hematuric patients: 80 urothelial cancer (UC) and 77 controls with confounding pathologies.
Methods: On the basis of biomarkers, we conducted agglomerative hierarchical clustering to identify patient and biomarker clusters. We then explored the relationship between the patient clusters and clinical characteristics using Chi-square analyses. We determined classification errors and areas under the receiver operating curve of Random Forest Classifiers (RFC) for patient subpopulations using the biomarker clusters to reduce the dimensionality of the data.
Results: Agglomerative clustering identified five patient clusters and seven biomarker clusters. Final diagnoses categories were non-randomly distributed across the five patient clusters. In addition, two of the patient clusters were enriched with patients with ‘low cancer-risk’ characteristics. The biomarkers which contributed to the diagnostic classifiers for these two patient clusters were similar. In contrast, three of the patient clusters were significantly enriched with patients harboring ‘high cancer-risk” characteristics including proteinuria, aggressive pathological stage and grade, and malignant cytology. Patients in these three clusters included controls, that is, patients with other serious disease and patients with cancers other than UC. Biomarkers which contributed to the diagnostic classifiers for the largest ‘high cancer- risk’ cluster were different than those contributing to the classifiers for the ‘low cancer-risk’ clusters. Biomarkers which contributed to subpopulations that were split according to smoking status, gender and medication were different.
Conclusions: The systems biology approach applied in this study allowed the hematuric patients to cluster naturally on the basis of the heterogeneity within their biomarker data, into five distinct risk subpopulations. Our findings highlight an approach with the promise to unlock the potential of biomarkers. This will be especially valuable in the field of diagnostic bladder cancer where biomarkers are urgently required. Clinicians could interpret risk classification scores in the context of clinical parameters at the time of triage. This could reduce cystoscopies and enable priority diagnosis of aggressive diseases, leading to improved patient outcomes at reduced costs. © 2013 Emmert-Streib et al; licensee BioMed Central Ltd.
Resumo:
Despite the importance of laughter in social interactions it remains little studied in affective computing. Respiratory, auditory, and facial laughter signals have been investigated but laughter-related body movements have received almost no attention. The aim of this study is twofold: first an investigation into observers' perception of laughter states (hilarious, social, awkward, fake, and non-laughter) based on body movements alone, through their categorization of avatars animated with natural and acted motion capture data. Significant differences in torso and limb movements were found between animations perceived as containing laughter and those perceived as nonlaughter. Hilarious laughter also differed from social laughter in the amount of bending of the spine, the amount of shoulder rotation and the amount of hand movement. The body movement features indicative of laughter differed between sitting and standing avatar postures. Based on the positive findings in this perceptual study, the second aim is to investigate the possibility of automatically predicting the distributions of observer's ratings for the laughter states. The findings show that the automated laughter recognition rates approach human rating levels, with the Random Forest method yielding the best performance.
Resumo:
The in-line measurement of COD and NH4-N in the WWTP inflow is crucial for the timely monitoring of biological wastewater treatment processes and for the development of advanced control strategies for optimized WWTP operation. As a direct measurement of COD and NH4-N requires expensive and high maintenance in-line probes or analyzers, an approach estimating COD and NH4-N based on standard and spectroscopic in-line inflow measurement systems using Machine Learning Techniques is presented in this paper. The results show that COD estimation using Radom Forest Regression with a normalized MSE of 0.3, which is sufficiently accurate for practical applications, can be achieved using only standard in-line measurements. In the case of NH4-N, a good estimation using Partial Least Squares Regression with a normalized MSE of 0.16 is only possible based on a combination of standard and spectroscopic in-line measurements. Furthermore, the comparison of regression and classification methods shows that both methods perform equally well in most cases.
Resumo:
Despite its importance in social interactions, laughter remains little studied in affective computing. Intelligent virtual agents are often blind to users’ laughter and unable to produce convincing laughter themselves. Respiratory, auditory, and facial laughter signals have been investigated but laughter-related body movements have received less attention. The aim of this study is threefold. First, to probe human laughter perception by analyzing patterns of categorisations of natural laughter animated on a minimal avatar. Results reveal that a low dimensional space can describe perception of laughter “types”. Second, to investigate observers’ perception of laughter (hilarious, social, awkward, fake, and non-laughter) based on animated avatars generated from natural and acted motion-capture data. Significant differences in torso and limb movements are found between animations perceived as laughter and those perceived as non-laughter. Hilarious laughter also differs from social laughter. Different body movement features were indicative of laughter in sitting and standing avatar postures. Third, to investigate automatic recognition of laughter to the same level of certainty as observers’ perceptions. Results show recognition rates of the Random Forest model approach human rating levels. Classification comparisons and feature importance analyses indicate an improvement in recognition of social laughter when localized features and nonlinear models are used.
Resumo:
Efficient identification and follow-up of astronomical transients is hindered by the need for humans to manually select promising candidates from data streams that contain many false positives. These artefacts arise in the difference images that are produced by most major ground-based time-domain surveys with large format CCD cameras. This dependence on humans to reject bogus detections is unsustainable for next generation all-sky surveys and significant effort is now being invested to solve the problem computationally. In this paper, we explore a simple machine learning approach to real-bogus classification by constructing a training set from the image data of similar to 32 000 real astrophysical transients and bogus detections from the Pan-STARRS1 Medium Deep Survey. We derive our feature representation from the pixel intensity values of a 20 x 20 pixel stamp around the centre of the candidates. This differs from previous work in that it works directly on the pixels rather than catalogued domain knowledge for feature design or selection. Three machine learning algorithms are trained (artificial neural networks, support vector machines and random forests) and their performances are tested on a held-out subset of 25 per cent of the training data. We find the best results from the random forest classifier and demonstrate that by accepting a false positive rate of 1 per cent, the classifier initially suggests a missed detection rate of around 10 per cent. However, we also find that a combination of bright star variability, nuclear transients and uncertainty in human labelling means that our best estimate of the missed detection rate is approximately 6 per cent.
Resumo:
PURPOSE Potentially inappropriate prescribing (PIP) is common in older people and can result in increased morbidity, adverse drug events, and hospitalizations. The OPTI-SCRIPT study (Optimizing Prescribing for Older People in Primary Care, a cluster-randomized controlled trial) tested the effectiveness of a multifaceted intervention for reducing PIP in primary care.
METHODS We conducted a cluster-randomized controlled trial among 21 general practitioner practices and 196 patients with PIP. Intervention participants received a complex, multifaceted intervention incorporating academic detailing; review of medicines with web-based pharmaceutical treatment algorithms that provide recommended alternative-treatment options; and tailored patient information leaflets. Control practices delivered usual care and received simple, patient-level PIP feedback. Primary outcomes were the proportion of patients with PIP and the mean number of potentially inappropriate prescriptions. We performed intention-to-treat analysis using random-effects regression.
RESULTS All 21 practices and 190 patients were followed. At intervention completion, patients in the intervention group had significantly lower odds of having PIP than patients in the control group (adjusted odds ratio = 0.32; 95% CI, 0.15–0.70; P = .02). The mean number of PIP drugs in the intervention group was 0.70, compared with 1.18 in the control group (P = .02). The intervention group was almost one-third less likely than the control group to have PIP drugs at intervention completion, but this difference was not significant (incidence rate ratio = 0.71; 95% CI, 0.50–1.02; P = .49). The intervention was effective in reducing proton pump inhibitor prescribing (adjusted odds ratio = 0.30; 95% CI, 0.14–0.68; P = .04).
CONCLUSIONS The OPTI-SCRIPT intervention incorporating academic detailing with a pharmacist, and a review of medicines with web-based pharmaceutical treatment algorithms, was effective in reducing PIP, particularly in modifying prescribing of proton pump inhibitors, the most commonly occurring PIP drugs nationally.
Resumo:
With over 50 billion downloads and more than 1.3 million apps in Google’s official market, Android has continued to gain popularity amongst smartphone users worldwide. At the same time there has been a rise in malware targeting the platform, with more recent strains employing highly sophisticated detection avoidance techniques. As traditional signature based methods become less potent in detecting unknown malware, alternatives are needed for timely zero-day discovery. Thus this paper proposes an approach that utilizes ensemble learning for Android malware detection. It combines advantages of static analysis with the efficiency and performance of ensemble machine learning to improve Android malware detection accuracy. The machine learning models are built using a large repository of malware samples and benign apps from a leading antivirus vendor. Experimental results and analysis presented shows that the proposed method which uses a large feature space to leverage the power of ensemble learning is capable of 97.3 % to 99% detection accuracy with very low false positive rates.
Resumo:
One of the most popular techniques of generating classifier ensembles is known as stacking which is based on a meta-learning approach. In this paper, we introduce an alternative method to stacking which is based on cluster analysis. Similar to stacking, instances from a validation set are initially classified by all base classifiers. The output of each classifier is subsequently considered as a new attribute of the instance. Following this, a validation set is divided into clusters according to the new attributes and a small subset of the original attributes of the instances. For each cluster, we find its centroid and calculate its class label. The collection of centroids is considered as a meta-classifier. Experimental results show that the new method outperformed all benchmark methods, namely Majority Voting, Stacking J48, Stacking LR, AdaBoost J48, and Random Forest, in 12 out of 22 data sets. The proposed method has two advantageous properties: it is very robust to relatively small training sets and it can be applied in semi-supervised learning problems. We provide a theoretical investigation regarding the proposed method. This demonstrates that for the method to be successful, the base classifiers applied in the ensemble should have greater than 50% accuracy levels.