68 resultados para random forest data analysis
Resumo:
In this study we show that forest areas contribute significantly to the estimated benefits from om outdoor recreation in Northern Ireland. Secondly we provide empirical evidence of the gains in the statistical efficiency of both benefit and parameter estimates obtained by analysing follow-up responses with Double Bounded interval data analysis. As these gains are considerable, it is clearly worth considering this method in CVM survey design even when moderately large sample sizes are used. Finally we demonstrate that estimates of means and medians of WTP distributions for access to forest recreation show plausible magnitude, are consistent with previous UK studies, and converge across parametric and non-parametic methods of estimation.
Resumo:
The concentration of organic acids in anaerobic digesters is one of the most critical parameters for monitoring and advanced control of anaerobic digestion processes. Thus, a reliable online-measurement system is absolutely necessary. A novel approach to obtaining these measurements indirectly and online using UV/vis spectroscopic probes, in conjunction with powerful pattern recognition methods, is presented in this paper. An UV/vis spectroscopic probe from S::CAN is used in combination with a custom-built dilution system to monitor the absorption of fully fermented sludge at a spectrum from 200 to 750 nm. Advanced pattern recognition methods are then used to map the non-linear relationship between measured absorption spectra to laboratory measurements of organic acid concentrations. Linear discriminant analysis, generalized discriminant analysis (GerDA), support vector machines (SVM), relevance vector machines, random forest and neural networks are investigated for this purpose and their performance compared. To validate the approach, online measurements have been taken at a full-scale 1.3-MW industrial biogas plant. Results show that whereas some of the methods considered do not yield satisfactory results, accurate prediction of organic acid concentration ranges can be obtained with both GerDA and SVM-based classifiers, with classification rates in excess of 87% achieved on test data.
Resumo:
Background: Ineffective risk stratification can delay diagnosis of serious disease in patients with hematuria. We applied a systems biology approach to analyze clinical, demographic and biomarker measurements (n = 29) collected from 157 hematuric patients: 80 urothelial cancer (UC) and 77 controls with confounding pathologies.
Methods: On the basis of biomarkers, we conducted agglomerative hierarchical clustering to identify patient and biomarker clusters. We then explored the relationship between the patient clusters and clinical characteristics using Chi-square analyses. We determined classification errors and areas under the receiver operating curve of Random Forest Classifiers (RFC) for patient subpopulations using the biomarker clusters to reduce the dimensionality of the data.
Results: Agglomerative clustering identified five patient clusters and seven biomarker clusters. Final diagnoses categories were non-randomly distributed across the five patient clusters. In addition, two of the patient clusters were enriched with patients with ‘low cancer-risk’ characteristics. The biomarkers which contributed to the diagnostic classifiers for these two patient clusters were similar. In contrast, three of the patient clusters were significantly enriched with patients harboring ‘high cancer-risk” characteristics including proteinuria, aggressive pathological stage and grade, and malignant cytology. Patients in these three clusters included controls, that is, patients with other serious disease and patients with cancers other than UC. Biomarkers which contributed to the diagnostic classifiers for the largest ‘high cancer- risk’ cluster were different than those contributing to the classifiers for the ‘low cancer-risk’ clusters. Biomarkers which contributed to subpopulations that were split according to smoking status, gender and medication were different.
Conclusions: The systems biology approach applied in this study allowed the hematuric patients to cluster naturally on the basis of the heterogeneity within their biomarker data, into five distinct risk subpopulations. Our findings highlight an approach with the promise to unlock the potential of biomarkers. This will be especially valuable in the field of diagnostic bladder cancer where biomarkers are urgently required. Clinicians could interpret risk classification scores in the context of clinical parameters at the time of triage. This could reduce cystoscopies and enable priority diagnosis of aggressive diseases, leading to improved patient outcomes at reduced costs. © 2013 Emmert-Streib et al; licensee BioMed Central Ltd.
Resumo:
Identifying differential expression of genes in psoriatic and healthy skin by microarray data analysis is a key approach to understand the pathogenesis of psoriasis. Analysis of more than one dataset to identify genes commonly upregulated reduces the likelihood of false positives and narrows down the possible signature genes. Genes controlling the critical balance between T helper 17 and regulatory T cells are of special interest in psoriasis. Our objectives were to identify genes that are consistently upregulated in lesional skin from three published microarray datasets. We carried out a reanalysis of gene expression data extracted from three experiments on samples from psoriatic and nonlesional skin using the same stringency threshold and software and further compared the expression levels of 92 genes related to the T helper 17 and regulatory T cell signaling pathways. We found 73 probe sets representing 57 genes commonly upregulated in lesional skin from all datasets. These included 26 probe sets representing 20 genes that have no previous link to the etiopathogenesis of psoriasis. These genes may represent novel therapeutic targets and surely need more rigorous experimental testing to be validated. Our analysis also identified 12 of 92 genes known to be related to the T helper 17 and regulatory T cell signaling pathways, and these were found to be differentially expressed in the lesional skin samples.
Resumo:
Our review and meta-analysis examined the association between a posteriori–derived dietary patterns (DPs) and risk of type 2 diabetes mellitus. MEDLINE and EMBASE were searched for articles published up to July 2012 and data were extracted by two independent reviewers. Overall, 19 cross-sectional, 12 prospective cohort, and two nested case-control studies were eligible for inclusion. Results from cross-sectional studies reported an inconsistent association between DPs and measures of insulin resistance and/or glucose abnormalities, or prevalence of type 2 diabetes. A meta-analysis was carried out on nine prospective cohort studies that had examined DPs derived by principle component/factor analysis and incidence of type 2 diabetes risk (totaling 309,430 participants and 16,644 incident cases). Multivariate-adjusted odds ratios were combined using a random-effects meta-analysis. Two broad DPs (Healthy/Prudent and Unhealthy/Western) were identified based on food factor loadings published in original studies. Pooled results indicated a 15% lower type 2 diabetes risk for those in the highest category of Healthy/Prudent pattern compared with those in the lowest category (95% CI 0.80 to 0.91; P<0.0001). Compared with the lowest category of Unhealthy/Western DP, those in the highest category had a 41% increased risk of type 2 diabetes (95% CI 1.32 to 1.52; P<0.0001). These results provide evidence that DPs are consistently associated with risk of type 2 diabetes even when other lifestyle factors are controlled for. Thus, greater adherence to a DP characterized by high intakes of fruit, vegetables, and complex carbohydrate and low intakes of refined carbohydrate, processed meat, and fried food may be one strategy that could have a positive influence on the global public health burden of type 2 diabetes.
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Resumo:
The predominant fear in capital markets is that of a price spike. Commodity markets differ in that there is a fear of both upward and down jumps, this results in implied volatility curves displaying distinct shapes when compared to equity markets. The use of a novel functional data analysis (FDA) approach, provides a framework to produce and interpret functional objects that characterise the underlying dynamics of oil future options. We use the FDA framework to examine implied volatility, jump risk, and pricing dynamics within crude oil markets. Examining a WTI crude oil sample for the 2007–2013 period, which includes the global financial crisis and the Arab Spring, strong evidence is found of converse jump dynamics during periods of demand and supply side weakness. This is used as a basis for an FDA-derived Merton (1976) jump diffusion optimised delta hedging strategy, which exhibits superior portfolio management results over traditional methods.
Resumo:
Efficient identification and follow-up of astronomical transients is hindered by the need for humans to manually select promising candidates from data streams that contain many false positives. These artefacts arise in the difference images that are produced by most major ground-based time-domain surveys with large format CCD cameras. This dependence on humans to reject bogus detections is unsustainable for next generation all-sky surveys and significant effort is now being invested to solve the problem computationally. In this paper, we explore a simple machine learning approach to real-bogus classification by constructing a training set from the image data of similar to 32 000 real astrophysical transients and bogus detections from the Pan-STARRS1 Medium Deep Survey. We derive our feature representation from the pixel intensity values of a 20 x 20 pixel stamp around the centre of the candidates. This differs from previous work in that it works directly on the pixels rather than catalogued domain knowledge for feature design or selection. Three machine learning algorithms are trained (artificial neural networks, support vector machines and random forests) and their performances are tested on a held-out subset of 25 per cent of the training data. We find the best results from the random forest classifier and demonstrate that by accepting a false positive rate of 1 per cent, the classifier initially suggests a missed detection rate of around 10 per cent. However, we also find that a combination of bright star variability, nuclear transients and uncertainty in human labelling means that our best estimate of the missed detection rate is approximately 6 per cent.
Resumo:
PURPOSE: This systematic review reports on the survival of feldspathic porcelain veneers.
MATERIALS AND METHODS: The Cochrane Library, MEDLINE (OVID), Embase, Web of Knowledge, selected journals, clinical trials registers, and conference proceedings were searched independently by two reviewers. Academic colleagues were also contacted to identify relevant research. Inclusion criteria were human cohort studies (prospective and retrospective) and controlled trials assessing outcomes of feldspathic porcelain veneers in more than 15 patients and with at least some of the veneers in situ for 5 years. Of 4,294 articles identified, 116 studies underwent full-text screenings and 69 were further reviewed for eligibility. Of these, 11 were included in the qualitative analysis and 6 (5 cohorts) were included in meta-analyses. Estimated cumulative survival and standard error for each study were assessed and used for meta-, sensitivity, and post hoc analyses. The I2 statistic and the Cochran Q test and its associated P value were used to evaluate statistical heterogeneity, with a random-effects meta-analysis used when the P value for heterogeneity was less than .1. Galbraith, forest, and funnel plots explored heterogeneity, publication patterns, and small study biases.
RESULTS: The estimated cumulative survival for feldspathic porcelain veneers was 95.7% (95% confidence interval [CI]: 92.9% to 98.4%) at 5 years and ranged from 64% to 95% at 10 years across three studies. A post hoc meta-analysis indicated that the 10-year best estimate may approach 95.6% (95% CI: 93.8% to 97.5%). High levels of statistical heterogeneity were found.
CONCLUSIONS: When bonded to enamel substrate, feldspathic porcelain veneers have a very high 10-year survival rate that may approach 95%. Clinical heterogeneity is associated with differences in reported survival rates. Use of clinically relevant survival definitions and careful reporting of tooth characteristics, censorship, clustering, and precise results in future research would improve metaanalytic estimates and aid treatment decisions.
Resumo:
With over 50 billion downloads and more than 1.3 million apps in Google’s official market, Android has continued to gain popularity amongst smartphone users worldwide. At the same time there has been a rise in malware targeting the platform, with more recent strains employing highly sophisticated detection avoidance techniques. As traditional signature based methods become less potent in detecting unknown malware, alternatives are needed for timely zero-day discovery. Thus this paper proposes an approach that utilizes ensemble learning for Android malware detection. It combines advantages of static analysis with the efficiency and performance of ensemble machine learning to improve Android malware detection accuracy. The machine learning models are built using a large repository of malware samples and benign apps from a leading antivirus vendor. Experimental results and analysis presented shows that the proposed method which uses a large feature space to leverage the power of ensemble learning is capable of 97.3 % to 99% detection accuracy with very low false positive rates.
Resumo:
Aims/hypothesis The aim of this study was to investigate the association between routine vaccinations and the risk of childhood type 1 diabetes mellitus by systematically reviewing the published literature and performing meta-analyses where possible.
Methods A comprehensive literature search was performed of MEDLINE and EMBASE to identify all studies that compared vaccination rates in children who subsequently developed type 1 diabetes mellitus and in control children. ORs and 95% CIs were obtained from published reports or derived from individual patient data and then combined using a random effects meta-analysis.
Results In total, 23 studies investigating 16 vaccinations met the inclusion criteria. Eleven of these contributed to meta-analyses which included data from between 359 and 11,828 childhood diabetes cases. Overall, there was no evidence to suggest an association between any of the childhood vaccinations investigated and type 1 diabetes mellitus. The pooled ORs ranged from 0.58 (95% CI 0.24, 1.40) for the measles, mumps and rubella (MMR) vaccination in five studies up to 1.04 (95% CI 0.94, 1.14) for the haemophilus influenza B (HiB) vaccination in 11 studies. Significant heterogeneity was present in most of the pooled analyses, but was markedly reduced when analyses were restricted to study reports with high methodology quality scores. Neither this restriction by quality nor the original authors’ adjustments for potential confounding made a substantial difference to the pooled ORs.
Conclusions/interpretation This study provides no evidence of an association between routine vaccinations and childhood type 1 diabetes.
Resumo:
One of the most popular techniques of generating classifier ensembles is known as stacking which is based on a meta-learning approach. In this paper, we introduce an alternative method to stacking which is based on cluster analysis. Similar to stacking, instances from a validation set are initially classified by all base classifiers. The output of each classifier is subsequently considered as a new attribute of the instance. Following this, a validation set is divided into clusters according to the new attributes and a small subset of the original attributes of the instances. For each cluster, we find its centroid and calculate its class label. The collection of centroids is considered as a meta-classifier. Experimental results show that the new method outperformed all benchmark methods, namely Majority Voting, Stacking J48, Stacking LR, AdaBoost J48, and Random Forest, in 12 out of 22 data sets. The proposed method has two advantageous properties: it is very robust to relatively small training sets and it can be applied in semi-supervised learning problems. We provide a theoretical investigation regarding the proposed method. This demonstrates that for the method to be successful, the base classifiers applied in the ensemble should have greater than 50% accuracy levels.
Resumo:
BACKGROUND: The task of revising dietary folate recommendations for optimal health is complicated by a lack of data quantifying the biomarker response that reliably reflects a given folate intake.
OBJECTIVE: We conducted a dose-response meta-analysis in healthy adults to quantify the typical response of recognized folate biomarkers to a change in folic acid intake.
DESIGN: Electronic and bibliographic searches identified 19 randomized controlled trials that supplemented with folic acid and measured folate biomarkers before and after the intervention in apparently healthy adults aged ≥18 y. For each biomarker response, the regression coefficient (β) for individual studies and the overall pooled β were calculated by using random-effects meta-analysis.
RESULTS: Folate biomarkers (serum/plasma and red blood cell folate) increased in response to folic acid in a dose-response manner only up to an intake of 400 μg/d. Calculation of the overall pooled β for studies in the range of 50 to 400 μg/d indicated that a doubling of folic acid intake resulted in an increase in serum/plasma folate by 63% (71% for microbiological assay; 61% for nonmicrobiological assay) and red blood cell folate by 31% (irrespective of whether microbiological or other assay was used). Studies that used the microbiological assay indicated lower heterogeneity compared with studies using nonmicrobiological assays for determining serum/plasma (I(2) = 13.5% compared with I(2) = 77.2%) and red blood cell (I(2) = 45.9% compared with I(2) = 70.2%) folate.
CONCLUSIONS: Studies administering >400 μg folic acid/d show no dose-response relation and thus will not yield meaningful results for consideration when generating dietary folate recommendations. The calculated folate biomarker response to a given folic acid intake may be more robust with the use of a microbiological assay rather than alternative methods for blood folate measurement.
Resumo:
A compositional multivariate approach is used to analyse regional scale soil geochemical data obtained as part of the Tellus Project generated by the Geological Survey Northern Ireland (GSNI). The multi-element total concentration data presented comprise XRF analyses of 6862 rural soil samples collected at 20cm depths on a non-aligned grid at one site per 2 km2. Censored data were imputed using published detection limits. Using these imputed values for 46 elements (including LOI), each soil sample site was assigned to the regional geology map provided by GSNI initially using the dominant lithology for the map polygon. Northern Ireland includes a diversity of geology representing a stratigraphic record from the Mesoproterozoic, up to and including the Palaeogene. However, the advance of ice sheets and their meltwaters over the last 100,000 years has left at least 80% of the bedrock covered by superficial deposits, including glacial till and post-glacial alluvium and peat. The question is to what extent the soil geochemistry reflects the underlying geology or superficial deposits. To address this, the geochemical data were transformed using centered log ratios (clr) to observe the requirements of compositional data analysis and avoid closure issues. Following this, compositional multivariate techniques including compositional Principal Component Analysis (PCA) and minimum/maximum autocorrelation factor (MAF) analysis method were used to determine the influence of underlying geology on the soil geochemistry signature. PCA showed that 72% of the variation was determined by the first four principal components (PC’s) implying “significant” structure in the data. Analysis of variance showed that only 10 PC’s were necessary to classify the soil geochemical data. To consider an improvement over PCA that uses the spatial relationships of the data, a classification based on MAF analysis was undertaken using the first 6 dominant factors. Understanding the relationship between soil geochemistry and superficial deposits is important for environmental monitoring of fragile ecosystems such as peat. To explore whether peat cover could be predicted from the classification, the lithology designation was adapted to include the presence of peat, based on GSNI superficial deposit polygons and linear discriminant analysis (LDA) undertaken. Prediction accuracy for LDA classification improved from 60.98% based on PCA using 10 principal components to 64.73% using MAF based on the 6 most dominant factors. The misclassification of peat may reflect degradation of peat covered areas since the creation of superficial deposit classification. Further work will examine the influence of underlying lithologies on elemental concentrations in peat composition and the effect of this in classification analysis.
Resumo:
BACKGROUND & AIMS: Gluteofemoral obesity (determined by measurement of subcutaneous fat in hip and thigh regions) could reduce risks of cardiovascular and diabetic disorders associated with abdominal obesity. We evaluated whether gluteofemoral obesity also reduces risk of Barrett's esophagus (BE), a premalignant lesion associated with abdominal obesity.
METHODS: We collected data from non-Hispanic white participants in 8 studies in the Barrett's and Esophageal Adenocarcinoma Consortium. We compared measures of hip circumference (as a proxy for gluteofemoral obesity) from cases of BE (n=1559) separately with 2 control groups: 2557 population-based controls and 2064 individuals with gastroesophageal reflux disease (GERD controls). Study-specific odds ratios (OR) and 95% confidence intervals (95% CI) were estimated using individual participant data and multivariable logistic regression and combined using random effects meta-analysis.
RESULTS: We found an inverse relationship between hip circumference and BE (OR per 5 cm increase, 0.88; 95% CI, 0.81-0.96), compared with population-based controls in a multivariable model that included waist circumference. This association was not observed in models that did not include waist circumference. Similar results were observed in analyses stratified by frequency of GERD symptoms. The inverse association with hip circumference was only statistically significant among men (vs population-based controls: OR, 0.85; 95% CI, 0.76-0.96 for men; OR, 0.93; 95% CI, 0.74-1.16 for women). For men, within each category of waist circumference, a larger hip circumference was associated with decreased risk of BE. Increasing waist circumference was associated with increased risk of BE in the mutually adjusted population-based and GERD control models.
CONCLUSIONS: Although abdominal obesity is associated with increased risk of BE, there is an inverse association between gluteofemoral obesity and BE, particularly among men.