876 resultados para Boosted regression trees


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background Individual signs and symptoms are of limited value for the diagnosis of influenza. Objective To develop a decision tree for the diagnosis of influenza based on a classification and regression tree (CART) analysis. Methods Data from two previous similar cohort studies were assembled into a single dataset. The data were randomly divided into a development set (70%) and a validation set (30%). We used CART analysis to develop three models that maximize the number of patients who do not require diagnostic testing prior to treatment decisions. The validation set was used to evaluate overfitting of the model to the training set. Results Model 1 has seven terminal nodes based on temperature, the onset of symptoms and the presence of chills, cough and myalgia. Model 2 was a simpler tree with only two splits based on temperature and the presence of chills. Model 3 was developed with temperature as a dichotomous variable (≥38°C) and had only two splits based on the presence of fever and myalgia. The area under the receiver operating characteristic curves (AUROCC) for the development and validation sets, respectively, were 0.82 and 0.80 for Model 1, 0.75 and 0.76 for Model 2 and 0.76 and 0.77 for Model 3. Model 2 classified 67% of patients in the validation group into a high- or low-risk group compared with only 38% for Model 1 and 54% for Model 3. Conclusions A simple decision tree (Model 2) classified two-thirds of patients as low or high risk and had an AUROCC of 0.76. After further validation in an independent population, this CART model could support clinical decision making regarding influenza, with low-risk patients requiring no further evaluation for influenza and high-risk patients being candidates for empiric symptomatic or drug therapy.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

PURPOSE: According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. METHOD: About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). RESULTS: The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. CONCLUSION: Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

1. Species distribution modelling is used increasingly in both applied and theoretical research to predict how species are distributed and to understand attributes of species' environmental requirements. In species distribution modelling, various statistical methods are used that combine species occurrence data with environmental spatial data layers to predict the suitability of any site for that species. While the number of data sharing initiatives involving species' occurrences in the scientific community has increased dramatically over the past few years, various data quality and methodological concerns related to using these data for species distribution modelling have not been addressed adequately. 2. We evaluated how uncertainty in georeferences and associated locational error in occurrences influence species distribution modelling using two treatments: (1) a control treatment where models were calibrated with original, accurate data and (2) an error treatment where data were first degraded spatially to simulate locational error. To incorporate error into the coordinates, we moved each coordinate with a random number drawn from the normal distribution with a mean of zero and a standard deviation of 5 km. We evaluated the influence of error on the performance of 10 commonly used distributional modelling techniques applied to 40 species in four distinct geographical regions. 3. Locational error in occurrences reduced model performance in three of these regions; relatively accurate predictions of species distributions were possible for most species, even with degraded occurrences. Two species distribution modelling techniques, boosted regression trees and maximum entropy, were the best performing models in the face of locational errors. The results obtained with boosted regression trees were only slightly degraded by errors in location, and the results obtained with the maximum entropy approach were not affected by such errors. 4. Synthesis and applications. To use the vast array of occurrence data that exists currently for research and management relating to the geographical ranges of species, modellers need to know the influence of locational error on model quality and whether some modelling techniques are particularly robust to error. We show that certain modelling techniques are particularly robust to a moderate level of locational error and that useful predictions of species distributions can be made even when occurrence data include some error.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Predictive species distribution modelling (SDM) has become an essential tool in biodiversity conservation and management. The choice of grain size (resolution) of environmental layers used in modelling is one important factor that may affect predictions. We applied 10 distinct modelling techniques to presence-only data for 50 species in five different regions, to test whether: (1) a 10-fold coarsening of resolution affects predictive performance of SDMs, and (2) any observed effects are dependent on the type of region, modelling technique, or species considered. Results show that a 10 times change in grain size does not severely affect predictions from species distribution models. The overall trend is towards degradation of model performance, but improvement can also be observed. Changing grain size does not equally affect models across regions, techniques, and species types. The strongest effect is on regions and species types, with tree species in the data sets (regions) with highest locational accuracy being most affected. Changing grain size had little influence on the ranking of techniques: boosted regression trees remain best at both resolutions. The number of occurrences used for model training had an important effect, with larger sample sizes resulting in better models, which tended to be more sensitive to grain. Effect of grain change was only noticeable for models reaching sufficient performance and/or with initial data that have an intrinsic error smaller than the coarser grain size.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Conservation and monitoring of forest biodiversity requires reliable information about forest structure and composition at multiple spatial scales. However, detailed data about forest habitat characteristics across large areas are often incomplete due to difficulties associated with field sampling methods. To overcome this limitation we employed a nationally available light detection and ranging (LiDAR) remote sensing dataset to develop variables describing forest landscape structure across a large environmental gradient in Switzerland. Using a model species indicative of structurally rich mountain forests (hazel grouse Bonasa bonasia), we tested the potential of such variables to predict species occurrence and evaluated the additional benefit of LiDAR data when used in combination with traditional, sample plot-based field variables. We calibrated boosted regression trees (BRT) models for both variable sets separately and in combination, and compared the models’ accuracies. While both field-based and LiDAR models performed well, combining the two data sources improved the accuracy of the species’ habitat model. The variables retained from the two datasets held different types of information: field variables mostly quantified food resources and cover in the field and shrub layer, LiDAR variables characterized heterogeneity of vegetation structure which correlated with field variables describing the understory and ground vegetation. When combined with data on forest vegetation composition from field surveys, LiDAR provides valuable complementary information for encompassing species niches more comprehensively. Thus, LiDAR bridges the gap between precise, locally restricted field-data and coarse digital land cover information by reliably identifying habitat structure and quality across large areas.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

No estudo das comunidades florestais, estabelecer a importância relativa dos fatores que definem a composição e a distribuição das espécies é um desafio. Em termos de gradientes ambientais o estudo das respostas das espécies arbóreas são essenciais para a compreensão dos processos ecológicos e decisões de conservação. Neste sentido, para contribuir com a elucidação dos processos ecológicos nas principais formações florestais do Estado de São Paulo (Floresta Ombrófila Densa de Terras Baixas, Floresta Ombrófila Densa Submontana, Floresta Estacional Semidecidual e Savana Florestada) este trabalho objetivou responder as seguintes questões: (I) a composição florística e a abundância das espécies arbóreas, em cada unidade fitogeográfica, variam conforme o gradiente edáfico e topográfico?; (II) características do solo e topografia podem influenciar na previsibilidade de ocorrência de espécies arbóreas de ampla distribuição em diferentes tipos vegetacionais? (III) existe relação entre o padrão de distribuição espacial de espécies arbóreas e os parâmetros do solo e topografia? O trabalho foi realizado em parcelas alocadas em unidades de conservação (UC) que apresentaram trechos representativos, em termos de conservação e tamanho, das quatro principais formações florestais presentes no Estado de São Paulo. Em cada UC foram contabilizados os indivíduos arbóreos (CAP ≥ 15 cm), topografia, dados de textura e atributos químicos dos solos em uma parcela de 10,24 ha, subdividida em 256 subparcelas. Análises de correspodência canônica foram aplicadas para estabelecer a correspondência entre a abundância das espécies e o gradiente ambiental (solo e topografia). O método TWINSPAN modificado foi aplicado ao diagrama de ordenação da CCA para avaliar a influência das variáveis ambientais (solo e topografia) na composição de espécies. Árvores de regressão \"ampliadas\" (BRT) foram ajustadas para a predição da ocorrência das espécies segundo as variáveis de solo e topografia. O índice de Getis-Ord (G) foi utilizado para determinar a autocorrelação espacial das variáveis ambientais utilizadas nos modelos de predição da ocorrência das espécies. Nas unidades fitogeográficas analisadas, a correspondência entre o gradiente ambiental (solo e topografia) e a abundância das espécies foi significativa, especialmente na Savana Florestada onde observou-se a maior relação. O solo e a topografia também se relacionaram com a semelhança na composição florística das subparcelas, com exceção da Floresta Estacional Semicidual (EEC). As principais variáveis de solo e topografia relacionadas a flora em cada UC foram: (1) Na Floresta Ombrófila Densa de Terras Baixas (PEIC) - teor de alumínio na camada profunda (Al (80-100 cm)) que pode refletir os teor de Al na superfície, acidez do solo (pH(H2O) (5-25 cm)) e altitude, que delimitou as áreas alagadas; (2) Na Floresta Ombrófila Densa Submontana (PECB) - altitude, fator que, devido ao relevo acidentado, influencia a temperatura e incidência de sol no sub-bosque; (3) Na Savana Florestada (EEA) - fertilidade, tolerância ao alumínio e acidez do solo. Nos modelos de predição BRT, as variáveis químicas dos solos foram mais importantes do que a textura, devido à pequena variação deste atributo no solo nas áreas amostradas. Dentre as variáveis químicas dos solos, a capacidade de troca catiônica foi utilizada para prever a ocorrência das espécies nas quatro formações florestais, sendo particularmente importante na camada mais profunda do solo da Floresta Ombrófila Densa de Terras Baixas (PEIC). Quanto à topografia, a altitude foi inserida na maioria dos modelos e apresentou diferentes influências sobre as áreas de estudo. De modo geral, para presença das espécies de ampla distribuição observou-se uma mesma tendência quando à associação com os atributos dos solos, porém com amplitudes dos descritores edáficos que variaram de acordo com a área de estudo. A ocorrência de Guapira opposita e Syagrus romanzoffiana, cujo padrão variou conforme a escala, foi explicada por variáveis com padrões espaciais agregados que somaram entre 30% e 50% de importância relativa no modelo BRT. A presença de A. anthelmia, cujo padrão também apresentou certo nível de agregação, foi associada apenas a uma variável com padrão agregado, a altitude (21%), que pode ter exercido grande influência na distribuição da espécie ao delimitar áreas alagadas. T. guianensis se associou a variáveis ambientais preditoras com padrão espacial agregado que somaram cerca de 70% de importância relativa, o que deve ter sido suficiente para estabelecer o padrão agregado em todas as escalas. No entanto, a influência dos fatores ambientais no padrão de distribuição da espécie não depende apenas do ótimo ambiental da espécie, mas um resultado da interação espécie-ambiente. Concluiu-se que: (I) características edáficas e topográficas explicaram uma pequena parcela da composição florística, em cada unidade fitogeográfica, embora a ocorrência de algumas espécies tenha se associado ao gradiente edáfico e topográfico; (II) a partir de características dos solos e da topografia foi possível prever a presença de espécies arbóreas, que apresentaram particularidades em relação a sua associação com o solo de cada fitofisionomia; (III) a partir de associações descritivas o solo e a topografia influenciam o padrão de distribuição espacial das espécies, na proporção em que contribuem para a presença das mesmas.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Highway Safety Manual (HSM) estimates roadway safety performance based on predictive models that were calibrated using national data. Calibration factors are then used to adjust these predictive models to local conditions for local applications. The HSM recommends that local calibration factors be estimated using 30 to 50 randomly selected sites that experienced at least a total of 100 crashes per year. It also recommends that the factors be updated every two to three years, preferably on an annual basis. However, these recommendations are primarily based on expert opinions rather than data-driven research findings. Furthermore, most agencies do not have data for many of the input variables recommended in the HSM. This dissertation is aimed at determining the best way to meet three major data needs affecting the estimation of calibration factors: (1) the required minimum sample sizes for different roadway facilities, (2) the required frequency for calibration factor updates, and (3) the influential variables affecting calibration factors. In this dissertation, statewide segment and intersection data were first collected for most of the HSM recommended calibration variables using a Google Maps application. In addition, eight years (2005-2012) of traffic and crash data were retrieved from existing databases from the Florida Department of Transportation. With these data, the effect of sample size criterion on calibration factor estimates was first studied using a sensitivity analysis. The results showed that the minimum sample sizes not only vary across different roadway facilities, but they are also significantly higher than those recommended in the HSM. In addition, results from paired sample t-tests showed that calibration factors in Florida need to be updated annually. To identify influential variables affecting the calibration factors for roadway segments, the variables were prioritized by combining the results from three different methods: negative binomial regression, random forests, and boosted regression trees. Only a few variables were found to explain most of the variation in the crash data. Traffic volume was consistently found to be the most influential. In addition, roadside object density, major and minor commercial driveway densities, and minor residential driveway density were also identified as influential variables.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background: Development of three classification trees (CT) based on the CART (Classification and Regression Trees), CHAID (Chi-Square Automatic Interaction Detection) and C4.5 methodologies for the calculation of probability of hospital mortality; the comparison of the results with the APACHE II, SAPS II and MPM II-24 scores, and with a model based on multiple logistic regression (LR). Methods: Retrospective study of 2864 patients. Random partition (70:30) into a Development Set (DS) n = 1808 and Validation Set (VS) n = 808. Their properties of discrimination are compared with the ROC curve (AUC CI 95%), Percent of correct classification (PCC CI 95%); and the calibration with the Calibration Curve and the Standardized Mortality Ratio (SMR CI 95%). Results: CTs are produced with a different selection of variables and decision rules: CART (5 variables and 8 decision rules), CHAID (7 variables and 15 rules) and C4.5 (6 variables and 10 rules). The common variables were: inotropic therapy, Glasgow, age, (A-a)O2 gradient and antecedent of chronic illness. In VS: all the models achieved acceptable discrimination with AUC above 0.7. CT: CART (0.75(0.71-0.81)), CHAID (0.76(0.72-0.79)) and C4.5 (0.76(0.73-0.80)). PCC: CART (72(69- 75)), CHAID (72(69-75)) and C4.5 (76(73-79)). Calibration (SMR) better in the CT: CART (1.04(0.95-1.31)), CHAID (1.06(0.97-1.15) and C4.5 (1.08(0.98-1.16)). Conclusion: With different methodologies of CTs, trees are generated with different selection of variables and decision rules. The CTs are easy to interpret, and they stratify the risk of hospital mortality. The CTs should be taken into account for the classification of the prognosis of critically ill patients.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper proposes a template for modelling complex datasets that integrates traditional statistical modelling approaches with more recent advances in statistics and modelling through an exploratory framework. Our approach builds on the well-known and long standing traditional idea of 'good practice in statistics' by establishing a comprehensive framework for modelling that focuses on exploration, prediction, interpretation and reliability assessment, a relatively new idea that allows individual assessment of predictions. The integrated framework we present comprises two stages. The first involves the use of exploratory methods to help visually understand the data and identify a parsimonious set of explanatory variables. The second encompasses a two step modelling process, where the use of non-parametric methods such as decision trees and generalized additive models are promoted to identify important variables and their modelling relationship with the response before a final predictive model is considered. We focus on fitting the predictive model using parametric, non-parametric and Bayesian approaches. This paper is motivated by a medical problem where interest focuses on developing a risk stratification system for morbidity of 1,710 cardiac patients given a suite of demographic, clinical and preoperative variables. Although the methods we use are applied specifically to this case study, these methods can be applied across any field, irrespective of the type of response.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação apresentada para obtenção do Grau de Doutor em Engenharia Electrotécnica e de Computadores – Sistemas Digitais e Percepcionais pela Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Esta dissertação apresenta um estudo sobre a garantia de fornecimento de energia elétrica por parte dos produtores em regime especial com tecnologia cogeração e o impacto que estes traduzem na fase de planeamento da rede. Este trabalho foi realizado na Energias de Portugal - Distribuição (EDP-D) na direção de planeamento da rede (DPL). Para este estudo foi utilizado o caso de uma subestação com dezoito produtores em regime especial agregados à sua rede, em que dezasseis desses produtores são cogeração. A proposta de estudo para o caso concreto, passa pela análise das condições de funcionamento da subestação e apurar se a mesma necessita de alguma reformulação, tendo em vista as cargas a satisfazer atuais e possível incremento de carga futura. Considerando que a subestação está inserida num ambiente industrial e atendendo que existem diversos produtores de energia elétrica nas imediações da subestação. Para a resolução da garantia do fornecimento de energia por parte da cogeração, estudou-se a possibilidade de prever a energia produzida por estes produtores, através dos seguintes modelos de previsão: árvore de regressão, árvore de regressão com aplicação bagging e uma rede neuronal (unidirecional). Com a implementação destes modelos pretende-se estimar qual a potência que se pode esperar na garantia de abastecimento da carga, prevenindo maior solicitação de potência por parte da subestação. A metodologia utilizada baseia-se em simulações computacionais.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Mestre em Engenharia Biomédica

Relevância:

80.00% 80.00%

Publicador:

Resumo:

OBJECTIVE: The European Surgical Outcomes Study described mortality following in-patient surgery. Several factors were identified that were able to predict poor outcomes in a multivariate analysis. These included age, procedure urgency, severity and type and the American Association of Anaesthesia score. This study describes in greater detail the relationship between the American Association of Anaesthesia score and postoperative mortality. METHODS: Patients in this 7-day cohort study were enrolled in April 2011. Consecutive patients aged 16 years and older undergoing inpatient non-cardiac surgery with a recorded American Association of Anaesthesia score in 498 hospitals across 28 European nations were included and followed up for a maximum of 60 days. The primary endpoint was in-hospital mortality. Decision tree analysis with the CHAID (SPSS) system was used to delineate nodes associated with mortality. RESULTS: The study enrolled 46,539 patients. Due to missing values, 873 patients were excluded, resulting in the analysis of 45,666 patients. Increasing American Association of Anaesthesia scores were associated with increased admission rates to intensive care and higher mortality rates. Despite a progressive relationship with mortality, discrimination was poor, with an area under the ROC curve of 0.658 (95% CI 0.642 - 0.6775). Using regression trees (CHAID), we identified four discrete American Association of Anaesthesia nodes associated with mortality, with American Association of Anaesthesia 1 and American Association of Anaesthesia 2 compressed into the same node. CONCLUSION: The American Association of Anaesthesia score can be used to determine higher risk groups of surgical patients, but clinicians cannot use the score to discriminate between grades 1 and 2. Overall, the discriminatory power of the model was less than acceptable for widespread use.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This Letter presents a search at the LHC for s-channel single top-quark production in proton-proton collisions at a centre-of-mass energy of 8 TeV. The analyzed data set was recorded by the ATLAS detector and corresponds to an integrated luminosity of 20.3 fb−1. Selected events contain one charged lepton, large missing transverse momentum and exactly two b-tagged jets. A multivariate event classifier based on boosted decision trees is developed to discriminate s-channel single top-quark events from the main background contributions. The signal extraction is based on a binned maximum-likelihood fit of the output classifier distribution. The analysis leads to an upper limit on the s-channel single top-quark production cross-section of 14.6 pb at the 95% confidence level. The fit gives a cross-section of σs=5.0±4.3 pb, consistent with the Standard Model expectation.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A search for new charged massive gauge bosons, called W′, is performed with the ATLAS detector at the LHC, in proton--proton collisions at a centre-of-mass energy of s√ = 8 TeV, using a dataset corresponding to an integrated luminosity of 20.3 fb−1. This analysis searches for W′ bosons in the W′→tb¯ decay channel in final states with electrons or muons, using a multivariate method based on boosted decision trees. The search covers masses between 0.5 and 3.0 TeV, for right-handed or left-handed W′ bosons. No significant deviation from the Standard Model expectation is observed and limits are set on the W′→tb¯ cross-section times branching ratio and on the W′-boson effective couplings as a function of the W′-boson mass using the CLs procedure. For a left-handed (right-handed) W′ boson, masses below 1.70 (1.92) TeV are excluded at 95% confidence level.