866 resultados para Boosted regression trees
Resumo:
Modeling the distributions of species, especially of invasive species in non-native ranges, involves multiple challenges. Here, we developed some novel approaches to species distribution modeling aimed at reducing the influences of such challenges and improving the realism of projections. We estimated species-environment relationships with four modeling methods run with multiple scenarios of (1) sources of occurrences and geographically isolated background ranges for absences, (2) approaches to drawing background (absence) points, and (3) alternate sets of predictor variables. We further tested various quantitative metrics of model evaluation against biological insight. Model projections were very sensitive to the choice of training dataset. Model accuracy was much improved by using a global dataset for model training, rather than restricting data input to the species’ native range. AUC score was a poor metric for model evaluation and, if used alone, was not a useful criterion for assessing model performance. Projections away from the sampled space (i.e. into areas of potential future invasion) were very different depending on the modeling methods used, raising questions about the reliability of ensemble projections. Generalized linear models gave very unrealistic projections far away from the training region. Models that efficiently fit the dominant pattern, but exclude highly local patterns in the dataset and capture interactions as they appear in data (e.g. boosted regression trees), improved generalization of the models. Biological knowledge of the species and its distribution was important in refining choices about the best set of projections. A post-hoc test conducted on a new Partenium dataset from Nepal validated excellent predictive performance of our “best” model. We showed that vast stretches of currently uninvaded geographic areas on multiple continents harbor highly suitable habitats for Parthenium hysterophorus L. (Asteraceae; parthenium). However, discrepancies between model predictions and parthenium invasion in Australia indicate successful management for this globally significant weed. This article is protected by copyright. All rights reserved.
Mapping reef fish and the seascape: using acoustics and spatial modeling to guide coastal management
Resumo:
Reef fish distributions are patchy in time and space with some coral reef habitats supporting higher densities (i.e., aggregations) of fish than others. Identifying and quantifying fish aggregations (particularly during spawning events) are often top priorities for coastal managers. However, the rapid mapping of these aggregations using conventional survey methods (e.g., non-technical SCUBA diving and remotely operated cameras) are limited by depth, visibility and time. Acoustic sensors (i.e., splitbeam and multibeam echosounders) are not constrained by these same limitations, and were used to concurrently map and quantify the location, density and size of reef fish along with seafloor structure in two, separate locations in the U.S. Virgin Islands. Reef fish aggregations were documented along the shelf edge, an ecologically important ecotone in the region. Fish were grouped into three classes according to body size, and relationships with the benthic seascape were modeled in one area using Boosted Regression Trees. These models were validated in a second area to test their predictive performance in locations where fish have not been mapped. Models predicting the density of large fish (≥29 cm) performed well (i.e., AUC = 0.77). Water depth and standard deviation of depth were the most influential predictors at two spatial scales (100 and 300 m). Models of small (≤11 cm) and medium (12–28 cm) fish performed poorly (i.e., AUC = 0.49 to 0.68) due to the high prevalence (45–79%) of smaller fish in both locations, and the unequal prevalence of smaller fish in the training and validation areas. Integrating acoustic sensors with spatial modeling offers a new and reliable approach to rapidly identify fish aggregations and to predict the density large fish in un-surveyed locations. This integrative approach will help coastal managers to prioritize sites, and focus their limited resources on areas that may be of higher conservation value.
Resumo:
Aim: Ecological niche modelling can provide valuable insight into species' environmental preferences and aid the identification of key habitats for populations of conservation concern. Here, we integrate biologging, satellite remote-sensing and ensemble ecological niche models (EENMs) to identify predictable foraging habitats for a globally important population of the grey-headed albatross (GHA) Thalassarche chrysostoma. Location: Bird Island, South Georgia; Southern Atlantic Ocean. Methods: GPS and geolocation-immersion loggers were used to track at-sea movements and activity patterns of GHA over two breeding seasons (n = 55; brood-guard). Immersion frequency (landings per 10-min interval) was used to define foraging events. EENM combining Generalized Additive Models (GAM), MaxEnt, Random Forest (RF) and Boosted Regression Trees (BRT) identified the biophysical conditions characterizing the locations of foraging events, using time-matched oceanographic predictors (Sea Surface Temperature, SST; chlorophyll a, chl-a; thermal front frequency, TFreq; depth). Model performance was assessed through iterative cross-validation and extrapolative performance through cross-validation among years. Results: Predictable foraging habitats identified by EENM spanned neritic (<500 m), shelf break and oceanic waters, coinciding with a set of persistent biophysical conditions characterized by particular thermal ranges (3–8 °C, 12–13 °C), elevated primary productivity (chl-a > 0.5 mg m−3) and frequent manifestation of mesoscale thermal fronts. Our results confirm previous indications that GHA exploit enhanced foraging opportunities associated with frontal systems and objectively identify the APFZ as a region of high foraging habitat suitability. Moreover, at the spatial and temporal scales investigated here, the performance of multi-model ensembles was superior to that of single-algorithm models, and cross-validation among years indicated reasonable extrapolative performance. Main conclusions: EENM techniques are useful for integrating the predictions of several single-algorithm models, reducing potential bias and increasing confidence in predictions. Our analysis highlights the value of EENM for use with movement data in identifying at-sea habitats of wide-ranging marine predators, with clear implications for conservation and management.
Resumo:
Aim: Ecological niche modelling can provide valuable insight into species' environmental preferences and aid the identification of key habitats for populations of conservation concern. Here, we integrate biologging, satellite remote-sensing and ensemble ecological niche models (EENMs) to identify predictable foraging habitats for a globally important population of the grey-headed albatross (GHA) Thalassarche chrysostoma. Location: Bird Island, South Georgia; Southern Atlantic Ocean. Methods: GPS and geolocation-immersion loggers were used to track at-sea movements and activity patterns of GHA over two breeding seasons (n = 55; brood-guard). Immersion frequency (landings per 10-min interval) was used to define foraging events. EENM combining Generalized Additive Models (GAM), MaxEnt, Random Forest (RF) and Boosted Regression Trees (BRT) identified the biophysical conditions characterizing the locations of foraging events, using time-matched oceanographic predictors (Sea Surface Temperature, SST; chlorophyll a, chl-a; thermal front frequency, TFreq; depth). Model performance was assessed through iterative cross-validation and extrapolative performance through cross-validation among years. Results: Predictable foraging habitats identified by EENM spanned neritic (<500 m), shelf break and oceanic waters, coinciding with a set of persistent biophysical conditions characterized by particular thermal ranges (3–8 °C, 12–13 °C), elevated primary productivity (chl-a > 0.5 mg m−3) and frequent manifestation of mesoscale thermal fronts. Our results confirm previous indications that GHA exploit enhanced foraging opportunities associated with frontal systems and objectively identify the APFZ as a region of high foraging habitat suitability. Moreover, at the spatial and temporal scales investigated here, the performance of multi-model ensembles was superior to that of single-algorithm models, and cross-validation among years indicated reasonable extrapolative performance. Main conclusions: EENM techniques are useful for integrating the predictions of several single-algorithm models, reducing potential bias and increasing confidence in predictions. Our analysis highlights the value of EENM for use with movement data in identifying at-sea habitats of wide-ranging marine predators, with clear implications for conservation and management.
Resumo:
Conservation and monitoring of forest biodiversity requires reliable information about forest structure and composition at multiple spatial scales. However, detailed data about forest habitat characteristics across large areas are often incomplete due to difficulties associated with field sampling methods. To overcome this limitation we employed a nationally available light detection and ranging (LiDAR) remote sensing dataset to develop variables describing forest landscape structure across a large environmental gradient in Switzerland. Using a model species indicative of structurally rich mountain forests (hazel grouse Bonasa bonasia), we tested the potential of such variables to predict species occurrence and evaluated the additional benefit of LiDAR data when used in combination with traditional, sample plot-based field variables. We calibrated boosted regression trees (BRT) models for both variable sets separately and in combination, and compared the models’ accuracies. While both field-based and LiDAR models performed well, combining the two data sources improved the accuracy of the species’ habitat model. The variables retained from the two datasets held different types of information: field variables mostly quantified food resources and cover in the field and shrub layer, LiDAR variables characterized heterogeneity of vegetation structure which correlated with field variables describing the understory and ground vegetation. When combined with data on forest vegetation composition from field surveys, LiDAR provides valuable complementary information for encompassing species niches more comprehensively. Thus, LiDAR bridges the gap between precise, locally restricted field-data and coarse digital land cover information by reliably identifying habitat structure and quality across large areas.
Resumo:
No estudo das comunidades florestais, estabelecer a importância relativa dos fatores que definem a composição e a distribuição das espécies é um desafio. Em termos de gradientes ambientais o estudo das respostas das espécies arbóreas são essenciais para a compreensão dos processos ecológicos e decisões de conservação. Neste sentido, para contribuir com a elucidação dos processos ecológicos nas principais formações florestais do Estado de São Paulo (Floresta Ombrófila Densa de Terras Baixas, Floresta Ombrófila Densa Submontana, Floresta Estacional Semidecidual e Savana Florestada) este trabalho objetivou responder as seguintes questões: (I) a composição florística e a abundância das espécies arbóreas, em cada unidade fitogeográfica, variam conforme o gradiente edáfico e topográfico?; (II) características do solo e topografia podem influenciar na previsibilidade de ocorrência de espécies arbóreas de ampla distribuição em diferentes tipos vegetacionais? (III) existe relação entre o padrão de distribuição espacial de espécies arbóreas e os parâmetros do solo e topografia? O trabalho foi realizado em parcelas alocadas em unidades de conservação (UC) que apresentaram trechos representativos, em termos de conservação e tamanho, das quatro principais formações florestais presentes no Estado de São Paulo. Em cada UC foram contabilizados os indivíduos arbóreos (CAP ≥ 15 cm), topografia, dados de textura e atributos químicos dos solos em uma parcela de 10,24 ha, subdividida em 256 subparcelas. Análises de correspodência canônica foram aplicadas para estabelecer a correspondência entre a abundância das espécies e o gradiente ambiental (solo e topografia). O método TWINSPAN modificado foi aplicado ao diagrama de ordenação da CCA para avaliar a influência das variáveis ambientais (solo e topografia) na composição de espécies. Árvores de regressão \"ampliadas\" (BRT) foram ajustadas para a predição da ocorrência das espécies segundo as variáveis de solo e topografia. O índice de Getis-Ord (G) foi utilizado para determinar a autocorrelação espacial das variáveis ambientais utilizadas nos modelos de predição da ocorrência das espécies. Nas unidades fitogeográficas analisadas, a correspondência entre o gradiente ambiental (solo e topografia) e a abundância das espécies foi significativa, especialmente na Savana Florestada onde observou-se a maior relação. O solo e a topografia também se relacionaram com a semelhança na composição florística das subparcelas, com exceção da Floresta Estacional Semicidual (EEC). As principais variáveis de solo e topografia relacionadas a flora em cada UC foram: (1) Na Floresta Ombrófila Densa de Terras Baixas (PEIC) - teor de alumínio na camada profunda (Al (80-100 cm)) que pode refletir os teor de Al na superfície, acidez do solo (pH(H2O) (5-25 cm)) e altitude, que delimitou as áreas alagadas; (2) Na Floresta Ombrófila Densa Submontana (PECB) - altitude, fator que, devido ao relevo acidentado, influencia a temperatura e incidência de sol no sub-bosque; (3) Na Savana Florestada (EEA) - fertilidade, tolerância ao alumínio e acidez do solo. Nos modelos de predição BRT, as variáveis químicas dos solos foram mais importantes do que a textura, devido à pequena variação deste atributo no solo nas áreas amostradas. Dentre as variáveis químicas dos solos, a capacidade de troca catiônica foi utilizada para prever a ocorrência das espécies nas quatro formações florestais, sendo particularmente importante na camada mais profunda do solo da Floresta Ombrófila Densa de Terras Baixas (PEIC). Quanto à topografia, a altitude foi inserida na maioria dos modelos e apresentou diferentes influências sobre as áreas de estudo. De modo geral, para presença das espécies de ampla distribuição observou-se uma mesma tendência quando à associação com os atributos dos solos, porém com amplitudes dos descritores edáficos que variaram de acordo com a área de estudo. A ocorrência de Guapira opposita e Syagrus romanzoffiana, cujo padrão variou conforme a escala, foi explicada por variáveis com padrões espaciais agregados que somaram entre 30% e 50% de importância relativa no modelo BRT. A presença de A. anthelmia, cujo padrão também apresentou certo nível de agregação, foi associada apenas a uma variável com padrão agregado, a altitude (21%), que pode ter exercido grande influência na distribuição da espécie ao delimitar áreas alagadas. T. guianensis se associou a variáveis ambientais preditoras com padrão espacial agregado que somaram cerca de 70% de importância relativa, o que deve ter sido suficiente para estabelecer o padrão agregado em todas as escalas. No entanto, a influência dos fatores ambientais no padrão de distribuição da espécie não depende apenas do ótimo ambiental da espécie, mas um resultado da interação espécie-ambiente. Concluiu-se que: (I) características edáficas e topográficas explicaram uma pequena parcela da composição florística, em cada unidade fitogeográfica, embora a ocorrência de algumas espécies tenha se associado ao gradiente edáfico e topográfico; (II) a partir de características dos solos e da topografia foi possível prever a presença de espécies arbóreas, que apresentaram particularidades em relação a sua associação com o solo de cada fitofisionomia; (III) a partir de associações descritivas o solo e a topografia influenciam o padrão de distribuição espacial das espécies, na proporção em que contribuem para a presença das mesmas.
Resumo:
The Highway Safety Manual (HSM) estimates roadway safety performance based on predictive models that were calibrated using national data. Calibration factors are then used to adjust these predictive models to local conditions for local applications. The HSM recommends that local calibration factors be estimated using 30 to 50 randomly selected sites that experienced at least a total of 100 crashes per year. It also recommends that the factors be updated every two to three years, preferably on an annual basis. However, these recommendations are primarily based on expert opinions rather than data-driven research findings. Furthermore, most agencies do not have data for many of the input variables recommended in the HSM. This dissertation is aimed at determining the best way to meet three major data needs affecting the estimation of calibration factors: (1) the required minimum sample sizes for different roadway facilities, (2) the required frequency for calibration factor updates, and (3) the influential variables affecting calibration factors. In this dissertation, statewide segment and intersection data were first collected for most of the HSM recommended calibration variables using a Google Maps application. In addition, eight years (2005-2012) of traffic and crash data were retrieved from existing databases from the Florida Department of Transportation. With these data, the effect of sample size criterion on calibration factor estimates was first studied using a sensitivity analysis. The results showed that the minimum sample sizes not only vary across different roadway facilities, but they are also significantly higher than those recommended in the HSM. In addition, results from paired sample t-tests showed that calibration factors in Florida need to be updated annually. To identify influential variables affecting the calibration factors for roadway segments, the variables were prioritized by combining the results from three different methods: negative binomial regression, random forests, and boosted regression trees. Only a few variables were found to explain most of the variation in the crash data. Traffic volume was consistently found to be the most influential. In addition, roadside object density, major and minor commercial driveway densities, and minor residential driveway density were also identified as influential variables.
Resumo:
The benefits of applying tree-based methods to the purpose of modelling financial assets as opposed to linear factor analysis are increasingly being understood by market practitioners. Tree-based models such as CART (classification and regression trees) are particularly well suited to analysing stock market data which is noisy and often contains non-linear relationships and high-order interactions. CART was originally developed in the 1980s by medical researchers disheartened by the stringent assumptions applied by traditional regression analysis (Brieman et al. [1984]). In the intervening years, CART has been successfully applied to many areas of finance such as the classification of financial distress of firms (see Frydman, Altman and Kao [1985]), asset allocation (see Sorensen, Mezrich and Miller [1996]), equity style timing (see Kao and Shumaker [1999]) and stock selection (see Sorensen, Miller and Ooi [2000])...
Resumo:
The quality of species distribution models (SDMs) relies to a large degree on the quality of the input data, from bioclimatic indices to environmental and habitat descriptors (Austin, 2002). Recent reviews of SDM techniques, have sought to optimize predictive performance e.g. Elith et al., 2006. In general SDMs employ one of three approaches to variable selection. The simplest approach relies on the expert to select the variables, as in environmental niche models Nix, 1986 or a generalized linear model without variable selection (Miller and Franklin, 2002). A second approach explicitly incorporates variable selection into model fitting, which allows examination of particular combinations of variables. Examples include generalized linear or additive models with variable selection (Hastie et al. 2002); or classification trees with complexity or model based pruning (Breiman et al., 1984, Zeileis, 2008). A third approach uses model averaging, to summarize the overall contribution of a variable, without considering particular combinations. Examples include neural networks, boosted or bagged regression trees and Maximum Entropy as compared in Elith et al. 2006. Typically, users of SDMs will either consider a small number of variable sets, via the first approach, or else supply all of the candidate variables (often numbering more than a hundred) to the second or third approaches. Bayesian SDMs exist, with several methods for eliciting and encoding priors on model parameters (see review in Low Choy et al. 2010). However few methods have been published for informative variable selection; one example is Bayesian trees (O’Leary 2008). Here we report an elicitation protocol that helps makes explicit a priori expert judgements on the quality of candidate variables. This protocol can be flexibly applied to any of the three approaches to variable selection, described above, Bayesian or otherwise. We demonstrate how this information can be obtained then used to guide variable selection in classical or machine learning SDMs, or to define priors within Bayesian SDMs.
Resumo:
Discrete Conditional Phase-type (DC-Ph) models are a family of models which represent skewed survival data conditioned on specific inter-related discrete variables. The survival data is modeled using a Coxian phase-type distribution which is associated with the inter-related variables using a range of possible data mining approaches such as Bayesian networks (BNs), the Naïve Bayes Classification method and classification regression trees. This paper utilizes the Discrete Conditional Phase-type model (DC-Ph) to explore the modeling of patient waiting times in an Accident and Emergency Department of a UK hospital. The resulting DC-Ph model takes on the form of the Coxian phase-type distribution conditioned on the outcome of a logistic regression model.
Resumo:
The roles of weather variability and sunspots in the occurrence of cyanobacteria blooms, were investigated using cyanobacteria cell data collected from the Fred Haigh Dam, Queensland, Australia. Time series generalized linear model and classification and regression (CART) model were used in the analysis. Data on notified cell numbers of cyanobacteria and weather variables over the periods 2001 and 2005 were provided by the Australian Department of Natural Resources and Water, and Australian Bureau of Meteorology, respectively. The results indicate that monthly minimum temperature (relative risk [RR]: 1.13, 95% confidence interval [CI]: 1.02-1.25) and rainfall (RR: 1.11; 95% CI: 1.03-1.20) had a positive association, but relative humidity (RR: 0.94; 95% CI: 0.91-0.98) and wind speed (RR:0.90; 95% CI: 0.82-0.98) were negatively associated with the cyanobacterial numbers, after adjustment for seasonality and auto-correlation. The CART model showed that the cyanobacteria numbers were best described by an interaction between minimum temperature, relative humidity, and sunspot numbers. When minimum temperature exceeded 18%C and relative humidity was under 66%, the number of cyanobacterial cells rose by 2.15-fold. We conclude that the weather variability and sunspot activity may affect cyanobacterial blooms in dams.
Resumo:
Objectives: The objectives of this study were to specifically investigate the differences in culture, attitudes and social networks between Australian and Taiwanese men and women and identify the factors that predict midlife men and women’s quality of life in both countries. Methods: A stratified random sample strategy based on probability proportional sampling (PPS) was conducted to investigate 278 Australian and 398 Taiwanese midlife men and women’s quality of life. Multiple regression modelling and classification and regression trees (CARTs) were performed to examine the potential differences on culture, attitude, social networks, social demographic factors and religion/spirituality in midlife men and women’s quality of life in both Australia and Taiwan. Results: The results of this study suggest that culture involves multiple functions and interacts with attitudes, social networks and individual factors to influence a person’s quality of life. Significant relationships were found between the interaction between cultural circumstances and a person’s internal and external factors. The research found that good social support networks and a healthy optimistic disposition may significantly enhance midlife men and women’s quality of life. Conclusion: The study indicated that there is a significant relationship between culture, attitude, social networks and quality of life in midlife Australian and Taiwanese men and women. People who had higher levels of horizontal individualism and collectivism, positive attitudes and better social support had better psychological, social, physical and environmental health, while it emerged that vertical individualists with competitive characteristics would experience a lower quality of life. This study has highlighted areas where opportunities exist to further reflect upon contemporary social health policies for Australian and Taiwanese societies and also within the global perspective, in order to provide enhanced quality care for growing midlife populations.
Resumo:
Habitat models are widely used in ecology, however there are relatively few studies of rare species, primarily because of a paucity of survey records and lack of robust means of assessing accuracy of modelled spatial predictions. We investigated the potential of compiled ecological data in developing habitat models for Macadamia integrifolia, a vulnerable mid-stratum tree endemic to lowland subtropical rainforests of southeast Queensland, Australia. We compared performance of two binomial models—Classification and Regression Trees (CART) and Generalised Additive Models (GAM)—with Maximum Entropy (MAXENT) models developed from (i) presence records and available absence data and (ii) developed using presence records and background data. The GAM model was the best performer across the range of evaluation measures employed, however all models were assessed as potentially useful for informing in situ conservation of M. integrifolia, A significant loss in the amount of M. integrifolia habitat has occurred (p < 0.05), with only 37% of former habitat (pre-clearing) remaining in 2003. Remnant patches are significantly smaller, have larger edge-to-area ratios and are more isolated from each other compared to pre-clearing configurations (p < 0.05). Whilst the network of suitable habitat patches is still largely intact, there are numerous smaller patches that are more isolated in the contemporary landscape compared with their connectedness before clearing. These results suggest that in situ conservation of M. integrifolia may be best achieved through a landscape approach that considers the relative contribution of small remnant habitat fragments to the species as a whole, as facilitating connectivity among the entire network of habitat patches.