956 resultados para random forest regression
Resumo:
Hospitals attached to the Spanish Ministry of Health are currently using the International Classification of Diseases 9 Clinical Modification (ICD9-CM) to classify health discharge records. Nowadays, this work is manually done by experts. This paper tackles the automatic classification of real Discharge Records in Spanish following the ICD9-CM standard. The challenge is that the Discharge Records are written in spontaneous language. We explore several machine learning techniques to deal with the classification problem. Random Forest resulted in the most competitive one, achieving an F-measure of 0.876.
Resumo:
Thesis (Master, Computing) -- Queen's University, 2016-05-29 18:11:34.114
Resumo:
Species distribution models (SDM) predict species occurrence based on statistical relationships with environmental conditions. The R-package biomod2 which includes 10 different SDM techniques and 10 different evaluation methods was used in this study. Macroalgae are the main biomass producers in Potter Cove, King George Island (Isla 25 de Mayo), Antarctica, and they are sensitive to climate change factors such as suspended particulate matter (SPM). Macroalgae presence and absence data were used to test SDMs suitability and, simultaneously, to assess the environmental response of macroalgae as well as to model four scenarios of distribution shifts by varying SPM conditions due to climate change. According to the averaged evaluation scores of Relative Operating Characteristics (ROC) and True scale statistics (TSS) by models, those methods based on a multitude of decision trees such as Random Forest and Classification Tree Analysis, reached the highest predictive power followed by generalized boosted models (GBM) and maximum-entropy approaches (Maxent). The final ensemble model used 135 of 200 calculated models (TSS > 0.7) and identified hard substrate and SPM as the most influencing parameters followed by distance to glacier, total organic carbon (TOC), bathymetry and slope. The climate change scenarios show an invasive reaction of the macroalgae in case of less SPM and a retreat of the macroalgae in case of higher assumed SPM values.
Resumo:
Paleoceanographic archives derived from 17 marine sediment cores reconstruct the response of the Southwest Pacific Ocean to the peak interglacial, Marine Isotope Stage (MIS) 5e (ca. 125?ka). Paleo-Sea Surface Temperature (SST) estimates were obtained from the Random Forest model-an ensemble decision tree tool-applied to core-top planktonic foraminiferal faunas calibrated to modern SSTs. The reconstructed geographic pattern of the SST anomaly (maximum SST between 120 and 132?ka minus mean modern SST) seems to indicate how MIS 5e conditions were generally warmer in the Southwest Pacific, especially in the western Tasman Sea where a strengthened East Australian Current (EAC) likely extended subtropical influence to ca. 45°S off Tasmania. In contrast, the eastern Tasman Sea may have had a modest cooling except around 45°S. The observed pattern resembles that developing under the present warming trend in the region. An increase in wind stress curl over the modern South Pacific is hypothesized to have spun-up the South Pacific Subtropical Gyre, with concurrent increase in subtropical flow in the western boundary currents that include the EAC. However, warmer temperatures along the Subtropical Front and Campbell Plateau to the south suggest that the relative influence of the boundary inflows to eastern New Zealand may have differed in MIS 5e, and these currents may have followed different paths compared to today.
Resumo:
Loading of the femoral neck (FN) is dominated by bending and compressive stresses. We hypothesize that adaptation of the FN to physical activity would be manifested in the cross-sectional area (CSA) and section modulus (Z) of bone, indices of axial and bending strength, respectively. We investigated the influence of physical activity on bone strength during adolescence using 7 years of longitudinal data from 109 boys and 121 girls from the Saskatchewan Paediatric Bone and Mineral Accrual Study (PBMAS). Physical activity data (PAC-Q physical activity inventory) and anthropometric measurements were taken every 6 months and DXA bone scans were measured annually (Hologic QDR2000, array mode). We applied hip structural analysis to derive strength and geometric indices of the femoral neck using DXA scans. To control for maturation, we determined a biological maturity age defined as years from age at peak height velocity (APHV). To account for the repeated measures within individual nature of longitudinal data, multilevel random effects regression analyses were used to analyze the data. When biological maturity age and body size (height and weight) were controlled, in both boys and girls, physical activity was a significant positive independent predictor of CSA and Z of the narrow region of the femoral neck (P < 0.05). There was no independent effect of physical activity on the subperiosteal width of the femoral neck. When leg length and leg lean mass were introduced into the random effects models to control for size and muscle mass of the leg (instead of height and weight), all significant effects of physical activity disappeared. Even among adolescents engaged in normal levels of physical activity, the statistically significant relationship between physical activity and indices of bone strength demonstrate that modifiable lifestyle factors like exercise play an important role in optimizing bone strength during the growing years. Physical activity differences were explained by the interdependence between activity and lean mass considerations. Physical activity is important for optimal development of bone strength. (c) 2005 Elsevier Inc. All rights reserved.
Resumo:
The aim of this study is to accurately distinguish Parkinson's disease (PD) participants from healthy controls using self-administered tests of gait and postural sway. Using consumer-grade smartphones with in-built accelerometers, we objectively measure and quantify key movement severity symptoms of Parkinson's disease. Specifically, we record tri-axial accelerations, and extract a range of different features based on the time and frequency-domain properties of the acceleration time series. The features quantify key characteristics of the acceleration time series, and enhance the underlying differences in the gait and postural sway accelerations between PD participants and controls. Using a random forest classifier, we demonstrate an average sensitivity of 98.5% and average specificity of 97.5% in discriminating PD participants from controls. © 2014 IEEE.
Resumo:
Large-extent vegetation datasets that co-occur with long-term hydrology data provide new ways to develop biologically meaningful hydrologic variables and to determine plant community responses to hydrology. We analyzed the suitability of different hydrological variables to predict vegetation in two water conservation areas (WCAs) in the Florida Everglades, USA, and developed metrics to define realized hydrologic optima and tolerances. Using vegetation data spatially co-located with long-term hydrological records, we evaluated seven variables describing water depth, hydroperiod length, and number of wet/dry events; each variable was tested for 2-, 4- and 10-year intervals for Julian annual averages and environmentally-defined hydrologic intervals. Maximum length and maximum water depth during the wet period calculated for environmentally-defined hydrologic intervals over a 4-year period were the best predictors of vegetation type. Proportional abundance of vegetation types along hydrological gradients indicated that communities had different realized optima and tolerances across WCAs. Although in both WCAs, the trees/shrubs class was on the drier/shallower end of hydrological gradients, while slough communities occupied the wetter/deeper end, the distribution ofCladium, Typha, wet prairie and Salix communities, which were intermediate for most hydrological variables, varied in proportional abundance along hydrologic gradients between WCAs, indicating that realized optima and tolerances are context-dependent.
Resumo:
Biodiversity citizen science projects are growing in number, size, and scope, and are gaining recognition as valuable data sources that build public engagement. Yet publication rates indicate that citizen science is still infrequently used as a primary tool for conservation research and the causes of this apparent disconnect have not been quantitatively evaluated. To uncover the barriers to the use of citizen science as a research tool, we surveyed professional biodiversity scientists (n = 423) and citizen science project managers (n = 125). We conducted three analyses using non-parametric recursive modeling (random forest), using questions that addressed: scientists' perceptions and preferences regarding citizen science, scientists' requirements for their own data, and the actual practices of citizen science projects. For all three analyses we identified the most important factors that influence the probability of publication using citizen science data. Four general barriers emerged: a narrow awareness among scientists of citizen science projects that match their needs; the fact that not all biodiversity science is well-suited for citizen science; inconsistency in data quality across citizen science projects; and bias among scientists for certain data sources (institutions and ages/education levels of data collectors). Notably, we find limited evidence to suggest a relationship between citizen science projects that satisfy scientists' biases and data quality or probability of publication. These results illuminate the need for greater visibility of citizen science practices with respect to the requirements of biodiversity science and show that addressing bias among scientists could improve application of citizen science in conservation.
Resumo:
Effective conservation and management of top predators requires a comprehensive understanding of their distributions and of the underlying biological and physical processes that affect these distributions. The Mid-Atlantic Bight shelf break system is a dynamic and productive region where at least 32 species of cetaceans have been recorded through various systematic and opportunistic marine mammal surveys from the 1970s through 2012. My dissertation characterizes the spatial distribution and habitat of cetaceans in the Mid-Atlantic Bight shelf break system by utilizing marine mammal line-transect survey data, synoptic multi-frequency active acoustic data, and fine-scale hydrographic data collected during the 2011 summer Atlantic Marine Assessment Program for Protected Species (AMAPPS) survey. Although studies describing cetacean habitat and distributions have been previously conducted in the Mid-Atlantic Bight, my research specifically focuses on the shelf break region to elucidate both the physical and biological processes that influence cetacean distribution patterns within this cetacean hotspot.
In Chapter One I review biologically important areas for cetaceans in the Atlantic waters of the United States. I describe the study area, the shelf break region of the Mid-Atlantic Bight, in terms of the general oceanography, productivity and biodiversity. According to recent habitat-based cetacean density models, the shelf break region is an area of high cetacean abundance and density, yet little research is directed at understanding the mechanisms that establish this region as a cetacean hotspot.
In Chapter Two I present the basic physical principles of sound in water and describe the methodology used to categorize opportunistically collected multi-frequency active acoustic data using frequency responses techniques. Frequency response classification methods are usually employed in conjunction with net-tow data, but the logistics of the 2011 AMAPPS survey did not allow for appropriate net-tow data to be collected. Biologically meaningful information can be extracted from acoustic scattering regions by comparing the frequency response curves of acoustic regions to theoretical curves of known scattering models. Using the five frequencies on the EK60 system (18, 38, 70, 120, and 200 kHz), three categories of scatterers were defined: fish-like (with swim bladder), nekton-like (e.g., euphausiids), and plankton-like (e.g., copepods). I also employed a multi-frequency acoustic categorization method using three frequencies (18, 38, and 120 kHz) that has been used in the Gulf of Maine and Georges Bank which is based the presence or absence of volume backscatter above a threshold. This method is more objective than the comparison of frequency response curves because it uses an established backscatter value for the threshold. By removing all data below the threshold, only strong scattering information is retained.
In Chapter Three I analyze the distribution of the categorized acoustic regions of interest during the daytime cross shelf transects. Over all transects, plankton-like acoustic regions of interest were detected most frequently, followed by fish-like acoustic regions and then nekton-like acoustic regions. Plankton-like detections were the only significantly different acoustic detections per kilometer, although nekton-like detections were only slightly not significant. Using the threshold categorization method by Jech and Michaels (2006) provides a more conservative and discrete detection of acoustic scatterers and allows me to retrieve backscatter values along transects in areas that have been categorized. This provides continuous data values that can be integrated at discrete spatial increments for wavelet analysis. Wavelet analysis indicates significant spatial scales of interest for fish-like and nekton-like acoustic backscatter range from one to four kilometers and vary among transects.
In Chapter Four I analyze the fine scale distribution of cetaceans in the shelf break system of the Mid-Atlantic Bight using corrected sightings per trackline region, classification trees, multidimensional scaling, and random forest analysis. I describe habitat for common dolphins, Risso’s dolphins and sperm whales. From the distribution of cetacean sightings, patterns of habitat start to emerge: within the shelf break region of the Mid-Atlantic Bight, common dolphins were sighted more prevalently over the shelf while sperm whales were more frequently found in the deep waters offshore and Risso’s dolphins were most prevalent at the shelf break. Multidimensional scaling presents clear environmental separation among common dolphins and Risso’s dolphins and sperm whales. The sperm whale random forest habitat model had the lowest misclassification error (0.30) and the Risso’s dolphin random forest habitat model had the greatest misclassification error (0.37). Shallow water depth (less than 148 meters) was the primary variable selected in the classification model for common dolphin habitat. Distance to surface density fronts and surface temperature fronts were the primary variables selected in the classification models to describe Risso’s dolphin habitat and sperm whale habitat respectively. When mapped back into geographic space, these three cetacean species occupy different fine-scale habitats within the dynamic Mid-Atlantic Bight shelf break system.
In Chapter Five I present a summary of the previous chapters and present potential analytical steps to address ecological questions pertaining the dynamic shelf break region. Taken together, the results of my dissertation demonstrate the use of opportunistically collected data in ecosystem studies; emphasize the need to incorporate middle trophic level data and oceanographic features into cetacean habitat models; and emphasize the importance of developing more mechanistic understanding of dynamic ecosystems.
Resumo:
Over the past few years, logging has evolved from from simple printf statements to more complex and widely used logging libraries. Today logging information is used to support various development activities such as fixing bugs, analyzing the results of load tests, monitoring performance and transferring knowledge. Recent research has examined how to improve logging practices by informing developers what to log and where to log. Furthermore, the strong dependence on logging has led to the development of logging libraries that have reduced the intricacies of logging, which has resulted in an abundance of log information. Two recent challenges have emerged as modern software systems start to treat logging as a core aspect of their software. In particular, 1) infrastructural challenges have emerged due to the plethora of logging libraries available today and 2) processing challenges have emerged due to the large number of log processing tools that ingest logs and produce useful information from them. In this thesis, we explore these two challenges. We first explore the infrastructural challenges that arise due to the plethora of logging libraries available today. As systems evolve, their logging infrastructure has to evolve (commonly this is done by migrating to new logging libraries). We explore logging library migrations within Apache Software Foundation (ASF) projects. We i find that close to 14% of the pro jects within the ASF migrate their logging libraries at least once. For processing challenges, we explore the different factors which can affect the likelihood of a logging statement changing in the future in four open source systems namely ActiveMQ, Camel, Cloudstack and Liferay. Such changes are likely to negatively impact the log processing tools that must be updated to accommodate such changes. We find that 20%-45% of the logging statements within the four systems are changed at least once. We construct random forest classifiers and Cox models to determine the likelihood of both just-introduced and long-lived logging statements changing in the future. We find that file ownership, developer experience, log density and SLOC are important factors in determining the stability of logging statements.
Resumo:
Background: Potentially inappropriate prescribing (PIP) is common in older people in primary care and can result in increased morbidity, adverse drug events and hospitalisations. We previously demonstrated the success of a multifaceted intervention in decreasing PIP in primary care in a cluster randomised controlled trial (RCT).
Objective: We sought to determine whether the improvement in PIP in the short term was sustained at 1-year follow-up.
Methods: A cluster RCT was conducted with 21 GP practices and 196 patients (aged ≥70) with PIP in Irish primary care. Intervention participants received a complex multifaceted intervention incorporating academic detailing, medicine review with web-based pharmaceutical treatment algorithms that provide recommended alternative treatment options, and tailored patient information leaflets. Control practices delivered usual care and received simple, patient-level PIP feedback. Primary outcomes were the proportion of patients with PIP and the mean number of potentially inappropriate prescriptions at 1-year follow-up. Intention-to-treat analysis using random effects regression was used.
Results: All 21 GP practices and 186 (95 %) patients were followed up. We found that at 1-year follow-up, the significant reduction in the odds of PIP exposure achieved during the intervention was sustained after its discontinuation (adjusted OR 0.28, 95 % CI 0.11 to 0.76, P = 0.01). Intervention participants had significantly lower odds of having a potentially inappropriate proton pump inhibitor compared to controls (adjusted OR 0.40, 95 % CI 0.17 to 0.94, P = 0.04).
Conclusion: The significant reduction in the odds of PIP achieved during the intervention was sustained after its discontinuation. These results indicate that improvements in prescribing quality can be maintained over time.
Resumo:
The application of custom classification techniques and posterior probability modeling (PPM) using Worldview-2 multispectral imagery to archaeological field survey is presented in this paper. Research is focused on the identification of Neolithic felsite stone tool workshops in the North Mavine region of the Shetland Islands in Northern Scotland. Sample data from known workshops surveyed using differential GPS are used alongside known non-sites to train a linear discriminant analysis (LDA) classifier based on a combination of datasets including Worldview-2 bands, band difference ratios (BDR) and topographical derivatives. Principal components analysis is further used to test and reduce dimensionality caused by redundant datasets. Probability models were generated by LDA using principal components and tested with sites identified through geological field survey. Testing shows the prospective ability of this technique and significance between 0.05 and 0.01, and gain statistics between 0.90 and 0.94, higher than those obtained using maximum likelihood and random forest classifiers. Results suggest that this approach is best suited to relatively homogenous site types, and performs better with correlated data sources. Finally, by combining posterior probability models and least-cost analysis, a survey least-cost efficacy model is generated showing the utility of such approaches to archaeological field survey.
Resumo:
Facing widespread poverty and land degradation, Vietnam started a land reform in 1993 as part of its renovation policy package known as “Doi Moi”. This paper examines the impacts of improved land tenure security, via this land reform, on manure use by farm households. As manure potentially improves soil fertility by adding organic matter and nutrients to the soil surface, it might contribute to improving soil productive capacity and reversing land degradation. Random effect regression models are applied to a panel dataset of 133 farm households in the Northern Uplands of Vietnam collected in 1993, 1998, and 2006. The results confirm that land tenure security has positive effects on manure use, but the levels of influence differ depending on whether the land has been privatized or whether the land title has already been issued. In addition, manure use is also influenced by the number of cattle and pigs, the education level and ethnicity of household heads, farm land size and non-farm income. The findings suggest that speeding up land privatization and titling, encouraging cattle and pig rearing, and improving education would promote manure use in farm production. However, careful interpretation of our research findings is required as land privatization, together with economic growth and population pressure, might lead to overuse of farm inputs.