941 resultados para random forest


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Hospitals attached to the Spanish Ministry of Health are currently using the International Classification of Diseases 9 Clinical Modification (ICD9-CM) to classify health discharge records. Nowadays, this work is manually done by experts. This paper tackles the automatic classification of real Discharge Records in Spanish following the ICD9-CM standard. The challenge is that the Discharge Records are written in spontaneous language. We explore several machine learning techniques to deal with the classification problem. Random Forest resulted in the most competitive one, achieving an F-measure of 0.876.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Thesis (Master, Computing) -- Queen's University, 2016-05-29 18:11:34.114

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Species distribution models (SDM) predict species occurrence based on statistical relationships with environmental conditions. The R-package biomod2 which includes 10 different SDM techniques and 10 different evaluation methods was used in this study. Macroalgae are the main biomass producers in Potter Cove, King George Island (Isla 25 de Mayo), Antarctica, and they are sensitive to climate change factors such as suspended particulate matter (SPM). Macroalgae presence and absence data were used to test SDMs suitability and, simultaneously, to assess the environmental response of macroalgae as well as to model four scenarios of distribution shifts by varying SPM conditions due to climate change. According to the averaged evaluation scores of Relative Operating Characteristics (ROC) and True scale statistics (TSS) by models, those methods based on a multitude of decision trees such as Random Forest and Classification Tree Analysis, reached the highest predictive power followed by generalized boosted models (GBM) and maximum-entropy approaches (Maxent). The final ensemble model used 135 of 200 calculated models (TSS > 0.7) and identified hard substrate and SPM as the most influencing parameters followed by distance to glacier, total organic carbon (TOC), bathymetry and slope. The climate change scenarios show an invasive reaction of the macroalgae in case of less SPM and a retreat of the macroalgae in case of higher assumed SPM values.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Paleoceanographic archives derived from 17 marine sediment cores reconstruct the response of the Southwest Pacific Ocean to the peak interglacial, Marine Isotope Stage (MIS) 5e (ca. 125?ka). Paleo-Sea Surface Temperature (SST) estimates were obtained from the Random Forest model-an ensemble decision tree tool-applied to core-top planktonic foraminiferal faunas calibrated to modern SSTs. The reconstructed geographic pattern of the SST anomaly (maximum SST between 120 and 132?ka minus mean modern SST) seems to indicate how MIS 5e conditions were generally warmer in the Southwest Pacific, especially in the western Tasman Sea where a strengthened East Australian Current (EAC) likely extended subtropical influence to ca. 45°S off Tasmania. In contrast, the eastern Tasman Sea may have had a modest cooling except around 45°S. The observed pattern resembles that developing under the present warming trend in the region. An increase in wind stress curl over the modern South Pacific is hypothesized to have spun-up the South Pacific Subtropical Gyre, with concurrent increase in subtropical flow in the western boundary currents that include the EAC. However, warmer temperatures along the Subtropical Front and Campbell Plateau to the south suggest that the relative influence of the boundary inflows to eastern New Zealand may have differed in MIS 5e, and these currents may have followed different paths compared to today.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The aim of this study is to accurately distinguish Parkinson's disease (PD) participants from healthy controls using self-administered tests of gait and postural sway. Using consumer-grade smartphones with in-built accelerometers, we objectively measure and quantify key movement severity symptoms of Parkinson's disease. Specifically, we record tri-axial accelerations, and extract a range of different features based on the time and frequency-domain properties of the acceleration time series. The features quantify key characteristics of the acceleration time series, and enhance the underlying differences in the gait and postural sway accelerations between PD participants and controls. Using a random forest classifier, we demonstrate an average sensitivity of 98.5% and average specificity of 97.5% in discriminating PD participants from controls. © 2014 IEEE.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Allergy is an overreaction by the immune system to a previously encountered, ordinarily harmless substance - typically proteins - resulting in skin rash, swelling of mucous membranes, sneezing or wheezing, or other abnormal conditions. The use of modified proteins is increasingly widespread: their presence in food, commercial products, such as washing powder, and medical therapeutics and diagnostics, makes predicting and identifying potential allergens a crucial societal issue. The prediction of allergens has been explored widely using bioinformatics, with many tools being developed in the last decade; many of these are freely available online. Here, we report a set of novel models for allergen prediction utilizing amino acid E-descriptors, auto- and cross-covariance transformation, and several machine learning methods for classification, including logistic regression (LR), decision tree (DT), naïve Bayes (NB), random forest (RF), multilayer perceptron (MLP) and k nearest neighbours (kNN). The best performing method was kNN with 85.3% accuracy at 5-fold cross-validation. The resulting model has been implemented in a revised version of the AllerTOP server (http://www.ddg-pharmfac.net/AllerTOP). © Springer-Verlag 2014.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Large-extent vegetation datasets that co-occur with long-term hydrology data provide new ways to develop biologically meaningful hydrologic variables and to determine plant community responses to hydrology. We analyzed the suitability of different hydrological variables to predict vegetation in two water conservation areas (WCAs) in the Florida Everglades, USA, and developed metrics to define realized hydrologic optima and tolerances. Using vegetation data spatially co-located with long-term hydrological records, we evaluated seven variables describing water depth, hydroperiod length, and number of wet/dry events; each variable was tested for 2-, 4- and 10-year intervals for Julian annual averages and environmentally-defined hydrologic intervals. Maximum length and maximum water depth during the wet period calculated for environmentally-defined hydrologic intervals over a 4-year period were the best predictors of vegetation type. Proportional abundance of vegetation types along hydrological gradients indicated that communities had different realized optima and tolerances across WCAs. Although in both WCAs, the trees/shrubs class was on the drier/shallower end of hydrological gradients, while slough communities occupied the wetter/deeper end, the distribution ofCladium, Typha, wet prairie and Salix communities, which were intermediate for most hydrological variables, varied in proportional abundance along hydrologic gradients between WCAs, indicating that realized optima and tolerances are context-dependent.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biodiversity citizen science projects are growing in number, size, and scope, and are gaining recognition as valuable data sources that build public engagement. Yet publication rates indicate that citizen science is still infrequently used as a primary tool for conservation research and the causes of this apparent disconnect have not been quantitatively evaluated. To uncover the barriers to the use of citizen science as a research tool, we surveyed professional biodiversity scientists (n = 423) and citizen science project managers (n = 125). We conducted three analyses using non-parametric recursive modeling (random forest), using questions that addressed: scientists' perceptions and preferences regarding citizen science, scientists' requirements for their own data, and the actual practices of citizen science projects. For all three analyses we identified the most important factors that influence the probability of publication using citizen science data. Four general barriers emerged: a narrow awareness among scientists of citizen science projects that match their needs; the fact that not all biodiversity science is well-suited for citizen science; inconsistency in data quality across citizen science projects; and bias among scientists for certain data sources (institutions and ages/education levels of data collectors). Notably, we find limited evidence to suggest a relationship between citizen science projects that satisfy scientists' biases and data quality or probability of publication. These results illuminate the need for greater visibility of citizen science practices with respect to the requirements of biodiversity science and show that addressing bias among scientists could improve application of citizen science in conservation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Effective conservation and management of top predators requires a comprehensive understanding of their distributions and of the underlying biological and physical processes that affect these distributions. The Mid-Atlantic Bight shelf break system is a dynamic and productive region where at least 32 species of cetaceans have been recorded through various systematic and opportunistic marine mammal surveys from the 1970s through 2012. My dissertation characterizes the spatial distribution and habitat of cetaceans in the Mid-Atlantic Bight shelf break system by utilizing marine mammal line-transect survey data, synoptic multi-frequency active acoustic data, and fine-scale hydrographic data collected during the 2011 summer Atlantic Marine Assessment Program for Protected Species (AMAPPS) survey. Although studies describing cetacean habitat and distributions have been previously conducted in the Mid-Atlantic Bight, my research specifically focuses on the shelf break region to elucidate both the physical and biological processes that influence cetacean distribution patterns within this cetacean hotspot.

In Chapter One I review biologically important areas for cetaceans in the Atlantic waters of the United States. I describe the study area, the shelf break region of the Mid-Atlantic Bight, in terms of the general oceanography, productivity and biodiversity. According to recent habitat-based cetacean density models, the shelf break region is an area of high cetacean abundance and density, yet little research is directed at understanding the mechanisms that establish this region as a cetacean hotspot.

In Chapter Two I present the basic physical principles of sound in water and describe the methodology used to categorize opportunistically collected multi-frequency active acoustic data using frequency responses techniques. Frequency response classification methods are usually employed in conjunction with net-tow data, but the logistics of the 2011 AMAPPS survey did not allow for appropriate net-tow data to be collected. Biologically meaningful information can be extracted from acoustic scattering regions by comparing the frequency response curves of acoustic regions to theoretical curves of known scattering models. Using the five frequencies on the EK60 system (18, 38, 70, 120, and 200 kHz), three categories of scatterers were defined: fish-like (with swim bladder), nekton-like (e.g., euphausiids), and plankton-like (e.g., copepods). I also employed a multi-frequency acoustic categorization method using three frequencies (18, 38, and 120 kHz) that has been used in the Gulf of Maine and Georges Bank which is based the presence or absence of volume backscatter above a threshold. This method is more objective than the comparison of frequency response curves because it uses an established backscatter value for the threshold. By removing all data below the threshold, only strong scattering information is retained.

In Chapter Three I analyze the distribution of the categorized acoustic regions of interest during the daytime cross shelf transects. Over all transects, plankton-like acoustic regions of interest were detected most frequently, followed by fish-like acoustic regions and then nekton-like acoustic regions. Plankton-like detections were the only significantly different acoustic detections per kilometer, although nekton-like detections were only slightly not significant. Using the threshold categorization method by Jech and Michaels (2006) provides a more conservative and discrete detection of acoustic scatterers and allows me to retrieve backscatter values along transects in areas that have been categorized. This provides continuous data values that can be integrated at discrete spatial increments for wavelet analysis. Wavelet analysis indicates significant spatial scales of interest for fish-like and nekton-like acoustic backscatter range from one to four kilometers and vary among transects.

In Chapter Four I analyze the fine scale distribution of cetaceans in the shelf break system of the Mid-Atlantic Bight using corrected sightings per trackline region, classification trees, multidimensional scaling, and random forest analysis. I describe habitat for common dolphins, Risso’s dolphins and sperm whales. From the distribution of cetacean sightings, patterns of habitat start to emerge: within the shelf break region of the Mid-Atlantic Bight, common dolphins were sighted more prevalently over the shelf while sperm whales were more frequently found in the deep waters offshore and Risso’s dolphins were most prevalent at the shelf break. Multidimensional scaling presents clear environmental separation among common dolphins and Risso’s dolphins and sperm whales. The sperm whale random forest habitat model had the lowest misclassification error (0.30) and the Risso’s dolphin random forest habitat model had the greatest misclassification error (0.37). Shallow water depth (less than 148 meters) was the primary variable selected in the classification model for common dolphin habitat. Distance to surface density fronts and surface temperature fronts were the primary variables selected in the classification models to describe Risso’s dolphin habitat and sperm whale habitat respectively. When mapped back into geographic space, these three cetacean species occupy different fine-scale habitats within the dynamic Mid-Atlantic Bight shelf break system.

In Chapter Five I present a summary of the previous chapters and present potential analytical steps to address ecological questions pertaining the dynamic shelf break region. Taken together, the results of my dissertation demonstrate the use of opportunistically collected data in ecosystem studies; emphasize the need to incorporate middle trophic level data and oceanographic features into cetacean habitat models; and emphasize the importance of developing more mechanistic understanding of dynamic ecosystems.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Over the past few years, logging has evolved from from simple printf statements to more complex and widely used logging libraries. Today logging information is used to support various development activities such as fixing bugs, analyzing the results of load tests, monitoring performance and transferring knowledge. Recent research has examined how to improve logging practices by informing developers what to log and where to log. Furthermore, the strong dependence on logging has led to the development of logging libraries that have reduced the intricacies of logging, which has resulted in an abundance of log information. Two recent challenges have emerged as modern software systems start to treat logging as a core aspect of their software. In particular, 1) infrastructural challenges have emerged due to the plethora of logging libraries available today and 2) processing challenges have emerged due to the large number of log processing tools that ingest logs and produce useful information from them. In this thesis, we explore these two challenges. We first explore the infrastructural challenges that arise due to the plethora of logging libraries available today. As systems evolve, their logging infrastructure has to evolve (commonly this is done by migrating to new logging libraries). We explore logging library migrations within Apache Software Foundation (ASF) projects. We i find that close to 14% of the pro jects within the ASF migrate their logging libraries at least once. For processing challenges, we explore the different factors which can affect the likelihood of a logging statement changing in the future in four open source systems namely ActiveMQ, Camel, Cloudstack and Liferay. Such changes are likely to negatively impact the log processing tools that must be updated to accommodate such changes. We find that 20%-45% of the logging statements within the four systems are changed at least once. We construct random forest classifiers and Cox models to determine the likelihood of both just-introduced and long-lived logging statements changing in the future. We find that file ownership, developer experience, log density and SLOC are important factors in determining the stability of logging statements.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The application of custom classification techniques and posterior probability modeling (PPM) using Worldview-2 multispectral imagery to archaeological field survey is presented in this paper. Research is focused on the identification of Neolithic felsite stone tool workshops in the North Mavine region of the Shetland Islands in Northern Scotland. Sample data from known workshops surveyed using differential GPS are used alongside known non-sites to train a linear discriminant analysis (LDA) classifier based on a combination of datasets including Worldview-2 bands, band difference ratios (BDR) and topographical derivatives. Principal components analysis is further used to test and reduce dimensionality caused by redundant datasets. Probability models were generated by LDA using principal components and tested with sites identified through geological field survey. Testing shows the prospective ability of this technique and significance between 0.05 and 0.01, and gain statistics between 0.90 and 0.94, higher than those obtained using maximum likelihood and random forest classifiers. Results suggest that this approach is best suited to relatively homogenous site types, and performs better with correlated data sources. Finally, by combining posterior probability models and least-cost analysis, a survey least-cost efficacy model is generated showing the utility of such approaches to archaeological field survey.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08