Biblioteca Digital

967 resultados para random forest

An Exploration of the challenges associated with software logging in large systems

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Thesis (Master, Computing) -- Queen's University, 2016-05-29 18:11:34.114

Veja mais

Ensemble prediction distribution maps of macroalgae for current conditions and four climate change scenarios and high resultion bathymetry for Potter Cove, WAP, Antarctica

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Species distribution models (SDM) predict species occurrence based on statistical relationships with environmental conditions. The R-package biomod2 which includes 10 different SDM techniques and 10 different evaluation methods was used in this study. Macroalgae are the main biomass producers in Potter Cove, King George Island (Isla 25 de Mayo), Antarctica, and they are sensitive to climate change factors such as suspended particulate matter (SPM). Macroalgae presence and absence data were used to test SDMs suitability and, simultaneously, to assess the environmental response of macroalgae as well as to model four scenarios of distribution shifts by varying SPM conditions due to climate change. According to the averaged evaluation scores of Relative Operating Characteristics (ROC) and True scale statistics (TSS) by models, those methods based on a multitude of decision trees such as Random Forest and Classification Tree Analysis, reached the highest predictive power followed by generalized boosted models (GBM) and maximum-entropy approaches (Maxent). The final ensemble model used 135 of 200 calculated models (TSS > 0.7) and identified hard substrate and SPM as the most influencing parameters followed by distance to glacier, total organic carbon (TOC), bathymetry and slope. The climate change scenarios show an invasive reaction of the macroalgae in case of less SPM and a retreat of the macroalgae in case of higher assumed SPM values.

Veja mais

Sea surface temperature reconstruction for the Southwest Pacific Ocean

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Paleoceanographic archives derived from 17 marine sediment cores reconstruct the response of the Southwest Pacific Ocean to the peak interglacial, Marine Isotope Stage (MIS) 5e (ca. 125?ka). Paleo-Sea Surface Temperature (SST) estimates were obtained from the Random Forest model-an ensemble decision tree tool-applied to core-top planktonic foraminiferal faunas calibrated to modern SSTs. The reconstructed geographic pattern of the SST anomaly (maximum SST between 120 and 132?ka minus mean modern SST) seems to indicate how MIS 5e conditions were generally warmer in the Southwest Pacific, especially in the western Tasman Sea where a strengthened East Australian Current (EAC) likely extended subtropical influence to ca. 45°S off Tasmania. In contrast, the eastern Tasman Sea may have had a modest cooling except around 45°S. The observed pattern resembles that developing under the present warming trend in the region. An increase in wind stress curl over the modern South Pacific is hypothesized to have spun-up the South Pacific Subtropical Gyre, with concurrent increase in subtropical flow in the western boundary currents that include the EAC. However, warmer temperatures along the Subtropical Front and Campbell Plateau to the south suggest that the relative influence of the boundary inflows to eastern New Zealand may have differed in MIS 5e, and these currents may have followed different paths compared to today.

Veja mais

High accuracy discrimination of Parkinson's disease participants from healthy controls using smartphones

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The aim of this study is to accurately distinguish Parkinson's disease (PD) participants from healthy controls using self-administered tests of gait and postural sway. Using consumer-grade smartphones with in-built accelerometers, we objectively measure and quantify key movement severity symptoms of Parkinson's disease. Specifically, we record tri-axial accelerations, and extract a range of different features based on the time and frequency-domain properties of the acceleration time series. The features quantify key characteristics of the acceleration time series, and enhance the underlying differences in the gait and postural sway accelerations between PD participants and controls. Using a random forest classifier, we demonstrate an average sensitivity of 98.5% and average specificity of 97.5% in discriminating PD participants from controls. © 2014 IEEE.

Veja mais

AllerTOP v.2 - a server for in silico prediction of allergens

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Allergy is an overreaction by the immune system to a previously encountered, ordinarily harmless substance - typically proteins - resulting in skin rash, swelling of mucous membranes, sneezing or wheezing, or other abnormal conditions. The use of modified proteins is increasingly widespread: their presence in food, commercial products, such as washing powder, and medical therapeutics and diagnostics, makes predicting and identifying potential allergens a crucial societal issue. The prediction of allergens has been explored widely using bioinformatics, with many tools being developed in the last decade; many of these are freely available online. Here, we report a set of novel models for allergen prediction utilizing amino acid E-descriptors, auto- and cross-covariance transformation, and several machine learning methods for classification, including logistic regression (LR), decision tree (DT), naïve Bayes (NB), random forest (RF), multilayer perceptron (MLP) and k nearest neighbours (kNN). The best performing method was kNN with 85.3% accuracy at 5-fold cross-validation. The resulting model has been implemented in a revised version of the AllerTOP server (http://www.ddg-pharmfac.net/AllerTOP). © Springer-Verlag 2014.

Veja mais

Quantitative Comparison of Plant Community Hydrology Using Large-Extent, Long-Term Data

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Large-extent vegetation datasets that co-occur with long-term hydrology data provide new ways to develop biologically meaningful hydrologic variables and to determine plant community responses to hydrology. We analyzed the suitability of different hydrological variables to predict vegetation in two water conservation areas (WCAs) in the Florida Everglades, USA, and developed metrics to define realized hydrologic optima and tolerances. Using vegetation data spatially co-located with long-term hydrological records, we evaluated seven variables describing water depth, hydroperiod length, and number of wet/dry events; each variable was tested for 2-, 4- and 10-year intervals for Julian annual averages and environmentally-defined hydrologic intervals. Maximum length and maximum water depth during the wet period calculated for environmentally-defined hydrologic intervals over a 4-year period were the best predictors of vegetation type. Proportional abundance of vegetation types along hydrological gradients indicated that communities had different realized optima and tolerances across WCAs. Although in both WCAs, the trees/shrubs class was on the drier/shallower end of hydrological gradients, while slough communities occupied the wetter/deeper end, the distribution ofCladium, Typha, wet prairie and Salix communities, which were intermediate for most hydrological variables, varied in proportional abundance along hydrologic gradients between WCAs, indicating that realized optima and tolerances are context-dependent.

Veja mais

The science of citizen science

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biodiversity citizen science projects are growing in number, size, and scope, and are gaining recognition as valuable data sources that build public engagement. Yet publication rates indicate that citizen science is still infrequently used as a primary tool for conservation research and the causes of this apparent disconnect have not been quantitatively evaluated. To uncover the barriers to the use of citizen science as a research tool, we surveyed professional biodiversity scientists (n = 423) and citizen science project managers (n = 125). We conducted three analyses using non-parametric recursive modeling (random forest), using questions that addressed: scientists' perceptions and preferences regarding citizen science, scientists' requirements for their own data, and the actual practices of citizen science projects. For all three analyses we identified the most important factors that influence the probability of publication using citizen science data. Four general barriers emerged: a narrow awareness among scientists of citizen science projects that match their needs; the fact that not all biodiversity science is well-suited for citizen science; inconsistency in data quality across citizen science projects; and bias among scientists for certain data sources (institutions and ages/education levels of data collectors). Notably, we find limited evidence to suggest a relationship between citizen science projects that satisfy scientists' biases and data quality or probability of publication. These results illuminate the need for greater visibility of citizen science practices with respect to the requirements of biodiversity science and show that addressing bias among scientists could improve application of citizen science in conservation.

Veja mais

Spatial Relationships among Hydroacoustic, Hydrographic and Top Predator Patterns: Cetacean Distributions in the Mid-Atlantic Bight

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Effective conservation and management of top predators requires a comprehensive understanding of their distributions and of the underlying biological and physical processes that affect these distributions. The Mid-Atlantic Bight shelf break system is a dynamic and productive region where at least 32 species of cetaceans have been recorded through various systematic and opportunistic marine mammal surveys from the 1970s through 2012. My dissertation characterizes the spatial distribution and habitat of cetaceans in the Mid-Atlantic Bight shelf break system by utilizing marine mammal line-transect survey data, synoptic multi-frequency active acoustic data, and fine-scale hydrographic data collected during the 2011 summer Atlantic Marine Assessment Program for Protected Species (AMAPPS) survey. Although studies describing cetacean habitat and distributions have been previously conducted in the Mid-Atlantic Bight, my research specifically focuses on the shelf break region to elucidate both the physical and biological processes that influence cetacean distribution patterns within this cetacean hotspot.

In Chapter One I review biologically important areas for cetaceans in the Atlantic waters of the United States. I describe the study area, the shelf break region of the Mid-Atlantic Bight, in terms of the general oceanography, productivity and biodiversity. According to recent habitat-based cetacean density models, the shelf break region is an area of high cetacean abundance and density, yet little research is directed at understanding the mechanisms that establish this region as a cetacean hotspot.

In Chapter Two I present the basic physical principles of sound in water and describe the methodology used to categorize opportunistically collected multi-frequency active acoustic data using frequency responses techniques. Frequency response classification methods are usually employed in conjunction with net-tow data, but the logistics of the 2011 AMAPPS survey did not allow for appropriate net-tow data to be collected. Biologically meaningful information can be extracted from acoustic scattering regions by comparing the frequency response curves of acoustic regions to theoretical curves of known scattering models. Using the five frequencies on the EK60 system (18, 38, 70, 120, and 200 kHz), three categories of scatterers were defined: fish-like (with swim bladder), nekton-like (e.g., euphausiids), and plankton-like (e.g., copepods). I also employed a multi-frequency acoustic categorization method using three frequencies (18, 38, and 120 kHz) that has been used in the Gulf of Maine and Georges Bank which is based the presence or absence of volume backscatter above a threshold. This method is more objective than the comparison of frequency response curves because it uses an established backscatter value for the threshold. By removing all data below the threshold, only strong scattering information is retained.

In Chapter Three I analyze the distribution of the categorized acoustic regions of interest during the daytime cross shelf transects. Over all transects, plankton-like acoustic regions of interest were detected most frequently, followed by fish-like acoustic regions and then nekton-like acoustic regions. Plankton-like detections were the only significantly different acoustic detections per kilometer, although nekton-like detections were only slightly not significant. Using the threshold categorization method by Jech and Michaels (2006) provides a more conservative and discrete detection of acoustic scatterers and allows me to retrieve backscatter values along transects in areas that have been categorized. This provides continuous data values that can be integrated at discrete spatial increments for wavelet analysis. Wavelet analysis indicates significant spatial scales of interest for fish-like and nekton-like acoustic backscatter range from one to four kilometers and vary among transects.

In Chapter Four I analyze the fine scale distribution of cetaceans in the shelf break system of the Mid-Atlantic Bight using corrected sightings per trackline region, classification trees, multidimensional scaling, and random forest analysis. I describe habitat for common dolphins, Risso’s dolphins and sperm whales. From the distribution of cetacean sightings, patterns of habitat start to emerge: within the shelf break region of the Mid-Atlantic Bight, common dolphins were sighted more prevalently over the shelf while sperm whales were more frequently found in the deep waters offshore and Risso’s dolphins were most prevalent at the shelf break. Multidimensional scaling presents clear environmental separation among common dolphins and Risso’s dolphins and sperm whales. The sperm whale random forest habitat model had the lowest misclassification error (0.30) and the Risso’s dolphin random forest habitat model had the greatest misclassification error (0.37). Shallow water depth (less than 148 meters) was the primary variable selected in the classification model for common dolphin habitat. Distance to surface density fronts and surface temperature fronts were the primary variables selected in the classification models to describe Risso’s dolphin habitat and sperm whale habitat respectively. When mapped back into geographic space, these three cetacean species occupy different fine-scale habitats within the dynamic Mid-Atlantic Bight shelf break system.

In Chapter Five I present a summary of the previous chapters and present potential analytical steps to address ecological questions pertaining the dynamic shelf break region. Taken together, the results of my dissertation demonstrate the use of opportunistically collected data in ecosystem studies; emphasize the need to incorporate middle trophic level data and oceanographic features into cetacean habitat models; and emphasize the importance of developing more mechanistic understanding of dynamic ecosystems.

Veja mais

An Exploration of the challenges associated with software logging in large systems

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Over the past few years, logging has evolved from from simple printf statements to more complex and widely used logging libraries. Today logging information is used to support various development activities such as fixing bugs, analyzing the results of load tests, monitoring performance and transferring knowledge. Recent research has examined how to improve logging practices by informing developers what to log and where to log. Furthermore, the strong dependence on logging has led to the development of logging libraries that have reduced the intricacies of logging, which has resulted in an abundance of log information. Two recent challenges have emerged as modern software systems start to treat logging as a core aspect of their software. In particular, 1) infrastructural challenges have emerged due to the plethora of logging libraries available today and 2) processing challenges have emerged due to the large number of log processing tools that ingest logs and produce useful information from them. In this thesis, we explore these two challenges. We first explore the infrastructural challenges that arise due to the plethora of logging libraries available today. As systems evolve, their logging infrastructure has to evolve (commonly this is done by migrating to new logging libraries). We explore logging library migrations within Apache Software Foundation (ASF) projects. We i find that close to 14% of the pro jects within the ASF migrate their logging libraries at least once. For processing challenges, we explore the different factors which can affect the likelihood of a logging statement changing in the future in four open source systems namely ActiveMQ, Camel, Cloudstack and Liferay. Such changes are likely to negatively impact the log processing tools that must be updated to accommodate such changes. We find that 20%-45% of the logging statements within the four systems are changed at least once. We construct random forest classifiers and Cox models to determine the likelihood of both just-introduced and long-lived logging statements changing in the future. We find that file ownership, developer experience, log density and SLOC are important factors in determining the stability of logging statements.

Veja mais

Posterior Probability Modeling and Image Classiﬁcation for Archaeological Site Prospection: Building a Survey Efﬁcacy Model for Identifying Neolithic Felsite Workshops in the Shetland Islands

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The application of custom classiﬁcation techniques and posterior probability modeling (PPM) using Worldview-2 multispectral imagery to archaeological ﬁeld survey is presented in this paper. Research is focused on the identiﬁcation of Neolithic felsite stone tool workshops in the North Mavine region of the Shetland Islands in Northern Scotland. Sample data from known workshops surveyed using differential GPS are used alongside known non-sites to train a linear discriminant analysis (LDA) classiﬁer based on a combination of datasets including Worldview-2 bands, band difference ratios (BDR) and topographical derivatives. Principal components analysis is further used to test and reduce dimensionality caused by redundant datasets. Probability models were generated by LDA using principal components and tested with sites identiﬁed through geological ﬁeld survey. Testing shows the prospective ability of this technique and signiﬁcance between 0.05 and 0.01, and gain statistics between 0.90 and 0.94, higher than those obtained using maximum likelihood and random forest classiﬁers. Results suggest that this approach is best suited to relatively homogenous site types, and performs better with correlated data sources. Finally, by combining posterior probability models and least-cost analysis, a survey least-cost efﬁcacy model is generated showing the utility of such approaches to archaeological ﬁeld survey.

Veja mais

Essays on Machine Learning and Hedonic Models

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Veja mais

Uninformative frame detection in colonoscopy through motion, edge and color features

Relevância:

60.00% 60.00%

Publicador:

Veja mais

Estudo comparativo entre técnicas de aprendizado de máquina para estimação de risco de crédito

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Dissertação (mestrado)—Universidade de Brasília, Faculdade de Economia, Administração e Contabilidade, Programa de Pós-Graduação em Administração, 2016.

Veja mais

Desarrollo de técnicas de aprendizaje automático para la predicción de resultados de partidos en ligas futbolísiticas

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En la actualidad, existen un gran número de investigaciones que usan técnicas de aprendizaje automático basadas en árboles de decisión. Como evolución de dichos trabajos, se han desarrollado métodos que usan Multiclasificadores (Random forest, Boosting, Bagging) que resuelven los mismos problemas abordados con árboles de decisión simples, aumentando el porcentaje de acierto. El ámbito de los problemas resueltos tradicionalmente por dichas técnicas es muy variado aunque destaca la bio-informática. En cualquier caso, la clasificación siempre puede ser consultada a un experto considerándose su respuesta como correcta. Existen problemas donde un experto en la materia no siempre acierta. Un ejemplo, pueden ser las quinielas (1X2). Donde podemos observar que un conocimiento del dominio del problema aumenta el porcentaje de aciertos, sin embargo, predecir un resultado erróneo es muy posible. El motivo es que el número de factores que influyen en un resultado es tan grande que, en muchas ocasiones, convierten la predicción en un acto de azar. En este trabajo pretendemos encontrar un multiclasificador basado en los clasificadores simples más estudiados como pueden ser el Perceptrón Multicapa o Árboles de Decisión con el porcentaje de aciertos más alto posible. Con tal fin, se van a estudiar e implementar una serie de configuraciones de clasificadores propios junto a multiclasificadores desarrollados por terceros. Otra línea de estudio son los propios datos, es decir, el conjunto de entrenamiento. Mediante un estudio del dominio del problema añadiremos nuevos atributos que enriquecen la información que disponemos de cada resultado intentando imitar el conocimiento en el que se basa un experto. Los desarrollos descritos se han realizado en R. Además, se ha realizado una aplicación que permite entrenar un multiclasificador (bien de los propios o bien de los desarrollados por terceros) y como resultado obtenemos la matriz de confusión junto al porcentaje de aciertos. En cuanto a resultados, obtenemos porcentajes de aciertos entre el 50% y el 55%. Por encima del azar y próximos a los resultados de los expertos.

Veja mais

Classifying And Predicting Software Security Vulnerabilities based on Reproducibility

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Security defects are common in large software systems because of their size and complexity. Although efficient development processes, testing, and maintenance policies are applied to software systems, there are still a large number of vulnerabilities that can remain, despite these measures. Some vulnerabilities stay in a system from one release to the next one because they cannot be easily reproduced through testing. These vulnerabilities endanger the security of the systems. We propose vulnerability classification and prediction frameworks based on vulnerability reproducibility. The frameworks are effective to identify the types and locations of vulnerabilities in the earlier stage, and improve the security of software in the next versions (referred to as releases). We expand an existing concept of software bug classification to vulnerability classification (easily reproducible and hard to reproduce) to develop a classification framework for differentiating between these vulnerabilities based on code fixes and textual reports. We then investigate the potential correlations between the vulnerability categories and the classical software metrics and some other runtime environmental factors of reproducibility to develop a vulnerability prediction framework. The classification and prediction frameworks help developers adopt corresponding mitigation or elimination actions and develop appropriate test cases. Also, the vulnerability prediction framework is of great help for security experts focus their effort on the top-ranked vulnerability-prone files. As a result, the frameworks decrease the number of attacks that exploit security vulnerabilities in the next versions of the software. To build the classification and prediction frameworks, different machine learning techniques (C4.5 Decision Tree, Random Forest, Logistic Regression, and Naive Bayes) are employed. The effectiveness of the proposed frameworks is assessed based on collected software security defects of Mozilla Firefox.

Veja mais

967 resultados para random forest

Filtro por publicador