47 resultados para random forests

em Université de Lausanne, Switzerland


Relevância:

60.00% 60.00%

Publicador:

Resumo:

BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

La présente étude est à la fois une évaluation du processus de la mise en oeuvre et des impacts de la police de proximité dans les cinq plus grandes zones urbaines de Suisse - Bâle, Berne, Genève, Lausanne et Zurich. La police de proximité (community policing) est à la fois une philosophie et une stratégie organisationnelle qui favorise un partenariat renouvelé entre la police et les communautés locales dans le but de résoudre les problèmes relatifs à la sécurité et à l'ordre public. L'évaluation de processus a analysé des données relatives aux réformes internes de la police qui ont été obtenues par l'intermédiaire d'entretiens semi-structurés avec des administrateurs clés des cinq départements de police, ainsi que dans des documents écrits de la police et d'autres sources publiques. L'évaluation des impacts, quant à elle, s'est basée sur des variables contextuelles telles que des statistiques policières et des données de recensement, ainsi que sur des indicateurs d'impacts construit à partir des données du Swiss Crime Survey (SCS) relatives au sentiment d'insécurité, à la perception du désordre public et à la satisfaction de la population à l'égard de la police. Le SCS est un sondage régulier qui a permis d'interroger des habitants des cinq grandes zones urbaines à plusieurs reprises depuis le milieu des années 1980. L'évaluation de processus a abouti à un « Calendrier des activités » visant à créer des données de panel permettant de mesurer les progrès réalisés dans la mise en oeuvre de la police de proximité à l'aide d'une grille d'évaluation à six dimensions à des intervalles de cinq ans entre 1990 et 2010. L'évaluation des impacts, effectuée ex post facto, a utilisé un concept de recherche non-expérimental (observational design) dans le but d'analyser les impacts de différents modèles de police de proximité dans des zones comparables à travers les cinq villes étudiées. Les quartiers urbains, délimités par zone de code postal, ont ainsi été regroupés par l'intermédiaire d'une typologie réalisée à l'aide d'algorithmes d'apprentissage automatique (machine learning). Des algorithmes supervisés et non supervisés ont été utilisés sur les données à haute dimensionnalité relatives à la criminalité, à la structure socio-économique et démographique et au cadre bâti dans le but de regrouper les quartiers urbains les plus similaires dans des clusters. D'abord, les cartes auto-organisatrices (self-organizing maps) ont été utilisées dans le but de réduire la variance intra-cluster des variables contextuelles et de maximiser simultanément la variance inter-cluster des réponses au sondage. Ensuite, l'algorithme des forêts d'arbres décisionnels (random forests) a permis à la fois d'évaluer la pertinence de la typologie de quartier élaborée et de sélectionner les variables contextuelles clés afin de construire un modèle parcimonieux faisant un minimum d'erreurs de classification. Enfin, pour l'analyse des impacts, la méthode des appariements des coefficients de propension (propensity score matching) a été utilisée pour équilibrer les échantillons prétest-posttest en termes d'âge, de sexe et de niveau d'éducation des répondants au sein de chaque type de quartier ainsi identifié dans chacune des villes, avant d'effectuer un test statistique de la différence observée dans les indicateurs d'impacts. De plus, tous les résultats statistiquement significatifs ont été soumis à une analyse de sensibilité (sensitivity analysis) afin d'évaluer leur robustesse face à un biais potentiel dû à des covariables non observées. L'étude relève qu'au cours des quinze dernières années, les cinq services de police ont entamé des réformes majeures de leur organisation ainsi que de leurs stratégies opérationnelles et qu'ils ont noué des partenariats stratégiques afin de mettre en oeuvre la police de proximité. La typologie de quartier développée a abouti à une réduction de la variance intra-cluster des variables contextuelles et permet d'expliquer une partie significative de la variance inter-cluster des indicateurs d'impacts avant la mise en oeuvre du traitement. Ceci semble suggérer que les méthodes de géocomputation aident à équilibrer les covariables observées et donc à réduire les menaces relatives à la validité interne d'un concept de recherche non-expérimental. Enfin, l'analyse des impacts a révélé que le sentiment d'insécurité a diminué de manière significative pendant la période 2000-2005 dans les quartiers se trouvant à l'intérieur et autour des centres-villes de Berne et de Zurich. Ces améliorations sont assez robustes face à des biais dus à des covariables inobservées et covarient dans le temps et l'espace avec la mise en oeuvre de la police de proximité. L'hypothèse alternative envisageant que les diminutions observées dans le sentiment d'insécurité soient, partiellement, un résultat des interventions policières de proximité semble donc être aussi plausible que l'hypothèse nulle considérant l'absence absolue d'effet. Ceci, même si le concept de recherche non-expérimental mis en oeuvre ne peut pas complètement exclure la sélection et la régression à la moyenne comme explications alternatives. The current research project is both a process and impact evaluation of community policing in Switzerland's five major urban areas - Basel, Bern, Geneva, Lausanne, and Zurich. Community policing is both a philosophy and an organizational strategy that promotes a renewed partnership between the police and the community to solve problems of crime and disorder. The process evaluation data on police internal reforms were obtained through semi-structured interviews with key administrators from the five police departments as well as from police internal documents and additional public sources. The impact evaluation uses official crime records and census statistics as contextual variables as well as Swiss Crime Survey (SCS) data on fear of crime, perceptions of disorder, and public attitudes towards the police as outcome measures. The SCS is a standing survey instrument that has polled residents of the five urban areas repeatedly since the mid-1980s. The process evaluation produced a "Calendar of Action" to create panel data to measure community policing implementation progress over six evaluative dimensions in intervals of five years between 1990 and 2010. The impact evaluation, carried out ex post facto, uses an observational design that analyzes the impact of the different community policing models between matched comparison areas across the five cities. Using ZIP code districts as proxies for urban neighborhoods, geospatial data mining algorithms serve to develop a neighborhood typology in order to match the comparison areas. To this end, both unsupervised and supervised algorithms are used to analyze high-dimensional data on crime, the socio-economic and demographic structure, and the built environment in order to classify urban neighborhoods into clusters of similar type. In a first step, self-organizing maps serve as tools to develop a clustering algorithm that reduces the within-cluster variance in the contextual variables and simultaneously maximizes the between-cluster variance in survey responses. The random forests algorithm then serves to assess the appropriateness of the resulting neighborhood typology and to select the key contextual variables in order to build a parsimonious model that makes a minimum of classification errors. Finally, for the impact analysis, propensity score matching methods are used to match the survey respondents of the pretest and posttest samples on age, gender, and their level of education for each neighborhood type identified within each city, before conducting a statistical test of the observed difference in the outcome measures. Moreover, all significant results were subjected to a sensitivity analysis to assess the robustness of these findings in the face of potential bias due to some unobserved covariates. The study finds that over the last fifteen years, all five police departments have undertaken major reforms of their internal organization and operating strategies and forged strategic partnerships in order to implement community policing. The resulting neighborhood typology reduced the within-cluster variance of the contextual variables and accounted for a significant share of the between-cluster variance in the outcome measures prior to treatment, suggesting that geocomputational methods help to balance the observed covariates and hence to reduce threats to the internal validity of an observational design. Finally, the impact analysis revealed that fear of crime dropped significantly over the 2000-2005 period in the neighborhoods in and around the urban centers of Bern and Zurich. These improvements are fairly robust in the face of bias due to some unobserved covariate and covary temporally and spatially with the implementation of community policing. The alternative hypothesis that the observed reductions in fear of crime were at least in part a result of community policing interventions thus appears at least as plausible as the null hypothesis of absolutely no effect, even if the observational design cannot completely rule out selection and regression to the mean as alternative explanations.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

It is estimated that around 230 people die each year due to radon (222Rn) exposure in Switzerland. 222Rn occurs mainly in closed environments like buildings and originates primarily from the subjacent ground. Therefore it depends strongly on geology and shows substantial regional variations. Correct identification of these regional variations would lead to substantial reduction of 222Rn exposure of the population based on appropriate construction of new and mitigation of already existing buildings. Prediction of indoor 222Rn concentrations (IRC) and identification of 222Rn prone areas is however difficult since IRC depend on a variety of different variables like building characteristics, meteorology, geology and anthropogenic factors. The present work aims at the development of predictive models and the understanding of IRC in Switzerland, taking into account a maximum of information in order to minimize the prediction uncertainty. The predictive maps will be used as a decision-support tool for 222Rn risk management. The construction of these models is based on different data-driven statistical methods, in combination with geographical information systems (GIS). In a first phase we performed univariate analysis of IRC for different variables, namely the detector type, building category, foundation, year of construction, the average outdoor temperature during measurement, altitude and lithology. All variables showed significant associations to IRC. Buildings constructed after 1900 showed significantly lower IRC compared to earlier constructions. We observed a further drop of IRC after 1970. In addition to that, we found an association of IRC with altitude. With regard to lithology, we observed the lowest IRC in sedimentary rocks (excluding carbonates) and sediments and the highest IRC in the Jura carbonates and igneous rock. The IRC data was systematically analyzed for potential bias due to spatially unbalanced sampling of measurements. In order to facilitate the modeling and the interpretation of the influence of geology on IRC, we developed an algorithm based on k-medoids clustering which permits to define coherent geological classes in terms of IRC. We performed a soil gas 222Rn concentration (SRC) measurement campaign in order to determine the predictive power of SRC with respect to IRC. We found that the use of SRC is limited for IRC prediction. The second part of the project was dedicated to predictive mapping of IRC using models which take into account the multidimensionality of the process of 222Rn entry into buildings. We used kernel regression and ensemble regression tree for this purpose. We could explain up to 33% of the variance of the log transformed IRC all over Switzerland. This is a good performance compared to former attempts of IRC modeling in Switzerland. As predictor variables we considered geographical coordinates, altitude, outdoor temperature, building type, foundation, year of construction and detector type. Ensemble regression trees like random forests allow to determine the role of each IRC predictor in a multidimensional setting. We found spatial information like geology, altitude and coordinates to have stronger influences on IRC than building related variables like foundation type, building type and year of construction. Based on kernel estimation we developed an approach to determine the local probability of IRC to exceed 300 Bq/m3. In addition to that we developed a confidence index in order to provide an estimate of uncertainty of the map. All methods allow an easy creation of tailor-made maps for different building characteristics. Our work is an essential step towards a 222Rn risk assessment which accounts at the same time for different architectural situations as well as geological and geographical conditions. For the communication of 222Rn hazard to the population we recommend to make use of the probability map based on kernel estimation. The communication of 222Rn hazard could for example be implemented via a web interface where the users specify the characteristics and coordinates of their home in order to obtain the probability to be above a given IRC with a corresponding index of confidence. Taking into account the health effects of 222Rn, our results have the potential to substantially improve the estimation of the effective dose from 222Rn delivered to the Swiss population.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

BACKGROUND: Workers with persistent disabilities after orthopaedic trauma may need occupational rehabilitation. Despite various risk profiles for non-return-to-work (non-RTW), there is no available predictive model. Moreover, injured workers may have various origins (immigrant workers), which may either affect their return to work or their eligibility for research purposes. The aim of this study was to develop and validate a predictive model that estimates the likelihood of non-RTW after occupational rehabilitation using predictors which do not rely on the worker's background. METHODS: Prospective cohort study (3177 participants, native (51%) and immigrant workers (49%)) with two samples: a) Development sample with patients from 2004 to 2007 with Full and Reduced Models, b) External validation of the Reduced Model with patients from 2008 to March 2010. We collected patients' data and biopsychosocial complexity with an observer rated interview (INTERMED). Non-RTW was assessed two years after discharge from the rehabilitation. Discrimination was assessed by the area under the receiver operating curve (AUC) and calibration was evaluated with a calibration plot. The model was reduced with random forests. RESULTS: At 2 years, the non-RTW status was known for 2462 patients (77.5% of the total sample). The prevalence of non-RTW was 50%. The full model (36 items) and the reduced model (19 items) had acceptable discrimination performance (AUC 0.75, 95% CI 0.72 to 0.78 and 0.74, 95% CI 0.71 to 0.76, respectively) and good calibration. For the validation model, the discrimination performance was acceptable (AUC 0.73; 95% CI 0.70 to 0.77) and calibration was also adequate. CONCLUSIONS: Non-RTW may be predicted with a simple model constructed with variables independent of the patient's education and language fluency. This model is useful for all kinds of trauma in order to adjust for case mix and it is applicable to vulnerable populations like immigrant workers.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

PURPOSE: According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. METHOD: About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). RESULTS: The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. CONCLUSION: Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Altitudinal tree lines are mainly constrained by temperature, but can also be influenced by factors such as human activity, particularly in the European Alps, where centuries of agricultural use have affected the tree-line. Over the last decades this trend has been reversed due to changing agricultural practices and land-abandonment. We aimed to combine a statistical land-abandonment model with a forest dynamics model, to take into account the combined effects of climate and human land-use on the Alpine tree-line in Switzerland. Land-abandonment probability was expressed by a logistic regression function of degree-day sum, distance from forest edge, soil stoniness, slope, proportion of employees in the secondary and tertiary sectors, proportion of commuters and proportion of full-time farms. This was implemented in the TreeMig spatio-temporal forest model. Distance from forest edge and degree-day sum vary through feed-back from the dynamics part of TreeMig and climate change scenarios, while the other variables remain constant for each grid cell over time. The new model, TreeMig-LAb, was tested on theoretical landscapes, where the variables in the land-abandonment model were varied one by one. This confirmed the strong influence of distance from forest and slope on the abandonment probability. Degree-day sum has a more complex role, with opposite influences on land-abandonment and forest growth. TreeMig-LAb was also applied to a case study area in the Upper Engadine (Swiss Alps), along with a model where abandonment probability was a constant. Two scenarios were used: natural succession only (100% probability) and a probability of abandonment based on past transition proportions in that area (2.1% per decade). The former showed new forest growing in all but the highest-altitude locations. The latter was more realistic as to numbers of newly forested cells, but their location was random and the resulting landscape heterogeneous. Using the logistic regression model gave results consistent with observed patterns of land-abandonment: existing forests expanded and gaps closed, leading to an increasingly homogeneous landscape.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Random mating is the null model central to population genetics. One assumption behind random mating is that individuals mate an infinite number of times. This is obviously unrealistic. Here we show that when each female mates a finite number of times, the effective size of the population is substantially decreased.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we study the average crossing number of equilateral random walks and polygons. We show that the mean average crossing number ACN of all equilateral random walks of length n is of the form . A similar result holds for equilateral random polygons. These results are confirmed by our numerical studies. Furthermore, our numerical studies indicate that when random polygons of length n are divided into individual knot types, the for each knot type can be described by a function of the form where a, b and c are constants depending on and n0 is the minimal number of segments required to form . The profiles diverge from each other, with more complex knots showing higher than less complex knots. Moreover, the profiles intersect with the ACN profile of all closed walks. These points of intersection define the equilibrium length of , i.e., the chain length at which a statistical ensemble of configurations with given knot type -upon cutting, equilibration and reclosure to a new knot type -does not show a tendency to increase or decrease . This concept of equilibrium length seems to be universal, and applies also to other length-dependent observables for random knots, such as the mean radius of gyration Rg.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we study the average inter-crossing number between two random walks and two random polygons in the three-dimensional space. The random walks and polygons in this paper are the so-called equilateral random walks and polygons in which each segment of the walk or polygon is of unit length. We show that the mean average inter-crossing number ICN between two equilateral random walks of the same length n is approximately linear in terms of n and we were able to determine the prefactor of the linear term, which is a = (3 In 2)/(8) approximate to 0.2599. In the case of two random polygons of length n, the mean average inter-crossing number ICN is also linear, but the prefactor of the linear term is different from that of the random walks. These approximations apply when the starting points of the random walks and polygons are of a distance p apart and p is small compared to n. We propose a fitting model that would capture the theoretical asymptotic behaviour of the mean average ICN for large values of p. Our simulation result shows that the model in fact works very well for the entire range of p. We also study the mean ICN between two equilateral random walks and polygons of different lengths. An interesting result is that even if one random walk (polygon) has a fixed length, the mean average ICN between the two random walks (polygons) would still approach infinity if the length of the other random walk (polygon) approached infinity. The data provided by our simulations match our theoretical predictions very well.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We introduce disk matrices which encode the knotting of all subchains in circular knot configurations. The disk matrices allow us to dissect circular knots into their subknots, i.e. knot types formed by subchains of the global knot. The identification of subknots is based on the study of linear chains in which a knot type is associated to the chain by means of a spatially robust closure protocol. We characterize the sets of observed subknot types in global knots taking energy-minimized shapes such as KnotPlot configurations and ideal geometric configurations. We compare the sets of observed subknots to knot types obtained by changing crossings in the classical prime knot diagrams. Building upon this analysis, we study the sets of subknots in random configurations of corresponding knot types. In many of the knot types we analyzed, the sets of subknots from the ideal geometric configurations are found in each of the hundreds of random configurations of the same global knot type. We also compare the sets of subknots observed in open protein knots with the subknots observed in the ideal configurations of the corresponding knot type. This comparison enables us to explain the specific dispositions of subknots in the analyzed protein knots.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Long polymers in solution frequently adopt knotted configurations. To understand the physical properties of knotted polymers, it is important to find out whether the knots formed at thermodynamic equilibrium are spread over the whole polymer chain or rather are localized as tight knots. We present here a method to analyze the knottedness of short linear portions of simulated random chains. Using this method, we observe that knot-determining domains are usually very tight, so that, for example, the preferred size of the trefoil-determining portions of knotted polymer chains corresponds to just seven freely jointed segments.