397 resultados para Bayes
Resumo:
L’un des problèmes importants en apprentissage automatique est de déterminer la complexité du modèle à apprendre. Une trop grande complexité mène au surapprentissage, ce qui correspond à trouver des structures qui n’existent pas réellement dans les données, tandis qu’une trop faible complexité mène au sous-apprentissage, c’est-à-dire que l’expressivité du modèle est insuffisante pour capturer l’ensemble des structures présentes dans les données. Pour certains modèles probabilistes, la complexité du modèle se traduit par l’introduction d’une ou plusieurs variables cachées dont le rôle est d’expliquer le processus génératif des données. Il existe diverses approches permettant d’identifier le nombre approprié de variables cachées d’un modèle. Cette thèse s’intéresse aux méthodes Bayésiennes nonparamétriques permettant de déterminer le nombre de variables cachées à utiliser ainsi que leur dimensionnalité. La popularisation des statistiques Bayésiennes nonparamétriques au sein de la communauté de l’apprentissage automatique est assez récente. Leur principal attrait vient du fait qu’elles offrent des modèles hautement flexibles et dont la complexité s’ajuste proportionnellement à la quantité de données disponibles. Au cours des dernières années, la recherche sur les méthodes d’apprentissage Bayésiennes nonparamétriques a porté sur trois aspects principaux : la construction de nouveaux modèles, le développement d’algorithmes d’inférence et les applications. Cette thèse présente nos contributions à ces trois sujets de recherches dans le contexte d’apprentissage de modèles à variables cachées. Dans un premier temps, nous introduisons le Pitman-Yor process mixture of Gaussians, un modèle permettant l’apprentissage de mélanges infinis de Gaussiennes. Nous présentons aussi un algorithme d’inférence permettant de découvrir les composantes cachées du modèle que nous évaluons sur deux applications concrètes de robotique. Nos résultats démontrent que l’approche proposée surpasse en performance et en flexibilité les approches classiques d’apprentissage. Dans un deuxième temps, nous proposons l’extended cascading Indian buffet process, un modèle servant de distribution de probabilité a priori sur l’espace des graphes dirigés acycliques. Dans le contexte de réseaux Bayésien, ce prior permet d’identifier à la fois la présence de variables cachées et la structure du réseau parmi celles-ci. Un algorithme d’inférence Monte Carlo par chaîne de Markov est utilisé pour l’évaluation sur des problèmes d’identification de structures et d’estimation de densités. Dans un dernier temps, nous proposons le Indian chefs process, un modèle plus général que l’extended cascading Indian buffet process servant à l’apprentissage de graphes et d’ordres. L’avantage du nouveau modèle est qu’il admet les connections entres les variables observables et qu’il prend en compte l’ordre des variables. Nous présentons un algorithme d’inférence Monte Carlo par chaîne de Markov avec saut réversible permettant l’apprentissage conjoint de graphes et d’ordres. L’évaluation est faite sur des problèmes d’estimations de densité et de test d’indépendance. Ce modèle est le premier modèle Bayésien nonparamétrique permettant d’apprendre des réseaux Bayésiens disposant d’une structure complètement arbitraire.
Resumo:
Resumo:
Particle filtering has proven to be an effective localization method for wheeled autonomous vehicles. For a given map, a sensor model, and observations, occasions arise where the vehicle could equally likely be in many locations of the map. Because particle filtering algorithms may generate low confidence pose estimates under these conditions, more robust localization strategies are required to produce reliable pose estimates. This becomes more critical if the state estimate is an integral part of system control. We investigate the use of particle filter estimation techniques on a hovercraft vehicle. The marginally stable dynamics of a hovercraft require reliable state estimates for proper stability and control. We use the Monte Carlo localization method, which implements a particle filter in a recursive state estimate algorithm. An H-infinity controller, designed to accommodate the latency inherent in our state estimation, provides stability and controllability to the hovercraft. In order to eliminate the low confidence estimates produced in certain environments, a multirobot system is designed to introduce mobile environment features. By tracking and controlling the secondary robot, we can position the mobile feature throughout the environment to ensure a high confidence estimate, thus maintaining stability in the system. A laser rangefinder is the sensor the hovercraft uses to track the secondary robot, observe the environment, and facilitate successful localization and stability in motion.
Resumo:
Hebb proposed that synapses between neurons that fire synchronously are strengthened, forming cell assemblies and phase sequences. The former, on a shorter scale, are ensembles of synchronized cells that function transiently as a closed processing system; the latter, on a larger scale, correspond to the sequential activation of cell assemblies able to represent percepts and behaviors. Nowadays, the recording of large neuronal populations allows for the detection of multiple cell assemblies. Within Hebb's theory, the next logical step is the analysis of phase sequences. Here we detected phase sequences as consecutive assembly activation patterns, and then analyzed their graph attributes in relation to behavior. We investigated action potentials recorded from the adult rat hippocampus and neocortex before, during and after novel object exploration (experimental periods). Within assembly graphs, each assembly corresponded to a node, and each edge corresponded to the temporal sequence of consecutive node activations. The sum of all assembly activations was proportional to firing rates, but the activity of individual assemblies was not. Assembly repertoire was stable across experimental periods, suggesting that novel experience does not create new assemblies in the adult rat. Assembly graph attributes, on the other hand, varied significantly across behavioral states and experimental periods, and were separable enough to correctly classify experimental periods (Naïve Bayes classifier; maximum AUROCs ranging from 0.55 to 0.99) and behavioral states (waking, slow wave sleep, and rapid eye movement sleep; maximum AUROCs ranging from 0.64 to 0.98). Our findings agree with Hebb's view that assemblies correspond to primitive building blocks of representation, nearly unchanged in the adult, while phase sequences are labile across behavioral states and change after novel experience. The results are compatible with a role for phase sequences in behavior and cognition.
Resumo:
El análisis de datos actual se enfrenta a problemas derivados de la combinación de datos procedentes de diversas fuentes de información. El valor de la información puede enriquecerse enormemente facilitando la integración de nuevas fuentes de datos y la industria es muy consciente de ello en la actualidad. Sin embargo, no solo el volumen sino también la gran diversidad de los datos constituye un problema previo al análisis. Una buena integración de los datos garantiza unos resultados fiables y por ello merece la pena detenerse en la mejora de procesos de especificación, recolección, limpieza e integración de los datos. Este trabajo está dedicado a la fase de limpieza e integración de datos analizando los procedimientos existentes y proponiendo una solución que se aplica a datos médicos, centrándose así en los proyectos de predicción (con finalidad de prevención) en ciencias de la salud. Además de la implementación de los procesos de limpieza, se desarrollan algoritmos de detección de outliers que permiten mejorar la calidad del conjunto de datos tras su eliminación. El trabajo también incluye la implementación de un proceso de predicción que sirva de ayuda a la toma de decisiones. Concretamente este trabajo realiza un análisis predictivo de los datos de pacientes drogodependientes de la Clínica Nuestra Señora de la Paz, con la finalidad de poder brindar un apoyo en la toma de decisiones del médico a cargo de admitir el internamiento de pacientes en dicha clínica. En la mayoría de los casos el estudio de los datos facilitados requiere un pre-procesado adecuado para que los resultados de los análisis estadísticos tradicionales sean fiables. En tal sentido en este trabajo se implementan varias formas de detectar los outliers: un algoritmo propio (Detección de Outliers con Cadenas No Monótonas), que utiliza las ventajas del algoritmo Knuth-Morris-Pratt para reconocimiento de patrones, y las librerías outliers y Rcmdr de R. La aplicación de procedimientos de cleaning e integración de datos, así como de eliminación de datos atípicos proporciona una base de datos limpia y fiable sobre la que se implementarán procedimientos de predicción de los datos con el algoritmo de clasificación Naive Bayes en R.
Resumo:
Teoría de la probabilidad, contiene definiciones y terminología de frecuente uso en esta parte de las matemáticas; también se exponen distintos métodos de solución y las reglas esenciales del análisis combinatorio que proporcionan, en muchas ocasiones, una vía más cómoda en la solución de problemas; además se enuncia el Teorema de Bayes y su adjunto, de la probabilidad total. Todos los temas son ilustrados con ejemplos y problemas resueltos; al final hay una serie de ejercicios propuestos que el lector debe intentar resolver. La colección lecciones de matemáticas, iniciativa del departamento de ciencias básicas de la universidad de Medellín, a través de su grupo de investigación SUMMA, incluye en cada número la exposición detallada de un tema matemático, tratado con mayor profundidad que en un curso regular. Las temáticas incluyen: algebra, trigonometría, calculo, estadística y probabilidades, algebra lineal, métodos lineales y numéricos, historia de las matemáticas, geometría, matemáticas puras y aplicadas, ecuaciones diferenciales y empleo de distintos softwares para la enseñanza de las matemáticas.
Resumo:
The aim of the present study was to propose and evaluate the use of factor analysis (FA) in obtaining latent variables (factors) that represent a set of pig traits simultaneously, for use in genome-wide selection (GWS) studies. We used crosses between outbred F2 populations of Brazilian Piau X commercial pigs. Data were obtained on 345 F2 pigs, genotyped for 237 SNPs, with 41 traits. FA allowed us to obtain four biologically interpretable factors: ?weight?, ?fat?, ?loin?, and ?performance?. These factors were used as dependent variables in multiple regression models of genomic selection (Bayes A, Bayes B, RR-BLUP, and Bayesian LASSO). The use of FA is presented as an interesting alternative to select individuals for multiple variables simultaneously in GWS studies; accuracy measurements of the factors were similar to those obtained when the original traits were considered individually. The similarities between the top 10% of individuals selected by the factor, and those selected by the individual traits, were also satisfactory. Moreover, the estimated markers effects for the traits were similar to those found for the relevant factor.
Resumo:
Genomic selection (GS) has been used to compute genomic estimated breeding values (GEBV) of individuals; however, it has only been applied to animal and major plant crops due to high costs. Besides, breeding and selection is performed at the family level in some crops. We aimed to study the implementation of genome-wide family selection (GWFS) in two loblolly pine (Pinus taeda L.) populations: i) the breeding population CCLONES composed of 63 families (5-20 individuals per family), phenotyped for four traits (stem diameter, stem rust susceptibility, tree stiffness and lignin content) and genotyped using an Illumina Infinium assay with 4740 polymorphic SNPs, and ii) a simulated population that reproduced the same pedigree as CCLONES, 5000 polymorphic loci and two traits (oligogenic and polygenic). In both populations, phenotypic and genotypic data was pooled at the family level in silico. Phenotypes were averaged across replicates for all the individuals and allele frequency was computed for each SNP. Marker effects were estimated at the individual (GEBV) and family (GEFV) levels with Bayes-B using the package BGLR in R and models were validated using 10-fold cross validations. Predicted ability, computed by correlating phenotypes with GEBV and GEFV, was always higher for GEFV in both populations, even after standardizing GEFV predictions to be comparable to GEBV. Results revealed great potential for using GWFS in breeding programs that select families, such as most outbreeding forage species. A significant drop in genotyping costs as one sample per family is needed would allow the application of GWFS in minor crops.
Resumo:
Dato il recente avvento delle tecnologie NGS, in grado di sequenziare interi genomi umani in tempi e costi ridotti, la capacità di estrarre informazioni dai dati ha un ruolo fondamentale per lo sviluppo della ricerca. Attualmente i problemi computazionali connessi a tali analisi rientrano nel topic dei Big Data, con databases contenenti svariati tipi di dati sperimentali di dimensione sempre più ampia. Questo lavoro di tesi si occupa dell'implementazione e del benchmarking dell'algoritmo QDANet PRO, sviluppato dal gruppo di Biofisica dell'Università di Bologna: il metodo consente l'elaborazione di dati ad alta dimensionalità per l'estrazione di una Signature a bassa dimensionalità di features con un'elevata performance di classificazione, mediante una pipeline d'analisi che comprende algoritmi di dimensionality reduction. Il metodo è generalizzabile anche all'analisi di dati non biologici, ma caratterizzati comunque da un elevato volume e complessità, fattori tipici dei Big Data. L'algoritmo QDANet PRO, valutando la performance di tutte le possibili coppie di features, ne stima il potere discriminante utilizzando un Naive Bayes Quadratic Classifier per poi determinarne il ranking. Una volta selezionata una soglia di performance, viene costruito un network delle features, da cui vengono determinate le componenti connesse. Ogni sottografo viene analizzato separatamente e ridotto mediante metodi basati sulla teoria dei networks fino all'estrapolazione della Signature finale. Il metodo, già precedentemente testato su alcuni datasets disponibili al gruppo di ricerca con riscontri positivi, è stato messo a confronto con i risultati ottenuti su databases omici disponibili in letteratura, i quali costituiscono un riferimento nel settore, e con algoritmi già esistenti che svolgono simili compiti. Per la riduzione dei tempi computazionali l'algoritmo è stato implementato in linguaggio C++ su HPC, con la parallelizzazione mediante librerie OpenMP delle parti più critiche.
Resumo:
Group testing has long been considered as a safe and sensible relative to one-at-a-time testing in applications where the prevalence rate p is small. In this thesis, we applied Bayes approach to estimate p using Beta-type prior distribution. First, we showed two Bayes estimators of p from prior on p derived from two different loss functions. Second, we presented two more Bayes estimators of p from prior on π according to two loss functions. We also displayed credible and HPD interval for p. In addition, we did intensive numerical studies. All results showed that the Bayes estimator was preferred over the usual maximum likelihood estimator (MLE) for small p. We also presented the optimal β for different p, m, and k.
Resumo:
In 2010, the American Association of State Highway and Transportation Officials (AASHTO) released a safety analysis software system known as SafetyAnalyst. SafetyAnalyst implements the empirical Bayes (EB) method, which requires the use of Safety Performance Functions (SPFs). The system is equipped with a set of national default SPFs, and the software calibrates the default SPFs to represent the agency’s safety performance. However, it is recommended that agencies generate agency-specific SPFs whenever possible. Many investigators support the view that the agency-specific SPFs represent the agency data better than the national default SPFs calibrated to agency data. Furthermore, it is believed that the crash trends in Florida are different from the states whose data were used to develop the national default SPFs. In this dissertation, Florida-specific SPFs were developed using the 2008 Roadway Characteristics Inventory (RCI) data and crash and traffic data from 2007-2010 for both total and fatal and injury (FI) crashes. The data were randomly divided into two sets, one for calibration (70% of the data) and another for validation (30% of the data). The negative binomial (NB) model was used to develop the Florida-specific SPFs for each of the subtypes of roadway segments, intersections and ramps, using the calibration data. Statistical goodness-of-fit tests were performed on the calibrated models, which were then validated using the validation data set. The results were compared in order to assess the transferability of the Florida-specific SPF models. The default SafetyAnalyst SPFs were calibrated to Florida data by adjusting the national default SPFs with local calibration factors. The performance of the Florida-specific SPFs and SafetyAnalyst default SPFs calibrated to Florida data were then compared using a number of methods, including visual plots and statistical goodness-of-fit tests. The plots of SPFs against the observed crash data were used to compare the prediction performance of the two models. Three goodness-of-fit tests, represented by the mean absolute deviance (MAD), the mean square prediction error (MSPE), and Freeman-Tukey R2 (R2FT), were also used for comparison in order to identify the better-fitting model. The results showed that Florida-specific SPFs yielded better prediction performance than the national default SPFs calibrated to Florida data. The performance of Florida-specific SPFs was further compared with that of the full SPFs, which include both traffic and geometric variables, in two major applications of SPFs, i.e., crash prediction and identification of high crash locations. The results showed that both SPF models yielded very similar performance in both applications. These empirical results support the use of the flow-only SPF models adopted in SafetyAnalyst, which require much less effort to develop compared to full SPFs.
Resumo:
Security defects are common in large software systems because of their size and complexity. Although efficient development processes, testing, and maintenance policies are applied to software systems, there are still a large number of vulnerabilities that can remain, despite these measures. Some vulnerabilities stay in a system from one release to the next one because they cannot be easily reproduced through testing. These vulnerabilities endanger the security of the systems. We propose vulnerability classification and prediction frameworks based on vulnerability reproducibility. The frameworks are effective to identify the types and locations of vulnerabilities in the earlier stage, and improve the security of software in the next versions (referred to as releases). We expand an existing concept of software bug classification to vulnerability classification (easily reproducible and hard to reproduce) to develop a classification framework for differentiating between these vulnerabilities based on code fixes and textual reports. We then investigate the potential correlations between the vulnerability categories and the classical software metrics and some other runtime environmental factors of reproducibility to develop a vulnerability prediction framework. The classification and prediction frameworks help developers adopt corresponding mitigation or elimination actions and develop appropriate test cases. Also, the vulnerability prediction framework is of great help for security experts focus their effort on the top-ranked vulnerability-prone files. As a result, the frameworks decrease the number of attacks that exploit security vulnerabilities in the next versions of the software. To build the classification and prediction frameworks, different machine learning techniques (C4.5 Decision Tree, Random Forest, Logistic Regression, and Naive Bayes) are employed. The effectiveness of the proposed frameworks is assessed based on collected software security defects of Mozilla Firefox.
Resumo:
SIN FINANCIACIÓN
Resumo:
Data are lacking on the characteristics of atrial activity in centenarians, including interatrial block (IAB). The aim of this study was to describe the prevalence of IAB and auricular arrhythmias in subjects older than 100 years and to elucidate their clinical implications. We studied 80 centenarians (mean age 101.4 ± 1.5 years; 21 men) with follow-ups of 6–34 months. Of these 80 centenarians, 71 subjects (88.8%) underwent echocardiography. The control group comprised 269 septuagenarians. A total of 23 subjects (28.8%) had normal P wave, 16 (20%) had partial IAB, 21 (26%) had advanced IAB, and 20 (25.0%) had atrial fibrillation/flutter. The IAB groups exhibited premature atrial beats more frequently than did the normal P wave group (35.1% vs 17.4%; P < .001); also, other measurements in the IAB groups frequently fell between values observed in the normal P wave and the atrial fibrillation/flutter groups. These measurements included sex preponderance, mental status and dementia, perceived health status, significant mitral regurgitation, and mortality. The IAB group had a higher previous stroke rate (24.3%) than did other groups. Compared with septuagenarians, centenarians less frequently presented a normal P wave (28.8% vs 53.5%) and more frequently presented advanced IAB (26.3% vs 8.2%), atrial fibrillation/flutter (25.0% vs 10.0%), and premature atrial beats (28.3 vs 7.0%) (P < .01). Relatively few centenarians (<30%) had a normal P wave, and nearly half had IAB. Our data suggested that IAB, particularly advanced IAB, is a pre–atrial fibrillation condition associated with premature atrial beats. Atrial arrhythmias and IAB occurred more frequently in centenarians than in septuagenarians.
Resumo:
SIN FINANCIACIÓN