19 resultados para probability distributions
em Universidad Politécnica de Madrid
Resumo:
The naïve Bayes approach is a simple but often satisfactory method for supervised classification. In this paper, we focus on the naïve Bayes model and propose the application of regularization techniques to learn a naïve Bayes classifier. The main contribution of the paper is a stagewise version of the selective naïve Bayes, which can be considered a regularized version of the naïve Bayes model. We call it forward stagewise naïve Bayes. For comparison’s sake, we also introduce an explicitly regularized formulation of the naïve Bayes model, where conditional independence (absence of arcs) is promoted via an L 1/L 2-group penalty on the parameters that define the conditional probability distributions. Although already published in the literature, this idea has only been applied for continuous predictors. We extend this formulation to discrete predictors and propose a modification that yields an adaptive penalization. We show that, whereas the L 1/L 2 group penalty formulation only discards irrelevant predictors, the forward stagewise naïve Bayes can discard both irrelevant and redundant predictors, which are known to be harmful for the naïve Bayes classifier. Both approaches, however, usually improve the classical naïve Bayes model’s accuracy.
Resumo:
In multi-attribute utility theory, it is often not easy to elicit precise values for the scaling weights representing the relative importance of criteria. A very widespread approach is to gather incomplete information. A recent approach for dealing with such situations is to use information about each alternative?s intensity of dominance, known as dominance measuring methods. Different dominancemeasuring methods have been proposed, and simulation studies have been carried out to compare these methods with each other and with other approaches but only when ordinal information about weights is available. In this paper, we useMonte Carlo simulation techniques to analyse the performance of and adapt such methods to deal with weight intervals, weights fitting independent normal probability distributions orweights represented by fuzzy numbers.Moreover, dominance measuringmethod performance is also compared with a widely used methodology dealing with incomplete information on weights, the stochastic multicriteria acceptability analysis (SMAA). SMAA is based on exploring the weight space to describe the evaluations that would make each alternative the preferred one.
Resumo:
The new Spanish Regulation in Building Acoustic establishes values and limits for the different acoustic magnitudes whose fulfillment can be verify by means field measurements. In this sense, an essential aspect of a field measurement is to give the measured magnitude and the uncertainty associated to such a magnitude. In the calculus of the uncertainty it is very usual to follow the uncertainty propagation method as described in the Guide to the expression of Uncertainty in Measurements (GUM). Other option is the numerical calculus based on the distribution propagation method by means of Monte Carlo simulation. In fact, at this stage, it is possible to find several publications developing this last method by using different software programs. In the present work, we used Excel for the Monte Carlo simulation for the calculus of the uncertainty associated to the different magnitudes derived from the field measurements following ISO 140-4, 140-5 and 140-7. We compare the results with the ones obtained by the uncertainty propagation method. Although both methods give similar values, some small differences have been observed. Some arguments to explain such differences are the asymmetry of the probability distributions associated to the entry magnitudes,the overestimation of the uncertainty following the GUM
Resumo:
We propose distributed algorithms for sampling networks based on a new class of random walks that we call Centrifugal Random Walks (CRW). A CRW is a random walk that starts at a source and always moves away from it. We propose CRW algorithms for connected networks with arbitrary probability distributions, and for grids and networks with regular concentric connectivity with distance based distributions. All CRW sampling algorithms select a node with the exact probability distribution, do not need warm-up, and end in a number of hops bounded by the network diameter.
Resumo:
Este trabajo aborda el problema de modelizar sistemas din´amicos reales a partir del estudio de sus series temporales, usando una formulaci´on est´andar que pretende ser una abstracci´on universal de los sistemas din´amicos, independientemente de su naturaleza determinista, estoc´astica o h´ıbrida. Se parte de modelizaciones separadas de sistemas deterministas por un lado y estoc´asticos por otro, para converger finalmente en un modelo h´ıbrido que permite estudiar sistemas gen´ericos mixtos, esto es, que presentan una combinaci´on de comportamiento determinista y aleatorio. Este modelo consta de dos componentes, uno determinista consistente en una ecuaci´on en diferencias, obtenida a partir de un estudio de autocorrelaci´on, y otro estoc´astico que modeliza el error cometido por el primero. El componente estoc´astico es un generador universal de distribuciones de probabilidad, basado en un proceso compuesto de variables aleatorias, uniformemente distribuidas en un intervalo variable en el tiempo. Este generador universal es deducido en la tesis a partir de una nueva teor´ıa sobre la oferta y la demanda de un recurso gen´erico. El modelo resultante puede formularse conceptualmente como una entidad con tres elementos fundamentales: un motor generador de din´amica determinista, una fuente interna de ruido generadora de incertidumbre y una exposici´on al entorno que representa las interacciones del sistema real con el mundo exterior. En las aplicaciones estos tres elementos se ajustan en base al hist´orico de las series temporales del sistema din´amico. Una vez ajustados sus componentes, el modelo se comporta de una forma adaptativa tomando como inputs los nuevos valores de las series temporales del sistema y calculando predicciones sobre su comportamiento futuro. Cada predicci´on se presenta como un intervalo dentro del cual cualquier valor es equipro- bable, teniendo probabilidad nula cualquier valor externo al intervalo. De esta forma el modelo computa el comportamiento futuro y su nivel de incertidumbre en base al estado actual del sistema. Se ha aplicado el modelo en esta tesis a sistemas muy diferentes mostrando ser muy flexible para afrontar el estudio de campos de naturaleza dispar. El intercambio de tr´afico telef´onico entre operadores de telefon´ıa, la evoluci´on de mercados financieros y el flujo de informaci´on entre servidores de Internet son estudiados en profundidad en la tesis. Todos estos sistemas son modelizados de forma exitosa con un mismo lenguaje, a pesar de tratarse de sistemas f´ısicos totalmente distintos. El estudio de las redes de telefon´ıa muestra que los patrones de tr´afico telef´onico presentan una fuerte pseudo-periodicidad semanal contaminada con una gran cantidad de ruido, sobre todo en el caso de llamadas internacionales. El estudio de los mercados financieros muestra por su parte que la naturaleza fundamental de ´estos es aleatoria con un rango de comportamiento relativamente acotado. Una parte de la tesis se dedica a explicar algunas de las manifestaciones emp´ıricas m´as importantes en los mercados financieros como son los “fat tails”, “power laws” y “volatility clustering”. Por ´ultimo se demuestra que la comunicaci´on entre servidores de Internet tiene, al igual que los mercados financieros, una componente subyacente totalmente estoc´astica pero de comportamiento bastante “d´ocil”, siendo esta docilidad m´as acusada a medida que aumenta la distancia entre servidores. Dos aspectos son destacables en el modelo, su adaptabilidad y su universalidad. El primero es debido a que, una vez ajustados los par´ametros generales, el modelo se “alimenta” de los valores observables del sistema y es capaz de calcular con ellos comportamientos futuros. A pesar de tener unos par´ametros fijos, la variabilidad en los observables que sirven de input al modelo llevan a una gran riqueza de ouputs posibles. El segundo aspecto se debe a la formulaci´on gen´erica del modelo h´ıbrido y a que sus par´ametros se ajustan en base a manifestaciones externas del sistema en estudio, y no en base a sus caracter´ısticas f´ısicas. Estos factores hacen que el modelo pueda utilizarse en gran variedad de campos. Por ´ultimo, la tesis propone en su parte final otros campos donde se han obtenido ´exitos preliminares muy prometedores como son la modelizaci´on del riesgo financiero, los algoritmos de routing en redes de telecomunicaci´on y el cambio clim´atico. Abstract This work faces the problem of modeling dynamical systems based on the study of its time series, by using a standard language that aims to be an universal abstraction of dynamical systems, irrespective of their deterministic, stochastic or hybrid nature. Deterministic and stochastic models are developed separately to be merged subsequently into a hybrid model, which allows the study of generic systems, that is to say, those having both deterministic and random behavior. This model is a combination of two different components. One of them is deterministic and consisting in an equation in differences derived from an auto-correlation study and the other is stochastic and models the errors made by the deterministic one. The stochastic component is an universal generator of probability distributions based on a process consisting in random variables distributed uniformly within an interval varying in time. This universal generator is derived in the thesis from a new theory of offer and demand for a generic resource. The resulting model can be visualized as an entity with three fundamental elements: an engine generating deterministic dynamics, an internal source of noise generating uncertainty and an exposure to the environment which depicts the interactions between the real system and the external world. In the applications these three elements are adjusted to the history of the time series from the dynamical system. Once its components have been adjusted, the model behaves in an adaptive way by using the new time series values from the system as inputs and calculating predictions about its future behavior. Every prediction is provided as an interval, where any inner value is equally probable while all outer ones have null probability. So, the model computes the future behavior and its level of uncertainty based on the current state of the system. The model is applied to quite different systems in this thesis, showing to be very flexible when facing the study of fields with diverse nature. The exchange of traffic between telephony operators, the evolution of financial markets and the flow of information between servers on the Internet are deeply studied in this thesis. All these systems are successfully modeled by using the same “language”, in spite the fact that they are systems physically radically different. The study of telephony networks shows that the traffic patterns are strongly weekly pseudo-periodic but mixed with a great amount of noise, specially in the case of international calls. It is proved that the underlying nature of financial markets is random with a moderate range of variability. A part of this thesis is devoted to explain some of the most important empirical observations in financial markets, such as “fat tails”, “power laws” and “volatility clustering”. Finally it is proved that the communication between two servers on the Internet has, as in the case of financial markets, an underlaying random dynamics but with a narrow range of variability, being this lack of variability more marked as the distance between servers is increased. Two aspects of the model stand out as being the most important: its adaptability and its universality. The first one is due to the fact that once the general parameters have been adjusted , the model is “fed” on the observable manifestations of the system in order to calculate its future behavior. Despite the fact that the model has fixed parameters the variability in the observable manifestations of the system, which are used as inputs of the model, lead to a great variability in the possible outputs. The second aspect is due to the general “language” used in the formulation of the hybrid model and to the fact that its parameters are adjusted based on external manifestations of the system under study instead of its physical characteristics. These factors made the model suitable to be used in great variety of fields. Lastly, this thesis proposes other fields in which preliminary and promising results have been obtained, such as the modeling of financial risk, the development of routing algorithms for telecommunication networks and the assessment of climate change.
Resumo:
Neuronal morphology is a key feature in the study of brain circuits, as it is highly related to information processing and functional identification. Neuronal morphology affects the process of integration of inputs from other neurons and determines the neurons which receive the output of the neurons. Different parts of the neurons can operate semi-independently according to the spatial location of the synaptic connections. As a result, there is considerable interest in the analysis of the microanatomy of nervous cells since it constitutes an excellent tool for better understanding cortical function. However, the morphologies, molecular features and electrophysiological properties of neuronal cells are extremely variable. Except for some special cases, this variability makes it hard to find a set of features that unambiguously define a neuronal type. In addition, there are distinct types of neurons in particular regions of the brain. This morphological variability makes the analysis and modeling of neuronal morphology a challenge. Uncertainty is a key feature in many complex real-world problems. Probability theory provides a framework for modeling and reasoning with uncertainty. Probabilistic graphical models combine statistical theory and graph theory to provide a tool for managing domains with uncertainty. In particular, we focus on Bayesian networks, the most commonly used probabilistic graphical model. In this dissertation, we design new methods for learning Bayesian networks and apply them to the problem of modeling and analyzing morphological data from neurons. The morphology of a neuron can be quantified using a number of measurements, e.g., the length of the dendrites and the axon, the number of bifurcations, the direction of the dendrites and the axon, etc. These measurements can be modeled as discrete or continuous data. The continuous data can be linear (e.g., the length or the width of a dendrite) or directional (e.g., the direction of the axon). These data may follow complex probability distributions and may not fit any known parametric distribution. Modeling this kind of problems using hybrid Bayesian networks with discrete, linear and directional variables poses a number of challenges regarding learning from data, inference, etc. In this dissertation, we propose a method for modeling and simulating basal dendritic trees from pyramidal neurons using Bayesian networks to capture the interactions between the variables in the problem domain. A complete set of variables is measured from the dendrites, and a learning algorithm is applied to find the structure and estimate the parameters of the probability distributions included in the Bayesian networks. Then, a simulation algorithm is used to build the virtual dendrites by sampling values from the Bayesian networks, and a thorough evaluation is performed to show the model’s ability to generate realistic dendrites. In this first approach, the variables are discretized so that discrete Bayesian networks can be learned and simulated. Then, we address the problem of learning hybrid Bayesian networks with different kinds of variables. Mixtures of polynomials have been proposed as a way of representing probability densities in hybrid Bayesian networks. We present a method for learning mixtures of polynomials approximations of one-dimensional, multidimensional and conditional probability densities from data. The method is based on basis spline interpolation, where a density is approximated as a linear combination of basis splines. The proposed algorithms are evaluated using artificial datasets. We also use the proposed methods as a non-parametric density estimation technique in Bayesian network classifiers. Next, we address the problem of including directional data in Bayesian networks. These data have some special properties that rule out the use of classical statistics. Therefore, different distributions and statistics, such as the univariate von Mises and the multivariate von Mises–Fisher distributions, should be used to deal with this kind of information. In particular, we extend the naive Bayes classifier to the case where the conditional probability distributions of the predictive variables given the class follow either of these distributions. We consider the simple scenario, where only directional predictive variables are used, and the hybrid case, where discrete, Gaussian and directional distributions are mixed. The classifier decision functions and their decision surfaces are studied at length. Artificial examples are used to illustrate the behavior of the classifiers. The proposed classifiers are empirically evaluated over real datasets. We also study the problem of interneuron classification. An extensive group of experts is asked to classify a set of neurons according to their most prominent anatomical features. A web application is developed to retrieve the experts’ classifications. We compute agreement measures to analyze the consensus between the experts when classifying the neurons. Using Bayesian networks and clustering algorithms on the resulting data, we investigate the suitability of the anatomical terms and neuron types commonly used in the literature. Additionally, we apply supervised learning approaches to automatically classify interneurons using the values of their morphological measurements. Then, a methodology for building a model which captures the opinions of all the experts is presented. First, one Bayesian network is learned for each expert, and we propose an algorithm for clustering Bayesian networks corresponding to experts with similar behaviors. Then, a Bayesian network which represents the opinions of each group of experts is induced. Finally, a consensus Bayesian multinet which models the opinions of the whole group of experts is built. A thorough analysis of the consensus model identifies different behaviors between the experts when classifying the interneurons in the experiment. A set of characterizing morphological traits for the neuronal types can be defined by performing inference in the Bayesian multinet. These findings are used to validate the model and to gain some insights into neuron morphology. Finally, we study a classification problem where the true class label of the training instances is not known. Instead, a set of class labels is available for each instance. This is inspired by the neuron classification problem, where a group of experts is asked to individually provide a class label for each instance. We propose a novel approach for learning Bayesian networks using count vectors which represent the number of experts who selected each class label for each instance. These Bayesian networks are evaluated using artificial datasets from supervised learning problems. Resumen La morfología neuronal es una característica clave en el estudio de los circuitos cerebrales, ya que está altamente relacionada con el procesado de información y con los roles funcionales. La morfología neuronal afecta al proceso de integración de las señales de entrada y determina las neuronas que reciben las salidas de otras neuronas. Las diferentes partes de la neurona pueden operar de forma semi-independiente de acuerdo a la localización espacial de las conexiones sinápticas. Por tanto, existe un interés considerable en el análisis de la microanatomía de las células nerviosas, ya que constituye una excelente herramienta para comprender mejor el funcionamiento de la corteza cerebral. Sin embargo, las propiedades morfológicas, moleculares y electrofisiológicas de las células neuronales son extremadamente variables. Excepto en algunos casos especiales, esta variabilidad morfológica dificulta la definición de un conjunto de características que distingan claramente un tipo neuronal. Además, existen diferentes tipos de neuronas en regiones particulares del cerebro. La variabilidad neuronal hace que el análisis y el modelado de la morfología neuronal sean un importante reto científico. La incertidumbre es una propiedad clave en muchos problemas reales. La teoría de la probabilidad proporciona un marco para modelar y razonar bajo incertidumbre. Los modelos gráficos probabilísticos combinan la teoría estadística y la teoría de grafos con el objetivo de proporcionar una herramienta con la que trabajar bajo incertidumbre. En particular, nos centraremos en las redes bayesianas, el modelo más utilizado dentro de los modelos gráficos probabilísticos. En esta tesis hemos diseñado nuevos métodos para aprender redes bayesianas, inspirados por y aplicados al problema del modelado y análisis de datos morfológicos de neuronas. La morfología de una neurona puede ser cuantificada usando una serie de medidas, por ejemplo, la longitud de las dendritas y el axón, el número de bifurcaciones, la dirección de las dendritas y el axón, etc. Estas medidas pueden ser modeladas como datos continuos o discretos. A su vez, los datos continuos pueden ser lineales (por ejemplo, la longitud o la anchura de una dendrita) o direccionales (por ejemplo, la dirección del axón). Estos datos pueden llegar a seguir distribuciones de probabilidad muy complejas y pueden no ajustarse a ninguna distribución paramétrica conocida. El modelado de este tipo de problemas con redes bayesianas híbridas incluyendo variables discretas, lineales y direccionales presenta una serie de retos en relación al aprendizaje a partir de datos, la inferencia, etc. En esta tesis se propone un método para modelar y simular árboles dendríticos basales de neuronas piramidales usando redes bayesianas para capturar las interacciones entre las variables del problema. Para ello, se mide un amplio conjunto de variables de las dendritas y se aplica un algoritmo de aprendizaje con el que se aprende la estructura y se estiman los parámetros de las distribuciones de probabilidad que constituyen las redes bayesianas. Después, se usa un algoritmo de simulación para construir dendritas virtuales mediante el muestreo de valores de las redes bayesianas. Finalmente, se lleva a cabo una profunda evaluaci ón para verificar la capacidad del modelo a la hora de generar dendritas realistas. En esta primera aproximación, las variables fueron discretizadas para poder aprender y muestrear las redes bayesianas. A continuación, se aborda el problema del aprendizaje de redes bayesianas con diferentes tipos de variables. Las mixturas de polinomios constituyen un método para representar densidades de probabilidad en redes bayesianas híbridas. Presentamos un método para aprender aproximaciones de densidades unidimensionales, multidimensionales y condicionales a partir de datos utilizando mixturas de polinomios. El método se basa en interpolación con splines, que aproxima una densidad como una combinación lineal de splines. Los algoritmos propuestos se evalúan utilizando bases de datos artificiales. Además, las mixturas de polinomios son utilizadas como un método no paramétrico de estimación de densidades para clasificadores basados en redes bayesianas. Después, se estudia el problema de incluir información direccional en redes bayesianas. Este tipo de datos presenta una serie de características especiales que impiden el uso de las técnicas estadísticas clásicas. Por ello, para manejar este tipo de información se deben usar estadísticos y distribuciones de probabilidad específicos, como la distribución univariante von Mises y la distribución multivariante von Mises–Fisher. En concreto, en esta tesis extendemos el clasificador naive Bayes al caso en el que las distribuciones de probabilidad condicionada de las variables predictoras dada la clase siguen alguna de estas distribuciones. Se estudia el caso base, en el que sólo se utilizan variables direccionales, y el caso híbrido, en el que variables discretas, lineales y direccionales aparecen mezcladas. También se estudian los clasificadores desde un punto de vista teórico, derivando sus funciones de decisión y las superficies de decisión asociadas. El comportamiento de los clasificadores se ilustra utilizando bases de datos artificiales. Además, los clasificadores son evaluados empíricamente utilizando bases de datos reales. También se estudia el problema de la clasificación de interneuronas. Desarrollamos una aplicación web que permite a un grupo de expertos clasificar un conjunto de neuronas de acuerdo a sus características morfológicas más destacadas. Se utilizan medidas de concordancia para analizar el consenso entre los expertos a la hora de clasificar las neuronas. Se investiga la idoneidad de los términos anatómicos y de los tipos neuronales utilizados frecuentemente en la literatura a través del análisis de redes bayesianas y la aplicación de algoritmos de clustering. Además, se aplican técnicas de aprendizaje supervisado con el objetivo de clasificar de forma automática las interneuronas a partir de sus valores morfológicos. A continuación, se presenta una metodología para construir un modelo que captura las opiniones de todos los expertos. Primero, se genera una red bayesiana para cada experto y se propone un algoritmo para agrupar las redes bayesianas que se corresponden con expertos con comportamientos similares. Después, se induce una red bayesiana que modela la opinión de cada grupo de expertos. Por último, se construye una multired bayesiana que modela las opiniones del conjunto completo de expertos. El análisis del modelo consensuado permite identificar diferentes comportamientos entre los expertos a la hora de clasificar las neuronas. Además, permite extraer un conjunto de características morfológicas relevantes para cada uno de los tipos neuronales mediante inferencia con la multired bayesiana. Estos descubrimientos se utilizan para validar el modelo y constituyen información relevante acerca de la morfología neuronal. Por último, se estudia un problema de clasificación en el que la etiqueta de clase de los datos de entrenamiento es incierta. En cambio, disponemos de un conjunto de etiquetas para cada instancia. Este problema está inspirado en el problema de la clasificación de neuronas, en el que un grupo de expertos proporciona una etiqueta de clase para cada instancia de manera individual. Se propone un método para aprender redes bayesianas utilizando vectores de cuentas, que representan el número de expertos que seleccionan cada etiqueta de clase para cada instancia. Estas redes bayesianas se evalúan utilizando bases de datos artificiales de problemas de aprendizaje supervisado.
Resumo:
Stochastic model updating must be considered for quantifying uncertainties inherently existing in real-world engineering structures. By this means the statistical properties,instead of deterministic values, of structural parameters can be sought indicating the parameter variability. However, the implementation of stochastic model updating is much more complicated than that of deterministic methods particularly in the aspects of theoretical complexity and low computational efficiency. This study attempts to propose a simple and cost-efficient method by decomposing a stochastic updating process into a series of deterministic ones with the aid of response surface models and Monte Carlo simulation. The response surface models are used as surrogates for original FE models in the interest of programming simplification, fast response computation and easy inverse optimization. Monte Carlo simulation is adopted for generating samples from the assumed or measured probability distributions of responses. Each sample corresponds to an individual deterministic inverse process predicting the deterministic values of parameters. Then the parameter means and variances can be statistically estimated based on all the parameter predictions by running all the samples. Meanwhile, the analysis of variance approach is employed for the evaluation of parameter variability significance. The proposed method has been demonstrated firstly on a numerical beam and then a set of nominally identical steel plates tested in the laboratory. It is found that compared with the existing stochastic model updating methods, the proposed method presents similar accuracy while its primary merits consist in its simple implementation and cost efficiency in response computation and inverse optimization.
Resumo:
La seguridad verificada es una metodología para demostrar propiedades de seguridad de los sistemas informáticos que se destaca por las altas garantías de corrección que provee. Los sistemas informáticos se modelan como programas probabilísticos y para probar que verifican una determinada propiedad de seguridad se utilizan técnicas rigurosas basadas en modelos matemáticos de los programas. En particular, la seguridad verificada promueve el uso de demostradores de teoremas interactivos o automáticos para construir demostraciones completamente formales cuya corrección es certificada mecánicamente (por ordenador). La seguridad verificada demostró ser una técnica muy efectiva para razonar sobre diversas nociones de seguridad en el área de criptografía. Sin embargo, no ha podido cubrir un importante conjunto de nociones de seguridad “aproximada”. La característica distintiva de estas nociones de seguridad es que se expresan como una condición de “similitud” entre las distribuciones de salida de dos programas probabilísticos y esta similitud se cuantifica usando alguna noción de distancia entre distribuciones de probabilidad. Este conjunto incluye destacadas nociones de seguridad de diversas áreas como la minería de datos privados, el análisis de flujo de información y la criptografía. Ejemplos representativos de estas nociones de seguridad son la indiferenciabilidad, que permite reemplazar un componente idealizado de un sistema por una implementación concreta (sin alterar significativamente sus propiedades de seguridad), o la privacidad diferencial, una noción de privacidad que ha recibido mucha atención en los últimos años y tiene como objetivo evitar la publicación datos confidenciales en la minería de datos. La falta de técnicas rigurosas que permitan verificar formalmente este tipo de propiedades constituye un notable problema abierto que tiene que ser abordado. En esta tesis introducimos varias lógicas de programa quantitativas para razonar sobre esta clase de propiedades de seguridad. Nuestra principal contribución teórica es una versión quantitativa de una lógica de Hoare relacional para programas probabilísticos. Las pruebas de correción de estas lógicas son completamente formalizadas en el asistente de pruebas Coq. Desarrollamos, además, una herramienta para razonar sobre propiedades de programas a través de estas lógicas extendiendo CertiCrypt, un framework para verificar pruebas de criptografía en Coq. Confirmamos la efectividad y aplicabilidad de nuestra metodología construyendo pruebas certificadas por ordendor de varios sistemas cuyo análisis estaba fuera del alcance de la seguridad verificada. Esto incluye, entre otros, una meta-construcción para diseñar funciones de hash “seguras” sobre curvas elípticas y algoritmos diferencialmente privados para varios problemas de optimización combinatoria de la literatura reciente. ABSTRACT The verified security methodology is an emerging approach to build high assurance proofs about security properties of computer systems. Computer systems are modeled as probabilistic programs and one relies on rigorous program semantics techniques to prove that they comply with a given security goal. In particular, it advocates the use of interactive theorem provers or automated provers to build fully formal machine-checked versions of these security proofs. The verified security methodology has proved successful in modeling and reasoning about several standard security notions in the area of cryptography. However, it has fallen short of covering an important class of approximate, quantitative security notions. The distinguishing characteristic of this class of security notions is that they are stated as a “similarity” condition between the output distributions of two probabilistic programs, and this similarity is quantified using some notion of distance between probability distributions. This class comprises prominent security notions from multiple areas such as private data analysis, information flow analysis and cryptography. These include, for instance, indifferentiability, which enables securely replacing an idealized component of system with a concrete implementation, and differential privacy, a notion of privacy-preserving data mining that has received a great deal of attention in the last few years. The lack of rigorous techniques for verifying these properties is thus an important problem that needs to be addressed. In this dissertation we introduce several quantitative program logics to reason about this class of security notions. Our main theoretical contribution is, in particular, a quantitative variant of a full-fledged relational Hoare logic for probabilistic programs. The soundness of these logics is fully formalized in the Coq proof-assistant and tool support is also available through an extension of CertiCrypt, a framework to verify cryptographic proofs in Coq. We validate the applicability of our approach by building fully machine-checked proofs for several systems that were out of the reach of the verified security methodology. These comprise, among others, a construction to build “safe” hash functions into elliptic curves and differentially private algorithms for several combinatorial optimization problems from the recent literature.
Resumo:
We propose a new Bayesian framework for automatically determining the position (location and orientation) of an uncalibrated camera using the observations of moving objects and a schematic map of the passable areas of the environment. Our approach takes advantage of static and dynamic information on the scene structures through prior probability distributions for object dynamics. The proposed approach restricts plausible positions where the sensor can be located while taking into account the inherent ambiguity of the given setting. The proposed framework samples from the posterior probability distribution for the camera position via data driven MCMC, guided by an initial geometric analysis that restricts the search space. A Kullback-Leibler divergence analysis is then used that yields the final camera position estimate, while explicitly isolating ambiguous settings. The proposed approach is evaluated in synthetic and real environments, showing its satisfactory performance in both ambiguous and unambiguous settings.
Resumo:
Colombia is one of the largest per capita mercury polluters in the world as a consequence of its artisanal gold mining activities. The severity of this problem in terms of potential health effects was evaluated by means of a probabilistic risk assessment carried out in the twelve departments (or provinces) in Colombia with the largest gold production. The two exposure pathways included in the risk assessment were inhalation of elemental Hg vapors and ingestion of fish contaminated with methyl mercury. Exposure parameters for the adult population (especially rates of fish consumption) were obtained from nation-wide surveys and concentrations of Hg in air and of methyl-mercury in fish were gathered from previous scientific studies. Fish consumption varied between departments and ranged from 0 to 0.3 kg d?1. Average concentrations of total mercury in fish (70 data) ranged from 0.026 to 3.3 lg g?1. A total of 550 individual measurements of Hg in workshop air (ranging from menor queDL to 1 mg m?3) and 261 measurements of Hg in outdoor air (ranging from menor queDL to 0.652 mg m?3) were used to generate the probability distributions used as concentration terms in the calculation of risk. All but two of the distributions of Hazard Quotients (HQ) associated with ingestion of Hg-contaminated fish for the twelve regions evaluated presented median values higher than the threshold value of 1 and the 95th percentiles ranged from 4 to 90. In the case of exposure to Hg vapors, minimum values of HQ for the general population exceeded 1 in all the towns included in this study, and the HQs for miner-smelters burning the amalgam is two orders of magnitude higher, reaching values of 200 for the 95th percentile. Even acknowledging the conservative assumptions included in the risk assessment and the uncertainties associated with it, its results clearly reveal the exorbitant levels of risk endured not only by miner-smelters but also by the general population of artisanal gold mining communities in Colombia.
Resumo:
The purpose of this work is to provide a description of the heavy rainfall phenomenon on statistical tools from a Spanish region. We want to quantify the effect of the climate change to verify the rapidity of its evolution across the variation of the probability distributions. Our conclusions have special interest for the agrarian insurances, which may make estimates of costs more realistically. In this work, the analysis mainly focuses on: The distribution of consecutive days without rain for each gauge stations and season. We estimate density Kernel functions and Generalized Pareto Distribution (GPD) for a network of station from the Ebro River basin until a threshold value u. We can establish a relation between distributional parameters and regional characteristics. Moreover we analyze especially the tail of the probability distribution. These tails are governed by law of power means that the number of events n can be expressed as the power of another quantity x : n(x) = x? . ? can be estimated as the slope of log-log plot the number of events and the size. The most convenient way to analyze n(x) is using the empirical probability distribution. Pr(X mayor que x) ? x-?. The distribution of rainfall over percentile of order 0.95 from wet days at the seasonal scale and in a yearly scale with the same treatment of tails than in the previous section.
Resumo:
Abstract Interneuron classification is an important and long-debated topic in neuroscience. A recent study provided a data set of digitally reconstructed interneurons classified by 42 leading neuroscientists according to a pragmatic classification scheme composed of five categorical variables, namely, of the interneuron type and four features of axonal morphology. From this data set we now learned a model which can classify interneurons, on the basis of their axonal morphometric parameters, into these five descriptive variables simultaneously. Because of differences in opinion among the neuroscientists, especially regarding neuronal type, for many interneurons we lacked a unique, agreed-upon classification, which we could use to guide model learning. Instead, we guided model learning with a probability distribution over the neuronal type and the axonal features, obtained, for each interneuron, from the neuroscientists’ classification choices. We conveniently encoded such probability distributions with Bayesian networks, calling them label Bayesian networks (LBNs), and developed a method to predict them. This method predicts an LBN by forming a probabilistic consensus among the LBNs of the interneurons most similar to the one being classified. We used 18 axonal morphometric parameters as predictor variables, 13 of which we introduce in this paper as quantitative counterparts to the categorical axonal features. We were able to accurately predict interneuronal LBNs. Furthermore, when extracting crisp (i.e., non-probabilistic) predictions from the predicted LBNs, our method outperformed related work on interneuron classification. Our results indicate that our method is adequate for multi-dimensional classification of interneurons with probabilistic labels. Moreover, the introduced morphometric parameters are good predictors of interneuron type and the four features of axonal morphology and thus may serve as objective counterparts to the subjective, categorical axonal features.
Resumo:
Interneuron classification is an important and long-debated topic in neuroscience. A recent study provided a data set of digitally reconstructed interneurons classified by 42 leading neuroscientists according to a pragmatic classification scheme composed of five categorical variables, namely, of the interneuron type and four features of axonal morphology. From this data set we now learned a model which can classify interneurons, on the basis of their axonal morphometric parameters, into these five descriptive variables simultaneously. Because of differences in opinion among the neuroscientists, especially regarding neuronal type, for many interneurons we lacked a unique, agreed-upon classification, which we could use to guide model learning. Instead, we guided model learning with a probability distribution over the neuronal type and the axonal features, obtained, for each interneuron, from the neuroscientists’ classification choices. We conveniently encoded such probability distributions with Bayesian networks, calling them label Bayesian networks (LBNs), and developed a method to predict them. This method predicts an LBN by forming a probabilistic consensus among the LBNs of the interneurons most similar to the one being classified. We used 18 axonal morphometric parameters as predictor variables, 13 of which we introduce in this paper as quantitative counterparts to the categorical axonal features. We were able to accurately predict interneuronal LBNs. Furthermore, when extracting crisp (i.e., non-probabilistic) predictions from the predicted LBNs, our method outperformed related work on interneuron classification. Our results indicate that our method is adequate for multi-dimensional classification of interneurons with probabilistic labels. Moreover, the introduced morphometric parameters are good predictors of interneuron type and the four features of axonal morphology and thus may serve as objective counterparts to the subjective, categorical axonal features.
Resumo:
The integrated Safety Assessment (ISA) methodology, developed by the Spanish Nuclear Safety Council (CSN), has been applied to a thermal-hydraulic analysis of PWR Station Blackout (SBO) sequences in the context of the IDPSA (Integrated Deterministic-Probabilistic Safety Assessment) network objectives. The ISA methodology allows obtaining the damage domain (the region of the uncertain parameters space where the damage limit is exceeded) for each sequence of interest as a function of the operator actuations times. Given a particular safety limit or damage limit, several data of every sequence are necessary in order to obtain the exceedance frequency of that limit. In this application these data are obtained from the results of the simulations performed with MAAP code transients inside each damage domain and the time-density probability distributions of the manual actions. Damage limits that have been taken into account within this analysis are: local cladding damage (PCT>1477 K); local fuel melting (T>2499 K); fuel relocation in lower plenum and vessel failure. Therefore, to every one of these damage variables corresponds a different damage domain. The operation of the new passive thermal shutdown seals developed by several companies since Fukushima accident is considered in the paper. The results show the capability and necessity of the ISA methodology, or similar, in order to obtain accurate results that take into account time uncertainties.
Resumo:
La presente investigación tiene como objetivo principal diseñar un Modelo de Gestión de Riesgos Operacionales (MGRO) según las Directrices de los Acuerdos II y III del Comité de Supervisión Bancaria de Basilea del Banco de Pagos Internacionales (CSBB-BPI). Se considera importante realizar un estudio sobre este tema dado que son los riesgos operacionales (OpR) los responsables en gran medida de las últimas crisis financieras mundiales y por la dificultad para detectarlos en las organizaciones. Se ha planteado un modelo de gestión subdividido en dos vías de influencias. La primera acoge el paradigma holístico en el que se considera que hay múltiples maneras de percibir un proceso cíclico, así como las herramientas para observar, conocer y entender el objeto o sujeto percibido. La segunda vía la representa el paradigma totalizante, en el que se obtienen datos tanto cualitativos como cuantitativos, los cuales son complementarios entre si. Por otra parte, este trabajo plantea el diseño de un programa informático de OpR Cualitativo, que ha sido diseñado para determinar la raíz de los riesgos en las organizaciones y su Valor en Riesgo Operacional (OpVaR) basado en el método del indicador básico. Aplicando el ciclo holístico al caso de estudio, se obtuvo el siguiente diseño de investigación: no experimental, univariable, transversal descriptiva, contemporánea, retrospectiva, de fuente mixta, cualitativa (fenomenológica y etnográfica) y cuantitativa (descriptiva y analítica). La toma de decisiones y recolección de información se realizó en dos fases en la unidad de estudio. En la primera se tomó en cuenta la totalidad de la empresa Corpoelec-EDELCA, en la que se presentó un universo estadístico de 4271 personas, una población de 2390 personas y una unidad de muestreo de 87 personas. Se repitió el proceso en una segunda fase, para la Central Hidroeléctrica Simón Bolívar, y se determinó un segundo universo estadístico de 300 trabajadores, una población de 191 personas y una muestra de 58 profesionales. Como fuentes de recolección de información se utilizaron fuentes primarias y secundarias. Para recabar la información primaria se realizaron observaciones directas, dos encuestas para detectar las áreas y procesos con mayor nivel de riesgos y se diseñó un cuestionario combinado con otra encuesta (ad hoc) para establecer las estimaciones de frecuencia y severidad de pérdidas operacionales. La información de fuentes secundarias se extrajo de las bases de datos de Corpoelec-EDELCA, de la IEA, del Banco Mundial, del CSBB-BPI, de la UPM y de la UC at Berkeley, entre otras. Se establecieron las distribuciones de frecuencia y de severidad de pérdidas operacionales como las variables independientes y el OpVaR como la variable dependiente. No se realizó ningún tipo de seguimiento o control a las variables bajo análisis, ya que se consideraron estas para un instante especifico y solo se determinan con la finalidad de establecer la existencia y valoración puntual de los OpR en la unidad de estudio. El análisis cualitativo planteado en el MGRO, permitió detectar que en la unidad de investigación, el 67% de los OpR detectados provienen de dos fuentes principales: procesos (32%) y eventos externos (35%). Adicionalmente, la validación del MGRO en Corpoelec-EDELCA, permitió detectar que el 63% de los OpR en la organización provienen de tres categorías principales, siendo los fraudes externos los presentes con mayor regularidad y severidad de pérdidas en la organización. La exposición al riesgo se determinó fundamentándose en la adaptación del concepto de OpVaR que generalmente se utiliza para series temporales y que en el caso de estudio presenta la primicia de aplicarlo a datos cualitativos transformados con la escala Likert. La posibilidad de utilizar distribuciones de probabilidad típicas para datos cuantitativos en distribuciones de frecuencia y severidad de pérdidas con datos de origen cualitativo fueron analizadas. Para el 64% de los OpR estudiados se obtuvo que la frecuencia tiene un comportamiento semejante al de la distribución de probabilidad de Poisson y en un 55% de los casos para la severidad de pérdidas se obtuvo a las log-normal como las distribuciones de probabilidad más comunes, con lo que se concluyó que los enfoques sugeridos por el BCBS-BIS para series de tiempo son aplicables a los datos cualitativos. Obtenidas las distribuciones de frecuencia y severidad de pérdidas, se convolucionaron estas implementando el método de Montecarlo, con lo que se obtuvieron los enfoques de distribuciones de pérdidas (LDA) para cada uno de los OpR. El OpVaR se dedujo como lo sugiere el CSBB-BPI del percentil 99,9 o 99% de cada una de las LDA, obteniéndose que los OpR presentan un comportamiento similar al sistema financiero, resultando como los de mayor peligrosidad los que se ubican con baja frecuencia y alto impacto, por su dificultad para ser detectados y monitoreados. Finalmente, se considera que el MGRO permitirá a los agentes del mercado y sus grupos de interés conocer con efectividad, fiabilidad y eficiencia el status de sus entidades, lo que reducirá la incertidumbre de sus inversiones y les permitirá establecer una nueva cultura de gestión en sus organizaciones. ABSTRACT This research has as main objective the design of a Model for Operational Risk Management (MORM) according to the guidelines of Accords II and III of the Basel Committee on Banking Supervision of the Bank for International Settlements (BCBS- BIS). It is considered important to conduct a study on this issue since operational risks (OpR) are largely responsible for the recent world financial crisis and due to the difficulty in detecting them in organizations. A management model has been designed which is divided into two way of influences. The first supports the holistic paradigm in which it is considered that there are multiple ways of perceiving a cyclical process and contains the tools to observe, know and understand the subject or object perceived. The second way is the totalizing paradigm, in which both qualitative and quantitative data are obtained, which are complementary to each other. Moreover, this paper presents the design of qualitative OpR software which is designed to determine the root of risks in organizations and their Operational Value at Risk (OpVaR) based on the basic indicator approach. Applying the holistic cycle to the case study, the following research design was obtained: non- experimental, univariate, descriptive cross-sectional, contemporary, retrospective, mixed-source, qualitative (phenomenological and ethnographic) and quantitative (descriptive and analytical). Decision making and data collection was conducted in two phases in the study unit. The first took into account the totality of the Corpoelec-EDELCA company, which presented a statistical universe of 4271 individuals, a population of 2390 individuals and a sampling unit of 87 individuals. The process was repeated in a second phase to the Simon Bolivar Hydroelectric Power Plant, and a second statistical universe of 300 workers, a population of 191 people and a sample of 58 professionals was determined. As sources of information gathering primary and secondary sources were used. To obtain the primary information direct observations were conducted and two surveys to identify the areas and processes with higher risks were designed. A questionnaire was combined with an ad hoc survey to establish estimates of frequency and severity of operational losses was also considered. The secondary information was extracted from the databases of Corpoelec-EDELCA, IEA, the World Bank, the BCBS-BIS, UPM and UC at Berkeley, among others. The operational loss frequency distributions and the operational loss severity distributions were established as the independent variables and OpVaR as the dependent variable. No monitoring or control of the variables under analysis was performed, as these were considered for a specific time and are determined only for the purpose of establishing the existence and timely assessment of the OpR in the study unit. Qualitative analysis raised in the MORM made it possible to detect that in the research unit, 67% of detected OpR come from two main sources: external processes (32%) and external events (35%). Additionally, validation of the MORM in Corpoelec-EDELCA, enabled to estimate that 63% of OpR in the organization come from three main categories, with external fraud being present more regularly and greater severity of losses in the organization. Risk exposure is determined basing on adapting the concept of OpVaR generally used for time series and in the case study it presents the advantage of applying it to qualitative data transformed with the Likert scale. The possibility of using typical probability distributions for quantitative data in loss frequency and loss severity distributions with data of qualitative origin were analyzed. For the 64% of OpR studied it was found that the frequency has a similar behavior to that of the Poisson probability distribution and 55% of the cases for loss severity it was found that the log-normal were the most common probability distributions. It was concluded that the approach suggested by the BCBS-BIS for time series can be applied to qualitative data. Once obtained the distributions of loss frequency and severity have been obtained they were subjected to convolution implementing the Monte Carlo method. Thus the loss distribution approaches (LDA) were obtained for each of the OpR. The OpVaR was derived as suggested by the BCBS-BIS 99.9 percentile or 99% of each of the LDA. It was determined that the OpR exhibits a similar behavior to the financial system, being the most dangerous those with low frequency and high impact for their difficulty in being detected and monitored. Finally, it is considered that the MORM will allows market players and their stakeholders to know with effectiveness, efficiency and reliability the status of their entities, which will reduce the uncertainty of their investments and enable them to establish a new management culture in their organizations.