866 resultados para expectation maximization
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
HLA-G has an important role in the modulation of the maternal immune system during pregnancy, and evidence that balancing selection acts in the promoter and 3′UTR regions has been previously reported. To determine whether selection acts on the HLA-G coding region in the Amazon Rainforest, exons 2, 3 and 4 were analyzed in a sample of 142 Amerindians from nine villages of five isolated tribes that inhabit the Central Amazon. Six previously described single-nucleotide polymorphisms (SNPs) were identified and the Expectation-Maximization (EM) and PHASE algorithms were used to computationally reconstruct SNP haplotypes (HLA-G alleles). A new HLA-G allele, which originated in Amerindian populations by a crossing-over event between two widespread HLA-G alleles, was identified in 18 individuals. Neutrality tests evidenced that natural selection has a complex part in the HLA-G coding region. Although balancing selection is the type of selection that shapes variability at a local level (Native American populations), we have also shown that purifying selection may occur on a worldwide scale. Moreover, the balancing selection does not seem to act on the coding region as strongly as it acts on the flanking regulatory regions, and such coding signature may actually reflect a hitchhiking effect.Genes and Immunity advance online publication, 3 October 2013; doi:10.1038/gene.2013.47.
Resumo:
Abstract Background In honeybees, differential feeding of female larvae promotes the occurrence of two different phenotypes, a queen and a worker, from identical genotypes, through incremental alterations, which affect general growth, and character state alterations that result in the presence or absence of specific structures. Although previous studies revealed a link between incremental alterations and differential expression of physiometabolic genes, the molecular changes accompanying character state alterations remain unknown. Results By using cDNA microarray analyses of >6,000 Apis mellifera ESTs, we found 240 differentially expressed genes (DEGs) between developing queens and workers. Many genes recorded as up-regulated in prospective workers appear to be unique to A. mellifera, suggesting that the workers' developmental pathway involves the participation of novel genes. Workers up-regulate more developmental genes than queens, whereas queens up-regulate a greater proportion of physiometabolic genes, including genes coding for metabolic enzymes and genes whose products are known to regulate the rate of mass-transforming processes and the general growth of the organism (e.g., tor). Many DEGs are likely to be involved in processes favoring the development of caste-biased structures, like brain, legs and ovaries, as well as genes that code for cytoskeleton constituents. Treatment of developing worker larvae with juvenile hormone (JH) revealed 52 JH responsive genes, specifically during the critical period of caste development. Using Gibbs sampling and Expectation Maximization algorithms, we discovered eight overrepresented cis-elements from four gene groups. Graph theory and complex networks concepts were adopted to attain powerful graphical representations of the interrelation between cis-elements and genes and objectively quantify the degree of relationship between these entities. Conclusion We suggest that clusters of functionally related DEGs are co-regulated during caste development in honeybees. This network of interactions is activated by nutrition-driven stimuli in early larval stages. Our data are consistent with the hypothesis that JH is a key component of the developmental determination of queen-like characters. Finally, we propose a conceptual model of caste differentiation in A. mellifera based on gene-regulatory networks.
Resumo:
Latent class analysis (LCA) and latent class regression (LCR) are widely used for modeling multivariate categorical outcomes in social sciences and biomedical studies. Standard analyses assume data of different respondents to be mutually independent, excluding application of the methods to familial and other designs in which participants are clustered. In this paper, we develop multilevel latent class model, in which subpopulation mixing probabilities are treated as random effects that vary among clusters according to a common Dirichlet distribution. We apply the Expectation-Maximization (EM) algorithm for model fitting by maximum likelihood (ML). This approach works well, but is computationally intensive when either the number of classes or the cluster size is large. We propose a maximum pairwise likelihood (MPL) approach via a modified EM algorithm for this case. We also show that a simple latent class analysis, combined with robust standard errors, provides another consistent, robust, but less efficient inferential procedure. Simulation studies suggest that the three methods work well in finite samples, and that the MPL estimates often enjoy comparable precision as the ML estimates. We apply our methods to the analysis of comorbid symptoms in the Obsessive Compulsive Disorder study. Our models' random effects structure has more straightforward interpretation than those of competing methods, thus should usefully augment tools available for latent class analysis of multilevel data.
Resumo:
The degree of polarization of a refected field from active laser illumination can be used for object identifcation and classifcation. The goal of this study is to investigate methods for estimating the degree of polarization for refected fields with active laser illumination, which involves the measurement and processing of two orthogonal field components (complex amplitudes), two orthogonal intensity components, and the total field intensity. We propose to replace interferometric optical apparatuses with a computational approach for estimating the degree of polarization from two orthogonal intensity data and total intensity data. Cramer-Rao bounds for each of the three sensing modalities with various noise models are computed. Algebraic estimators and maximum-likelihood (ML) estimators are proposed. Active-set algorithm and expectation-maximization (EM) algorithm are used to compute ML estimates. The performances of the estimators are compared with each other and with their corresponding Cramer-Rao bounds. Estimators for four-channel polarimeter (intensity interferometer) sensing have a better performance than orthogonal intensities estimators and total intensity estimators. Processing the four intensities data from polarimeter, however, requires complicated optical devices, alignment, and four CCD detectors. It only requires one or two detectors and a computer to process orthogonal intensities data and total intensity data, and the bounds and estimator performances demonstrate that reasonable estimates may still be obtained from orthogonal intensities or total intensity data. Computational sensing is a promising way to estimate the degree of polarization.
Resumo:
This paper presents a comparison of principal component (PC) regression and regularized expectation maximization (RegEM) to reconstruct European summer and winter surface air temperature over the past millennium. Reconstruction is performed within a surrogate climate using the National Center for Atmospheric Research (NCAR) Climate System Model (CSM) 1.4 and the climate model ECHO-G 4, assuming different white and red noise scenarios to define the distortion of pseudoproxy series. We show how sensitivity tests lead to valuable “a priori” information that provides a basis for improving real world proxy reconstructions. Our results emphasize the need to carefully test and evaluate reconstruction techniques with respect to the temporal resolution and the spatial scale they are applied to. Furthermore, we demonstrate that uncertainties inherent to the predictand and predictor data have to be more rigorously taken into account. The comparison of the two statistical techniques, in the specific experimental setting presented here, indicates that more skilful results are achieved with RegEM as low frequency variability is better preserved. We further detect seasonal differences in reconstruction skill for the continental scale, as e.g. the target temperature average is more adequately reconstructed for summer than for winter. For the specific predictor network given in this paper, both techniques underestimate the target temperature variations to an increasing extent as more noise is added to the signal, albeit RegEM less than with PC regression. We conclude that climate field reconstruction techniques can be improved and need to be further optimized in future applications.
Resumo:
Information theory-based metric such as mutual information (MI) is widely used as similarity measurement for multimodal registration. Nevertheless, this metric may lead to matching ambiguity for non-rigid registration. Moreover, maximization of MI alone does not necessarily produce an optimal solution. In this paper, we propose a segmentation-assisted similarity metric based on point-wise mutual information (PMI). This similarity metric, termed SPMI, enhances the registration accuracy by considering tissue classification probabilities as prior information, which is generated from an expectation maximization (EM) algorithm. Diffeomorphic demons is then adopted as the registration model and is optimized in a hierarchical framework (H-SPMI) based on different levels of anatomical structure as prior knowledge. The proposed method is evaluated using Brainweb synthetic data and clinical fMRI images. Both qualitative and quantitative assessment were performed as well as a sensitivity analysis to the segmentation error. Compared to the pure intensity-based approaches which only maximize mutual information, we show that the proposed algorithm provides significantly better accuracy on both synthetic and clinical data.
Resumo:
Diseases are believed to arise from dysregulation of biological systems (pathways) perturbed by environmental triggers. Biological systems as a whole are not just the sum of their components, rather ever-changing, complex and dynamic systems over time in response to internal and external perturbation. In the past, biologists have mainly focused on studying either functions of isolated genes or steady-states of small biological pathways. However, it is systems dynamics that play an essential role in giving rise to cellular function/dysfunction which cause diseases, such as growth, differentiation, division and apoptosis. Biological phenomena of the entire organism are not only determined by steady-state characteristics of the biological systems, but also by intrinsic dynamic properties of biological systems, including stability, transient-response, and controllability, which determine how the systems maintain their functions and performance under a broad range of random internal and external perturbations. As a proof of principle, we examine signal transduction pathways and genetic regulatory pathways as biological systems. We employ widely used state-space equations in systems science to model biological systems, and use expectation-maximization (EM) algorithms and Kalman filter to estimate the parameters in the models. We apply the developed state-space models to human fibroblasts obtained from the autoimmune fibrosing disease, scleroderma, and then perform dynamic analysis of partial TGF-beta pathway in both normal and scleroderma fibroblasts stimulated by silica. We find that TGF-beta pathway under perturbation of silica shows significant differences in dynamic properties between normal and scleroderma fibroblasts. Our findings may open a new avenue in exploring the functions of cells and mechanism operative in disease development.
Resumo:
Diseases are believed to arise from dysregulation of biological systems (pathways) perturbed by environmental triggers. Biological systems as a whole are not just the sum of their components, rather ever-changing, complex and dynamic systems over time in response to internal and external perturbation. In the past, biologists have mainly focused on studying either functions of isolated genes or steady-states of small biological pathways. However, it is systems dynamics that play an essential role in giving rise to cellular function/dysfunction which cause diseases, such as growth, differentiation, division and apoptosis. Biological phenomena of the entire organism are not only determined by steady-state characteristics of the biological systems, but also by intrinsic dynamic properties of biological systems, including stability, transient-response, and controllability, which determine how the systems maintain their functions and performance under a broad range of random internal and external perturbations. As a proof of principle, we examine signal transduction pathways and genetic regulatory pathways as biological systems. We employ widely used state-space equations in systems science to model biological systems, and use expectation-maximization (EM) algorithms and Kalman filter to estimate the parameters in the models. We apply the developed state-space models to human fibroblasts obtained from the autoimmune fibrosing disease, scleroderma, and then perform dynamic analysis of partial TGF-beta pathway in both normal and scleroderma fibroblasts stimulated by silica. We find that TGF-beta pathway under perturbation of silica shows significant differences in dynamic properties between normal and scleroderma fibroblasts. Our findings may open a new avenue in exploring the functions of cells and mechanism operative in disease development.
Resumo:
We have developed a new projector model specifically tailored for fast list-mode tomographic reconstructions in Positron emission tomography (PET) scanners with parallel planar detectors. The model provides an accurate estimation of the probability distribution of coincidence events defined by pairs of scintillating crystals. This distribution is parameterized with 2D elliptical Gaussian functions defined in planes perpendicular to the main axis of the tube of response (TOR). The parameters of these Gaussian functions have been obtained by fitting Monte Carlo simulations that include positron range, acolinearity of gamma rays, as well as detector attenuation and scatter effects. The proposed model has been applied efficiently to list-mode reconstruction algorithms. Evaluation with Monte Carlo simulations over a rotating high resolution PET scanner indicates that this model allows to obtain better recovery to noise ratio in OSEM (ordered-subsets, expectation-maximization) reconstruction, if compared to list-mode reconstruction with symmetric circular Gaussian TOR model, and histogram-based OSEM with precalculated system matrix using Monte Carlo simulated models and symmetries.
Resumo:
La mayor parte de los entornos diseñados por el hombre presentan características geométricas específicas. En ellos es frecuente encontrar formas poligonales, rectangulares, circulares . . . con una serie de relaciones típicas entre distintos elementos del entorno. Introducir este tipo de conocimiento en el proceso de construcción de mapas de un robot móvil puede mejorar notablemente la calidad y la precisión de los mapas resultantes. También puede hacerlos más útiles de cara a un razonamiento de más alto nivel. Cuando la construcción de mapas se formula en un marco probabilístico Bayesiano, una especificación completa del problema requiere considerar cierta información a priori sobre el tipo de entorno. El conocimiento previo puede aplicarse de varias maneras, en esta tesis se presentan dos marcos diferentes: uno basado en el uso de primitivas geométricas y otro que emplea un método de representación cercano al espacio de las medidas brutas. Un enfoque basado en características geométricas supone implícitamente imponer un cierto modelo a priori para el entorno. En este sentido, el desarrollo de una solución al problema SLAM mediante la optimización de un grafo de características geométricas constituye un primer paso hacia nuevos métodos de construcción de mapas en entornos estructurados. En el primero de los dos marcos propuestos, el sistema deduce la información a priori a aplicar en cada caso en base a una extensa colección de posibles modelos geométricos genéricos, siguiendo un método de Maximización de la Esperanza para hallar la estructura y el mapa más probables. La representación de la estructura del entorno se basa en un enfoque jerárquico, con diferentes niveles de abstracción para los distintos elementos geométricos que puedan describirlo. Se llevaron a cabo diversos experimentos para mostrar la versatilidad y el buen funcionamiento del método propuesto. En el segundo marco, el usuario puede definir diferentes modelos de estructura para el entorno mediante grupos de restricciones y energías locales entre puntos vecinos de un conjunto de datos del mismo. El grupo de restricciones que se aplica a cada grupo de puntos depende de la topología, que es inferida por el propio sistema. De este modo, se pueden incorporar nuevos modelos genéricos de estructura para el entorno con gran flexibilidad y facilidad. Se realizaron distintos experimentos para demostrar la flexibilidad y los buenos resultados del enfoque propuesto. Abstract Most human designed environments present specific geometrical characteristics. In them, it is easy to find polygonal, rectangular and circular shapes, with a series of typical relations between different elements of the environment. Introducing this kind of knowledge in the mapping process of mobile robots can notably improve the quality and accuracy of the resulting maps. It can also make them more suitable for higher level reasoning applications. When mapping is formulated in a Bayesian probabilistic framework, a complete specification of the problem requires considering a prior for the environment. The prior over the structure of the environment can be applied in several ways; this dissertation presents two different frameworks, one using a feature based approach and another one employing a dense representation close to the measurements space. A feature based approach implicitly imposes a prior for the environment. In this sense, feature based graph SLAM was a first step towards a new mapping solution for structured scenarios. In the first framework, the prior is inferred by the system from a wide collection of feature based priors, following an Expectation-Maximization approach to obtain the most probable structure and the most probable map. The representation of the structure of the environment is based on a hierarchical model with different levels of abstraction for the geometrical elements describing it. Various experiments were conducted to show the versatility and the good performance of the proposed method. In the second framework, different priors can be defined by the user as sets of local constraints and energies for consecutive points in a range scan from a given environment. The set of constraints applied to each group of points depends on the topology, which is inferred by the system. This way, flexible and generic priors can be incorporated very easily. Several tests were carried out to demonstrate the flexibility and the good results of the proposed approach.
Resumo:
The modal analysis of a structural system consists on computing its vibrational modes. The experimental way to estimate these modes requires to excite the system with a measured or known input and then to measure the system output at different points using sensors. Finally, system inputs and outputs are used to compute the modes of vibration. When the system refers to large structures like buildings or bridges, the tests have to be performed in situ, so it is not possible to measure system inputs such as wind, traffic, . . .Even if a known input is applied, the procedure is usually difficult and expensive, and there are still uncontrolled disturbances acting at the time of the test. These facts led to the idea of computing the modes of vibration using only the measured vibrations and regardless of the inputs that originated them, whether they are ambient vibrations (wind, earthquakes, . . . ) or operational loads (traffic, human loading, . . . ). This procedure is usually called Operational Modal Analysis (OMA), and in general consists on to fit a mathematical model to the measured data assuming the unobserved excitations are realizations of a stationary stochastic process (usually white noise processes). Then, the modes of vibration are computed from the estimated model. The first issue investigated in this thesis is the performance of the Expectation- Maximization (EM) algorithm for the maximum likelihood estimation of the state space model in the field of OMA. The algorithm is described in detail and it is analysed how to apply it to vibration data. After that, it is compared to another well known method, the Stochastic Subspace Identification algorithm. The maximum likelihood estimate enjoys some optimal properties from a statistical point of view what makes it very attractive in practice, but the most remarkable property of the EM algorithm is that it can be used to address a wide range of situations in OMA. In this work, three additional state space models are proposed and estimated using the EM algorithm: • The first model is proposed to estimate the modes of vibration when several tests are performed in the same structural system. Instead of analyse record by record and then compute averages, the EM algorithm is extended for the joint estimation of the proposed state space model using all the available data. • The second state space model is used to estimate the modes of vibration when the number of available sensors is lower than the number of points to be tested. In these cases it is usual to perform several tests changing the position of the sensors from one test to the following (multiple setups of sensors). Here, the proposed state space model and the EM algorithm are used to estimate the modal parameters taking into account the data of all setups. • And last, a state space model is proposed to estimate the modes of vibration in the presence of unmeasured inputs that cannot be modelled as white noise processes. In these cases, the frequency components of the inputs cannot be separated from the eigenfrequencies of the system, and spurious modes are obtained in the identification process. The idea is to measure the response of the structure corresponding to different inputs; then, it is assumed that the parameters common to all the data correspond to the structure (modes of vibration), and the parameters found in a specific test correspond to the input in that test. The problem is solved using the proposed state space model and the EM algorithm. Resumen El análisis modal de un sistema estructural consiste en calcular sus modos de vibración. Para estimar estos modos experimentalmente es preciso excitar el sistema con entradas conocidas y registrar las salidas del sistema en diferentes puntos por medio de sensores. Finalmente, los modos de vibración se calculan utilizando las entradas y salidas registradas. Cuando el sistema es una gran estructura como un puente o un edificio, los experimentos tienen que realizarse in situ, por lo que no es posible registrar entradas al sistema tales como viento, tráfico, . . . Incluso si se aplica una entrada conocida, el procedimiento suele ser complicado y caro, y todavía están presentes perturbaciones no controladas que excitan el sistema durante el test. Estos hechos han llevado a la idea de calcular los modos de vibración utilizando sólo las vibraciones registradas en la estructura y sin tener en cuenta las cargas que las originan, ya sean cargas ambientales (viento, terremotos, . . . ) o cargas de explotación (tráfico, cargas humanas, . . . ). Este procedimiento se conoce en la literatura especializada como Análisis Modal Operacional, y en general consiste en ajustar un modelo matemático a los datos registrados adoptando la hipótesis de que las excitaciones no conocidas son realizaciones de un proceso estocástico estacionario (generalmente ruido blanco). Posteriormente, los modos de vibración se calculan a partir del modelo estimado. El primer problema que se ha investigado en esta tesis es la utilización de máxima verosimilitud y el algoritmo EM (Expectation-Maximization) para la estimación del modelo espacio de los estados en el ámbito del Análisis Modal Operacional. El algoritmo se describe en detalle y también se analiza como aplicarlo cuando se dispone de datos de vibraciones de una estructura. A continuación se compara con otro método muy conocido, el método de los Subespacios. Los estimadores máximo verosímiles presentan una serie de propiedades que los hacen óptimos desde un punto de vista estadístico, pero la propiedad más destacable del algoritmo EM es que puede utilizarse para resolver un amplio abanico de situaciones que se presentan en el Análisis Modal Operacional. En este trabajo se proponen y estiman tres modelos en el espacio de los estados: • El primer modelo se utiliza para estimar los modos de vibración cuando se dispone de datos correspondientes a varios experimentos realizados en la misma estructura. En lugar de analizar registro a registro y calcular promedios, se utiliza algoritmo EM para la estimación conjunta del modelo propuesto utilizando todos los datos disponibles. • El segundo modelo en el espacio de los estados propuesto se utiliza para estimar los modos de vibración cuando el número de sensores disponibles es menor que vi Resumen el número de puntos que se quieren analizar en la estructura. En estos casos es usual realizar varios ensayos cambiando la posición de los sensores de un ensayo a otro (múltiples configuraciones de sensores). En este trabajo se utiliza el algoritmo EM para estimar los parámetros modales teniendo en cuenta los datos de todas las configuraciones. • Por último, se propone otro modelo en el espacio de los estados para estimar los modos de vibración en la presencia de entradas al sistema que no pueden modelarse como procesos estocásticos de ruido blanco. En estos casos, las frecuencias de las entradas no se pueden separar de las frecuencias del sistema y se obtienen modos espurios en la fase de identificación. La idea es registrar la respuesta de la estructura correspondiente a diferentes entradas; entonces se adopta la hipótesis de que los parámetros comunes a todos los registros corresponden a la estructura (modos de vibración), y los parámetros encontrados en un registro específico corresponden a la entrada en dicho ensayo. El problema se resuelve utilizando el modelo propuesto y el algoritmo EM.
Resumo:
Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.
Resumo:
Computing the modal parameters of structural systems often requires processing data from multiple non-simultaneously recorded setups of sensors. These setups share some sensors in common, the so-called reference sensors, which are fixed for all measurements, while the other sensors change their position from one setup to the next. One possibility is to process the setups separately resulting in different modal parameter estimates for each setup. Then, the reference sensors are used to merge or glue the different parts of the mode shapes to obtain global mode shapes, while the natural frequencies and damping ratios are usually averaged. In this paper we present a new state space model that processes all setups at once. The result is that the global mode shapes are obtained automatically, and only a value for the natural frequency and damping ratio of each mode is estimated. We also investigate the estimation of this model using maximum likelihood and the Expectation Maximization algorithm, and apply this technique to simulated and measured data corresponding to different structures.