941 resultados para Kernel density estimation
Resumo:
We consider online prediction problems where the loss between the prediction and the outcome is measured by the squared Euclidean distance and its generalization, the squared Mahalanobis distance. We derive the minimax solutions for the case where the prediction and action spaces are the simplex (this setup is sometimes called the Brier game) and the \ell_2 ball (this setup is related to Gaussian density estimation). We show that in both cases the value of each sub-game is a quadratic function of a simple statistic of the state, with coefficients that can be efficiently computed using an explicit recurrence relation. The resulting deterministic minimax strategy and randomized maximin strategy are linear functions of the statistic.
Resumo:
Aerial surveys of kangaroos (Macropus spp.) in Queensland are used to make economically important judgements on the levels of viable commercial harvest. Previous analysis methods for aerial kangaroo surveys have used both mark-recapture methodologies and conventional distance-sampling analyses. Conventional distance sampling has the disadvantage that detection is assumed to be perfect on the transect line, while mark-recapture methods are notoriously sensitive to problems with unmodelled heterogeneity in capture probabilities. We introduce three methodologies for combining together mark-recapture and distance-sampling data, aimed at exploiting the strengths of both methodologies and overcoming the weaknesses. Of these methods, two are based on the assumption of full independence between observers in the mark-recapture component, and this appears to introduce more bias in density estimation than it resolves through allowing uncertain trackline detection. Both of these methods give lower density estimates than conventional distance sampling, indicating a clear failure of the independence assumption. The third method, termed point independence, appears to perform very well, giving credible density estimates and good properties in terms of goodness-of-fit and percentage coefficient of variation. Estimated densities of eastern grey kangaroos range from 21 to 36 individuals km-2, with estimated coefficients of variation between 11% and 14% and estimated trackline detection probabilities primarily between 0.7 and 0.9.
Resumo:
Images from cell biology experiments often indicate the presence of cell clustering, which can provide insight into the mechanisms driving the collective cell behaviour. Pair-correlation functions provide quantitative information about the presence, or absence, of clustering in a spatial distribution of cells. This is because the pair-correlation function describes the ratio of the abundance of pairs of cells, separated by a particular distance, relative to a randomly distributed reference population. Pair-correlation functions are often presented as a kernel density estimate where the frequency of pairs of objects are grouped using a particular bandwidth (or bin width), Δ>0. The choice of bandwidth has a dramatic impact: choosing Δ too large produces a pair-correlation function that contains insufficient information, whereas choosing Δ too small produces a pair-correlation signal dominated by fluctuations. Presently, there is little guidance available regarding how to make an objective choice of Δ. We present a new technique to choose Δ by analysing the power spectrum of the discrete Fourier transform of the pair-correlation function. Using synthetic simulation data, we confirm that our approach allows us to objectively choose Δ such that the appropriately binned pair-correlation function captures known features in uniform and clustered synthetic images. We also apply our technique to images from two different cell biology assays. The first assay corresponds to an approximately uniform distribution of cells, while the second assay involves a time series of images of a cell population which forms aggregates over time. The appropriately binned pair-correlation function allows us to make quantitative inferences about the average aggregate size, as well as quantifying how the average aggregate size changes with time.
Resumo:
State-of-the-art image-set matching techniques typically implicitly model each image-set with a Gaussian distribution. Here, we propose to go beyond these representations and model image-sets as probability distribution functions (PDFs) using kernel density estimators. To compare and match image-sets, we exploit Csiszar´ f-divergences, which bear strong connections to the geodesic distance defined on the space of PDFs, i.e., the statistical manifold. Furthermore, we introduce valid positive definite kernels on the statistical manifold, which let us make use of more powerful classification schemes to match image-sets. Finally, we introduce a supervised dimensionality reduction technique that learns a latent space where f-divergences reflect the class labels of the data. Our experiments on diverse problems, such as video-based face recognition and dynamic texture classification, evidence the benefits of our approach over the state-of-the-art image-set matching methods.
Resumo:
Interstellar clouds are not featureless, but show quite complex internal structures of filaments and clumps when observed with high enough resolution. These structures have been generated by 1) turbulent motions driven mainly by supernovae, 2) magnetic fields working on the ions and, through neutral-ion collisions, on neutral gas as well, and 3) self-gravity pulling a dense clump together to form a new star. The study of the cloud structure gives us information on the relative importance of each of these mechanisms, and helps us to gain a better understanding of the details of the star formation process. Interstellar dust is often used as a tracer for the interstellar gas which forms the bulk of the interstellar matter. Some of the methods that are used to derive the column density are summarized in this thesis. A new method, which uses the scattered light to map the column density in large fields with high spatial resolution, is introduced. This thesis also takes a look at the grain alignment with respect to the magnetic fields. The aligned grains give rise to the polarization of starlight and dust emission, thus revealing the magnetic field. The alignment mechanisms have been debated for the last half century. The strongest candidate at present is the radiative torques mechanism. In the first four papers included in this thesis, the scattered light method of column density estimation is formulated, tested in simulations, and finally used to obtain a column density map from observations. They demonstrate that the scattered light method is a very useful and reliable tool in column density estimation, and is able to provide higher resolution than the near-infrared color excess method. These two methods are complementary. The derived column density maps are also used to gain information on the dust emissivity within the observed cloud. The two final papers present simulations of polarized thermal dust emission assuming that the alignment happens by the radiative torques mechanism. We show that the radiative torques can explain the observed decline of the polarization degree towards dense cores. Furthermore, the results indicate that the dense cores themselves might not contribute significantly to the polarized signal, and hence one needs to be careful when interpreting the observations and deriving the magnetic field.
Resumo:
In this paper, we compare the experimental results for Tamil online handwritten character recognition using HMM and Statistical Dynamic Time Warping (SDTW) as classifiers. HMM was used for a 156-class problem. Different feature sets and values for the HMM states & mixtures were tried and the best combination was found to be 16 states & 14 mixtures, giving an accuracy of 85%. The features used in this combination were retained and a SDTW model with 20 states and single Gaussian was used as classifier. Also, the symbol set was increased to include numerals, punctuation marks and special symbols like $, & and #, taking the number of classes to 188. It was found that, with a small addition to the feature set, this simple SDTW classifier performed on par with the more complicated HMM model, giving an accuracy of 84%. Mixture density estimation computations was reduced by 11 times. The recognition is writer independent, as the dataset used is quite large, with a variety of handwriting styles.
Resumo:
Determinar áreas de vida tem sido um tema amplamente discutido em trabalhos que procuram entender a relação da espécie estudada com as características de seu habitat. A Baía de Guanabara abriga uma população residente de botos-cinza (Sotalia guianensis) e o objetivo do presente estudo foi analisar o uso espacial de Sotalia guianensis, na Baía de Guanabara (RJ), entre 2002 e 2012. Um total de 204 dias de coleta foi analisado e 902 pontos selecionados para serem gerados os mapas de distribuição. A baía foi dividida em quatro subáreas e a diferença no esforço entre cada uma não ultrapassou 16%. O método Kernel Density foi utilizado nas análises para estimativa e interpretação do uso do habitat pelos grupos de botos-cinza. A interpretação das áreas de concentração da população também foi feita a partir de células (grids) de 1,5km x 1,5km com posterior aplicação do índice de sobreposição de nicho de Pianka. As profundidades utilizadas por S. guianensis não apresentaram variações significativas ao longo do período de estudo (p = 0,531). As áreas utilizadas durante o período de 2002/2004 foram estimadas em 79,4 km com áreas de concentração de 19,4 km. Os períodos de 2008/2010 e 2010/2012 apresentaram áreas de uso estimadas em um total de 51,4 e 58,9 km, respectivamente e áreas de concentração com 10,8 e 10,4 km, respectivamente. As áreas utilizadas envolveram regiões que se estendem por todo o canal central e região nordeste da Baía de Guanabara, onde também está localizada a Área de Proteção Ambiental de Guapimirim. Apesar disso, a área de vida da população, assim como suas áreas de concentração, diminuiu gradativamente ao longo dos anos, especialmente no entorno da Ilha de Paquetá e centro-sul do canal central. Grupos com mais de 10 indivíduos e grupos na classe ≥ 25% de filhotes em sua composição, evidenciaram reduções de mais de 60% no tamanho das áreas utilizadas. A população de botos-cinza vem decrescendo rapidamente nos últimos anos, além de interagir diariamente com fontes perturbadoras, sendo estas possíveis causas da redução do uso do habitat da Baía de Guanabara. Por esse motivo, os resultados apresentados são de fundamental importância para a conservação desta população já que representam consequências da interação em longo prazo com um ambiente costeiro altamente impactado pela ação antrópica.
Resumo:
The mixtures of factor analyzers (MFA) model allows data to be modeled as a mixture of Gaussians with a reduced parametrization. We present the formulation of a nonparametric form of the MFA model, the Dirichlet process MFA (DPMFA). The proposed model can be used for density estimation or clustering of high dimensiona data. We utilize the DPMFA for clustering the action potentials of different neurons from extracellular recordings, a problem known as spike sorting. DPMFA model is compared to Dirichlet process mixtures of Gaussians model (DPGMM) which has a higher computational complexity. We show that DPMFA has similar modeling performance in lower dimensions when compared to DPGMM, and is able to work in higher dimensions. ©2009 IEEE.
Resumo:
A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.
Resumo:
Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian non-parametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian non-parametrics. The survey covers the use of Bayesian non-parametrics for modelling unknown functions, density estimation, clustering, time-series modelling, and representing sparsity, hierarchies, and covariance structure. More specifically, it gives brief non-technical overviews of Gaussian processes, Dirichlet processes, infinite hidden Markov models, Indian buffet processes, Kingman's coalescent, Dirichlet diffusion trees and Wishart processes.
Resumo:
A real-time VHF swept frequency (20–300 MHz) reflectometry measurement for radio-frequency capacitive-coupled atmospheric pressure plasmas is described. The measurement is scalar, non-invasive and deployed on the main power line of the plasma chamber. The purpose of this VHF signal injection is to remotely interrogate in real-time the frequency reflection properties of plasma. The information obtained is used for remote monitoring of high-value atmospheric plasma processing. Measurements are performed under varying gas feed (helium mixed with 0–2% oxygen) and power conditions (0–40 W) on two contrasting reactors. The first is a classical parallel-plate chamber driven at 16 MHz with well-defined electrical grounding but limited optical access and the second is a cross-field plasma jet driven at 13.56 MHz with open optical access but with poor electrical shielding of the driven electrode. The electrical measurements are modelled using a lumped element electrical circuit to provide an estimate of power dissipated in the plasma as a function of gas and applied power. The performances of both reactors are evaluated against each other. The scalar measurements reveal that 0.1% oxygen admixture in helium plasma can be detected. The equivalent electrical model indicates that the current density between the parallel-plate reactor is of the order of 8–20 mA cm-2 . This value is in accord with 0.03 A cm-2 values reported by Park et al (2001 J. Appl. Phys. 89 20–8). The current density of the cross-field plasma jet electrodes is found to be 20 times higher. When the cross-field plasma jet unshielded electrode area is factored into the current density estimation, the resultant current density agrees with the parallel-plate reactor. This indicates that the unshielded reactor radiates electromagnetic energy into free space and so acts as a plasma antenna.
Resumo:
The conflict known as the oTroubleso in Northern Ireland began during the late 1960s and is defined by political and ethno-sectarian violence between state, pro-state, and anti-state forces. Reasons for the conflict are contested and complicated by social, religious, political, and cultural disputes, with much of the debate concerning the victims of violence hardened by competing propaganda-conditioning perspectives. This article introduces a database holding information on the location of individual fatalities connected with the contemporary Irish conflict. For each victim, it includes a demographic profile, home address, manner of death, and the organization responsible. Employing geographic information system (GIS) techniques, the database is used to measure, map, and analyze the spatial distribution of conflict-related deaths between 1966 and 2007 across Belfast, the capital city of Northern Ireland, with respect to levels of segregation, social and economic deprivation, and interfacing. The GIS analysis includes a kernel density estimator designed to generate smooth intensity surfaces of the conflict-related deaths by both incident and home locations. Neighborhoods with high-intensity surfaces of deaths were those with the highest levels of segregation ( 90 percent Catholic or Protestant) and deprivation, and they were located near physical barriers, the so-called peacelines, between predominantly Catholic and predominantly Protestant communities. Finally, despite the onset of peace and the formation of a power-sharing and devolved administration (the Northern Ireland Assembly), disagreements remain over the responsibility and ocommemorationo of victims, sentiments that still uphold division and atavistic attitudes between spatially divided Catholic and Protestant populations.
Resumo:
Camera traps are used to estimate densities or abundances using capture-recapture and, more recently, random encounter models (REMs). We deploy REMs to describe an invasive-native species replacement process, and to demonstrate their wider application beyond abundance estimation. The Irish hare Lepus timidus hibernicus is a high priority endemic of conservation concern. It is threatened by an expanding population of non-native, European hares L. europaeus, an invasive species of global importance. Camera traps were deployed in thirteen 1 km squares, wherein the ratio of invader to native densities were corroborated by night-driven line transect distance sampling throughout the study area of 1652 km2. Spatial patterns of invasive and native densities between the invader’s core and peripheral ranges, and native allopatry, were comparable between methods. Native densities in the peripheral range were comparable to those in native allopatry using REM, or marginally depressed using Distance Sampling. Numbers of the invader were substantially higher than the native in the core range, irrespective of method, with a 5:1 invader-to-native ratio indicating species replacement. We also describe a post hoc optimization protocol for REM which will inform subsequent (re-)surveys, allowing survey effort (camera hours) to be reduced by up to 57% without compromising the width of confidence intervals associated with density estimates. This approach will form the basis of a more cost-effective means of surveillance and monitoring for both the endemic and invasive species. The European hare undoubtedly represents a significant threat to the endemic Irish hare.
Resumo:
L'un des modèles d'apprentissage non-supervisé générant le plus de recherche active est la machine de Boltzmann --- en particulier la machine de Boltzmann restreinte, ou RBM. Un aspect important de l'entraînement ainsi que l'exploitation d'un tel modèle est la prise d'échantillons. Deux développements récents, la divergence contrastive persistante rapide (FPCD) et le herding, visent à améliorer cet aspect, se concentrant principalement sur le processus d'apprentissage en tant que tel. Notamment, le herding renonce à obtenir un estimé précis des paramètres de la RBM, définissant plutôt une distribution par un système dynamique guidé par les exemples d'entraînement. Nous généralisons ces idées afin d'obtenir des algorithmes permettant d'exploiter la distribution de probabilités définie par une RBM pré-entraînée, par tirage d'échantillons qui en sont représentatifs, et ce sans que l'ensemble d'entraînement ne soit nécessaire. Nous présentons trois méthodes: la pénalisation d'échantillon (basée sur une intuition théorique) ainsi que la FPCD et le herding utilisant des statistiques constantes pour la phase positive. Ces méthodes définissent des systèmes dynamiques produisant des échantillons ayant les statistiques voulues et nous les évaluons à l'aide d'une méthode d'estimation de densité non-paramétrique. Nous montrons que ces méthodes mixent substantiellement mieux que la méthode conventionnelle, l'échantillonnage de Gibbs.
Resumo:
L’apprentissage supervisé de réseaux hiérarchiques à grande échelle connaît présentement un succès fulgurant. Malgré cette effervescence, l’apprentissage non-supervisé représente toujours, selon plusieurs chercheurs, un élément clé de l’Intelligence Artificielle, où les agents doivent apprendre à partir d’un nombre potentiellement limité de données. Cette thèse s’inscrit dans cette pensée et aborde divers sujets de recherche liés au problème d’estimation de densité par l’entremise des machines de Boltzmann (BM), modèles graphiques probabilistes au coeur de l’apprentissage profond. Nos contributions touchent les domaines de l’échantillonnage, l’estimation de fonctions de partition, l’optimisation ainsi que l’apprentissage de représentations invariantes. Cette thèse débute par l’exposition d’un nouvel algorithme d'échantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la température des chaînes de Markov sous simulation, afin de maintenir une vitesse de convergence élevée tout au long de l’apprentissage. Lorsqu’utilisé dans le contexte de l’apprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face à la sélection du taux d’apprentissage, ainsi qu’une meilleure vitesse de convergence. Nos résultats sont présent ́es dans le domaine des BMs, mais la méthode est générale et applicable à l’apprentissage de tout modèle probabiliste exploitant l’échantillonnage par chaînes de Markov. Tandis que le gradient du maximum de vraisemblance peut-être approximé par échantillonnage, l’évaluation de la log-vraisemblance nécessite un estimé de la fonction de partition. Contrairement aux approches traditionnelles qui considèrent un modèle donné comme une boîte noire, nous proposons plutôt d’exploiter la dynamique de l’apprentissage en estimant les changements successifs de log-partition encourus à chaque mise à jour des paramètres. Le problème d’estimation est reformulé comme un problème d’inférence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, où les dimensions correspondent aux axes du temps et au paramètre de température. Sur le thème de l’optimisation, nous présentons également un algorithme permettant d’appliquer, de manière efficace, le gradient naturel à des machines de Boltzmann comportant des milliers d’unités. Jusqu’à présent, son adoption était limitée par son haut coût computationel ainsi que sa demande en mémoire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet d’éviter le calcul explicite de la matrice d’information de Fisher (et son inverse) en exploitant un solveur linéaire combiné à un produit matrice-vecteur efficace. L’algorithme est prometteur: en terme du nombre d’évaluations de fonctions, MFNG converge plus rapidement que SML. Son implémentation demeure malheureusement inefficace en temps de calcul. Ces travaux explorent également les mécanismes sous-jacents à l’apprentissage de représentations invariantes. À cette fin, nous utilisons la famille de machines de Boltzmann restreintes “spike & slab” (ssRBM), que nous modifions afin de pouvoir modéliser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent être rendues invariantes à un sous-espace vectoriel, en associant à chacune d’elles, un vecteur de variables latentes continues (dénommées “slabs”). Ceci se traduit par une invariance accrue au niveau de la représentation et un meilleur taux de classification lorsque peu de données étiquetées sont disponibles. Nous terminons cette thèse sur un sujet ambitieux: l’apprentissage de représentations pouvant séparer les facteurs de variations présents dans le signal d’entrée. Nous proposons une solution à base de ssRBM bilinéaire (avec deux groupes de facteurs latents) et formulons le problème comme l’un de “pooling” dans des sous-espaces vectoriels complémentaires.