827 resultados para DISTANCE MATRICES
Resumo:
Given a metric space with a Borel probability measure, for each integer N, we obtain a probability distribution on N x N distance matrices by considering the distances between pairs of points in a sample consisting of N points chosen independently from the metric space with respect to the given measure. We show that this gives an asymptotically bi-Lipschitz relation between metric measure spaces and the corresponding distance matrices. This is an effective version of a result of Vershik that metric measure spaces are determined by associated distributions on infinite random matrices.
Resumo:
A mathematical model on computation of molecular similarity was suggested, The algorithmic techniques for measuring the degree of similarity between pairs of three-dimensional chemical molecules was represented by modified interatomic distance matrices. Current work was carried out on Indigo 2 work station with Sybyl software. Four groups of molecules were used to compute the molecules similarity to testing the mathematical model with satisfactory results.
Resumo:
Bayesian clustering methods are typically used to identify barriers to gene flow, but they are prone to deduce artificial subdivisions in a study population characterized by an isolation-by-distance pattern (IbD). Here we analysed the landscape genetic structure of a population of wild boars (Sus scrofa) from south-western Germany. Two clustering methods inferred the presence of the same genetic discontinuity. However, the population in question was characterized by a strong IbD pattern. While landscape-resistance modelling failed to identify landscape features that influenced wild boar movement, partial Mantel tests and multiple regression of distance matrices (MRDMs) suggested that the empirically inferred clusters were separated by a genuine barrier. When simulating random lines bisecting the study area, 60% of the unique barriers represented, according to partial Mantel tests and MRDMs, significant obstacles to gene flow. By contrast, the random-lines simulation showed that the boundaries of the inferred empirical clusters corresponded to the most important genetic discontinuity in the study area. Given the degree of habitat fragmentation separating the two empirical partitions, it is likely that the clustering programs correctly identified a barrier to gene flow. The differing results between the work published here and other studies suggest that it will be very difficult to draw general conclusions about habitat permeability in wild boar from individual studies.
Resumo:
Ensis siliqua is regarded as an increasingly valuable fishery resource with potential for commercial aquaculture in many European countries. The genetic variation of this razor clam was analysed by randomly amplified polymorphic DNA (RAPD) in six populations from Spain, Portugal and Ireland. Out of the 40 primers tested, five were chosen to assess genetic variation. A total of 61 RAPD loci were developed ranging in size from 400 to 2000 bp. The percentages of polymorphic loci, the allele effective number and the genetic diversity were comparable among populations, and demonstrated a high level of genetic variability. The values of Nei's genetic distance were small among the Spanish and Portuguese populations (0.051-0.065), and high between these and the Irish populations. Cluster and principal coordinate analyses supported these findings. A mantel test performed between geographic and genetic distance matrices showed a significant correlation (r=0.84, P
Resumo:
Tese de doutoramento, Biologia (Biologia da Conservação), Universidade de Lisboa, Faculdade de Ciências, 2015
Resumo:
Some sesquiterpene lactones (SLs) are the active compounds of a great number of traditionally medicinal plants from the Asteraceae family and possess considerable cytotoxic activity. Several studies in vitro have shown the inhibitory activity against cells derived from human carcinoma of the nasopharynx (KB). Chemical studies showed that the cytotoxic activity is due to the reaction of alpha,beta-unsaturated carbonyl structures of the SLs with thiols, such as cysteine. These studies support the view that SLs inhibit tumour growth by selective alkylation of growth-regulatory biological macromolecules, such as key enzymes, which control cell division, thereby inhibiting a variety of cellular functions, which directs the cells into apoptosis. In this study we investigated a set of 55 different sesquiterpene lactones, represented by 5 skeletons (22 germacranolides, 6 elemanolides, 2 eudesmanolides, 16 guaianolides and nor-derivatives and 9 pseudoguaianolides), in respect to their cytotoxic properties. The experimental results and 3D molecular descriptors were submitted to Kohonen self-organizing map (SOM) to classify (training set) and predict (test set) the cytotoxic activity. From the obtained results, it was concluded that only the geometrical descriptors showed satisfactory values. The Kohonen map obtained after training set using 25 geometrical descriptors shows a very significant match, mainly among the inactive compounds (similar to 84%). Analyzing both groups, the percentage seen is high (83%). The test set shows the highest match, where 89% of the substances had their cytotoxic activity correctly predicted. From these results, important properties for the inhibition potency are discussed for the whole dataset and for subsets of the different structural skeletons. (C) 2008 Elsevier Masson SAS. All rights reserved.
Resumo:
Ecologists usually estimate means, but devote much less attention to variation. The study of variation is a key aspect to understand natural systems and to make predictions regarding them. In community ecology, most studies focus on local species diversity (alpha diversity), but only in recent decades have ecologists devoted proper attention to variation in community composition among sites (beta diversity). This is in spite of the fact that the first attempts to estimate beta diversity date back to the pioneering work by Koch and Whittaker in the 1950s. Progress in the last decade has been made in the development both of methods and of hypotheses about the origin and maintenance of variation in community composition. For instance, methods are available to partition total diversity in a region (gamma diversity), in a local component (alpha), and several beta diversities, each corresponding to one scale in a hierarchy. The popularization of the so-called raw-data approach (based on partial constrained ordination techniques) and the distance-based approach (based on correlation of dissimilarity/distance matrices) have allowed many ecologists to address current hypotheses about beta diversity patterns. Overall, these hypotheses are based on niche and neutral theory, accounting for the relative roles of environmental and spatial processes (or a combination of them) in shaping metacommunities. Recent studies have addressed these issues on a variety of spatial and temporal scales, habitats and taxonomic groups. Moreover, life history and functional traits of species such as dispersal abilities and rarity have begun to be considered in studies of beta diversity. In this article we briefly review some of these new tools and approaches developed in recent years, and illustrate them by using case studies in aquatic ecosystems.
Resumo:
To identify genetic susceptibility loci for severe diabetic retinopathy, 286 Mexican-Americans with type 2 diabetes from Starr County, Texas completed detailed physical and ophthalmologic examinations including fundus photography for diabetic retinopathy grading. 103 individuals with moderate-to-severe non-proliferative diabetic retinopathy or proliferative diabetic retinopathy were defined as cases for this study. DNA samples extracted from study subjects were genotyped using the Affymetrix GeneChip® Human Mapping 100K Set, which includes 116,204 single nucleotide polymorphisms (SNPs) across the whole genome. Single-marker allelic tests and 2- to 8-SNP sliding-window Haplotype Trend Regression implemented in HelixTreeTM were first performed with these direct genotypes to identify genes/regions contributing to the risk of severe diabetic retinopathy. An additional 1,885,781 HapMap Phase II SNPs were imputed from the direct genotypes to expand the genomic coverage for a more detailed exploration of genetic susceptibility to diabetic retinopathy. The average estimated allelic dosage and imputed genotypes with the highest posterior probabilities were subsequently analyzed for associations using logistic regression and Fisher's Exact allelic tests, respectively. To move beyond these SNP-based approaches, 104,572 directly genotyped and 333,375 well-imputed SNPs were used to construct genetic distance matrices based on 262 retinopathy candidate genes and their 112 related biological pathways. Multivariate distance matrix regression was then used to test hypotheses with genes and pathways as the units of inference in the context of susceptibility to diabetic retinopathy. This study provides a framework for genome-wide association analyses, and implicated several genes involved in the regulation of oxidative stress, inflammatory processes, histidine metabolism, and pancreatic cancer pathways associated with severe diabetic retinopathy. Many of these loci have not previously been implicated in either diabetic retinopathy or diabetes. In summary, CDC73, IL12RB2, and SULF1 had the best evidence as candidates to influence diabetic retinopathy, possibly through novel biological mechanisms related to VEGF-mediated signaling pathway or inflammatory processes. While this study uncovered some genes for diabetic retinopathy, a comprehensive picture of the genetic architecture of diabetic retinopathy has not yet been achieved. Once fully understood, the genetics and biology of diabetic retinopathy will contribute to better strategies for diagnosis, treatment and prevention of this disease.^
Resumo:
In this paper, we investigate the use of manifold learning techniques to enhance the separation properties of standard graph kernels. The idea stems from the observation that when we perform multidimensional scaling on the distance matrices extracted from the kernels, the resulting data tends to be clustered along a curve that wraps around the embedding space, a behavior that suggests that long range distances are not estimated accurately, resulting in an increased curvature of the embedding space. Hence, we propose to use a number of manifold learning techniques to compute a low-dimensional embedding of the graphs in an attempt to unfold the embedding manifold, and increase the class separation. We perform an extensive experimental evaluation on a number of standard graph datasets using the shortest-path (Borgwardt and Kriegel, 2005), graphlet (Shervashidze et al., 2009), random walk (Kashima et al., 2003) and Weisfeiler-Lehman (Shervashidze et al., 2011) kernels. We observe the most significant improvement in the case of the graphlet kernel, which fits with the observation that neglecting the locational information of the substructures leads to a stronger curvature of the embedding manifold. On the other hand, the Weisfeiler-Lehman kernel partially mitigates the locality problem by using the node labels information, and thus does not clearly benefit from the manifold learning. Interestingly, our experiments also show that the unfolding of the space seems to reduce the performance gap between the examined kernels.
Resumo:
In this work, we study a version of the general question of how well a Haar-distributed orthogonal matrix can be approximated by a random Gaussian matrix. Here, we consider a Gaussian random matrix (Formula presented.) of order n and apply to it the Gram–Schmidt orthonormalization procedure by columns to obtain a Haar-distributed orthogonal matrix (Formula presented.). If (Formula presented.) denotes the vector formed by the first m-coordinates of the ith row of (Formula presented.) and (Formula presented.), our main result shows that the Euclidean norm of (Formula presented.) converges exponentially fast to (Formula presented.), up to negligible terms. To show the extent of this result, we use it to study the convergence of the supremum norm (Formula presented.) and we find a coupling that improves by a factor (Formula presented.) the recently proved best known upper bound on (Formula presented.). Our main result also has applications in Quantum Information Theory.
Resumo:
Reorganizing a dataset so that its hidden structure can be observed is useful in any data analysis task. For example, detecting a regularity in a dataset helps us to interpret the data, compress the data, and explain the processes behind the data. We study datasets that come in the form of binary matrices (tables with 0s and 1s). Our goal is to develop automatic methods that bring out certain patterns by permuting the rows and columns. We concentrate on the following patterns in binary matrices: consecutive-ones (C1P), simultaneous consecutive-ones (SC1P), nestedness, k-nestedness, and bandedness. These patterns reflect specific types of interplay and variation between the rows and columns, such as continuity and hierarchies. Furthermore, their combinatorial properties are interlinked, which helps us to develop the theory of binary matrices and efficient algorithms. Indeed, we can detect all these patterns in a binary matrix efficiently, that is, in polynomial time in the size of the matrix. Since real-world datasets often contain noise and errors, we rarely witness perfect patterns. Therefore we also need to assess how far an input matrix is from a pattern: we count the number of flips (from 0s to 1s or vice versa) needed to bring out the perfect pattern in the matrix. Unfortunately, for most patterns it is an NP-complete problem to find the minimum distance to a matrix that has the perfect pattern, which means that the existence of a polynomial-time algorithm is unlikely. To find patterns in datasets with noise, we need methods that are noise-tolerant and work in practical time with large datasets. The theory of binary matrices gives rise to robust heuristics that have good performance with synthetic data and discover easily interpretable structures in real-world datasets: dialectical variation in the spoken Finnish language, division of European locations by the hierarchies found in mammal occurrences, and co-occuring groups in network data. In addition to determining the distance from a dataset to a pattern, we need to determine whether the pattern is significant or a mere occurrence of a random chance. To this end, we use significance testing: we deem a dataset significant if it appears exceptional when compared to datasets generated from a certain null hypothesis. After detecting a significant pattern in a dataset, it is up to domain experts to interpret the results in the terms of the application.
Resumo:
Acoustic modeling using mixtures of multivariate Gaussians is the prevalent approach for many speech processing problems. Computing likelihoods against a large set of Gaussians is required as a part of many speech processing systems and it is the computationally dominant phase for Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We express the likelihood computation as a multiplication of matrices representing augmented feature vectors and Gaussian parameters. The computational gain of this approach over traditional methods is by exploiting the structure of these matrices and efficient implementation of their multiplication. In particular, we explore direct low-rank approximation of the Gaussian parameter matrix and indirect derivation of low-rank factors of the Gaussian parameter matrix by optimum approximation of the likelihood matrix. We show that both the methods lead to similar speedups but the latter leads to far lesser impact on the recognition accuracy. Experiments on 1,138 work vocabulary RM1 task and 6,224 word vocabulary TIMIT task using Sphinx 3.7 system show that, for a typical case the matrix multiplication based approach leads to overall speedup of 46 % on RM1 task and 115 % for TIMIT task. Our low-rank approximation methods provide a way for trading off recognition accuracy for a further increase in computational performance extending overall speedups up to 61 % for RM1 and 119 % for TIMIT for an increase of word error rate (WER) from 3.2 to 3.5 % for RM1 and for no increase in WER for TIMIT. We also express pairwise Euclidean distance computation phase in Dynamic Time Warping (DTW) in terms of matrix multiplication leading to saving of approximately of computational operations. In our experiments using efficient implementation of matrix multiplication, this leads to a speedup of 5.6 in computing the pairwise Euclidean distances and overall speedup up to 3.25 for DTW.
Resumo:
This paper addresses the problem of low-rank distance matrix completion. This problem amounts to recover the missing entries of a distance matrix when the dimension of the data embedding space is possibly unknown but small compared to the number of considered data points. The focus is on high-dimensional problems. We recast the considered problem into an optimization problem over the set of low-rank positive semidefinite matrices and propose two efficient algorithms for low-rank distance matrix completion. In addition, we propose a strategy to determine the dimension of the embedding space. The resulting algorithms scale to high-dimensional problems and monotonically converge to a global solution of the problem. Finally, numerical experiments illustrate the good performance of the proposed algorithms on benchmarks. © 2011 IEEE.
Resumo:
The paper addresses the problem of learning a regression model parameterized by a fixed-rank positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on scalability to high-dimensional problems. The mathematical developments rely on the theory of gradient descent algorithms adapted to the Riemannian geometry that underlies the set of fixedrank positive semidefinite matrices. In contrast with previous contributions in the literature, no restrictions are imposed on the range space of the learned matrix. The resulting algorithms maintain a linear complexity in the problem size and enjoy important invariance properties. We apply the proposed algorithms to the problem of learning a distance function parameterized by a positive semidefinite matrix. Good performance is observed on classical benchmarks. © 2011 Gilles Meyer, Silvere Bonnabel and Rodolphe Sepulchre.
Resumo:
This paper introduces a new metric and mean on the set of positive semidefinite matrices of fixed-rank. The proposed metric is derived from a well-chosen Riemannian quotient geometry that generalizes the reductive geometry of the positive cone and the associated natural metric. The resulting Riemannian space has strong geometrical properties: it is geodesically complete, and the metric is invariant with respect to all transformations that preserve angles (orthogonal transformations, scalings, and pseudoinversion). A meaningful approximation of the associated Riemannian distance is proposed, that can be efficiently numerically computed via a simple algorithm based on SVD. The induced mean preserves the rank, possesses the most desirable characteristics of a geometric mean, and is easy to compute. © 2009 Society for Industrial and Applied Mathematics.