811 resultados para hierarchical clustering
Resumo:
Spatial data warehouses (SDWs) allow for spatial analysis together with analytical multidimensional queries over huge volumes of data. The challenge is to retrieve data related to ad hoc spatial query windows according to spatial predicates, avoiding the high cost of joining large tables. Therefore, mechanisms to provide efficient query processing over SDWs are essential. In this paper, we propose two efficient indices for SDW: the SB-index and the HSB-index. The proposed indices share the following characteristics. They enable multidimensional queries with spatial predicate for SDW and also support predefined spatial hierarchies. Furthermore, they compute the spatial predicate and transform it into a conventional one, which can be evaluated together with other conventional predicates by accessing a star-join Bitmap index. While the SB-index has a sequential data structure, the HSB-index uses a hierarchical data structure to enable spatial objects clustering and a specialized buffer-pool to decrease the number of disk accesses. The advantages of the SB-index and the HSB-index over the DBMS resources for SDW indexing (i.e. star-join computation and materialized views) were investigated through performance tests, which issued roll-up operations extended with containment and intersection range queries. The performance results showed that improvements ranged from 68% up to 99% over both the star-join computation and the materialized view. Furthermore, the proposed indices proved to be very compact, adding only less than 1% to the storage requirements. Therefore, both the SB-index and the HSB-index are excellent choices for SDW indexing. Choosing between the SB-index and the HSB-index mainly depends on the query selectivity of spatial predicates. While low query selectivity benefits the HSB-index, the SB-index provides better performance for higher query selectivity.
Resumo:
This paper addresses the m-machine no-wait flow shop problem where the set-up time of a job is separated from its processing time. The performance measure considered is the total flowtime. A new hybrid metaheuristic Genetic Algorithm-Cluster Search is proposed to solve the scheduling problem. The performance of the proposed method is evaluated and the results are compared with the best method reported in the literature. Experimental tests show superiority of the new method for the test problems set, regarding the solution quality. (c) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Objective: To assess the risk factors for delayed diagnosis of uterine cervical lesions. Materials and Methods: This is a case-control study that recruited 178 women at 2 Brazilian hospitals. The cases (n = 74) were composed of women with a late diagnosis of a lesion in the uterine cervix (invasive carcinoma in any stage). The controls (n = 104) were composed of women with cervical lesions diagnosed early on (low-or high-grade intraepithelial lesions). The analysis was performed by means of logistic regression model using a hierarchical model. The socioeconomic and demographic variables were included at level I (distal). Level II (intermediate) included the personal and family antecedents and knowledge about the Papanicolaou test and human papillomavirus. Level III (proximal) encompassed the variables relating to individuals' care for their own health, gynecologic symptoms, and variables relating to access to the health care system. Results: The risk factors for late diagnosis of uterine cervical lesions were age older than 40 years (odds ratio [OR] = 10.4; 95% confidence interval [CI], 2.3-48.4), not knowing the difference between the Papanicolaou test and gynecological pelvic examinations (OR, = 2.5; 95% CI, 1.3-4.9), not thinking that the Papanicolaou test was important (odds ratio [OR], 4.2; 95% CI, 1.3-13.4), and abnormal vaginal bleeding (OR, 15.0; 95% CI, 6.5-35.0). Previous treatment for sexually transmissible disease was a protective factor (OR, 0.3; 95% CI, 0.1-0.8) for delayed diagnosis. Conclusions: Deficiencies in cervical cancer prevention programs in developing countries are not simply a matter of better provision and coverage of Papanicolaou tests. The misconception about the Papanicolaou test is a serious educational problem, as demonstrated by the present study.
Resumo:
Abstract Background Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Results Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster. Conclusion Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.
Resumo:
Background: A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results: In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions: This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.
Resumo:
Spin systems in the presence of disorder are described by two sets of degrees of freedom, associated with orientational (spin) and disorder variables, which may be characterized by two distinct relaxation times. Disordered spin models have been mostly investigated in the quenched regime, which is the usual situation in solid state physics, and in which the relaxation time of the disorder variables is much larger than the typical measurement times. In this quenched regime, disorder variables are fixed, and only the orientational variables are duly thermalized. Recent studies in the context of lattice statistical models for the phase diagrams of nematic liquid-crystalline systems have stimulated the interest of going beyond the quenched regime. The phase diagrams predicted by these calculations for a simple Maier-Saupe model turn out to be qualitative different from the quenched case if the two sets of degrees of freedom are allowed to reach thermal equilibrium during the experimental time, which is known as the fully annealed regime. In this work, we develop a transfer matrix formalism to investigate annealed disordered Ising models on two hierarchical structures, the diamond hierarchical lattice (DHL) and the Apollonian network (AN). The calculations follow the same steps used for the analysis of simple uniform systems, which amounts to deriving proper recurrence maps for the thermodynamic and magnetic variables in terms of the generations of the construction of the hierarchical structures. In this context, we may consider different kinds of disorder, and different types of ferromagnetic and anti-ferromagnetic interactions. In the present work, we analyze the effects of dilution, which are produced by the removal of some magnetic ions. The system is treated in a “grand canonical" ensemble. The introduction of two extra fields, related to the concentration of two different types of particles, leads to higher-rank transfer matrices as compared with the formalism for the usual uniform models. Preliminary calculations on a DHL indicate that there is a phase transition for a wide range of dilution concentrations. Ising spin systems on the AN are known to be ferromagnetically ordered at all temperatures; in the presence of dilution, however, there are indications of a disordered (paramagnetic) phase at low concentrations of magnetic ions.
Resumo:
Hierarchical multi-label classification is a complex classification task where the classes involved in the problem are hierarchically structured and each example may simultaneously belong to more than one class in each hierarchical level. In this paper, we extend our previous works, where we investigated a new local-based classification method that incrementally trains a multi-layer perceptron for each level of the classification hierarchy. Predictions made by a neural network in a given level are used as inputs to the neural network responsible for the prediction in the next level. We compare the proposed method with one state-of-the-art decision-tree induction method and two decision-tree induction methods, using several hierarchical multi-label classification datasets. We perform a thorough experimental analysis, showing that our method obtains competitive results to a robust global method regarding both precision and recall evaluation measures.
Resumo:
The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.
Resumo:
The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling tool in each field where the multivariate dependence is of great interest and their use in clustering has not been still investigated. The first part of this work contains the review of the literature of clustering methods, copula functions and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong, 1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007) clustering techniques because their performance is compared. Then, the probabilistic interpretation of the Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of clustering methods and copulas to the genetic and microarray experiments are highlighted. The second part contains the original contribution proposed. A simulation study is performed in order to evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying clusters according to the dependence structure of the data generating process. Different simulations are performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and the value of the dependence parameter ) and the results are evaluated by means of different measures of performance. In light of the simulation results and of the limits of the two investigated clustering methods, a new clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative procedure of the CoClust and the description of the written R functions with their output are given. The CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the dependence parameter value and the degree of overlap of margins) and is compared with the performance of model–based clustering by using different measures of performance, like the percentage of well–identified number of clusters and the not rejection percentage of H0 on . It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated clustering techniques and is able to identify clusters according to the dependence structure of the data independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses a criterion based on the maximized log–likelihood function of the copula and can virtually account for any possible dependence relationship between observations. Many peculiar characteristics are shown for the CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a starting classification. Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the whole data matrix.
Resumo:
[EN]We present a new strategy for constructing spline spaces over hierarchical T-meshes with quad- and octree subdivision scheme. The proposed technique includes some simple rules for inferring local knot vectors to define C 2 -continuous cubic tensor product spline blending functions. Our conjecture is that these rules allow to obtain, for a given T-mesh, a set of linearly independent spline functions with the property that spaces spanned by nested T-meshes are also nested, and therefore, the functions can reproduce cubic polynomials. In order to span spaces with these properties applying the proposed rules, the T-mesh should fulfill the only requirement of being a 0- balanced mesh...
Resumo:
The intensity of regional specialization in specific activities, and conversely, the level of industrial concentration in specific locations, has been used as a complementary evidence for the existence and significance of externalities. Additionally, economists have mainly focused the debate on disentangling the sources of specialization and concentration processes according to three vectors: natural advantages, internal, and external scale economies. The arbitrariness of partitions plays a key role in capturing these effects, while the selection of the partition would have to reflect the actual characteristics of the economy. Thus, the identification of spatial boundaries to measure specialization becomes critical, since most likely the model will be adapted to different scales of distance, and be influenced by different types of externalities or economies of agglomeration, which are based on the mechanisms of interaction with particular requirements of spatial proximity. This work is based on the analysis of the spatial aspect of economic specialization supported by the manufacturing industry case. The main objective is to propose, for discrete and continuous space: i) a measure of global specialization; ii) a local disaggregation of the global measure; and iii) a spatial clustering method for the identification of specialized agglomerations.
Resumo:
This PhD Thesis is part of a long-term wide research project, carried out by the "Osservatorio Astronomico di Bologna (INAF-OABO)", that has as primary goal the comprehension and reconstruction of formation mechanism of galaxies and their evolution history. There is now substantial evidence, both from theoretical and observational point of view, in favor of the hypothesis that the halo of our Galaxy has been at least partially, built up by the progressive accretion of small fragments, similar in nature to the present day dwarf galaxies of the Local Group. In this context, the photometric and spectroscopic study of systems which populate the halo of our Galaxy (i.e. dwarf spheroidal galaxy, tidal streams, massive globular cluster, etc) permits to discover, not only the origin and behaviour of these systems, but also the structure of our Galactic halo, combined with its formation history. In fact, the study of the population of these objects and also of their chemical compositions, age, metallicities and velocity dispersion, permit us not only an improvement in the understanding of the mechanisms that govern the Galactic formation, but also a valid indirect test for cosmological model itself. Specifically, in this Thesis we provided a complete characterization of the tidal Stream of the Sagittarius dwarf spheroidal galaxy, that is the most striking example of the process of tidal disruption and accretion of a dwarf satellite in to our Galaxy. Using Red Clump stars, extracted from the catalogue of the Sloan Digital Sky Survey (SDSS) we obtained an estimate of the distance, the depth along the line of sight and of the number density for each detected portion of the Stream (and more in general for each detected structure along our line of sight). Moreover comparing the relative number (i.e. the ratio) of Blue Horizontal Branch stars and Red Clump stars (the two features are tracers of different age/different metallicity populations) in the main body of the galaxy and in the Stream, in order to verify the presence of an age-metallicity gradient along the Stream. We also report the detection of a population of Red Clump stars probably associated with the recently discovered Bootes III stellar system. Finally, we also present the results of a survey of radial velocities over a wide region, extending from r ~ 10' out to r ~ 80' within the massive star cluster Omega Centauri. The survey was performed with FLAMES@VLT, to study the velocity dispersion profile in the outer regions of this stellar system. All the results presented in this Thesis, have already been published in refeered journals.