849 resultados para Constrained Clustering


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Com a crescente geração, armazenamento e disseminação da informação nos últimos anos, o anterior problema de falta de informação transformou-se num problema de extracção do conhecimento útil a partir da informação disponível. As representações visuais da informação abstracta têm sido utilizadas para auxiliar a interpretação os dados e para revelar padrões de outra forma escondidos. A visualização de informação procura aumentar a cognição humana aproveitando as capacidades visuais humanas, de forma a tornar perceptível a informação abstracta, fornecendo os meios necessários para que um humano possa absorver quantidades crescentes de informação, com as suas capacidades de percepção. O objectivo das técnicas de agrupamento de dados consiste na divisão de um conjunto de dados em vários grupos, em que dados semelhantes são colocados no mesmo grupo e dados dissemelhantes em grupos diferentes. Mais especificamente, o agrupamento de dados com restrições tem o intuito de incorporar conhecimento a priori no processo de agrupamento de dados, com o objectivo de aumentar a qualidade do agrupamento de dados e, simultaneamente, encontrar soluções apropriadas a tarefas e interesses específicos. Nesta dissertação é estudado a abordagem de Agrupamento de Dados Visual Interactivo que permite ao utilizador, através da interacção com uma representação visual da informação, incorporar o seu conhecimento prévio acerca do domínio de dados, de forma a influenciar o agrupamento resultante para satisfazer os seus objectivos. Esta abordagem combina e estende técnicas de visualização interactiva de informação, desenho de grafos de forças direccionadas e agrupamento de dados com restrições. Com o propósito de avaliar o desempenho de diferentes estratégias de interacção com o utilizador, são efectuados estudos comparativos utilizando conjuntos de dados sintéticos e reais.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper seeks to address the problem of the empirical identification of housing market segmentation,once we assume that submarkets exist. The typical difficulty in identifying housing submarkets when dealing with many locations is the vast number of potential solutions and, in such cases, the use of the Chow test for hedonic functions is not a practical solution. Here, we solve this problem by undertaking an identification process with a heuristic for spatially constrained clustering, the"Housing Submarket Identifier" (HouSI). The solution is applied to the housing market in the city of Barcelona (Spain), where we estimate a hedonic model for fifty thousand dwellings aggregated into ten groups. In order to determine the utility of the procedure we seek to verify whether the final solution provided by the heuristic is comparable with the division of the city into ten administrative districts.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper addresses the estimation of object boundaries from a set of 3D points. An extension of the constrained clustering algorithm developed by Abrantes and Marques in the context of edge linking is presented. The object surface is approximated using rectangular meshes and simplex nets. Centroid-based forces are used for attracting the model nodes towards the data, using competitive learning methods. It is shown that competitive learning improves the model performance in the presence of concavities and allows to discriminate close surfaces. The proposed model is evaluated using synthetic data and medical images (MRI and ultrasound images).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Results Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster. Conclusion Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Different types of water bodies, including lakes, streams, and coastal marine waters, are often susceptible to fecal contamination from a range of point and nonpoint sources, and have been evaluated using fecal indicator microorganisms. The most commonly used fecal indicator is Escherichia coli, but traditional cultivation methods do not allow discrimination of the source of pollution. The use of triplex PCR offers an approach that is fast and inexpensive, and here enabled the identification of phylogroups. The phylogenetic distribution of E. coli subgroups isolated from water samples revealed higher frequencies of subgroups A1 and B23 in rivers impacted by human pollution sources, while subgroups D1 and D2 were associated with pristine sites, and subgroup B1 with domesticated animal sources, suggesting their use as a first screening for pollution source identification. A simple classification is also proposed based on phylogenetic subgroup distribution using the w-clique metric, enabling differentiation of polluted and unpolluted sites.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In the southern region of Mato Grosso do Sul state, Brazil, a foot-and-mouth disease (FMD) epidemic started in September 2005. A total of 33 outbreaks were detected and 33,741 FMD-susceptible animals were slaughtered and destroyed. There were no reports of FMD cases in other species than bovines. Based on the data of this epidemic, it was carried out an analysis using the K-function and it was observed spatial clustering of outbreaks within a range of 25km. This observation may be related to the dynamics of foot-and-mouth disease spread and to the measures undertaken to control the disease dissemination. The control measures were effective once the disease did not spread to farms more than 47 km apart from the initial outbreaks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We present a new analysis of J/psi production yields in deuteron-gold collisions at root s(NN) =200 GeV using data taken from the PHENIX experiment in 2003 and previously published in S. S. Adler [Phys. Rev. Lett 96, 012304 (2006)]. The high statistics proton-proton J/psi data taken in 2005 are used to improve the baseline measurement and thus construct updated cold nuclear matter modification factors (R(dAu)). A suppression of J/psi in cold nuclear matter is observed as one goes forward in rapidity (in the deuteron-going direction), corresponding to a region more sensitive to initial-state low-x gluons in the gold nucleus. The measured nuclear modification factors are compared to theoretical calculations of nuclear shadowing to which a J/psi (or precursor) breakup cross section is added. Breakup cross sections of sigma(breakup)=2.8(-1.4)(+1.7) (2.2(-1.5)(+1.6)) mb are obtained by fitting these calculations to the data using two different models of nuclear shadowing. These breakup cross-section values are consistent within large uncertainties with the 4.2 +/- 0.5 mb determined at lower collision energies. Projecting this range of cold nuclear matter effects to copper-copper and gold-gold collisions reveals that the current constraints are not sufficient to firmly quantify the additional hot nuclear matter effect.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Recent efforts in the characterization of air-water flows properties have included some clustering process analysis. A cluster of bubbles is defined as a group of two or more bubbles, with a distinct separation from other bubbles before and after the cluster. The present paper compares the results of clustering processes two hydraulic structures. That is, a large-size dropshaft and a hydraulic jump in a rectangular horizontal channel. The comparison highlighted some significant differences in clustering production and structures. Both dropshaft and hydraulic jump flows are complex turbulent shear flows, and some clustering index may provide some measure of the bubble-turbulence interactions and associated energy dissipation.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Activation of the human complement system of plasma proteins in response to infection or injury produces a 4-helix bundle glycoprotein (74 amino acids) known as C5a. C5a binds to G-protein-coupled receptors on cell surfaces triggering receptor-ligand internalization, signal transduction, and powerful inflammatory responses. Since excessive levels of C5a are associated with autoimmune and chronic inflammatory disorders, inhibitors of receptor activation may have therapeutic potential. We now report solution structures and receptor-binding and antagonist activities for some of the first small molecule antagonists of C5a derived from its hexapeptide C terminus. The antagonist NMe-Phe-Lys-Pro-D-Cha-Trp-D-Arg-CO2H (1) surprisingly shows an unusually well-defined solution structure as determined by H-1 NMR spectroscopy. This is one of the smallest acyclic peptides found to possess a defined solution conformation, which can be explained by the constraining role of intramolecular hydrogen bonding. NOE and coupling constant data, slow deuterium exchange, and a low dependence on temperature for the chemical shift of the D-Cha-NH strongly indicate an inverse gamma turn stabilized by a D-Cha-NH ... OC-Lys hydrogen bond. Smaller conformational populations are associated with a hydrogen bond between Trp-NH ... OC-Lys, defining a type II beta turn distorted by the inverse gamma turn incorporated within it. An excellent correlation between receptor-affinity and antagonist activity is indicated for a limited set of synthetic peptides. Conversion of the C-terminal carboxylate of 1 to an amide decreases antagonist potency 5-fold, but potency is increased up to 10-fold over 1 if the amide bond is made between the C-terminal carboxylate and a Lys/Orn side chain to form a cyclic analogue. The solution structure of cycle 6 also shows gamma and beta turns; however, the latter occurs in a different position, and there are clear conformational changes in 6 vs 1 that result in enhanced activity. These results indicate that potent C5a antagonists can be developed by targeting site 2 alone of the C5a receptor and define a novel pharmacophore for developing powerful receptor probes or drug candidates.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The high-affinity receptors for human granulocyte-macrophage colony-stimulating factor (GM-CSF), interleukin-1 (IL-3), and IL-5 are heterodimeric complexes consisting of cytokine-specific alpha subunits and a common signal-transducing beta subunit (h beta c). We have previously demonstrated the oncogenic potential of this group of receptors by identifying constitutively activating point mutations in the extracellular and transmembrane domains of h beta c. We report here a comprehensive screen of the entire h beta c molecule that has led to the identification of additional constitutive point mutations by virtue of their ability to confer factor independence on murine FDC-P1 cells. These mutations were clustered exclusively in a central region of h beta c that encompasses the extracellular membrane-proximal domain, transmembrane domain, and membrane-proximal region of the cytoplasmic domain. Interestingly, most h beta c mutants exhibited cell type-specific constitutive activity, with only two transmembrane domain mutants able to confer factor independence on both murine FDC-P1 and BAF-B03 cells. Examination of the biochemical properties of these mutants in FDC-P1 cells indicated that MAP kinase (ERK1/2), STAT, and JAK2 signaling molecules were constitutively activated. In contrast, only some of the mutant beta subunits were constitutively tyrosine phosphorylated. Taken together; these results highlight key regions involved in h beta c activation, dissociate h beta c tyrosine phosphorylation from MAP kinase and STAT activation, and suggest the involvement of distinct mechanisms by which proliferative signals can be generated by h beta c. (C) 1998 by The American Society of Hematology.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A chance constrained programming model is developed to assist Queensland barley growers make varietal and agronomic decisions in the face of changing product demands and volatile production conditions. Unsuitable or overlooked in many risk programming applications, the chance constrained programming approach nonetheless aptly captures the single-stage decision problem faced by barley growers of whether to plant lower-yielding but potentially higher-priced malting varieties, given a particular expectation of meeting malting grade standards. Different expectations greatly affect the optimal mix of malting and feed barley activities. The analysis highlights the suitability of chance constrained programming to this specific class of farm decision problem.