26 resultados para Labeling hierarchical clustering
Resumo:
The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shed light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts. Copyright (C) EPLA, 2012
Resumo:
The mechanisms responsible for containing activity in systems represented by networks are crucial in various phenomena, for example, in diseases such as epilepsy that affect the neuronal networks and for information dissemination in social networks. The first models to account for contained activity included triggering and inhibition processes, but they cannot be applied to social networks where inhibition is clearly absent. A recent model showed that contained activity can be achieved with no need of inhibition processes provided that the network is subdivided into modules (communities). In this paper, we introduce a new concept inspired in the Hebbian theory, through which containment of activity is achieved by incorporating a dynamics based on a decaying activity in a random walk mechanism preferential to the node activity. Upon selecting the decay coefficient within a proper range, we observed sustained activity in all the networks tested, namely, random, Barabasi-Albert and geographical networks. The generality of this finding was confirmed by showing that modularity is no longer needed if the dynamics based on the integrate-and-fire dynamics incorporated the decay factor. Taken together, these results provide a proof of principle that persistent, restrained network activation might occur in the absence of any particular topological structure. This may be the reason why neuronal activity does not spread out to the entire neuronal network, even when no special topological organization exists.
Resumo:
Objective. We aimed to evaluate whether the differential gene expression profiles of patients with rheumatoid arthritis (RA) could distinguish responders from nonresponders to methotrexate (MTX) and, in the case of MTX nonresponders, responsiveness to MTX plus anti-tumor necrosis factor-alpha (anti-TNF) combined therapy. Methods. We evaluated 25 patients with RA taking MTX 15-20 mg/week as a monotherapy (8 responders and 17 nonresponders). All MTX nonresponders received intliximab and were reassessed after 20 weeks to evaluate their anti-TNF responsiveness using the European League Against Rheumatism response criteria. A differential gene expression analysis from peripheral blood mononuclear cells was performed in terms of hierarchical gene clustering, and an evaluation of differentially expressed genes was performed using the significance analysis of microarrays program. Results. Hierarchical gene expression clustering discriminated MTX responders from nonresponders, and MTX plus anti-TNF responders from nonresponders. The evaluation of only highly modulated genes (fold change > 1.3 or < 0.7) yielded 5 induced (4 antiapoptotic and CCL4) and 4 repressed (4 proapoptotic) genes in MTX nonresponders compared to responders. In MTX plus anti-TNF nonresponders, the CCL4, CD83, and BCL2A1 genes were induced in relation to responders. Conclusion. Study of the gene expression profiles of RA peripheral blood cells permitted differentiation of responders from nonresponders to MTX and anti-TNF. Several candidate genes in MTX non-responders (CCL4, HTRA2, PRKCD, BCL2A1, CAV1, TNIP1 CASP8AP2, MXD1, and BTG2) and 3 genes in MTX plus anti-TNF nonresponders (CCL4, CD83, and BCL2A1) were identified for further study. (First Release July 1 2012; J Rheumatol 2012;39:1524-32; doi:10.3899/jrheum.120092)
Resumo:
Spatial data warehouses (SDWs) allow for spatial analysis together with analytical multidimensional queries over huge volumes of data. The challenge is to retrieve data related to ad hoc spatial query windows according to spatial predicates, avoiding the high cost of joining large tables. Therefore, mechanisms to provide efficient query processing over SDWs are essential. In this paper, we propose two efficient indices for SDW: the SB-index and the HSB-index. The proposed indices share the following characteristics. They enable multidimensional queries with spatial predicate for SDW and also support predefined spatial hierarchies. Furthermore, they compute the spatial predicate and transform it into a conventional one, which can be evaluated together with other conventional predicates by accessing a star-join Bitmap index. While the SB-index has a sequential data structure, the HSB-index uses a hierarchical data structure to enable spatial objects clustering and a specialized buffer-pool to decrease the number of disk accesses. The advantages of the SB-index and the HSB-index over the DBMS resources for SDW indexing (i.e. star-join computation and materialized views) were investigated through performance tests, which issued roll-up operations extended with containment and intersection range queries. The performance results showed that improvements ranged from 68% up to 99% over both the star-join computation and the materialized view. Furthermore, the proposed indices proved to be very compact, adding only less than 1% to the storage requirements. Therefore, both the SB-index and the HSB-index are excellent choices for SDW indexing. Choosing between the SB-index and the HSB-index mainly depends on the query selectivity of spatial predicates. While low query selectivity benefits the HSB-index, the SB-index provides better performance for higher query selectivity.
Resumo:
This paper addresses the m-machine no-wait flow shop problem where the set-up time of a job is separated from its processing time. The performance measure considered is the total flowtime. A new hybrid metaheuristic Genetic Algorithm-Cluster Search is proposed to solve the scheduling problem. The performance of the proposed method is evaluated and the results are compared with the best method reported in the literature. Experimental tests show superiority of the new method for the test problems set, regarding the solution quality. (c) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Objective: To assess the risk factors for delayed diagnosis of uterine cervical lesions. Materials and Methods: This is a case-control study that recruited 178 women at 2 Brazilian hospitals. The cases (n = 74) were composed of women with a late diagnosis of a lesion in the uterine cervix (invasive carcinoma in any stage). The controls (n = 104) were composed of women with cervical lesions diagnosed early on (low-or high-grade intraepithelial lesions). The analysis was performed by means of logistic regression model using a hierarchical model. The socioeconomic and demographic variables were included at level I (distal). Level II (intermediate) included the personal and family antecedents and knowledge about the Papanicolaou test and human papillomavirus. Level III (proximal) encompassed the variables relating to individuals' care for their own health, gynecologic symptoms, and variables relating to access to the health care system. Results: The risk factors for late diagnosis of uterine cervical lesions were age older than 40 years (odds ratio [OR] = 10.4; 95% confidence interval [CI], 2.3-48.4), not knowing the difference between the Papanicolaou test and gynecological pelvic examinations (OR, = 2.5; 95% CI, 1.3-4.9), not thinking that the Papanicolaou test was important (odds ratio [OR], 4.2; 95% CI, 1.3-13.4), and abnormal vaginal bleeding (OR, 15.0; 95% CI, 6.5-35.0). Previous treatment for sexually transmissible disease was a protective factor (OR, 0.3; 95% CI, 0.1-0.8) for delayed diagnosis. Conclusions: Deficiencies in cervical cancer prevention programs in developing countries are not simply a matter of better provision and coverage of Papanicolaou tests. The misconception about the Papanicolaou test is a serious educational problem, as demonstrated by the present study.
Resumo:
Abstract Background Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Results Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster. Conclusion Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.
Resumo:
Background: A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results: In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions: This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.
Resumo:
Spin systems in the presence of disorder are described by two sets of degrees of freedom, associated with orientational (spin) and disorder variables, which may be characterized by two distinct relaxation times. Disordered spin models have been mostly investigated in the quenched regime, which is the usual situation in solid state physics, and in which the relaxation time of the disorder variables is much larger than the typical measurement times. In this quenched regime, disorder variables are fixed, and only the orientational variables are duly thermalized. Recent studies in the context of lattice statistical models for the phase diagrams of nematic liquid-crystalline systems have stimulated the interest of going beyond the quenched regime. The phase diagrams predicted by these calculations for a simple Maier-Saupe model turn out to be qualitative different from the quenched case if the two sets of degrees of freedom are allowed to reach thermal equilibrium during the experimental time, which is known as the fully annealed regime. In this work, we develop a transfer matrix formalism to investigate annealed disordered Ising models on two hierarchical structures, the diamond hierarchical lattice (DHL) and the Apollonian network (AN). The calculations follow the same steps used for the analysis of simple uniform systems, which amounts to deriving proper recurrence maps for the thermodynamic and magnetic variables in terms of the generations of the construction of the hierarchical structures. In this context, we may consider different kinds of disorder, and different types of ferromagnetic and anti-ferromagnetic interactions. In the present work, we analyze the effects of dilution, which are produced by the removal of some magnetic ions. The system is treated in a “grand canonical" ensemble. The introduction of two extra fields, related to the concentration of two different types of particles, leads to higher-rank transfer matrices as compared with the formalism for the usual uniform models. Preliminary calculations on a DHL indicate that there is a phase transition for a wide range of dilution concentrations. Ising spin systems on the AN are known to be ferromagnetically ordered at all temperatures; in the presence of dilution, however, there are indications of a disordered (paramagnetic) phase at low concentrations of magnetic ions.
Resumo:
Hierarchical multi-label classification is a complex classification task where the classes involved in the problem are hierarchically structured and each example may simultaneously belong to more than one class in each hierarchical level. In this paper, we extend our previous works, where we investigated a new local-based classification method that incrementally trains a multi-layer perceptron for each level of the classification hierarchy. Predictions made by a neural network in a given level are used as inputs to the neural network responsible for the prediction in the next level. We compare the proposed method with one state-of-the-art decision-tree induction method and two decision-tree induction methods, using several hierarchical multi-label classification datasets. We perform a thorough experimental analysis, showing that our method obtains competitive results to a robust global method regarding both precision and recall evaluation measures.
Resumo:
Given a large image set, in which very few images have labels, how to guess labels for the remaining majority? How to spot images that need brand new labels different from the predefined ones? How to summarize these data to route the user’s attention to what really matters? Here we answer all these questions. Specifically, we propose QuMinS, a fast, scalable solution to two problems: (i) Low-labor labeling (LLL) – given an image set, very few images have labels, find the most appropriate labels for the rest; and (ii) Mining and attention routing – in the same setting, find clusters, the top-'N IND.O' outlier images, and the 'N IND.R' images that best represent the data. Experiments on satellite images spanning up to 2.25 GB show that, contrasting to the state-of-the-art labeling techniques, QuMinS scales linearly on the data size, being up to 40 times faster than top competitors (GCap), still achieving better or equal accuracy, it spots images that potentially require unpredicted labels, and it works even with tiny initial label sets, i.e., nearly five examples. We also report a case study of our method’s practical usage to show that QuMinS is a viable tool for automatic coffee crop detection from remote sensing images.