812 resultados para Hier-archical clustering
Resumo:
In recent years, there has been an increas-ing interest in learning a distributed rep-resentation of word sense. Traditional context clustering based models usually require careful tuning of model parame-ters, and typically perform worse on infre-quent word senses. This paper presents a novel approach which addresses these lim-itations by first initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned represen-tations outperform the publicly available embeddings on 2 out of 4 metrics in the word similarity task, and 6 out of 13 sub tasks in the analogical reasoning task.
Resumo:
2000 Mathematics Subject Classification: 62H30
Resumo:
In recent years, there has been an increasing interest in learning a distributed representation of word sense. Traditional context clustering based models usually require careful tuning of model parameters, and typically perform worse on infrequent word senses. This paper presents a novel approach which addresses these limitations by first initializing the word sense embeddings through learning sentence-level embeddings from WordNet glosses using a convolutional neural networks. The initialized word sense embeddings are used by a context clustering based model to generate the distributed representations of word senses. Our learned representations outperform the publicly available embeddings on half of the metrics in the word similarity task, 6 out of 13 sub tasks in the analogical reasoning task, and gives the best overall accuracy in the word sense effect classification task, which shows the effectiveness of our proposed distributed distribution learning model.
Resumo:
In this paper, we focus on the design of bivariate EDAs for discrete optimization problems and propose a new approach named HSMIEC. While the current EDAs require much time in the statistical learning process as the relationships among the variables are too complicated, we employ the Selfish gene theory (SG) in this approach, as well as a Mutual Information and Entropy based Cluster (MIEC) model is also set to optimize the probability distribution of the virtual population. This model uses a hybrid sampling method by considering both the clustering accuracy and clustering diversity and an incremental learning and resample scheme is also set to optimize the parameters of the correlations of the variables. Compared with several benchmark problems, our experimental results demonstrate that HSMIEC often performs better than some other EDAs, such as BMDA, COMIT, MIMIC and ECGA. © 2009 Elsevier B.V. All rights reserved.
Resumo:
Abstract Driven by the political and economic forces of cross-strait, Taiwan has become one of the major source markets for Hong Kong tourism industry since 1987. The major purposes of this study were to investigate the following factors (1) The influential factors of travel motivation, (2) The clusters of travel motivations, (3) The marketing segmentation of clusters of Taiwanese tourists to visit Hong Kong. Through ten travel agents, self-report surveys were distributed to collect data from 366 Taiwanese travelers. Hence, four push factors and six pull factors were identified as travel motivations through the factor analysis. Combined with the cluster analysis; five new groups were founded. Finally, five clusters which process unique profiles (location difference, visiting frequency, travel satisfaction, and destination loyalty) were addressed. The suggestions of developing effective market strategies to attract Taiwanese tourists to Hong Kong were also provided.
Resumo:
With the popularization of GPS-enabled devices such as mobile phones, location data are becoming available at an unprecedented scale. The locations may be collected from many different sources such as vehicles moving around a city, user check-ins in social networks, and geo-tagged micro-blogging photos or messages. Besides the longitude and latitude, each location record may also have a timestamp and additional information such as the name of the location. Time-ordered sequences of these locations form trajectories, which together contain useful high-level information about people's movement patterns.
The first part of this thesis focuses on a few geometric problems motivated by the matching and clustering of trajectories. We first give a new algorithm for computing a matching between a pair of curves under existing models such as dynamic time warping (DTW). The algorithm is more efficient than standard dynamic programming algorithms both theoretically and practically. We then propose a new matching model for trajectories that avoids the drawbacks of existing models. For trajectory clustering, we present an algorithm that computes clusters of subtrajectories, which correspond to common movement patterns. We also consider trajectories of check-ins, and propose a statistical generative model, which identifies check-in clusters as well as the transition patterns between the clusters.
The second part of the thesis considers the problem of covering shortest paths in a road network, motivated by an EV charging station placement problem. More specifically, a subset of vertices in the road network are selected to place charging stations so that every shortest path contains enough charging stations and can be traveled by an EV without draining the battery. We first introduce a general technique for the geometric set cover problem. This technique leads to near-linear-time approximation algorithms, which are the state-of-the-art algorithms for this problem in either running time or approximation ratio. We then use this technique to develop a near-linear-time algorithm for this
shortest-path cover problem.
Resumo:
China is today facing rapid economic development and the long-term implications of China’s rise for European economy, society and culture, are constantly debated but still almost unknown. Moreover, only recently a new volume edited by Kunzmann has clearly pointed out a particular field of research like the EU spatial impact of China’s convergence in the global market. The aim of the present paper is to deal with the spatial issues related to the growing Chinese communities, especially in Italy, that are part of a more general and considerable transformation process of the traditional Chinese enclaves in EU cities: from recognizable “Chinatowns” to new hybrid urban formations where housing, retail, wholesale and even commodity production often tend to match. Key-Concepts like rise, fragmentation, infringement and fear are useful in analysing some of the more controversial socio-economic dynamics of Chinese clusters especially in a traditionally manufactured-based country like Italy, where it’s recognizable a unique paradox of a “double competition” from outside and from inside. This statement poses a serious threat to local economic systems in terms of sustainability and social cohesion, making it necessary to rethink the role and the nature of public action in facing new forms of marginality at urban and regional level.
Resumo:
Non-parametric multivariate analyses of complex ecological datasets are widely used. Following appropriate pre-treatment of the data inter-sample resemblances are calculated using appropriate measures. Ordination and clustering derived from these resemblances are used to visualise relationships among samples (or variables). Hierarchical agglomerative clustering with group-average (UPGMA) linkage is often the clustering method chosen. Using an example dataset of zooplankton densities from the Bristol Channel and Severn Estuary, UK, a range of existing and new clustering methods are applied and the results compared. Although the examples focus on analysis of samples, the methods may also be applied to species analysis. Dendrograms derived by hierarchical clustering are compared using cophenetic correlations, which are also used to determine optimum in flexible beta clustering. A plot of cophenetic correlation against original dissimilarities reveals that a tree may be a poor representation of the full multivariate information. UNCTREE is an unconstrained binary divisive clustering algorithm in which values of the ANOSIM R statistic are used to determine (binary) splits in the data, to form a dendrogram. A form of flat clustering, k-R clustering, uses a combination of ANOSIM R and Similarity Profiles (SIMPROF) analyses to determine the optimum value of k, the number of groups into which samples should be clustered, and the sample membership of the groups. Robust outcomes from the application of such a range of differing techniques to the same resemblance matrix, as here, result in greater confidence in the validity of a clustering approach.
Resumo:
Non-parametric multivariate analyses of complex ecological datasets are widely used. Following appropriate pre-treatment of the data inter-sample resemblances are calculated using appropriate measures. Ordination and clustering derived from these resemblances are used to visualise relationships among samples (or variables). Hierarchical agglomerative clustering with group-average (UPGMA) linkage is often the clustering method chosen. Using an example dataset of zooplankton densities from the Bristol Channel and Severn Estuary, UK, a range of existing and new clustering methods are applied and the results compared. Although the examples focus on analysis of samples, the methods may also be applied to species analysis. Dendrograms derived by hierarchical clustering are compared using cophenetic correlations, which are also used to determine optimum in flexible beta clustering. A plot of cophenetic correlation against original dissimilarities reveals that a tree may be a poor representation of the full multivariate information. UNCTREE is an unconstrained binary divisive clustering algorithm in which values of the ANOSIM R statistic are used to determine (binary) splits in the data, to form a dendrogram. A form of flat clustering, k-R clustering, uses a combination of ANOSIM R and Similarity Profiles (SIMPROF) analyses to determine the optimum value of k, the number of groups into which samples should be clustered, and the sample membership of the groups. Robust outcomes from the application of such a range of differing techniques to the same resemblance matrix, as here, result in greater confidence in the validity of a clustering approach.
Resumo:
Clustering algorithms, pattern mining techniques and associated quality metrics emerged as reliable methods for modeling learners’ performance, comprehension and interaction in given educational scenarios. The specificity of available data such as missing values, extreme values or outliers, creates a challenge to extract significant user models from an educational perspective. In this paper we introduce a pattern detection mechanism with-in our data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics. Experiments performed on a dataset obtained from our online e-learning platform show that the extracted interaction patterns were representative in classifying learners. Furthermore, the performed monitoring activities created a strong basis for generating automatic feedback to learners in terms of their course participation, while relying on their previous performance. In addition, our analysis introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.
Resumo:
Community-driven Question Answering (CQA) systems that crowdsource experiential information in the form of questions and answers and have accumulated valuable reusable knowledge. Clustering of QA datasets from CQA systems provides a means of organizing the content to ease tasks such as manual curation and tagging. In this paper, we present a clustering method that exploits the two-part question-answer structure in QA datasets to improve clustering quality. Our method, {\it MixKMeans}, composes question and answer space similarities in a way that the space on which the match is higher is allowed to dominate. This construction is motivated by our observation that semantic similarity between question-answer data (QAs) could get localized in either space. We empirically evaluate our method on a variety of real-world labeled datasets. Our results indicate that our method significantly outperforms state-of-the-art clustering methods for the task of clustering question-answer archives.
Resumo:
This papers examines the use of trajectory distance measures and clustering techniques to define normal
and abnormal trajectories in the context of pedestrian tracking in public spaces. In order to detect abnormal
trajectories, what is meant by a normal trajectory in a given scene is firstly defined. Then every trajectory
that deviates from this normality is classified as abnormal. By combining Dynamic Time Warping and a
modified K-Means algorithms for arbitrary-length data series, we have developed an algorithm for trajectory
clustering and abnormality detection. The final system performs with an overall accuracy of 83% and 75%
when tested in two different standard datasets.
Resumo:
BACKGROUND: We used four years of paediatric severe acute respiratory illness (SARI) sentinel surveillance in Blantyre, Malawi to identify factors associated with clinical severity and co-viral clustering.
METHODS: From January 2011 to December 2014, 2363 children aged 3 months to 14 years presenting to hospital with SARI were enrolled. Nasopharyngeal aspirates were tested for influenza and other respiratory viruses. We assessed risk factors for clinical severity and conducted clustering analysis to identify viral clusters in children with co-viral detection.
RESULTS: Hospital-attended influenza-positive SARI incidence was 2.0 cases per 10,000 children annually; it was highest children aged under 1 year (6.3 cases per 10,000), and HIV-infected children aged 5 to 9 years (6.0 cases per 10,000). 605 (26.8%) SARI cases had warning signs, which were positively associated with HIV infection (adjusted risk ratio [aRR]: 2.4, 95% CI: 1.4, 3.9), RSV infection (aRR: 1.9, 95% CI: 1.3, 3.0) and rainy season (aRR: 2.4, 95% CI: 1.6, 3.8). We identified six co-viral clusters; one cluster was associated with SARI with warning signs.
CONCLUSIONS: Influenza vaccination may benefit young children and HIV infected children in this setting. Viral clustering may be associated with SARI severity; its assessment should be included in routine SARI surveillance.
Resumo:
We consider the problem of resource selection in clustered Peer-to-Peer Information Retrieval (P2P IR) networks with cooperative peers. The clustered P2P IR framework presents a significant departure from general P2P IR architectures by employing clustering to ensure content coherence between resources at the resource selection layer, without disturbing document allocation. We propose that such a property could be leveraged in resource selection by adapting well-studied and popular inverted lists for centralized document retrieval. Accordingly, we propose the Inverted PeerCluster Index (IPI), an approach that adapts the inverted lists, in a straightforward manner, for resource selection in clustered P2P IR. IPI also encompasses a strikingly simple peer-specific scoring mechanism that exploits the said index for resource selection. Through an extensive empirical analysis on P2P IR testbeds, we establish that IPI competes well with the sophisticated state-of-the-art methods in virtually every parameter of interest for the resource selection task, in the context of clustered P2P IR.
Resumo:
The Twitter System is the biggest social network in the world, and everyday millions of tweets are posted and talked about, expressing various views and opinions. A large variety of research activities have been conducted to study how the opinions can be clustered and analyzed, so that some tendencies can be uncovered. Due to the inherent weaknesses of the tweets - very short texts and very informal styles of writing - it is rather hard to make an investigation of tweet data analysis giving results with good performance and accuracy. In this paper, we intend to attack the problem from another aspect - using a two-layer structure to analyze the twitter data: LDA with topic map modelling. The experimental results demonstrate that this approach shows a progress in twitter data analysis. However, more experiments with this method are expected in order to ensure that the accurate analytic results can be maintained.