997 resultados para complete linkage clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Important words, which usually exist in part of Title, Subject and Keywords, can briefly reflect the main topic of a document. In recent years, it is a common practice to exploit the semantic topic of documents and utilize important words to achieve document clustering, especially for short texts such as news articles. This paper proposes a novel method to extract important words from Subject and Keywords of articles, and then partition documents only with those important words. Considering the fact that frequencies of important words are usually low and the scale matrix dataset for important words is small, a normalization method is then proposed to normalize the scale dataset so that more accurate results can be achieved by sufficiently exploiting the limited information. The experiments validate the effectiveness of our method.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

One reason for semi-supervised clustering fail to deliver satisfactory performance in document clustering is that the transformed optimization problem could have many candidate solutions, but existing methods provide no mechanism to select a suitable one from all those candidates. This paper alleviates this problem by posing the same task as a soft-constrained optimization problem, and introduces the salient degree measure as an information guide to control the searching of an optimal solution. Experimental results show the effectiveness of the proposed method in the improvement of the performance, especially when the amount of priori domain knowledge is limited.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present a document clustering framework incorporating instance-level knowledge in the form of pairwise constraints and attribute-level knowledge in the form of keyphrases. Firstly, we initialize weights based on metric learning with pairwise constraints, then simultaneously learn two kinds of knowledge by combining the distance-based and the constraint-based approaches, finally evaluate and select clustering result based on the degree of users’ satisfaction. The experimental results demonstrate the effectiveness and potential of the proposed method.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Network traffic classification is an essential component for network management and security systems. To address the limitations of traditional port-based and payload-based methods, recent studies have been focusing on alternative approaches. One promising direction is applying machine learning techniques to classify traffic flows based on packet and flow level statistics. In particular, previous papers have illustrated that clustering can achieve high accuracy and discover unknown application classes. In this work, we present a novel semi-supervised learning method using constrained clustering algorithms. The motivation is that in network domain a lot of background information is available in addition to the data instances themselves. For example, we might know that flow ƒ1 and ƒ2 are using the same application protocol because they are visiting the same host address at the same port simultaneously. In this case, ƒ1 and ƒ2 shall be grouped into the same cluster ideally. Therefore, we describe these correlations in the form of pair-wise must-link constraints and incorporate them in the process of clustering. We have applied three constrained variants of the K-Means algorithm, which perform hard or soft constraint satisfaction and metric learning from constraints. A number of real-world traffic traces have been used to show the availability of constraints and to test the proposed approach. The experimental results indicate that by incorporating constraints in the course of clustering, the overall accuracy and cluster purity can be significantly improved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: Children who participate in regular physical activity obtain health benefits. Preliminary pedometerbased cut-points representing sufficient levels of physical activity among youth have been established; however limited evidence regarding correlates of achieving these cut-points exists. The purpose of this study was to identify correlates of pedometer-based cut-points among elementary school-aged children.
Method: A cross-section of children in grades 5-7 (10-12 years of age) were randomly selected from the most (n = 13) and least (n = 12) ‘walkable’ public elementary schools (Perth, Western Australia), stratified by socioeconomic status. Children (n = 1480; response rate = 56.6%) and parents (n = 1332; response rate = 88.8%) completed a survey, and steps were collected from children using pedometers. Pedometer data were categorized to reflect the sex-specific pedometer-based cut-points of ≥15000 steps/day for boys and ≥12000 steps/day for girls. Associations between socio-demographic characteristics, sedentary and active leisure-time behavior, independent mobility, active transportation and built environmental variables - collected from the child and parent surveys - and meeting pedometer-based cut-points were estimated (odds ratios: OR) using generalized estimating equations.
Results: Overall 927 children participated in all components of the study and provided complete data. On average, children took 11407 ± 3136 steps/day (boys: 12270 ± 3350 vs. girls: 10681 ± 2745 steps/day; p < 0.001) and 25.9% (boys: 19.1 vs. girls: 31.6%; p < 0.001) achieved the pedometer-based cut-points. After adjusting for all other variables and school clustering, meeting the pedometer-based cut-points was negatively associated (p < 0.05) with being male (OR = 0.42), parent self-reported number of different destinations in the neighborhood (OR 0.93), and a friend’s (OR 0.62) or relative’s (OR 0.44, boys only) house being at least a 10-minute walk from home. Achieving the pedometer-based cut-points was positively associated with participating in screen-time < 2 hours/day (OR 1.88), not being driven to school (OR 1.48), attending a school located in a high SES neighborhood (OR 1.33), the average number of steps among children within the respondent’s grade (for each 500 step/day increase: OR 1.29), and living further than a 10-minute walk from a relative’s house (OR 1.69, girls only).
Conclusions: Comprehensive multi-level interventions that reduce screen-time, encourage active travel to/from school and foster a physically active classroom culture might encourage more physical activity among children.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper examines the recovery of user context in indoor environmnents with existing wireless infrastructures to enable assistive systems. We present a novel approach to the extraction of user context, casting the problem of context recovery as an unsupervised, clustering problem. A well known density-based clustering technique, DBSCAN, is adapted to recover user context that includes user motion state, and significant places the user visits from WiFi observations consisting of access point id and signal strength. Furthermore, user rhythms or sequences of places the user visits periodically are derived from the above low level contexts by employing state-of-the-art probabilistic clustering technique, the Latent Dirichiet Allocation (LDA), to enable a variety of application services. Experimental results with real data are presented to validate the proposed unsupervised learning approach and demonstrate its applicability.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents an efficient evaluation algorithm for cross-validating the two-stage approach of KFD classifiers. The proposed algorithm is of the same complexity level as the existing indirect efficient cross-validation methods but it is more reliable since it is direct and constitutes exact cross-validation for the KFD classifier formulation. Simulations demonstrate that the proposed algorithm is almost as fast as the existing fast indirect evaluation algorithm and the twostage cross-validation selects better models on most of the thirteen benchmark data sets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article presents experimental results devoted to a new application of the novel clustering technique introduced by the authors recently. Our aim is to facilitate the application of robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on the particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, we use a consensus function to combine these independent clusterings into one consensus clustering . Feature ranking is used to select a subset of features for the consensus function. Third, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for effectiveness of the whole procedure. We investigated various combinations of three consensus functions, Cluster-Based Graph Formulation (CBGF), Hybrid Bipartite Graph Formulation (HBGF), and Instance-Based Graph Formulation (IBGF) and a variety of supervised classification algorithms. The best precision and recall have been obtained by the combination of the HBGF consensus function and the SMO classifier with the polynomial kernel.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background Cohort studies can provide valuable evidence of cause and effect relationships but are subject to loss of participants over time, limiting the validity of findings. Computerised record linkage offers a passive and ongoing method of obtaining health outcomes from existing routinely collected data sources. However, the quality of record linkage is reliant upon the availability and accuracy of common identifying variables. We sought to develop and validate a method for linking a cohort study to a state-wide hospital admissions dataset with limited availability of unique identifying variables.

Methods A sample of 2000 participants from a cohort study (n = 41 514) was linked to a state-wide hospitalisations dataset in Victoria, Australia using the national health insurance (Medicare) number and demographic data as identifying variables. Availability of the health insurance number was limited in both datasets; therefore linkage was undertaken both with and without use of this number and agreement tested between both algorithms. Sensitivity was calculated for a sub-sample of 101 participants with a hospital admission confirmed by medical record review.

Results Of the 2000 study participants, 85% were found to have a record in the hospitalisations dataset when the national health insurance number and sex were used as linkage variables and 92% when demographic details only were used. When agreement between the two methods was tested the disagreement fraction was 9%, mainly due to "false positive" links when demographic details only were used. A final algorithm that used multiple combinations of identifying variables resulted in a match proportion of 87%. Sensitivity of this final linkage was 95%.

Conclusions High quality record linkage of cohort data with a hospitalisations dataset that has limited identifiers can be achieved using combinations of a national health insurance number and demographic data as identifying variables.