928 resultados para K-Means Cluster


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Collaborative recommendation is one of widely used recommendation systems, which recommend items to visitor on a basis of referring other's preference that is similar to current user. User profiling technique upon Web transaction data is able to capture such informative knowledge of user task or interest. With the discovered usage pattern information, it is likely to recommend Web users more preferred content or customize the Web presentation to visitors via collaborative recommendation. In addition, it is helpful to identify the underlying relationships among Web users, items as well as latent tasks during Web mining period. In this paper, we propose a Web recommendation framework based on user profiling technique. In this approach, we employ Probabilistic Latent Semantic Analysis (PLSA) to model the co-occurrence activities and develop a modified k-means clustering algorithm to build user profiles as the representatives of usage patterns. Moreover, the hidden task model is derived by characterizing the meaningful latent factor space. With the discovered user profiles, we then choose the most matched profile, which possesses the closely similar preference to current user and make collaborative recommendation based on the corresponding page weights appeared in the selected user profile. The preliminary experimental results performed on real world data sets show that the proposed approach is capable of making recommendation accurately and efficiently.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

An overview of neural networks, covering multilayer perceptrons, radial basis functions, constructive algorithms, Kohonen and K-means unupervised algorithms, RAMnets, first and second order training methods, and Bayesian regularisation methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Clustering techniques such as k-means and hierarchical clustering are commonly used to analyze DNA microarray derived gene expression data. However, the interactions between processes underlying the cell activity suggest that the complexity of the microarray data structure may not be fully represented with discrete clustering methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Objective: Recently, much research has been proposed using nature inspired algorithms to perform complex machine learning tasks. Ant colony optimization (ACO) is one such algorithm based on swarm intelligence and is derived from a model inspired by the collective foraging behavior of ants. Taking advantage of the ACO in traits such as self-organization and robustness, this paper investigates ant-based algorithms for gene expression data clustering and associative classification. Methods and material: An ant-based clustering (Ant-C) and an ant-based association rule mining (Ant-ARM) algorithms are proposed for gene expression data analysis. The proposed algorithms make use of the natural behavior of ants such as cooperation and adaptation to allow for a flexible robust search for a good candidate solution. Results: Ant-C has been tested on the three datasets selected from the Stanford Genomic Resource Database and achieved relatively high accuracy compared to other classical clustering methods. Ant-ARM has been tested on the acute lymphoblastic leukemia (ALL)/acute myeloid leukemia (AML) dataset and generated about 30 classification rules with high accuracy. Conclusions: Ant-C can generate optimal number of clusters without incorporating any other algorithms such as K-means or agglomerative hierarchical clustering. For associative classification, while a few of the well-known algorithms such as Apriori, FP-growth and Magnum Opus are unable to mine any association rules from the ALL/AML dataset within a reasonable period of time, Ant-ARM is able to extract associative classification rules.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We propose a hybrid generative/discriminative framework for semantic parsing which combines the hidden vector state (HVS) model and the hidden Markov support vector machines (HM-SVMs). The HVS model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. The HM-SVMs combine the advantages of the hidden Markov models and the support vector machines. By employing a modified K-means clustering method, a small set of most representative sentences can be automatically selected from an un-annotated corpus. These sentences together with their abstract annotations are used to train an HVS model which could be subsequently applied on the whole corpus to generate semantic parsing results. The most confident semantic parsing results are selected to generate a fully-annotated corpus which is used to train the HM-SVMs. The proposed framework has been tested on the DARPA Communicator Data. Experimental results show that an improvement over the baseline HVS parser has been observed using the hybrid framework. When compared with the HM-SVMs trained from the fully-annotated corpus, the hybrid framework gave a comparable performance with only a small set of lightly annotated sentences. © 2008. Licensed under the Creative Commons.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Projection of a high-dimensional dataset onto a two-dimensional space is a useful tool to visualise structures and relationships in the dataset. However, a single two-dimensional visualisation may not display all the intrinsic structure. Therefore, hierarchical/multi-level visualisation methods have been used to extract more detailed understanding of the data. Here we propose a multi-level Gaussian process latent variable model (MLGPLVM). MLGPLVM works by segmenting data (with e.g. K-means, Gaussian mixture model or interactive clustering) in the visualisation space and then fitting a visualisation model to each subset. To measure the quality of multi-level visualisation (with respect to parent and child models), metrics such as trustworthiness, continuity, mean relative rank errors, visualisation distance distortion and the negative log-likelihood per point are used. We evaluate the MLGPLVM approach on the ‘Oil Flow’ dataset and a dataset of protein electrostatic potentials for the ‘Major Histocompatibility Complex (MHC) class I’ of humans. In both cases, visual observation and the quantitative quality measures have shown better visualisation at lower levels.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We present a test for identifying clusters in high dimensional data based on the k-means algorithm when the null hypothesis is spherical normal. We show that projection techniques used for evaluating validity of clusters may be misleading for such data. In particular, we demonstrate that increasingly well-separated clusters are identified as the dimensionality increases, when no such clusters exist. Furthermore, in a case of true bimodality, increasing the dimensionality makes identifying the correct clusters more difficult. In addition to the original conservative test, we propose a practical test with the same asymptotic behavior that performs well for a moderate number of points and moderate dimensionality. ACM Computing Classification System (1998): I.5.3.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A likviditás mérésére többféle mutató terjedt el, amelyek a likviditás jelenségét különböző szempontok alapján számszerűsítik. A cikk a szakirodalom által javasolt, különféle likviditási mutatókat elemzi sokdimenziós statisztikai módszerekkel: főkomponens-elemzés segítségével keresünk olyan faktorokat, amelyek legjobban tömörítik a likviditási jellemzőket, majd megnézzük, hogy az egyes mutatók milyen mértékben mozognak együtt a faktorokkal, illetve a korrelációk alapján klaszterezési eljárással keresünk hasonló tulajdonságokkal bíró csoportokat. Arra keressük a választ, hogy a rendelkezésünkre álló minta elemzésével kialakított változócsoportok egybeesnek-e a likviditás egyes aspektusaihoz kapcsolt mutatókkal, valamint meghatározhatók-e olyan összetett likviditási mérőszámok, amelyeknek a segítségével a likviditás jelensége több dimenzióban mérhető. / === / Liquidity is measured from different aspects (e.g. tightness, depth, and resiliency) by different ratios. We studied the co-movements and the clustering of different liquidity measures on a sample of the Swiss stock market. We performed a PCA to obtain the main factors that explain the cross-sectional variability of liquidity measures, and we used the k-means clustering methodology to defi ne groups of liquidity measures. Based on our explorative data analysis, we formed clusters of liquidity measures, and we compared the resulting groups with the expectations and intuition. Our modelling methodology provides a framework to analyze the correlation between the different aspects of liquidity as well as a means to defi ne complex liquidity measures.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This dissertation introduces a new approach for assessing the effects of pediatric epilepsy on the language connectome. Two novel data-driven network construction approaches are presented. These methods rely on connecting different brain regions using either extent or intensity of language related activations as identified by independent component analysis of fMRI data. An auditory description decision task (ADDT) paradigm was used to activate the language network for 29 patients and 30 controls recruited from three major pediatric hospitals. Empirical evaluations illustrated that pediatric epilepsy can cause, or is associated with, a network efficiency reduction. Patients showed a propensity to inefficiently employ the whole brain network to perform the ADDT language task; on the contrary, controls seemed to efficiently use smaller segregated network components to achieve the same task. To explain the causes of the decreased efficiency, graph theoretical analysis was carried out. The analysis revealed no substantial global network feature differences between the patient and control groups. It also showed that for both subject groups the language network exhibited small-world characteristics; however, the patient's extent of activation network showed a tendency towards more random networks. It was also shown that the intensity of activation network displayed ipsilateral hub reorganization on the local level. The left hemispheric hubs displayed greater centrality values for patients, whereas the right hemispheric hubs displayed greater centrality values for controls. This hub hemispheric disparity was not correlated with a right atypical language laterality found in six patients. Finally it was shown that a multi-level unsupervised clustering scheme based on self-organizing maps, a type of artificial neural network, and k-means was able to fairly and blindly separate the subjects into their respective patient or control groups. The clustering was initiated using the local nodal centrality measurements only. Compared to the extent of activation network, the intensity of activation network clustering demonstrated better precision. This outcome supports the assertion that the local centrality differences presented by the intensity of activation network can be associated with focal epilepsy.^

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Based on a well-established stratigraphic framework and 47 AMS-14C dated sediment cores, the distribution of facies types on the NW Iberian margin is analysed in response to the last deglacial sea-level rise, thus providing a case study on the sedimentary evolution of a high-energy, low-accumulation shelf system. Altogether, four main types of sedimentary facies are defined. (1) A gravel-dominated facies occurs mostly as time-transgressive ravinement beds, which initially developed as shoreface and storm deposits in shallow waters on the outer shelf during the last sea-level lowstand; (2) A widespread, time-transgressive mixed siliceous/biogenic-carbonaceous sand facies indicates areas of moderate hydrodynamic regimes, high contribution of reworked shelf material, and fluvial supply to the shelf; (3) A glaucony-containing sand facies in a stationary position on the outer shelf formed mostly during the last-glacial sea-level rise by reworking of older deposits as well as authigenic mineral formation; and (4) A mud facies is mostly restricted to confined Holocene fine-grained depocentres, which are located in mid-shelf position. The observed spatial and temporal distribution of these facies types on the high-energy, low-accumulation NW Iberian shelf was essentially controlled by the local interplay of sediment supply, shelf morphology, and strength of the hydrodynamic system. These patterns are in contrast to high-accumulation systems where extensive sediment supply is the dominant factor on the facies distribution. This study emphasises the importance of large-scale erosion and material recycling on the sedimentary buildup during the deglacial drowning of the shelf. The presence of a homogenous and up to 15-m thick transgressive cover above a lag horizon contradicts the common assumption of sparse and laterally confined sediment accumulation on high-energy shelf systems during deglacial sea-level rise. In contrast to this extensive sand cover, laterally very confined and maximal 4-m thin mud depocentres developed during the Holocene sea-level highstand. This restricted formation of fine-grained depocentres was related to the combination of: (1) frequently occurring high-energy hydrodynamic conditions; (2) low overall terrigenous input by the adjacent rivers; and (3) the large distance of the Galicia Mud Belt to its main sediment supplier.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Skeletal muscle consists of muscle fiber types that have different physiological and biochemical characteristics. Basically, the muscle fiber can be classified into type I and type II, presenting, among other features, contraction speed and sensitivity to fatigue different for each type of muscle fiber. These fibers coexist in the skeletal muscles and their relative proportions are modulated according to the muscle functionality and the stimulus that is submitted. To identify the different proportions of fiber types in the muscle composition, many studies use biopsy as standard procedure. As the surface electromyography (EMGs) allows to extract information about the recruitment of different motor units, this study is based on the assumption that it is possible to use the EMG to identify different proportions of fiber types in a muscle. The goal of this study was to identify the characteristics of the EMG signals which are able to distinguish, more precisely, different proportions of fiber types. Also was investigated the combination of characteristics using appropriate mathematical models. To achieve the proposed objective, simulated signals were developed with different proportions of motor units recruited and with different signal-to-noise ratios. Thirteen characteristics in function of time and the frequency were extracted from emulated signals. The results for each extracted feature of the signals were submitted to the clustering algorithm k-means to separate the different proportions of motor units recruited on the emulated signals. Mathematical techniques (confusion matrix and analysis of capability) were implemented to select the characteristics able to identify different proportions of muscle fiber types. As a result, the average frequency and median frequency were selected as able to distinguish, with more precision, the proportions of different muscle fiber types. Posteriorly, the features considered most able were analyzed in an associated way through principal component analysis. Were found two principal components of the signals emulated without noise (CP1 and CP2) and two principal components of the noisy signals (CP1 and CP2 ). The first principal components (CP1 and CP1 ) were identified as being able to distinguish different proportions of muscle fiber types. The selected characteristics (median frequency, mean frequency, CP1 and CP1 ) were used to analyze real EMGs signals, comparing sedentary people with physically active people who practice strength training (weight training). The results obtained with the different groups of volunteers show that the physically active people obtained higher values of mean frequency, median frequency and principal components compared with the sedentary people. Moreover, these values decreased with increasing power level for both groups, however, the decline was more accented for the group of physically active people. Based on these results, it is assumed that the volunteers of the physically active group have higher proportions of type II fibers than sedentary people. Finally, based on these results, we can conclude that the selected characteristics were able to distinguish different proportions of muscle fiber types, both for the emulated signals as to the real signals. These characteristics can be used in several studies, for example, to evaluate the progress of people with myopathy and neuromyopathy due to the physiotherapy, and also to analyze the development of athletes to improve their muscle capacity according to their sport. In both cases, the extraction of these characteristics from the surface electromyography signals provides a feedback to the physiotherapist and the coach physical, who can analyze the increase in the proportion of a given type of fiber, as desired in each case.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A modelação dos sistemas industriais apresenta para as organizações uma vantagem estratégica no domínio do estudo dos seus processos produtivos. Através da modelação será possível aumentar o conhecimento sobre os sistemas podendo permitir, quando possível, melhorias na gestão e planeamento da produção. Este conhecimento poderá permitir também um aumento da eficiência dos processos produtivos, através da melhoria ou eliminação das principais perdas detetadas no processo. Este trabalho tem como principal objetivo o desenvolvimento e validação de uma ferramenta de modelação, previsão e análise para sistemas produtivos industriais, tendo em vista o aumento do conhecimento sobre estes. Para a execução e desenvolvimento deste trabalho, foram utilizadas e desenvolvidas várias ferramentas, conceitos, metodologias e fundamentos teóricos conhecidos da bibliografia, como OEE (Overall Equipment Effectiveness), RdP (Redes de Petri), Séries Temporais, Kmeans, ou SPC (Statistical Process Control). A ferramenta de modelação, previsão e análise desenvolvida e descrita neste trabalho, mostrou-se capaz de auxiliar na deteção e interpretação das causas que influenciam os resultados do sistema produtivo e originam perdas, demonstrando as vantagens esperadas. Estes resultados foram baseados em dados reais de um sistema produtivo.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Ground Delay Programs (GDP) are sometimes cancelled before their initial planned duration and for this reason aircraft are delayed when it is no longer needed. Recovering this delay usually leads to extra fuel consumption, since the aircraft will typically depart after having absorbed on ground their assigned delay and, therefore, they will need to cruise at more fuel consuming speeds. Past research has proposed speed reduction strategy aiming at splitting the GDP-assigned delay between ground and airborne delay, while using the same fuel as in nominal conditions. Being airborne earlier, an aircraft can speed up to nominal cruise speed and recover part of the GDP delay without incurring extra fuel consumption if the GDP is cancelled earlier than planned. In this paper, all GDP initiatives that occurred in San Francisco International Airport during 2006 are studied and characterised by a K-means algorithm into three different clusters. The centroids for these three clusters have been used to simulate three different GDPs at the airport by using a realistic set of inbound traffic and the Future Air Traffic Management Concepts Evaluation Tool (FACET). The amount of delay that can be recovered using this cruise speed reduction technique, as a function of the GDP cancellation time, has been computed and compared with the delay recovered with the current concept of operations. Simulations have been conducted in calm wind situation and without considering a radius of exemption. Results indicate that when aircraft depart early and fly at the slower speed they can recover additional delays, compared to current operations where all delays are absorbed prior to take-off, in the event the GDP cancels early. There is a variability of extra delay recovered, being more significant, in relative terms, for those GDPs with a relatively low amount of demand exceeding the airport capacity.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Clustering algorithms, pattern mining techniques and associated quality metrics emerged as reliable methods for modeling learners’ performance, comprehension and interaction in given educational scenarios. The specificity of available data such as missing values, extreme values or outliers, creates a challenge to extract significant user models from an educational perspective. In this paper we introduce a pattern detection mechanism with-in our data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics. Experiments performed on a dataset obtained from our online e-learning platform show that the extracted interaction patterns were representative in classifying learners. Furthermore, the performed monitoring activities created a strong basis for generating automatic feedback to learners in terms of their course participation, while relying on their previous performance. In addition, our analysis introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.