853 resultados para Unsupervised clustering
Resumo:
Estudi, disseny i implementació de diferents tècniques d’agrupament defibres (clustering) per tal d’integrar a la plataforma DTIWeb diferentsalgorismes de clustering i tècniques de visualització de clústers de fibres de forma quefaciliti la interpretació de dades de DTI als especialistes
Resumo:
A methodology of exploratory data analysis investigating the phenomenon of orographic precipitation enhancement is proposed. The precipitation observations obtained from three Swiss Doppler weather radars are analysed for the major precipitation event of August 2005 in the Alps. Image processing techniques are used to detect significant precipitation cells/pixels from radar images while filtering out spurious effects due to ground clutter. The contribution of topography to precipitation patterns is described by an extensive set of topographical descriptors computed from the digital elevation model at multiple spatial scales. Additionally, the motion vector field is derived from subsequent radar images and integrated into a set of topographic features to highlight the slopes exposed to main flows. Following the exploratory data analysis with a recent algorithm of spectral clustering, it is shown that orographic precipitation cells are generated under specific flow and topographic conditions. Repeatability of precipitation patterns in particular spatial locations is found to be linked to specific local terrain shapes, e.g. at the top of hills and on the upwind side of the mountains. This methodology and our empirical findings for the Alpine region provide a basis for building computational data-driven models of orographic enhancement and triggering of precipitation. Copyright (C) 2011 Royal Meteorological Society .
Resumo:
In this project a research both in finding predictors via clustering techniques and in reviewing the Data Mining free software is achieved. The research is based in a case of study, from where additionally to the KDD free software used by the scientific community; a new free tool for pre-processing the data is presented. The predictors are intended for the e-learning domain as the data from where these predictors have to be inferred are student qualifications from different e-learning environments. Through our case of study not only clustering algorithms are tested but also additional goals are proposed.
Resumo:
Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse pattern after surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years, respectively. Although several clinical and pathological features have been used to discriminate between low- and high-risk patients, the identification of molecular biomarkers with prognostic value remains an unmet need in the current management of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in 71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developed early (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregated tumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarray data analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentially expressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs were down-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-risk group of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value<0.05). Network analysis based on miRNA-target interactions curated by public databases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result in an overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-related microRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breast surgery.
Resumo:
HEMOLIA (a project under European community’s 7th framework programme) is a new generation Anti-Money Laundering (AML) intelligent multi-agent alert and investigation system which in addition to the traditional financial data makes extensive use of modern society’s huge telecom data source, thereby opening up a new dimension of capabilities to all Money Laundering fighters (FIUs, LEAs) and Financial Institutes (Banks, Insurance Companies, etc.). This Master-Thesis project is done at AIA, one of the partners for the HEMOLIA project in Barcelona. The objective of this thesis is to find the clusters in a network drawn by using the financial data. An extensive literature survey has been carried out and several standard algorithms related to networks have been studied and implemented. The clustering problem is a NP-hard problem and several algorithms like K-Means and Hierarchical clustering are being implemented for studying several problems relating to sociology, evolution, anthropology etc. However, these algorithms have certain drawbacks which make them very difficult to implement. The thesis suggests (a) a possible improvement to the K-Means algorithm, (b) a novel approach to the clustering problem using the Genetic Algorithms and (c) a new algorithm for finding the cluster of a node using the Genetic Algorithm.
Resumo:
OBJECTIVE: This study assessed clustering of multiple risk behaviors (i.e., low leisure-time physical activity, low fruits/vegetables intake, and high alcohol consumption) with level of cigarette consumption. METHODS: Data from the 2002 Swiss Health Survey, a population-based cross-sectional telephone survey assessing health and self-reported risk behaviors, were used. 18,005 subjects (8052 men and 9953 women) aged 25 years old or more participated. RESULTS: Smokers more frequently had low leisure time physical activity, low fruits/vegetables intake, and high alcohol consumption than non- and ex-smokers. Frequency of each risk behavior increased steadily with cigarette consumption. Clustering of risk behaviors increased with cigarette consumption in both men and women. For men, the odds ratios of multiple (> or =2) risk behaviors other than smoking, adjusted for age, nationality, and educational level, were 1.14 (95% confidence interval: 0.97, 1.33) for ex-smokers, 1.24 (0.93, 1.64) for light smokers (1-9 cigarettes/day), 1.72 (1.36, 2.17) for moderate smokers (10-19 cigarettes/day), and 3.07 (2.59, 3.64) for heavy smokers (> or =20 cigarettes/day) versus non-smokers. Similar odds ratios were found for women for corresponding groups, i.e., 1.01 (0.86, 1.19), 1.26 (1.00, 1.58), 1.62 (1.33, 1.98), and 2.75 (2.30, 3.29). CONCLUSIONS: Counseling and intervention with smokers should take into account the strong clustering of risk behaviors with level of cigarette consumption.
Resumo:
Our essay aims at studying suitable statistical methods for the clustering ofcompositional data in situations where observations are constituted by trajectories ofcompositional data, that is, by sequences of composition measurements along a domain.Observed trajectories are known as “functional data” and several methods have beenproposed for their analysis.In particular, methods for clustering functional data, known as Functional ClusterAnalysis (FCA), have been applied by practitioners and scientists in many fields. To ourknowledge, FCA techniques have not been extended to cope with the problem ofclustering compositional data trajectories. In order to extend FCA techniques to theanalysis of compositional data, FCA clustering techniques have to be adapted by using asuitable compositional algebra.The present work centres on the following question: given a sample of compositionaldata trajectories, how can we formulate a segmentation procedure giving homogeneousclasses? To address this problem we follow the steps described below.First of all we adapt the well-known spline smoothing techniques in order to cope withthe smoothing of compositional data trajectories. In fact, an observed curve can bethought of as the sum of a smooth part plus some noise due to measurement errors.Spline smoothing techniques are used to isolate the smooth part of the trajectory:clustering algorithms are then applied to these smooth curves.The second step consists in building suitable metrics for measuring the dissimilaritybetween trajectories: we propose a metric that accounts for difference in both shape andlevel, and a metric accounting for differences in shape only.A simulation study is performed in order to evaluate the proposed methodologies, usingboth hierarchical and partitional clustering algorithm. The quality of the obtained resultsis assessed by means of several indices
Resumo:
Globalization involves several facility location problems that need to be handled at large scale. Location Allocation (LA) is a combinatorial problem in which the distance among points in the data space matter. Precisely, taking advantage of the distance property of the domain we exploit the capability of clustering techniques to partition the data space in order to convert an initial large LA problem into several simpler LA problems. Particularly, our motivation problem involves a huge geographical area that can be partitioned under overall conditions. We present different types of clustering techniques and then we perform a cluster analysis over our dataset in order to partition it. After that, we solve the LA problem applying simulated annealing algorithm to the clustered and non-clustered data in order to work out how profitable is the clustering and which of the presented methods is the most suitable
Resumo:
Abstract: To cluster textual sequence types (discourse types/modes) in French texts, K-means algorithm with high-dimensional embeddings and fuzzy clustering algorithm were applied on clauses whose POS (part-ofspeech) n-gram profiles were previously extracted. Uni-, bi- and trigrams were used on four 19th century French short stories by Maupassant. For high-dimensional embeddings, power transformations on the chi-squared distances between clauses were explored. Preliminary results show that highdimensional embeddings improve the quality of clustering, contrasting the use of bi and trigrams whose performance is disappointing, possibly because of feature space sparsity.
Resumo:
BACKGROUND: The trithorax group (trxG) and Polycomb group (PcG) proteins are responsible for the maintenance of stable transcriptional patterns of many developmental regulators. They bind to specific regions of DNA and direct the post-translational modifications of histones, playing a role in the dynamics of chromatin structure. RESULTS: We have performed genome-wide expression studies of trx and ash2 mutants in Drosophila melanogaster. Using computational analysis of our microarray data, we have identified 25 clusters of genes potentially regulated by TRX. Most of these clusters consist of genes that encode structural proteins involved in cuticle formation. This organization appears to be a distinctive feature of the regulatory networks of TRX and other chromatin regulators, since we have observed the same arrangement in clusters after experiments performed with ASH2, as well as in experiments performed by others with NURF, dMyc, and ASH1. We have also found many of these clusters to be significantly conserved in D. simulans, D. yakuba, D. pseudoobscura and partially in Anopheles gambiae. CONCLUSION: The analysis of genes governed by chromatin regulators has led to the identification of clusters of functionally related genes conserved in other insect species, suggesting this chromosomal organization is biologically important. Moreover, our results indicate that TRX and other chromatin regulators may act globally on chromatin domains that contain transcriptionally co-regulated genes.
Resumo:
AIMS/HYPOTHESIS: The metabolic syndrome comprises a clustering of cardiovascular risk factors but the underlying mechanism is not known. Mice with targeted disruption of endothelial nitric oxide synthase (eNOS) are hypertensive and insulin resistant. We wondered, whether eNOS deficiency in mice is associated with a phenotype mimicking the human metabolic syndrome. METHODS AND RESULTS: In addition to arterial pressure and insulin sensitivity (euglycaemic hyperinsulinaemic clamp), we measured the plasma concentration of leptin, insulin, cholesterol, triglycerides, free fatty acids, fibrinogen and uric acid in 10 to 12 week old eNOS-/- and wild type mice. We also assessed glucose tolerance under basal conditions and following a metabolic stress with a high fat diet. As expected eNOS-/- mice were hypertensive and insulin resistant, as evidenced by fasting hyperinsulinaemia and a roughly 30 percent lower steady state glucose infusion rate during the clamp. eNOS-/- mice had a 1.5 to 2-fold elevation of the cholesterol, triglyceride and free fatty acid plasma concentration. Even though body weight was comparable, the leptin plasma level was 30% higher in eNOS-/- than in wild type mice. Finally, uric acid and fibrinogen were elevated in the eNOS-/- mice. Whereas under basal conditions, glucose tolerance was comparable in knock out and control mice, on a high fat diet, knock out mice became significantly more glucose intolerant than control mice. CONCLUSIONS: A single gene defect, eNOS deficiency, causes a clustering of cardiovascular risk factors in young mice. We speculate that defective nitric oxide synthesis could trigger many of the abnormalities making up the metabolic syndrome in humans.
Resumo:
The article examines the structure of the collaboration networks of research groups where Slovenian and Spanish PhD students are pursuing their doctorate. The units of analysis are student-supervisor dyads. We use duocentred networks, a novel network structure appropriate for networks which are centred around a dyad. A cluster analysis reveals three typical clusters of research groups. Those which are large and belong to several institutions are labelled under a bridging social capital label. Those which are small, centred in a single institution but have high cohesion are labelled as bonding social capital. Those which are small and with low cohesion are called weak social capital groups. Academic performance of both PhD students and supervisors are highest in bridging groups and lowest in weak groups. Other variables are also found to differ according to the type of research group. At the end, some recommendations regarding academic and research policy are drawn
Resumo:
BACKGROUND: Little is known about engagement in multiple health behaviours in childhood cancer survivors. METHODS: Using latent class analysis, we identified health behaviour patterns in 835 adult survivors of childhood cancer (age 20-35 years) and 1670 age- and sex-matched controls from the general population. Behaviour groups were determined from replies to questions on smoking, drinking, cannabis use, sporting activities, diet, sun protection and skin examination. RESULTS: The model identified four health behaviour patterns: 'risk-avoidance', with a generally healthy behaviour; 'moderate drinking', with higher levels of sporting activities, but moderate alcohol-consumption; 'risk-taking', engaging in several risk behaviours; and 'smoking', smoking but not drinking. Similar proportions of survivors and controls fell into the 'risk-avoiding' (42% vs 44%) and the 'risk-taking' cluster (14% vs 12%), but more survivors were in the 'moderate drinking' (39% vs 28%) and fewer in the 'smoking' cluster (5% vs 16%). Determinants of health behaviour clusters were gender, migration background, income and therapy. CONCLUSION: A comparable proportion of childhood cancer survivors as in the general population engage in multiple health-compromising behaviours. Because of increased vulnerability of survivors, multiple risk behaviours should be addressed in targeted health interventions.
Resumo:
The coverage and volume of geo-referenced datasets are extensive and incessantly¦growing. The systematic capture of geo-referenced information generates large volumes¦of spatio-temporal data to be analyzed. Clustering and visualization play a key¦role in the exploratory data analysis and the extraction of knowledge embedded in¦these data. However, new challenges in visualization and clustering are posed when¦dealing with the special characteristics of this data. For instance, its complex structures,¦large quantity of samples, variables involved in a temporal context, high dimensionality¦and large variability in cluster shapes.¦The central aim of my thesis is to propose new algorithms and methodologies for¦clustering and visualization, in order to assist the knowledge extraction from spatiotemporal¦geo-referenced data, thus improving making decision processes.¦I present two original algorithms, one for clustering: the Fuzzy Growing Hierarchical¦Self-Organizing Networks (FGHSON), and the second for exploratory visual data analysis:¦the Tree-structured Self-organizing Maps Component Planes. In addition, I present¦methodologies that combined with FGHSON and the Tree-structured SOM Component¦Planes allow the integration of space and time seamlessly and simultaneously in¦order to extract knowledge embedded in a temporal context.¦The originality of the FGHSON lies in its capability to reflect the underlying structure¦of a dataset in a hierarchical fuzzy way. A hierarchical fuzzy representation of¦clusters is crucial when data include complex structures with large variability of cluster¦shapes, variances, densities and number of clusters. The most important characteristics¦of the FGHSON include: (1) It does not require an a-priori setup of the number¦of clusters. (2) The algorithm executes several self-organizing processes in parallel.¦Hence, when dealing with large datasets the processes can be distributed reducing the¦computational cost. (3) Only three parameters are necessary to set up the algorithm.¦In the case of the Tree-structured SOM Component Planes, the novelty of this algorithm¦lies in its ability to create a structure that allows the visual exploratory data analysis¦of large high-dimensional datasets. This algorithm creates a hierarchical structure¦of Self-Organizing Map Component Planes, arranging similar variables' projections in¦the same branches of the tree. Hence, similarities on variables' behavior can be easily¦detected (e.g. local correlations, maximal and minimal values and outliers).¦Both FGHSON and the Tree-structured SOM Component Planes were applied in¦several agroecological problems proving to be very efficient in the exploratory analysis¦and clustering of spatio-temporal datasets.¦In this thesis I also tested three soft competitive learning algorithms. Two of them¦well-known non supervised soft competitive algorithms, namely the Self-Organizing¦Maps (SOMs) and the Growing Hierarchical Self-Organizing Maps (GHSOMs); and the¦third was our original contribution, the FGHSON. Although the algorithms presented¦here have been used in several areas, to my knowledge there is not any work applying¦and comparing the performance of those techniques when dealing with spatiotemporal¦geospatial data, as it is presented in this thesis.¦I propose original methodologies to explore spatio-temporal geo-referenced datasets¦through time. Our approach uses time windows to capture temporal similarities and¦variations by using the FGHSON clustering algorithm. The developed methodologies¦are used in two case studies. In the first, the objective was to find similar agroecozones¦through time and in the second one it was to find similar environmental patterns¦shifted in time.¦Several results presented in this thesis have led to new contributions to agroecological¦knowledge, for instance, in sugar cane, and blackberry production.¦Finally, in the framework of this thesis we developed several software tools: (1)¦a Matlab toolbox that implements the FGHSON algorithm, and (2) a program called¦BIS (Bio-inspired Identification of Similar agroecozones) an interactive graphical user¦interface tool which integrates the FGHSON algorithm with Google Earth in order to¦show zones with similar agroecological characteristics.