925 resultados para Data clustering. Fuzzy C-Means. Cluster centers initialization. Validation indices
Resumo:
In 2000 the European Statistical Office published the guidelines for developing theHarmonized European Time Use Surveys system. Under such a unified framework,the first Time Use Survey of national scope was conducted in Spain during 2002–03. The aim of these surveys is to understand human behavior and the lifestyle ofpeople. Time allocation data are of compositional nature in origin, that is, they aresubject to non-negativity and constant-sum constraints. Thus, standard multivariatetechniques cannot be directly applied to analyze them. The goal of this work is toidentify homogeneous Spanish Autonomous Communities with regard to the typicalactivity pattern of their respective populations. To this end, fuzzy clustering approachis followed. Rather than the hard partitioning of classical clustering, where objects areallocated to only a single group, fuzzy method identify overlapping groups of objectsby allowing them to belong to more than one group. Concretely, the probabilistic fuzzyc-means algorithm is conveniently adapted to deal with the Spanish Time Use Surveymicrodata. As a result, a map distinguishing Autonomous Communities with similaractivity pattern is drawn.Key words: Time use data, Fuzzy clustering; FCM; simplex space; Aitchison distance
Resumo:
Our essay aims at studying suitable statistical methods for the clustering ofcompositional data in situations where observations are constituted by trajectories ofcompositional data, that is, by sequences of composition measurements along a domain.Observed trajectories are known as “functional data” and several methods have beenproposed for their analysis.In particular, methods for clustering functional data, known as Functional ClusterAnalysis (FCA), have been applied by practitioners and scientists in many fields. To ourknowledge, FCA techniques have not been extended to cope with the problem ofclustering compositional data trajectories. In order to extend FCA techniques to theanalysis of compositional data, FCA clustering techniques have to be adapted by using asuitable compositional algebra.The present work centres on the following question: given a sample of compositionaldata trajectories, how can we formulate a segmentation procedure giving homogeneousclasses? To address this problem we follow the steps described below.First of all we adapt the well-known spline smoothing techniques in order to cope withthe smoothing of compositional data trajectories. In fact, an observed curve can bethought of as the sum of a smooth part plus some noise due to measurement errors.Spline smoothing techniques are used to isolate the smooth part of the trajectory:clustering algorithms are then applied to these smooth curves.The second step consists in building suitable metrics for measuring the dissimilaritybetween trajectories: we propose a metric that accounts for difference in both shape andlevel, and a metric accounting for differences in shape only.A simulation study is performed in order to evaluate the proposed methodologies, usingboth hierarchical and partitional clustering algorithm. The quality of the obtained resultsis assessed by means of several indices
Resumo:
BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.
Resumo:
Positron emission tomography with [18F] fluorodeoxyglucose (FDG-PET) plays a well-established role in assisting early detection of frontotemporal lobar degeneration (FTLD). Here, we examined the impact of intensity normalization to different reference areas on accuracy of FDG-PET to discriminate between patients with mild FTLD and healthy elderly subjects. FDG-PET was conducted at two centers using different acquisition protocols: 41 FTLD patients and 42 controls were studied at center 1, 11 FTLD patients and 13 controls were studied at center 2. All PET images were intensity normalized to the cerebellum, primary sensorimotor cortex (SMC), cerebral global mean (CGM), and a reference cluster with most preserved FDG uptake in the aforementioned patients group of center 1. Metabolic deficits in the patient group at center 1 appeared 1.5, 3.6, and 4.6 times greater in spatial extent, when tracer uptake was normalized to the reference cluster rather than to the cerebellum, SMC, and CGM, respectively. Logistic regression analyses based on normalized values from FTLD-typical regions showed that at center 1, cerebellar, SMC, CGM, and cluster normalizations differentiated patients from controls with accuracies of 86%, 76%, 75% and 90%, respectively. A similar order of effects was found at center 2. Cluster normalization leads to a significant increase of statistical power in detecting early FTLD-associated metabolic deficits. The established FTLD-specific cluster can be used to improve detection of FTLD on a single case basis at independent centers - a decisive step towards early diagnosis and prediction of FTLD syndromes enabling specific therapies in the future.
Resumo:
OBJECTIVE: To assess the effect of a governmentally-led center based child care physical activity program (Youp'la Bouge) on child motor skills.Patients and methods: We conducted a single blinded cluster randomized controlled trial in 58 Swiss child care centers. Centers were randomly selected and 1:1 assigned to a control or intervention group. The intervention lasted from September 2009 to June 2010 and included training of the educators, adaptation of the child care built environment, parental involvement and daily physical activity. Motor skill was the primary outcome and body mass index (BMI), physical activity and quality of life secondary outcomes. The intervention implementation was also assessed. RESULTS: At baseline, 648 children present on the motor test day were included (age 3.3 +/- 0.6, BMI 16.3 +/- 1.3 kg/m2, 13.2% overweight, 49% girls) and 313 received the intervention. Relative to children in the control group (n = 201), children in the intervention group (n = 187) showed no significant increase in motor skills (delta of mean change (95% confidence interval: -0.2 (-0.8 to 0.3), p = 0.43) or in any of the secondary outcomes. Not all child care centers implemented all the intervention components. Within the intervention group, several predictors were positively associated with trial outcomes: 1) free-access to a movement space and parental information session for motor skills 2) highly motivated and trained educators for BMI 3) free-access to a movement space and purchase of mobile equipment for physical activity (all p < 0.05). CONCLUSION: This "real-life" physical activity program in child care centers confirms the complexity of implementing an intervention outside a study setting and identified potentially relevant predictors that could improve future programs.Trial registration: Trial registration number: clinical trials.gov NCT00967460 http://clinicaltrials.gov/ct2/show/NCT00967460.
Resumo:
Multicentric carpotarsal osteolysis (MCTO) is a rare skeletal dysplasia characterized by aggressive osteolysis, particularly affecting the carpal and tarsal bones, and is frequently associated with progressive renal failure. Using exome capture and next-generation sequencing in five unrelated simplex cases of MCTO, we identified previously unreported missense mutations clustering within a 51 base pair region of the single exon of MAFB, validated by Sanger sequencing. A further six unrelated simplex cases with MCTO were also heterozygous for previously unreported mutations within this same region, as were affected members of two families with autosomal-dominant MCTO. MAFB encodes a transcription factor that negatively regulates RANKL-induced osteoclastogenesis and is essential for normal renal development. Identification of this gene paves the way for development of novel therapeutic approaches for this crippling disease and provides insight into normal bone and kidney development.
Resumo:
Many classification systems rely on clustering techniques in which a collection of training examples is provided as an input, and a number of clusters c1,...cm modelling some concept C results as an output, such that every cluster ci is labelled as positive or negative. Given a new, unlabelled instance enew, the above classification is used to determine to which particular cluster ci this new instance belongs. In such a setting clusters can overlap, and a new unlabelled instance can be assigned to more than one cluster with conflicting labels. In the literature, such a case is usually solved non-deterministically by making a random choice. This paper presents a novel, hybrid approach to solve this situation by combining a neural network for classification along with a defeasible argumentation framework which models preference criteria for performing clustering.
Resumo:
PURPOSE: According to estimations around 230 people die as a result of radon exposure in Switzerland. This public health concern makes reliable indoor radon prediction and mapping methods necessary in order to improve risk communication to the public. The aim of this study was to develop an automated method to classify lithological units according to their radon characteristics and to develop mapping and predictive tools in order to improve local radon prediction. METHOD: About 240 000 indoor radon concentration (IRC) measurements in about 150 000 buildings were available for our analysis. The automated classification of lithological units was based on k-medoids clustering via pair-wise Kolmogorov distances between IRC distributions of lithological units. For IRC mapping and prediction we used random forests and Bayesian additive regression trees (BART). RESULTS: The automated classification groups lithological units well in terms of their IRC characteristics. Especially the IRC differences in metamorphic rocks like gneiss are well revealed by this method. The maps produced by random forests soundly represent the regional difference of IRCs in Switzerland and improve the spatial detail compared to existing approaches. We could explain 33% of the variations in IRC data with random forests. Additionally, the influence of a variable evaluated by random forests shows that building characteristics are less important predictors for IRCs than spatial/geological influences. BART could explain 29% of IRC variability and produced maps that indicate the prediction uncertainty. CONCLUSION: Ensemble regression trees are a powerful tool to model and understand the multidimensional influences on IRCs. Automatic clustering of lithological units complements this method by facilitating the interpretation of radon properties of rock types. This study provides an important element for radon risk communication. Future approaches should consider taking into account further variables like soil gas radon measurements as well as more detailed geological information.
Resumo:
This study aimed at identifying different conditions of coffee plants after harvesting period, using data mining and spectral behavior profiles from Hyperion/EO1 sensor. The Hyperion image, with spatial resolution of 30 m, was acquired in August 28th, 2008, at the end of the coffee harvest season in the studied area. For pre-processing imaging, atmospheric and signal/noise effect corrections were carried out using Flaash and MNF (Minimum Noise Fraction Transform) algorithms, respectively. Spectral behavior profiles (38) of different coffee varieties were generated from 150 Hyperion bands. The spectral behavior profiles were analyzed by Expectation-Maximization (EM) algorithm considering 2; 3; 4 and 5 clusters. T-test with 5% of significance was used to verify the similarity among the wavelength cluster means. The results demonstrated that it is possible to separate five different clusters, which were comprised by different coffee crop conditions making possible to improve future intervention actions.
Resumo:
This master thesis work introduces the fuzzy tolerance/equivalence relation and its application in cluster analysis. The work presents about the construction of fuzzy equivalence relations using increasing generators. Here, we investigate and research on the role of increasing generators for the creation of intersection, union and complement operators. The objective is to develop different varieties of fuzzy tolerance/equivalence relations using different varieties of increasing generators. At last, we perform a comparative study with these developed varieties of fuzzy tolerance/equivalence relations in their application to a clustering method.
Resumo:
Coronary artery disease (CAD) is a worldwide leading cause of death. The standard method for evaluating critical partial occlusions is coronary arteriography, a catheterization technique which is invasive, time consuming, and costly. There are noninvasive approaches for the early detection of CAD. The basis for the noninvasive diagnosis of CAD has been laid in a sequential analysis of the risk factors, and the results of the treadmill test and myocardial perfusion scintigraphy (MPS). Many investigators have demonstrated that the diagnostic applications of MPS are appropriate for patients who have an intermediate likelihood of disease. Although this information is useful, it is only partially utilized in clinical practice due to the difficulty to properly classify the patients. Since the seminal work of Lotfi Zadeh, fuzzy logic has been applied in numerous areas. In the present study, we proposed and tested a model to select patients for MPS based on fuzzy sets theory. A group of 1053 patients was used to develop the model and another group of 1045 patients was used to test it. Receiver operating characteristic curves were used to compare the performance of the fuzzy model against expert physician opinions, and showed that the performance of the fuzzy model was equal or superior to that of the physicians. Therefore, we conclude that the fuzzy model could be a useful tool to assist the general practitioner in the selection of patients for MPS.
Resumo:
La participación en carreras atléticas de calle ha aumentado; esto requiere detectar riesgos previos al esfuerzo físico. Objetivo. Identificar factores de riesgo del comportamiento y readiness de inscritos a una carrera. Método. Estudio transversal en aficionados de 18-64 años. Encuesta digital con módulos de IPAQ, PARQ+ y STEP. Muestreo aleatorio sistemático con n=510, para una inactividad física esperada de 35% (±5%). Se evaluó nivel de actividad física, consumo de alcohol (peligroso), de fruta, verdura, tabaco y sal, y readiness. Resultados. El cumplimiento de actividad física fue 97,4%; 2,4% consume nivel óptimo de fruta o verdura (diferencias por edad, sexo y estrato), 3,7% fuma y 44,1% consumo peligroso de alcohol. El 19,8% reportó PARQ+ positivo y 5,7% requiere supervisión. Hay diferencias por trabajo y estudio. Discusión. Los aficionados cumplen el nivel de actividad física; pero no de otros factores. Una estrategia de seguridad en el atletismo de calle es evaluar los factores de riesgo relacionados con el estilo de vida así como el readiness.
Resumo:
In the past decade, the amount of data in biological field has become larger and larger; Bio-techniques for analysis of biological data have been developed and new tools have been introduced. Several computational methods are based on unsupervised neural network algorithms that are widely used for multiple purposes including clustering and visualization, i.e. the Self Organizing Maps (SOM). Unfortunately, even though this method is unsupervised, the performances in terms of quality of result and learning speed are strongly dependent from the neuron weights initialization. In this paper we present a new initialization technique based on a totally connected undirected graph, that report relations among some intersting features of data input. Result of experimental tests, where the proposed algorithm is compared to the original initialization techniques, shows that our technique assures faster learning and better performance in terms of quantization error.
Resumo:
This dissertation deals with aspects of sequential data assimilation (in particular ensemble Kalman filtering) and numerical weather forecasting. In the first part, the recently formulated Ensemble Kalman-Bucy (EnKBF) filter is revisited. It is shown that the previously used numerical integration scheme fails when the magnitude of the background error covariance grows beyond that of the observational error covariance in the forecast window. Therefore, we present a suitable integration scheme that handles the stiffening of the differential equations involved and doesn’t represent further computational expense. Moreover, a transform-based alternative to the EnKBF is developed: under this scheme, the operations are performed in the ensemble space instead of in the state space. Advantages of this formulation are explained. For the first time, the EnKBF is implemented in an atmospheric model. The second part of this work deals with ensemble clustering, a phenomenon that arises when performing data assimilation using of deterministic ensemble square root filters in highly nonlinear forecast models. Namely, an M-member ensemble detaches into an outlier and a cluster of M-1 members. Previous works may suggest that this issue represents a failure of EnSRFs; this work dispels that notion. It is shown that ensemble clustering can be reverted also due to nonlinear processes, in particular the alternation between nonlinear expansion and compression of the ensemble for different regions of the attractor. Some EnSRFs that use random rotations have been developed to overcome this issue; these formulations are analyzed and their advantages and disadvantages with respect to common EnSRFs are discussed. The third and last part contains the implementation of the Robert-Asselin-Williams (RAW) filter in an atmospheric model. The RAW filter is an improvement to the widely popular Robert-Asselin filter that successfully suppresses spurious computational waves while avoiding any distortion in the mean value of the function. Using statistical significance tests both at the local and field level, it is shown that the climatology of the SPEEDY model is not modified by the changed time stepping scheme; hence, no retuning of the parameterizations is required. It is found the accuracy of the medium-term forecasts is increased by using the RAW filter.
Resumo:
Ensemble clustering (EC) can arise in data assimilation with ensemble square root filters (EnSRFs) using non-linear models: an M-member ensemble splits into a single outlier and a cluster of M−1 members. The stochastic Ensemble Kalman Filter does not present this problem. Modifications to the EnSRFs by a periodic resampling of the ensemble through random rotations have been proposed to address it. We introduce a metric to quantify the presence of EC and present evidence to dispel the notion that EC leads to filter failure. Starting from a univariate model, we show that EC is not a permanent but transient phenomenon; it occurs intermittently in non-linear models. We perform a series of data assimilation experiments using a standard EnSRF and a modified EnSRF by a resampling though random rotations. The modified EnSRF thus alleviates issues associated with EC at the cost of traceability of individual ensemble trajectories and cannot use some of algorithms that enhance performance of standard EnSRF. In the non-linear regimes of low-dimensional models, the analysis root mean square error of the standard EnSRF slowly grows with ensemble size if the size is larger than the dimension of the model state. However, we do not observe this problem in a more complex model that uses an ensemble size much smaller than the dimension of the model state, along with inflation and localisation. Overall, we find that transient EC does not handicap the performance of the standard EnSRF.