11 resultados para Data clustering
em BORIS: Bern Open Repository and Information System - Berna - Suiça
Resumo:
An important problem in unsupervised data clustering is how to determine the number of clusters. Here we investigate how this can be achieved in an automated way by using interrelation matrices of multivariate time series. Two nonparametric and purely data driven algorithms are expounded and compared. The first exploits the eigenvalue spectra of surrogate data, while the second employs the eigenvector components of the interrelation matrix. Compared to the first algorithm, the second approach is computationally faster and not limited to linear interrelation measures.
Resumo:
The task considered in this paper is performance evaluation of region segmentation algorithms in the ground-truth-based paradigm. Given a machine segmentation and a ground-truth segmentation, performance measures are needed. We propose to consider the image segmentation problem as one of data clustering and, as a consequence, to use measures for comparing clusterings developed in statistics and machine learning. By doing so, we obtain a variety of performance measures which have not been used before in image processing. In particular, some of these measures have the highly desired property of being a metric. Experimental results are reported on both synthetic and real data to validate the measures and compare them with others.
Resumo:
We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.
Does published orthodontic research account for clustering effects during statistical data analysis?
Resumo:
In orthodontics, multiple site observations within patients or multiple observations collected at consecutive time points are often encountered. Clustered designs require larger sample sizes compared to individual randomized trials and special statistical analyses that account for the fact that observations within clusters are correlated. It is the purpose of this study to assess to what degree clustering effects are considered during design and data analysis in the three major orthodontic journals. The contents of the most recent 24 issues of the American Journal of Orthodontics and Dentofacial Orthopedics (AJODO), Angle Orthodontist (AO), and European Journal of Orthodontics (EJO) from December 2010 backwards were hand searched. Articles with clustering effects and whether the authors accounted for clustering effects were identified. Additionally, information was collected on: involvement of a statistician, single or multicenter study, number of authors in the publication, geographical area, and statistical significance. From the 1584 articles, after exclusions, 1062 were assessed for clustering effects from which 250 (23.5 per cent) were considered to have clustering effects in the design (kappa = 0.92, 95 per cent CI: 0.67-0.99 for inter rater agreement). From the studies with clustering effects only, 63 (25.20 per cent) had indicated accounting for clustering effects. There was evidence that the studies published in the AO have higher odds of accounting for clustering effects [AO versus AJODO: odds ratio (OR) = 2.17, 95 per cent confidence interval (CI): 1.06-4.43, P = 0.03; EJO versus AJODO: OR = 1.90, 95 per cent CI: 0.84-4.24, non-significant; and EJO versus AO: OR = 1.15, 95 per cent CI: 0.57-2.33, non-significant). The results of this study indicate that only about a quarter of the studies with clustering effects account for this in statistical data analysis.
Resumo:
The study describes brain areas involved in medial temporal lobe (mTL) seizures of 12 patients. All patients showed so-called oro-alimentary behavior within the first 20 s of clinical seizure manifestation characteristic of mTL seizures. Single photon emission computed tomography (SPECT) images of regional cerebral blood flow (rCBF) were acquired from the patients in ictal and interictal phases and from normal volunteers. Image analysis employed categorical comparisons with statistical parametric mapping and principal component analysis (PCA) to assess functional connectivity. PCA supplemented the findings of the categorical analysis by decomposing the covariance matrix containing images of patients and healthy subjects into distinct component images of independent variance, including areas not identified by the categorical analysis. Two principal components (PCs) discriminated the subject groups: patients with right or left mTL seizures and normal volunteers, indicating distinct neuronal networks implicated by the seizure. Both PCs were correlated with seizure duration, one positively and the other negatively, confirming their physiological significance. The independence of the two PCs yielded a clear clustering of subject groups. The local pattern within the temporal lobe describes critical relay nodes which are the counterpart of oro-alimentary behavior: (1) right mesial temporal zone and ipsilateral anterior insula in right mTL seizures, and (2) temporal poles on both sides that are densely interconnected by the anterior commissure. Regions remote from the temporal lobe may be related to seizure propagation and include positively and negatively loaded areas. These patterns, the covarying areas of the temporal pole and occipito-basal visual association cortices, for example, are related to known anatomic paths.
Resumo:
AIMS: Multiple arrhythmia re-inductions were recently shown in His-Purkinje system (HPS) ventricular tachycardia (VT). We hypothesized that HPS VT was a frequent mechanism of repetitive or incessant VT and assessed diagnostic criteria to select patients likely to have HPS VT. METHODS AND RESULTS: Consecutive patients with clustering VT episodes (>3 sustained monomorphic VT within 2 weeks) were included in the analysis. HPS VT was considered plausible in patients with (i) impaired left ventricular function associated with dilated cardiomyopathy or valvular heart disease; or (ii) ECG during VT similar to sinus rhythm QRS or to bundle-branch block QRS. HPS VT was plausible in 12 of 48 patients and HPS VT was demonstrated in 6 of 12 patients (50%, or 13% of the whole study group). Median VT cycle length was 318 ms (250-550). Catheter ablation was successful in all six patients. CONCLUSION: His-Purkinje system VT is found in a significant number of patients with repetitive or incessant VT episodes, and in a large proportion of patients with predefined clinical or electrocardiographic characteristics. Since it is easily amenable to catheter ablation, our data support the screening of all patients with repetitive VT in this regard and an invasive approach in a selected group of patients.
Resumo:
Dynamic changes in ERP topographies can be conveniently analyzed by means of microstates, the so-called "atoms of thoughts", that represent brief periods of quasi-stable synchronized network activation. Comparing temporal microstate features such as on- and offset or duration between groups and conditions therefore allows a precise assessment of the timing of cognitive processes. So far, this has been achieved by assigning the individual time-varying ERP maps to spatially defined microstate templates obtained from clustering the grand mean data into predetermined numbers of topographies (microstate prototypes). Features obtained from these individual assignments were then statistically compared. This has the problem that the individual noise dilutes the match between individual topographies and templates leading to lower statistical power. We therefore propose a randomization-based procedure that works without assigning grand-mean microstate prototypes to individual data. In addition, we propose a new criterion to select the optimal number of microstate prototypes based on cross-validation across subjects. After a formal introduction, the method is applied to a sample data set of an N400 experiment and to simulated data with varying signal-to-noise ratios, and the results are compared to existing methods. In a first comparison with previously employed statistical procedures, the new method showed an increased robustness to noise, and a higher sensitivity for more subtle effects of microstate timing. We conclude that the proposed method is well-suited for the assessment of timing differences in cognitive processes. The increased statistical power allows identifying more subtle effects, which is particularly important in small and scarce patient populations.
Resumo:
BACKGROUND: HCV coinfection remains a major cause of morbidity and mortality among HIV-infected individuals and its incidence has increased dramatically in HIV-infected men who have sex with men(MSM). METHODS: Hepatitis C virus (HCV) coinfection in the Swiss HIV Cohort Study(SHCS) was studied by combining clinical data with HIV-1 pol-sequences from the SHCS Drug Resistance Database(DRDB). We inferred maximum-likelihood phylogenetic trees, determined Swiss HIV-transmission pairs as monophyletic patient pairs, and then considered the distribution of HCV on those pairs. RESULTS: Among the 9748 patients in the SHCS-DRDB with known HCV status, 2768(28%) were HCV-positive. Focusing on subtype B(7644 patients), we identified 1555 potential HIV-1 transmission pairs. There, we found that, even after controlling for transmission group, calendar year, age and sex, the odds for an HCV coinfection were increased by an odds ratio (OR) of 3.2 [95% confidence interval (CI) 2.2, 4.7) if a patient clustered with another HCV-positive case. This strong association persisted if transmission groups of intravenous drug users (IDUs), MSMs and heterosexuals (HETs) were considered separately(in all cases OR >2). Finally we found that HCV incidence was increased by a hazard ratio of 2.1 (1.1, 3.8) for individuals paired with an HCV-positive partner. CONCLUSIONS: Patients whose HIV virus is closely related to the HIV virus of HIV/HCV-coinfected patients have a higher risk for carrying or acquiring HCV themselves. This indicates the occurrence of domestic and sexual HCV transmission and allows the identification of patients with a high HCV-infection risk.
Resumo:
We consider the problem of fitting a union of subspaces to a collection of data points drawn from one or more subspaces and corrupted by noise and/or gross errors. We pose this problem as a non-convex optimization problem, where the goal is to decompose the corrupted data matrix as the sum of a clean and self-expressive dictionary plus a matrix of noise and/or gross errors. By self-expressive we mean a dictionary whose atoms can be expressed as linear combinations of themselves with low-rank coefficients. In the case of noisy data, our key contribution is to show that this non-convex matrix decomposition problem can be solved in closed form from the SVD of the noisy data matrix. The solution involves a novel polynomial thresholding operator on the singular values of the data matrix, which requires minimal shrinkage. For one subspace, a particular case of our framework leads to classical PCA, which requires no shrinkage. For multiple subspaces, the low-rank coefficients obtained by our framework can be used to construct a data affinity matrix from which the clustering of the data according to the subspaces can be obtained by spectral clustering. In the case of data corrupted by gross errors, we solve the problem using an alternating minimization approach, which combines our polynomial thresholding operator with the more traditional shrinkage-thresholding operator. Experiments on motion segmentation and face clustering show that our framework performs on par with state-of-the-art techniques at a reduced computational cost.
Resumo:
SUMMARY There is interest in the potential of companion animal surveillance to provide data to improve pet health and to provide early warning of environmental hazards to people. We implemented a companion animal surveillance system in Calgary, Alberta and the surrounding communities. Informatics technologies automatically extracted electronic medical records from participating veterinary practices and identified cases of enteric syndrome in the warehoused records. The data were analysed using time-series analyses and a retrospective space-time permutation scan statistic. We identified a seasonal pattern of reports of occurrences of enteric syndromes in companion animals and four statistically significant clusters of enteric syndrome cases. The cases within each cluster were examined and information about the animals involved (species, age, sex), their vaccination history, possible exposure or risk behaviour history, information about disease severity, and the aetiological diagnosis was collected. We then assessed whether the cases within the cluster were unusual and if they represented an animal or public health threat. There was often insufficient information recorded in the medical record to characterize the clusters by aetiology or exposures. Space-time analysis of companion animal enteric syndrome cases found evidence of clustering. Collection of more epidemiologically relevant data would enhance the utility of practice-based companion animal surveillance.
Resumo:
Smart homes for the aging population have recently started attracting the attention of the research community. The "health state" of smart homes is comprised of many different levels; starting with the physical health of citizens, it also includes longer-term health norms and outcomes, as well as the arena of positive behavior changes. One of the problems of interest is to monitor the activities of daily living (ADL) of the elderly, aiming at their protection and well-being. For this purpose, we installed passive infrared (PIR) sensors to detect motion in a specific area inside a smart apartment and used them to collect a set of ADL. In a novel approach, we describe a technology that allows the ground truth collected in one smart home to train activity recognition systems for other smart homes. We asked the users to label all instances of all ADL only once and subsequently applied data mining techniques to cluster in-home sensor firings. Each cluster would therefore represent the instances of the same activity. Once the clusters were associated to their corresponding activities, our system was able to recognize future activities. To improve the activity recognition accuracy, our system preprocessed raw sensor data by identifying overlapping activities. To evaluate the recognition performance from a 200-day dataset, we implemented three different active learning classification algorithms and compared their performance: naive Bayesian (NB), support vector machine (SVM) and random forest (RF). Based on our results, the RF classifier recognized activities with an average specificity of 96.53%, a sensitivity of 68.49%, a precision of 74.41% and an F-measure of 71.33%, outperforming both the NB and SVM classifiers. Further clustering markedly improved the results of the RF classifier. An activity recognition system based on PIR sensors in conjunction with a clustering classification approach was able to detect ADL from datasets collected from different homes. Thus, our PIR-based smart home technology could improve care and provide valuable information to better understand the functioning of our societies, as well as to inform both individual and collective action in a smart city scenario.