948 resultados para Clustering Analysis
Resumo:
This paper presents the results of our data mining study of Pb-Zn (lead-zinc) ore assay records from a mine enterprise in Bulgaria. We examined the dataset, cleaned outliers, visualized the data, and created dataset statistics. A Pb-Zn cluster data mining model was created for segmentation and prediction of Pb-Zn ore assay data. The Pb-Zn cluster data model consists of five clusters and DMX queries. We analyzed the Pb-Zn cluster content, size, structure, and characteristics. The set of the DMX queries allows for browsing and managing the clusters, as well as predicting ore assay records. A testing and validation of the Pb-Zn cluster data mining model was developed in order to show its reasonable accuracy before beingused in a production environment. The Pb-Zn cluster data mining model can be used for changes of the mine grinding and floatation processing parameters in almost real-time, which is important for the efficiency of the Pb-Zn ore beneficiation process. ACM Computing Classification System (1998): H.2.8, H.3.3.
Resumo:
To determine the factors influencing the distribution of β-amyloid (Aβ) deposits in Alzheimer's disease (AD), the spatial patterns of the diffuse, primitive, and classic Aβ deposits were studied from the superior temporal gyrus (STG) to sector CA4 of the hippocampus in six sporadic cases of the disease. In cortical gyri and in the CA sectors of the hippocampus, the Aβ deposits were distributed either in clusters 200-6400 μm in diameter that were regularly distributed parallel to the tissue boundary or in larger clusters greater than 6400 μm in diameter. In some regions, smaller clusters of Aβ deposits were aggregated into larger 'superclusters'. In many cortical gyri, the density of Aβ deposits was positively correlated with distance below the gyral crest. In the majority of regions, clusters of the diffuse, primitive, and classic deposits were not spatially correlated with each other. In two cases, double immunolabelled to reveal the Aβ deposits and blood vessels, the classic Aβ deposits were clustered around the larger diameter vessels. These results suggest a complex pattern of Aβ deposition in the temporal lobe in sporadic AD. A regular distribution of Aβ deposit clusters may reflect the degeneration of specific cortico-cortical and cortico-hippocampal pathways and the influence of the cerebral blood vessels. Large-scale clustering may reflect the aggregation of deposits in the depths of the sulci and the coalescence of smaller clusters.
Resumo:
The purpose of the study was to examine the relationship between teacher beliefs and actual classroom practice in early literacy instruction. Conjoint analysis was used to measure teachers' beliefs on four early literacy factors—phonological awareness, print awareness, graphophonic awareness, and structural awareness. A collective case study format was then used to measure the correspondence of teachers' beliefs with their actual classroom practice. ^ Ninety Project READS participants were given twelve cards in an orthogonal experimental design describing students that either met or did not meet criteria on the four early literacy factors. Conjoint measurements of whether the student is an efficient reader were taken. These measurements provided relative importance scores for each respondent. Based on the relative important scores, four teachers were chosen to participate in a collective case study. ^ The conjoint results enabled the clustering of teachers into four distinct groups, each aligned with one of the four early literacy factors. K-means cluster analysis of the relative importance measurements showed commonalities among the ninety respondents' beliefs. The collective case study results were mixed. Implications for researchers and practitioners include the use of conjoint analysis in measuring teacher beliefs on the four early literacy factors. Further, the understanding of teacher preferences on these beliefs may assist in the development of curriculum design and therefore increase educational effectiveness. Finally, comparisons between teachers' beliefs on the four early literacy factors and actual instructional practices may facilitate teacher self-reflection thus encouraging positive teacher change. ^
Resumo:
The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^
Resumo:
To carry out their specific roles in the cell, genes and gene products often work together in groups, forming many relationships among themselves and with other molecules. Such relationships include physical protein-protein interaction relationships, regulatory relationships, metabolic relationships, genetic relationships, and much more. With advances in science and technology, some high throughput technologies have been developed to simultaneously detect tens of thousands of pairwise protein-protein interactions and protein-DNA interactions. However, the data generated by high throughput methods are prone to noise. Furthermore, the technology itself has its limitations, and cannot detect all kinds of relationships between genes and their products. Thus there is a pressing need to investigate all kinds of relationships and their roles in a living system using bioinformatic approaches, and is a central challenge in Computational Biology and Systems Biology. This dissertation focuses on exploring relationships between genes and gene products using bioinformatic approaches. Specifically, we consider problems related to regulatory relationships, protein-protein interactions, and semantic relationships between genes. A regulatory element is an important pattern or "signal", often located in the promoter of a gene, which is used in the process of turning a gene "on" or "off". Predicting regulatory elements is a key step in exploring the regulatory relationships between genes and gene products. In this dissertation, we consider the problem of improving the prediction of regulatory elements by using comparative genomics data. With regard to protein-protein interactions, we have developed bioinformatics techniques to estimate support for the data on these interactions. While protein-protein interactions and regulatory relationships can be detected by high throughput biological techniques, there is another type of relationship called semantic relationship that cannot be detected by a single technique, but can be inferred using multiple sources of biological data. The contributions of this thesis involved the development and application of a set of bioinformatic approaches that address the challenges mentioned above. These included (i) an EM-based algorithm that improves the prediction of regulatory elements using comparative genomics data, (ii) an approach for estimating the support of protein-protein interaction data, with application to functional annotation of genes, (iii) a novel method for inferring functional network of genes, and (iv) techniques for clustering genes using multi-source data.
Resumo:
This dissertation establishes a novel data-driven method to identify language network activation patterns in pediatric epilepsy through the use of the Principal Component Analysis (PCA) on functional magnetic resonance imaging (fMRI). A total of 122 subjects’ data sets from five different hospitals were included in the study through a web-based repository site designed here at FIU. Research was conducted to evaluate different classification and clustering techniques in identifying hidden activation patterns and their associations with meaningful clinical variables. The results were assessed through agreement analysis with the conventional methods of lateralization index (LI) and visual rating. What is unique in this approach is the new mechanism designed for projecting language network patterns in the PCA-based decisional space. Synthetic activation maps were randomly generated from real data sets to uniquely establish nonlinear decision functions (NDF) which are then used to classify any new fMRI activation map into typical or atypical. The best nonlinear classifier was obtained on a 4D space with a complexity (nonlinearity) degree of 7. Based on the significant association of language dominance and intensities with the top eigenvectors of the PCA decisional space, a new algorithm was deployed to delineate primary cluster members without intensity normalization. In this case, three distinct activations patterns (groups) were identified (averaged kappa with rating 0.65, with LI 0.76) and were characterized by the regions of: (1) the left inferior frontal Gyrus (IFG) and left superior temporal gyrus (STG), considered typical for the language task; (2) the IFG, left mesial frontal lobe, right cerebellum regions, representing a variant left dominant pattern by higher activation; and (3) the right homologues of the first pattern in Broca's and Wernicke's language areas. Interestingly, group 2 was found to reflect a different language compensation mechanism than reorganization. Its high intensity activation suggests a possible remote effect on the right hemisphere focus on traditionally left-lateralized functions. In retrospect, this data-driven method provides new insights into mechanisms for brain compensation/reorganization and neural plasticity in pediatric epilepsy.
Resumo:
This dissertation introduces a new approach for assessing the effects of pediatric epilepsy on the language connectome. Two novel data-driven network construction approaches are presented. These methods rely on connecting different brain regions using either extent or intensity of language related activations as identified by independent component analysis of fMRI data. An auditory description decision task (ADDT) paradigm was used to activate the language network for 29 patients and 30 controls recruited from three major pediatric hospitals. Empirical evaluations illustrated that pediatric epilepsy can cause, or is associated with, a network efficiency reduction. Patients showed a propensity to inefficiently employ the whole brain network to perform the ADDT language task; on the contrary, controls seemed to efficiently use smaller segregated network components to achieve the same task. To explain the causes of the decreased efficiency, graph theoretical analysis was carried out. The analysis revealed no substantial global network feature differences between the patient and control groups. It also showed that for both subject groups the language network exhibited small-world characteristics; however, the patient's extent of activation network showed a tendency towards more random networks. It was also shown that the intensity of activation network displayed ipsilateral hub reorganization on the local level. The left hemispheric hubs displayed greater centrality values for patients, whereas the right hemispheric hubs displayed greater centrality values for controls. This hub hemispheric disparity was not correlated with a right atypical language laterality found in six patients. Finally it was shown that a multi-level unsupervised clustering scheme based on self-organizing maps, a type of artificial neural network, and k-means was able to fairly and blindly separate the subjects into their respective patient or control groups. The clustering was initiated using the local nodal centrality measurements only. Compared to the extent of activation network, the intensity of activation network clustering demonstrated better precision. This outcome supports the assertion that the local centrality differences presented by the intensity of activation network can be associated with focal epilepsy.^
Resumo:
Abstract Driven by the political and economic forces of cross-strait, Taiwan has become one of the major source markets for Hong Kong tourism industry since 1987. The major purposes of this study were to investigate the following factors (1) The influential factors of travel motivation, (2) The clusters of travel motivations, (3) The marketing segmentation of clusters of Taiwanese tourists to visit Hong Kong. Through ten travel agents, self-report surveys were distributed to collect data from 366 Taiwanese travelers. Hence, four push factors and six pull factors were identified as travel motivations through the factor analysis. Combined with the cluster analysis; five new groups were founded. Finally, five clusters which process unique profiles (location difference, visiting frequency, travel satisfaction, and destination loyalty) were addressed. The suggestions of developing effective market strategies to attract Taiwanese tourists to Hong Kong were also provided.
Resumo:
This dissertation establishes a novel data-driven method to identify language network activation patterns in pediatric epilepsy through the use of the Principal Component Analysis (PCA) on functional magnetic resonance imaging (fMRI). A total of 122 subjects’ data sets from five different hospitals were included in the study through a web-based repository site designed here at FIU. Research was conducted to evaluate different classification and clustering techniques in identifying hidden activation patterns and their associations with meaningful clinical variables. The results were assessed through agreement analysis with the conventional methods of lateralization index (LI) and visual rating. What is unique in this approach is the new mechanism designed for projecting language network patterns in the PCA-based decisional space. Synthetic activation maps were randomly generated from real data sets to uniquely establish nonlinear decision functions (NDF) which are then used to classify any new fMRI activation map into typical or atypical. The best nonlinear classifier was obtained on a 4D space with a complexity (nonlinearity) degree of 7. Based on the significant association of language dominance and intensities with the top eigenvectors of the PCA decisional space, a new algorithm was deployed to delineate primary cluster members without intensity normalization. In this case, three distinct activations patterns (groups) were identified (averaged kappa with rating 0.65, with LI 0.76) and were characterized by the regions of: 1) the left inferior frontal Gyrus (IFG) and left superior temporal gyrus (STG), considered typical for the language task; 2) the IFG, left mesial frontal lobe, right cerebellum regions, representing a variant left dominant pattern by higher activation; and 3) the right homologues of the first pattern in Broca's and Wernicke's language areas. Interestingly, group 2 was found to reflect a different language compensation mechanism than reorganization. Its high intensity activation suggests a possible remote effect on the right hemisphere focus on traditionally left-lateralized functions. In retrospect, this data-driven method provides new insights into mechanisms for brain compensation/reorganization and neural plasticity in pediatric epilepsy.
Resumo:
This study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.
Resumo:
Olfactory sensory neurons (OSNs), which detect a myriad of odorants, are known to express one allele of one olfactory receptor (OR) gene (Olfr) from the largest gene family in the mammalian genome. The OSNs expressing the same OR project their axons to the main olfactory bulb where they converge to form glomeruli. This “One neuron-one receptor rule” makes the olfactory epithelium (OE), which consists of a vast number of OSNs expressing unique ORs, one of the most heterogeneous cell populations. However, the mechanism of how the single OR allele is chosen remains unclear along with the question of whether one OSN only expresses a single OR gene, a hypothesis that has not been rigorously verified while we performed the experiments. Moreover, failure of axonal targeting to single glomerulus was observed in MeCP2 deficient OSNs where delayed development was proposed as an explanation for the phenotype. How Mecp2 mutation caused this aberrant targeting is not entirely understood.
In this dissertation, we explored the transcriptomes of single and mature OSNs by single-cell RNA-Seq to reveal their heterogeneity and further studied the OR gene expression from these isolated OSNs. The singularity of sequenced OSNs was ensured by the observation of monoallelic expression of X-linked genes from the hybrid samples from crosses between mice of different strains where strain-specific polymorphisms could be used to track the allelic origins of SNP-containing reads. The clustering of expression profiles from triplicates that originated from the same cell assured that the transcriptomic identities of OSNs were maintained through the experimental process. The average gene expression profiles of sequenced OSNs correlated well to the conventional transcriptome data of FACS-sorted Omp-positive cells, and the top-ranked expression of OR was conceded in the single-OSN transcriptomes. While exploring cellular diversity, in addition to OR genes, we revealed nearly 200 differentially expressed genes among the sequenced OSNs in this study. Among the 36 sequenced OSNs, eight cells (22.2%) showed multiple OR gene expression and the presences of additional ORs were not restricted to the neighbor loci that shared the transcriptional effect of the primary OR expression, suggesting that the “One neuron-one receptor rule” might not be strictly true at the transcription level. All of the inferable ORs, including additional co-expressed ORs, were shown to be monoallelic. Our sequencing of 21 Mecp2308 mutant OSNs, of which 62% expressed more than one OR genes, and the expression levels of the additional ORs were significantly higher than those in the wild-type, suggested that MeCP2 plays a role in the regulation of singular OR gene expression. Dual label in situ hybridization along with the sequence data revealed that dorsal and ventral ORs were co-expressed in the same Mecp2 mutant OSN, further implying that MeCP2 might be involved in regulation of OR territories in the OE. Our results suggested a new role of MeCP2 in OR gene choice and ratified that this multiple-OR expression caused by Mecp2 mutation did not accompany delayed OSN development that has been observed in the previous studies on the Mecp2 mutants.
Resumo:
Two concepts in rural economic development policy have been the focus of much research and policy action: the identification and support of clusters or networks of firms and the availability and adoption by rural businesses of Information and Communication Technologies (ICT). From a theoretical viewpoint these policies are based on two contrasting models, with clustering seen as a process of economic agglomeration, and ICT-mediated communication as a means of facilitating economic dispersion. The study’s conceptual framework is based on four interrelated elements: location, interaction, knowledge, and advantage, together with the concept of networks which is employed as an operationally and theoretically unifying concept. The research questions are developed in four successive categories: Policy, Theory, Networks, and Method. The questions are approached using a study of two contrasting groups of rural small businesses in West Cork, Ireland: (a) Speciality Foods, and (b) firms in Digital Products and Services. The study combines Social Network Analysis (SNA) with Qualitative Thematic Analysis, using data collected from semi-structured interviews with 58 owners or managers of these businesses. Data comprise relational network data on the firms’ connections to suppliers, customers, allies and competitors, together with linked qualitative data on how the firms established connections, and how tacit and codified knowledge was sourced and utilised. The research finds that the key characteristics identified in the cluster literature are evident in the sample of Speciality Food businesses, in relation to flows of tacit knowledge, social embedding, and the development of forms of social capital. In particular the research identified the presence of two distinct forms of collective social capital in this network, termed “community” and “reputation”. By contrast the sample of Digital Products and Services businesses does not have the form of a cluster, but matches more closely to dispersive models, or “chain” structures. Much of the economic and social structure of this set of firms is best explained in terms of “project organisation”, and by the operation of an individual rather than collective form of “reputation”. The rural setting in which these firms are located has resulted in their being service-centric, and consequently they rely on ICT-mediated communication in order to exchange tacit knowledge “at a distance”. It is this factor, rather than inputs of codified knowledge, that most strongly influences their operation and their need for availability and adoption of high quality communication technologies. Thus the findings have applicability in relation to theory in Economic Geography and to policy and practice in Rural Development. In addition the research contributes to methodological questions in SNA, and to methodological questions about the combination or mixing of quantitative and qualitative methods.
Resumo:
Non-parametric multivariate analyses of complex ecological datasets are widely used. Following appropriate pre-treatment of the data inter-sample resemblances are calculated using appropriate measures. Ordination and clustering derived from these resemblances are used to visualise relationships among samples (or variables). Hierarchical agglomerative clustering with group-average (UPGMA) linkage is often the clustering method chosen. Using an example dataset of zooplankton densities from the Bristol Channel and Severn Estuary, UK, a range of existing and new clustering methods are applied and the results compared. Although the examples focus on analysis of samples, the methods may also be applied to species analysis. Dendrograms derived by hierarchical clustering are compared using cophenetic correlations, which are also used to determine optimum in flexible beta clustering. A plot of cophenetic correlation against original dissimilarities reveals that a tree may be a poor representation of the full multivariate information. UNCTREE is an unconstrained binary divisive clustering algorithm in which values of the ANOSIM R statistic are used to determine (binary) splits in the data, to form a dendrogram. A form of flat clustering, k-R clustering, uses a combination of ANOSIM R and Similarity Profiles (SIMPROF) analyses to determine the optimum value of k, the number of groups into which samples should be clustered, and the sample membership of the groups. Robust outcomes from the application of such a range of differing techniques to the same resemblance matrix, as here, result in greater confidence in the validity of a clustering approach.
Resumo:
Non-parametric multivariate analyses of complex ecological datasets are widely used. Following appropriate pre-treatment of the data inter-sample resemblances are calculated using appropriate measures. Ordination and clustering derived from these resemblances are used to visualise relationships among samples (or variables). Hierarchical agglomerative clustering with group-average (UPGMA) linkage is often the clustering method chosen. Using an example dataset of zooplankton densities from the Bristol Channel and Severn Estuary, UK, a range of existing and new clustering methods are applied and the results compared. Although the examples focus on analysis of samples, the methods may also be applied to species analysis. Dendrograms derived by hierarchical clustering are compared using cophenetic correlations, which are also used to determine optimum in flexible beta clustering. A plot of cophenetic correlation against original dissimilarities reveals that a tree may be a poor representation of the full multivariate information. UNCTREE is an unconstrained binary divisive clustering algorithm in which values of the ANOSIM R statistic are used to determine (binary) splits in the data, to form a dendrogram. A form of flat clustering, k-R clustering, uses a combination of ANOSIM R and Similarity Profiles (SIMPROF) analyses to determine the optimum value of k, the number of groups into which samples should be clustered, and the sample membership of the groups. Robust outcomes from the application of such a range of differing techniques to the same resemblance matrix, as here, result in greater confidence in the validity of a clustering approach.
Resumo:
Clustering algorithms, pattern mining techniques and associated quality metrics emerged as reliable methods for modeling learners’ performance, comprehension and interaction in given educational scenarios. The specificity of available data such as missing values, extreme values or outliers, creates a challenge to extract significant user models from an educational perspective. In this paper we introduce a pattern detection mechanism with-in our data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics. Experiments performed on a dataset obtained from our online e-learning platform show that the extracted interaction patterns were representative in classifying learners. Furthermore, the performed monitoring activities created a strong basis for generating automatic feedback to learners in terms of their course participation, while relying on their previous performance. In addition, our analysis introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.