931 resultados para k-Means algorithm
Resumo:
This dissertation introduces a new approach for assessing the effects of pediatric epilepsy on the language connectome. Two novel data-driven network construction approaches are presented. These methods rely on connecting different brain regions using either extent or intensity of language related activations as identified by independent component analysis of fMRI data. An auditory description decision task (ADDT) paradigm was used to activate the language network for 29 patients and 30 controls recruited from three major pediatric hospitals. Empirical evaluations illustrated that pediatric epilepsy can cause, or is associated with, a network efficiency reduction. Patients showed a propensity to inefficiently employ the whole brain network to perform the ADDT language task; on the contrary, controls seemed to efficiently use smaller segregated network components to achieve the same task. To explain the causes of the decreased efficiency, graph theoretical analysis was carried out. The analysis revealed no substantial global network feature differences between the patient and control groups. It also showed that for both subject groups the language network exhibited small-world characteristics; however, the patient's extent of activation network showed a tendency towards more random networks. It was also shown that the intensity of activation network displayed ipsilateral hub reorganization on the local level. The left hemispheric hubs displayed greater centrality values for patients, whereas the right hemispheric hubs displayed greater centrality values for controls. This hub hemispheric disparity was not correlated with a right atypical language laterality found in six patients. Finally it was shown that a multi-level unsupervised clustering scheme based on self-organizing maps, a type of artificial neural network, and k-means was able to fairly and blindly separate the subjects into their respective patient or control groups. The clustering was initiated using the local nodal centrality measurements only. Compared to the extent of activation network, the intensity of activation network clustering demonstrated better precision. This outcome supports the assertion that the local centrality differences presented by the intensity of activation network can be associated with focal epilepsy.^
Resumo:
The aim of the present study was to trace the mortality profile of the elderly in Brazil using two neighboring age groups: 60 to 69 years (young-old) and 80 years or more (oldest-old). To do this, we sought to characterize the trend and distinctions of different mortality profiles, as well as the quality of the data and associations with socioeconomic and sanitary conditions in the micro-regions of Brazil. Data was collected from the Mortality Information System (SIM) and the Brazilian Institute of Geography and Statistics (IBGE). Based on these data, the coefficients of mortality were calculated for the chapters of the International Disease Classification (ICD-10). A polynomial regression model was used to ascertain the trend of the main chapters. Non-hierarchical cluster analysis (K-Means) was used to obtain the profiles for different Brazilian micro-regions. Factorial analysis of the contextual variables was used to obtain the socio-economic and sanitary deprivation indices (IPSS). The trend of the CMId and of the ratio of its values in the two age groups confirmed a decrease in most of the indicators, particularly for badly-defined causes among the oldest-old. Among the young-old, the following profiles emerged: the Development Profile; the Modernity Profile; the Epidemiological Paradox Profile and the Ignorance Profile. Among the oldest-old, the latter three profiles were confirmed, in addition to the Low Mortality Rates Profile. When comparing the mean IPSS values in global terms, all of the groups were different in both of the age groups. The Ignorance Profile was compared with the other profiles using orthogonal contrasts. This profile differed from all of the others in isolation and in clusters. However, the mean IPSS was similar for the Low Mortality Rates Profile among the oldest-old. Furthermore, associations were found between the data quality indicators, the CMId for badly-defined causes, the general coefficient of mortality for each age group (CGMId) and the IPSS of the micro-regions. The worst rates were recorded in areas with the greatest socioeconomic and sanitary deprivation. The findings of the present study show that, despite the decrease in the mortality coefficients, there are notable differences in the profiles related to contextual conditions, including regional differences in data quality. These differences increase the vulnerability of the age groups studied and the health iniquities that are already present.
Resumo:
This study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.
Resumo:
A modelação dos sistemas industriais apresenta para as organizações uma vantagem estratégica no domínio do estudo dos seus processos produtivos. Através da modelação será possível aumentar o conhecimento sobre os sistemas podendo permitir, quando possível, melhorias na gestão e planeamento da produção. Este conhecimento poderá permitir também um aumento da eficiência dos processos produtivos, através da melhoria ou eliminação das principais perdas detetadas no processo. Este trabalho tem como principal objetivo o desenvolvimento e validação de uma ferramenta de modelação, previsão e análise para sistemas produtivos industriais, tendo em vista o aumento do conhecimento sobre estes. Para a execução e desenvolvimento deste trabalho, foram utilizadas e desenvolvidas várias ferramentas, conceitos, metodologias e fundamentos teóricos conhecidos da bibliografia, como OEE (Overall Equipment Effectiveness), RdP (Redes de Petri), Séries Temporais, Kmeans, ou SPC (Statistical Process Control). A ferramenta de modelação, previsão e análise desenvolvida e descrita neste trabalho, mostrou-se capaz de auxiliar na deteção e interpretação das causas que influenciam os resultados do sistema produtivo e originam perdas, demonstrando as vantagens esperadas. Estes resultados foram baseados em dados reais de um sistema produtivo.
Resumo:
Understanding the impact of atmospheric black carbon (BC) containing particles on human health and radiative forcing requires knowledge of the mixing state of BC, including the characteristics of the materials with which it is internally mixed. In this study, we demonstrate for the first time the capabilities of the Aerodyne Soot-Particle Aerosol Mass Spectrometer equipped with a light scattering module (LS-SP-AMS) to examine the mixing state of refractory BC (rBC) and other aerosol components in an urban environment (downtown Toronto). K-means clustering analysis was used to classify single particle mass spectra into chemically distinct groups. One resultant cluster is dominated by rBC mass spectral signals (C+1 to C+5) while the organic signals fall into a few major clusters, identified as hydrocarbon-like organic aerosol (HOA), oxygenated organic aerosol (OOA), and cooking emission organic aerosol (COA). A nearly external mixing is observed with small BC particles only thinly coated by HOA ( 28% by mass on average), while over 90% of the HOA-rich particles did not contain detectable amounts of rBC. Most of the particles classified into other inorganic and organic clusters were not significantly associated with BC. The single particle results also suggest that HOA and COA emitted from anthropogenic sources were likely major contributors to organic-rich particles with low to mid-range aerodynamic diameter (dva). The similar temporal profiles and mass spectral features of the organic clusters and the factors from a positive matrix factorization (PMF) analysis of the ensemble aerosol dataset validate the conventional interpretation of the PMF results.
Resumo:
Clustering algorithms, pattern mining techniques and associated quality metrics emerged as reliable methods for modeling learners’ performance, comprehension and interaction in given educational scenarios. The specificity of available data such as missing values, extreme values or outliers, creates a challenge to extract significant user models from an educational perspective. In this paper we introduce a pattern detection mechanism with-in our data analytics tool based on k-means clustering and on SSE, silhouette, Dunn index and Xi-Beni index quality metrics. Experiments performed on a dataset obtained from our online e-learning platform show that the extracted interaction patterns were representative in classifying learners. Furthermore, the performed monitoring activities created a strong basis for generating automatic feedback to learners in terms of their course participation, while relying on their previous performance. In addition, our analysis introduces automatic triggers that highlight learners who will potentially fail the course, enabling tutors to take timely actions.
Resumo:
Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
[EN]In this paper an architecture for an estimator of short-term wind farm power is proposed. The estimator is made up of a Linear Machine classifier and a set of k Multilayer Perceptrons, training each one for a specific subspace of the input space. The splitting of the input dataset into the k clusters is done using a k-means technique, obtaining the equivalent Linear Machine classifier from the cluster centroids...
Resumo:
O objetivo da avaliação de impactos ambientais (AIA) é permitir uma análise integrada de possíveis impactos diretos ou indiretos ao meio ambiente decorrentes da implantação e operação de empreendimentos, de forma a propor de medidas ou programas que visem evitar, mitigar ou compensar tais impactos. Para tanto é necessário conhecer as diversas características das áreas direta e indiretamente afetadas pela instalação de um projeto, tais como as condições meteorológicas e climatológicas. Estas também são relevantes no estudo das emissões em cenários de operação regular ou acidental de empreendimentos, dada sua influência nas condições de transporte e de dispersão de poluentes na atmosfera. Neste trabalho é realizado um estudo das condições de dispersão de poluentes na atmosfera para a região da Central Nuclear Almirante Álvaro Alberto (CNAAA) em Angra dos Reis, no Estado do Rio de Janeiro, utilizando o modelo WRF, considerando um cenário acidental com liberações por 48 horas. Os dois episódios simulados representam os regimes de tempo predominantes na região obtidos a partir da análise pelo o método k-means sobre as EOFs para o campo de pressões ao nível médio do mar entre os anos de 1985 e 2014. A aplicação da metodologia dos regimes de tempo permite observar os fenômenos meteorológicos de grande escala persistentes e recorrentes sobre uma dada região, servindo como uma ferramenta para a elaboração de estudos e documentos técnicos que fundamentem a decisão dos órgãos reguladores.
Resumo:
The recent advent of new technologies has led to huge amounts of genomic data. With these data come new opportunities to understand biological cellular processes underlying hidden regulation mechanisms and to identify disease related biomarkers for informative diagnostics. However, extracting biological insights from the immense amounts of genomic data is a challenging task. Therefore, effective and efficient computational techniques are needed to analyze and interpret genomic data. In this thesis, novel computational methods are proposed to address such challenges: a Bayesian mixture model, an extended Bayesian mixture model, and an Eigen-brain approach. The Bayesian mixture framework involves integration of the Bayesian network and the Gaussian mixture model. Based on the proposed framework and its conjunction with K-means clustering and principal component analysis (PCA), biological insights are derived such as context specific/dependent relationships and nested structures within microarray where biological replicates are encapsulated. The Bayesian mixture framework is then extended to explore posterior distributions of network space by incorporating a Markov chain Monte Carlo (MCMC) model. The extended Bayesian mixture model summarizes the sampled network structures by extracting biologically meaningful features. Finally, an Eigen-brain approach is proposed to analyze in situ hybridization data for the identification of the cell-type specific genes, which can be useful for informative blood diagnostics. Computational results with region-based clustering reveals the critical evidence for the consistency with brain anatomical structure.
Resumo:
A flexible and multipurpose bio-inspired hierarchical model for analyzing musical timbre is presented in this paper. Inspired by findings in the fields of neuroscience, computational neuroscience, and psychoacoustics, not only does the model extract spectral and temporal characteristics of a signal, but it also analyzes amplitude modulations on different timescales. It uses a cochlear filter bank to resolve the spectral components of a sound, lateral inhibition to enhance spectral resolution, and a modulation filter bank to extract the global temporal envelope and roughness of the sound from amplitude modulations. The model was evaluated in three applications. First, it was used to simulate subjective data from two roughness experiments. Second, it was used for musical instrument classification using the k-NN algorithm and a Bayesian network. Third, it was applied to find the features that characterize sounds whose timbres were labeled in an audiovisual experiment. The successful application of the proposed model in these diverse tasks revealed its potential in capturing timbral information.
Resumo:
Background: The capacity of European pear fruit (Pyrus communis L.) to ripen after harvest develops during the final stages of growth on the tree. The objective of this study was to characterize changes in 'Bartlett' pear fruit physico-chemical properties and transcription profiles during fruit maturation leading to attainment of ripening capacity. Results: The softening response of pear fruit held for 14days at 20°C after harvest depended on their maturity. We identified four maturity stages: S1-failed to soften and S2- displayed partial softening (with or without ET-ethylene treatment); S3 - able to soften following ET; and S4 - able to soften without ET. Illumina sequencing and Trinity assembly generated 68,010 unigenes (mean length of 911bp), of which 32.8% were annotated to the RefSeq plant database. Higher numbers of differentially expressed transcripts were recorded in the S3-S4 and S1-S2 transitions (2805 and 2505 unigenes, respectively) than in the S2-S3 transition (2037 unigenes). High expression of genes putatively encoding pectin degradation enzymes in the S1-S2 transition suggests pectic oligomers may be involved as early signals triggering the transition to responsiveness to ethylene in pear fruit. Moreover, the co-expression of these genes with Exps (Expansins) suggests their collaboration in modifying cell wall polysaccharide networks that are required for fruit growth. K-means cluster analysis revealed that auxin signaling associated transcripts were enriched in cluster K6 that showed the highest gene expression at S3. AP2/EREBP (APETALA 2/ethylene response element binding protein) and bHLH (basic helix-loop-helix) transcripts were enriched in all three transition S1-S2, S2-S3, and S3-S4. Several members of Aux/IAA (Auxin/indole-3-acetic acid), ARF (Auxin response factors), and WRKY appeared to play an important role in orchestrating the S2-S3 transition. Conclusions: We identified maturity stages associated with the development of ripening capacity in 'Bartlett' pear, and described the transcription profile of fruit at these stages. Our findings suggest that auxin is essential in regulating the transition of pear fruit from being ethylene-unresponsive (S2) to ethylene-responsive (S3), resulting in fruit softening. The transcriptome will be helpful for future studies about specific developmental pathways regulating the transition to ripening. © 2015 Nham et al.
Resumo:
The Internet has grown in size at rapid rates since BGP records began, and continues to do so. This has raised concerns about the scalability of the current BGP routing system, as the routing state at each router in a shortest-path routing protocol will grow at a supra-linearly rate as the network grows. The concerns are that the memory capacity of routers will not be able to keep up with demands, and that the growth of the Internet will become ever more cramped as more and more of the world seeks the benefits of being connected. Compact routing schemes, where the routing state grows only sub-linearly relative to the growth of the network, could solve this problem and ensure that router memory would not be a bottleneck to Internet growth. These schemes trade away shortest-path routing for scalable memory state, by allowing some paths to have a certain amount of bounded “stretch”. The most promising such scheme is Cowen Routing, which can provide scalable, compact routing state for Internet routing, while still providing shortest-path routing to nearly all other nodes, with only slightly stretched paths to a very small subset of the network. Currently, there is no fully distributed form of Cowen Routing that would be practical for the Internet. This dissertation describes a fully distributed and compact protocol for Cowen routing, using the k-core graph decomposition. Previous compact routing work showed the k-core graph decomposition is useful for Cowen Routing on the Internet, but no distributed form existed. This dissertation gives a distributed k-core algorithm optimised to be efficient on dynamic graphs, along with with proofs of its correctness. The performance and efficiency of this distributed k-core algorithm is evaluated on large, Internet AS graphs, with excellent results. This dissertation then goes on to describe a fully distributed and compact Cowen Routing protocol. This protocol being comprised of a landmark selection process for Cowen Routing using the k-core algorithm, with mechanisms to ensure compact state at all times, including at bootstrap; a local cluster routing process, with mechanisms for policy application and control of cluster sizes, ensuring again that state can remain compact at all times; and a landmark routing process is described with a prioritisation mechanism for announcements that ensures compact state at all times.
Resumo:
Forensic speaker comparison exams have complex characteristics, demanding a long time for manual analysis. A method for automatic recognition of vowels, providing feature extraction for acoustic analysis is proposed, aiming to contribute as a support tool in these exams. The proposal is based in formant measurements by LPC (Linear Predictive Coding), selectively by fundamental frequency detection, zero crossing rate, bandwidth and continuity, with the clustering being done by the k-means method. Experiments using samples from three different databases have shown promising results, in which the regions corresponding to five of the Brasilian Portuguese vowels were successfully located, providing visualization of a speaker’s vocal tract behavior, as well as the detection of segments corresponding to target vowels.