928 resultados para K-Means Cluster
Resumo:
This papers examines the use of trajectory distance measures and clustering techniques to define normal
and abnormal trajectories in the context of pedestrian tracking in public spaces. In order to detect abnormal
trajectories, what is meant by a normal trajectory in a given scene is firstly defined. Then every trajectory
that deviates from this normality is classified as abnormal. By combining Dynamic Time Warping and a
modified K-Means algorithms for arbitrary-length data series, we have developed an algorithm for trajectory
clustering and abnormality detection. The final system performs with an overall accuracy of 83% and 75%
when tested in two different standard datasets.
Resumo:
O objetivo da avaliação de impactos ambientais (AIA) é permitir uma análise integrada de possíveis impactos diretos ou indiretos ao meio ambiente decorrentes da implantação e operação de empreendimentos, de forma a propor de medidas ou programas que visem evitar, mitigar ou compensar tais impactos. Para tanto é necessário conhecer as diversas características das áreas direta e indiretamente afetadas pela instalação de um projeto, tais como as condições meteorológicas e climatológicas. Estas também são relevantes no estudo das emissões em cenários de operação regular ou acidental de empreendimentos, dada sua influência nas condições de transporte e de dispersão de poluentes na atmosfera. Neste trabalho é realizado um estudo das condições de dispersão de poluentes na atmosfera para a região da Central Nuclear Almirante Álvaro Alberto (CNAAA) em Angra dos Reis, no Estado do Rio de Janeiro, utilizando o modelo WRF, considerando um cenário acidental com liberações por 48 horas. Os dois episódios simulados representam os regimes de tempo predominantes na região obtidos a partir da análise pelo o método k-means sobre as EOFs para o campo de pressões ao nível médio do mar entre os anos de 1985 e 2014. A aplicação da metodologia dos regimes de tempo permite observar os fenômenos meteorológicos de grande escala persistentes e recorrentes sobre uma dada região, servindo como uma ferramenta para a elaboração de estudos e documentos técnicos que fundamentem a decisão dos órgãos reguladores.
Resumo:
The recent advent of new technologies has led to huge amounts of genomic data. With these data come new opportunities to understand biological cellular processes underlying hidden regulation mechanisms and to identify disease related biomarkers for informative diagnostics. However, extracting biological insights from the immense amounts of genomic data is a challenging task. Therefore, effective and efficient computational techniques are needed to analyze and interpret genomic data. In this thesis, novel computational methods are proposed to address such challenges: a Bayesian mixture model, an extended Bayesian mixture model, and an Eigen-brain approach. The Bayesian mixture framework involves integration of the Bayesian network and the Gaussian mixture model. Based on the proposed framework and its conjunction with K-means clustering and principal component analysis (PCA), biological insights are derived such as context specific/dependent relationships and nested structures within microarray where biological replicates are encapsulated. The Bayesian mixture framework is then extended to explore posterior distributions of network space by incorporating a Markov chain Monte Carlo (MCMC) model. The extended Bayesian mixture model summarizes the sampled network structures by extracting biologically meaningful features. Finally, an Eigen-brain approach is proposed to analyze in situ hybridization data for the identification of the cell-type specific genes, which can be useful for informative blood diagnostics. Computational results with region-based clustering reveals the critical evidence for the consistency with brain anatomical structure.
Resumo:
Forensic speaker comparison exams have complex characteristics, demanding a long time for manual analysis. A method for automatic recognition of vowels, providing feature extraction for acoustic analysis is proposed, aiming to contribute as a support tool in these exams. The proposal is based in formant measurements by LPC (Linear Predictive Coding), selectively by fundamental frequency detection, zero crossing rate, bandwidth and continuity, with the clustering being done by the k-means method. Experiments using samples from three different databases have shown promising results, in which the regions corresponding to five of the Brasilian Portuguese vowels were successfully located, providing visualization of a speaker’s vocal tract behavior, as well as the detection of segments corresponding to target vowels.
Resumo:
En el presente trabajo se presenta un análisis para construir un modelo tridimensional de una pieza sólida a partir de la integración de los perfiles bidimensionales aportados por la interfase de un escáner láser, el cual hubo sido acoplado a un brazo robótico, y empleando cuaterniones para la descripción espacial del ensamble. Este ensamble escáner - robot está ideado para asistir en los procesos de inspección de las industrias manufactureras. Se presenta además un análisis, basado en el análisis de componentes principales ponderado (WPCA) combinado con el algoritmo k – means, para discriminar los puntos atípicos que aparecen de manera inherente en los perfiles aportados por la interfase del escáner láser, con lo cual es posible, disminuir la carga computacional del procesamiento al reducir la nube de puntos siguiendo la tendencia lineal de ciertos bloques de puntos.
Resumo:
A farinha é um derivado da mandioca de grande importância alimentar, porém com pequena padronização, por causa do processo artesanal de fabricação. O objetivo deste estudo foi analisar a variabilidade da farinha de mandioca artesanal, produzida no Território da Cidadania do Vale do Juruá, Acre, e agrupar os municípios produtores de acordo com suas características físico-químicas, por meio de análises multivariadas, determinando sua influência na qualidade da farinha de mandioca. Foram analisadas 138 amostras de farinhas, coletadas nos municípios de Cruzeiro do Sul, Mâncio Lima, Rodrigues Alves, Porto Walter e Marechal Thaumaturgo, com determinação da umidade, cinzas, proteína total, extrato etéreo, fibra total, carboidratos totais, valor energético, acidez titulável, pH e atividade de água. Os dados foram analisados pela estatística descritiva com comparação de médias pelo teste de Tukey e estatística multivariada, de forma complementar entre si; com análises de agrupamento hierárquica, pela distância euclidiana e método de Ward, e, não hierárquica, k-means, análise de componentes principais, pela matriz de correlação, e análise discriminante, pelo método da exclusão progressiva passo a passo. Os resultados mostraram que as farinhas encontram-se dentro das normas de qualidade exigidas em legislação. As diferentes análises multivariadas foram coerentes, indicando que há um padrão de distribuição das características físico-químicas das farinhas, o que sugere padrões no processo de fabricação, distribuídos conforme a localização dos municípios analisados. As características de maior influência na discriminação das farinhas são acidez, pH, atividade de água e umidade, indicando que o modo de fabricação tem grande influência na qualidade da farinha produzida.
Resumo:
The main objective of this study is to apply recently developed methods of physical-statistic to time series analysis, particularly in electrical induction s profiles of oil wells data, to study the petrophysical similarity of those wells in a spatial distribution. For this, we used the DFA method in order to know if we can or not use this technique to characterize spatially the fields. After obtain the DFA values for all wells, we applied clustering analysis. To do these tests we used the non-hierarchical method called K-means. Usually based on the Euclidean distance, the K-means consists in dividing the elements of a data matrix N in k groups, so that the similarities among elements belonging to different groups are the smallest possible. In order to test if a dataset generated by the K-means method or randomly generated datasets form spatial patterns, we created the parameter Ω (index of neighborhood). High values of Ω reveals more aggregated data and low values of Ω show scattered data or data without spatial correlation. Thus we concluded that data from the DFA of 54 wells are grouped and can be used to characterize spatial fields. Applying contour level technique we confirm the results obtained by the K-means, confirming that DFA is effective to perform spatial analysis
Resumo:
The extent of the Brazilian Atlantic rainforest, a global biodiversity hotspot, has been reduced to less than 7% of its original range. Yet, it contains one of the richest butterfly fauna in the world. Butterflies are commonly used as environmental indicators, mostly because of their strict association with host plants, microclimate and resource availability. This research describes diversity, composition and species richness of frugivorous butterflies in a forest fragment in the Brazilian Northeast. It compares communities in different physiognomies and seasons. The climate in the study area is classified as tropical rainy, with two well defined seasons. Butterfly captures were made with 60 Van Someren-Rydon traps, randomly located within six different habitat units (10 traps per unit) that varied from very open (e.g. coconut plantation) to forest interior. Sampling was made between January and December 2008, for five days each month. I captured 12090 individuals from 32 species. The most abundant species were Taygetis laches, Opsiphanes invirae and Hamadryas februa, which accounted for 70% of all captures. Similarity analysis identified two main groups, one of species associated with open or disturbed areas and a second by species associated with shaded areas. There was a strong seasonal component in species composition, with less species and lower abundance in the dry season and more species and higher abundance in the rainy season. K-means analysis indicates that choice of habitat units overestimated faunal perceptions, suggesting less distinct units. The species Taygetis virgilia, Hamadryas chloe, Callicore pygas e Morpho achilles were associated with less disturbed habitats, while Yphthimoides sp, Historis odius, H. acheronta, Hamadryas feronia e Siderone marthesia likey indicate open or disturbed habitats. This research brings important information for conservation of frugivorous butterflies, and will serve as baseline for future projects in environmental monitoring
Resumo:
This dissertation introduces a new approach for assessing the effects of pediatric epilepsy on the language connectome. Two novel data-driven network construction approaches are presented. These methods rely on connecting different brain regions using either extent or intensity of language related activations as identified by independent component analysis of fMRI data. An auditory description decision task (ADDT) paradigm was used to activate the language network for 29 patients and 30 controls recruited from three major pediatric hospitals. Empirical evaluations illustrated that pediatric epilepsy can cause, or is associated with, a network efficiency reduction. Patients showed a propensity to inefficiently employ the whole brain network to perform the ADDT language task; on the contrary, controls seemed to efficiently use smaller segregated network components to achieve the same task. To explain the causes of the decreased efficiency, graph theoretical analysis was carried out. The analysis revealed no substantial global network feature differences between the patient and control groups. It also showed that for both subject groups the language network exhibited small-world characteristics; however, the patient’s extent of activation network showed a tendency towards more random networks. It was also shown that the intensity of activation network displayed ipsilateral hub reorganization on the local level. The left hemispheric hubs displayed greater centrality values for patients, whereas the right hemispheric hubs displayed greater centrality values for controls. This hub hemispheric disparity was not correlated with a right atypical language laterality found in six patients. Finally it was shown that a multi-level unsupervised clustering scheme based on self-organizing maps, a type of artificial neural network, and k-means was able to fairly and blindly separate the subjects into their respective patient or control groups. The clustering was initiated using the local nodal centrality measurements only. Compared to the extent of activation network, the intensity of activation network clustering demonstrated better precision. This outcome supports the assertion that the local centrality differences presented by the intensity of activation network can be associated with focal epilepsy.
Resumo:
A problemática relacionada com a modelação da qualidade da água de albufeiras pode ser abordada de diversos pontos de vista. Neste trabalho recorre-se a metodologias de resolução de problemas que emanam da Área Cientifica da Inteligência Artificial, assim como a ferramentas utilizadas na procura de soluções como as Árvores de Decisão, as Redes Neuronais Artificiais e a Aproximação de Vizinhanças. Actualmente os métodos de avaliação da qualidade da água são muito restritivos já que não permitem aferir a qualidade da água em tempo real. O desenvolvimento de modelos de previsão baseados em técnicas de Descoberta de Conhecimento em Bases de Dados, mostrou ser uma alternativa tendo em vista um comportamento pró-activo que pode contribuir decisivamente para diagnosticar, preservar e requalificar as albufeiras. No decurso do trabalho, foi utilizada a aprendizagem não-supervisionada tendo em vista estudar a dinâmica das albufeiras sendo descritos dois comportamentos distintos, relacionados com a época do ano. ABSTRACT: The problems related to the modelling of water quality in reservoirs can be approached from different viewpoints. This work resorts to methods of resolving problems emanating from the Scientific Area of Artificial lntelligence as well as to tools used in the search for solutions such as Decision Trees, Artificial Neural Networks and Nearest-Neighbour Method. Currently, the methods for assessing water quality are very restrictive because they do not indicate the water quality in real time. The development of forecasting models, based on techniques of Knowledge Discovery in Databases, shows to be an alternative in view of a pro-active behavior that may contribute to diagnose, maintain and requalify the water bodies. ln this work. unsupervised learning was used to study the dynamics of reservoirs, being described two distinct behaviors, related to the time of year.
Resumo:
The semiarid region of northeastern Brazil, the Caatinga, is extremely important due to its biodiversity and endemism. Measurements of plant physiology are crucial to the calibration of Dynamic Global Vegetation Models (DGVMs) that are currently used to simulate the responses of vegetation in face of global changes. In a field work realized in an area of preserved Caatinga forest located in Petrolina, Pernambuco, measurements of carbon assimilation (in response to light and CO2) were performed on 11 individuals of Poincianella microphylla, a native species that is abundant in this region. These data were used to calibrate the maximum carboxylation velocity (Vcmax) used in the INLAND model. The calibration techniques used were Multiple Linear Regression (MLR), and data mining techniques as the Classification And Regression Tree (CART) and K-MEANS. The results were compared to the UNCALIBRATED model. It was found that simulated Gross Primary Productivity (GPP) reached 72% of observed GPP when using the calibrated Vcmax values, whereas the UNCALIBRATED approach accounted for 42% of observed GPP. Thus, this work shows the benefits of calibrating DGVMs using field ecophysiological measurements, especially in areas where field data is scarce or non-existent, such as in the Caatinga
Resumo:
Although the debate of what data science is has a long history and has not reached a complete consensus yet, Data Science can be summarized as the process of learning from data. Guided by the above vision, this thesis presents two independent data science projects developed in the scope of multidisciplinary applied research. The first part analyzes fluorescence microscopy images typically produced in life science experiments, where the objective is to count how many marked neuronal cells are present in each image. Aiming to automate the task for supporting research in the area, we propose a neural network architecture tuned specifically for this use case, cell ResUnet (c-ResUnet), and discuss the impact of alternative training strategies in overcoming particular challenges of our data. The approach provides good results in terms of both detection and counting, showing performance comparable to the interpretation of human operators. As a meaningful addition, we release the pre-trained model and the Fluorescent Neuronal Cells dataset collecting pixel-level annotations of where neuronal cells are located. In this way, we hope to help future research in the area and foster innovative methodologies for tackling similar problems. The second part deals with the problem of distributed data management in the context of LHC experiments, with a focus on supporting ATLAS operations concerning data transfer failures. In particular, we analyze error messages produced by failed transfers and propose a Machine Learning pipeline that leverages the word2vec language model and K-means clustering. This provides groups of similar errors that are presented to human operators as suggestions of potential issues to investigate. The approach is demonstrated on one full day of data, showing promising ability in understanding the message content and providing meaningful groupings, in line with previously reported incidents by human operators.
Resumo:
Long-term monitoring of acoustical environments is gaining popularity thanks to the relevant amount of scientific and engineering insights that it provides. The increasing interest is due to the constant growth of storage capacity and computational power to process large amounts of data. In this perspective, machine learning (ML) provides a broad family of data-driven statistical techniques to deal with large databases. Nowadays, the conventional praxis of sound level meter measurements limits the global description of a sound scene to an energetic point of view. The equivalent continuous level Leq represents the main metric to define an acoustic environment, indeed. Finer analyses involve the use of statistical levels. However, acoustic percentiles are based on temporal assumptions, which are not always reliable. A statistical approach, based on the study of the occurrences of sound pressure levels, would bring a different perspective to the analysis of long-term monitoring. Depicting a sound scene through the most probable sound pressure level, rather than portions of energy, brought more specific information about the activity carried out during the measurements. The statistical mode of the occurrences can capture typical behaviors of specific kinds of sound sources. The present work aims to propose an ML-based method to identify, separate and measure coexisting sound sources in real-world scenarios. It is based on long-term monitoring and is addressed to acousticians focused on the analysis of environmental noise in manifold contexts. The presented method is based on clustering analysis. Two algorithms, Gaussian Mixture Model and K-means clustering, represent the main core of a process to investigate different active spaces monitored through sound level meters. The procedure has been applied in two different contexts: university lecture halls and offices. The proposed method shows robust and reliable results in describing the acoustic scenario and it could represent an important analytical tool for acousticians.
Resumo:
L'esperimento ATLAS, come gli altri esperimenti che operano al Large Hadron Collider, produce Petabytes di dati ogni anno, che devono poi essere archiviati ed elaborati. Inoltre gli esperimenti si sono proposti di rendere accessibili questi dati in tutto il mondo. In risposta a questi bisogni è stato progettato il Worldwide LHC Computing Grid che combina la potenza di calcolo e le capacità di archiviazione di più di 170 siti sparsi in tutto il mondo. Nella maggior parte dei siti del WLCG sono state sviluppate tecnologie per la gestione dello storage, che si occupano anche della gestione delle richieste da parte degli utenti e del trasferimento dei dati. Questi sistemi registrano le proprie attività in logfiles, ricchi di informazioni utili agli operatori per individuare un problema in caso di malfunzionamento del sistema. In previsione di un maggiore flusso di dati nei prossimi anni si sta lavorando per rendere questi siti ancora più affidabili e uno dei possibili modi per farlo è lo sviluppo di un sistema in grado di analizzare i file di log autonomamente e individuare le anomalie che preannunciano un malfunzionamento. Per arrivare a realizzare questo sistema si deve prima individuare il metodo più adatto per l'analisi dei file di log. In questa tesi viene studiato un approccio al problema che utilizza l'intelligenza artificiale per analizzare i logfiles, più nello specifico viene studiato l'approccio che utilizza dell'algoritmo di clustering K-means.
Resumo:
L’elaborato di tesi è frutto di un percorso di tirocinio svolto in Gruppo Montenegro S.r.l., il cui obiettivo risiede nello sviluppo di un algoritmo per la pallettizzazione e la saturazione del mezzo di trasporto per la Divisione Food. Nello specifico viene proposto un algoritmo euristico elaborato nel linguaggio di programmazione Python. La divisione Food è costituita da tre categorie: Cannamela, Cuore e Vitalia.Queste comprendono prodotti molto eterogenei. Attraverso il coinvolgimento delle funzioni aziendali di Packaging e Qualità, sono stati stabiliti i vincoli da rispettare per la pallettizzazione dei prodotti. L’algoritmo proposto viene descritto suddividendo il processo in tre macro-step. La prima parte affronta il problema del 3D Bin Packing Problem, utilizzando e modificando un programma già presente in letteratura per soddisfare le esigenze della categoria Cannamela. Quest’ultima a differenza delle altre categorie, viene allestita in groupage preallestito poiché gli ordini Cannamela possono contenere quantità non-multiple rispetto alle quantità contenute nell’imballo secondario. La seconda parte dell’algoritmo si occupa della creazione dei pallet per le categorie Cuore e Vitalia. Attraverso l’utilizzo dell’algoritmo di clustering K-means sono state create famiglie di codici che permettessero l’allestimento di pallet con prodotti considerati simili. Di conseguenza, l’algoritmo per la pallettizzazione delle due categorie viene sviluppato ex-novo basandosi sulla percentuale di occupazione del prodotto nel pallet. L’ultima parte dell’algoritmo studia la possibilità di sovrapporre i pallet precedentemente creati. Infine, viene effettuata un’analisi di un periodo strategico confrontando i risultatidell’algoritmo Python con quelli dell’algoritmo presente nel gestionale aziendale. I risultati vengono poi analizzati in relazione a due impatti importanti per l’azienda:economici e ambientali.