938 resultados para k-means clustering
Resumo:
El trabajo realizado se divide en dos bloques bien diferenciados, ambos relacionados con el análisis de microarrays. El primer bloque consiste en agrupar las condiciones muestrales de todos los genes en grupos o clústers. Estas agrupaciones se obtienen al aplicar directamente sobre la microarray los siguientes algoritmos de agrupación: SOM,PAM,SOTA,HC y al aplicar sobre la microarray escalada con PC y MDS los siguientes algoritmos: SOM,PAM,SOTA,HC y K-MEANS. El segundo bloque consiste en realizar una búsqueda de genes basada en los intervalos de confianza de cada clúster de la agrupación activa. Las condiciones de búsqueda ajustadas por el usuario se validan para cada clúster respecto el valor basal 0 y respecto el resto de clústers, para estas validaciones se usan los intervalos de confianza. Estos dos bloques se integran en una aplicación web ya existente, el applet PCOPGene, alojada en el servidor: http://revolutionresearch.uab.es.
Mejora diagnóstica de hepatopatías de afectación difusa mediante técnicas de inteligencia artificial
Resumo:
The automatic diagnostic discrimination is an application of artificial intelligence techniques that can solve clinical cases based on imaging. Diffuse liver diseases are diseases of wide prominence in the population and insidious course, yet early in its progression. Early and effective diagnosis is necessary because many of these diseases progress to cirrhosis and liver cancer. The usual technique of choice for accurate diagnosis is liver biopsy, an invasive and not without incompatibilities one. It is proposed in this project an alternative non-invasive and free of contraindications method based on liver ultrasonography. The images are digitized and then analyzed using statistical techniques and analysis of texture. The results are validated from the pathology report. Finally, we apply artificial intelligence techniques as Fuzzy k-Means or Support Vector Machines and compare its significance to the analysis Statistics and the report of the clinician. The results show that this technique is significantly valid and a promising alternative as a noninvasive diagnostic chronic liver disease from diffuse involvement. Artificial Intelligence classifying techniques significantly improve the diagnosing discrimination compared to other statistics.
Resumo:
In an earlier investigation (Burger et al., 2000) five sediment cores near the RodriguesTriple Junction in the Indian Ocean were studied applying classical statistical methods(fuzzy c-means clustering, linear mixing model, principal component analysis) for theextraction of endmembers and evaluating the spatial and temporal variation ofgeochemical signals. Three main factors of sedimentation were expected by the marinegeologists: a volcano-genetic, a hydro-hydrothermal and an ultra-basic factor. Thedisplay of fuzzy membership values and/or factor scores versus depth providedconsistent results for two factors only; the ultra-basic component could not beidentified. The reason for this may be that only traditional statistical methods wereapplied, i.e. the untransformed components were used and the cosine-theta coefficient assimilarity measure.During the last decade considerable progress in compositional data analysis was madeand many case studies were published using new tools for exploratory analysis of thesedata. Therefore it makes sense to check if the application of suitable data transformations,reduction of the D-part simplex to two or three factors and visualinterpretation of the factor scores would lead to a revision of earlier results and toanswers to open questions . In this paper we follow the lines of a paper of R. Tolosana-Delgado et al. (2005) starting with a problem-oriented interpretation of the biplotscattergram, extracting compositional factors, ilr-transformation of the components andvisualization of the factor scores in a spatial context: The compositional factors will beplotted versus depth (time) of the core samples in order to facilitate the identification ofthe expected sources of the sedimentary process.Kew words: compositional data analysis, biplot, deep sea sediments
Resumo:
This paper presents a pattern recognition method focused on paintings images. The purpose is construct a system able to recognize authors or art styles based on common elements of his work (here called patterns). The method is based on comparing images that contain the same or similar patterns. It uses different computer vision techniques, like SIFT and SURF, to describe the patterns in descriptors, K-Means to classify and simplify these descriptors, and RANSAC to determine and detect good results. The method are good to find patterns of known images but not so good if they are not.
Resumo:
Os solos submetidos aos sistemas de produção sem preparo estão sujeitos à compactação, provocada pelo tráfego de máquinas, tornando necessário o acompanhamento das alterações do ambiente físico, que, quando desfavorável, restringe o crescimento radicular, podendo reduzir a produtividade das culturas. O objetivo do trabalho foi avaliar o efeito de diferentes intensidades de compactação na qualidade física de um Latossolo Vermelho textura média, localizado em Jaboticabal (SP), sob cultivo de milho, usando métodos de estatística multivariada. O delineamento experimental foi inteiramente casualizado, com seis intensidades de compactação e quatro repetições. Foram coletadas amostras indeformadas do solo nas camadas de 0,02-0,05, 0,08-0,11 e 0,15-0,18 m para determinação da densidade do solo (Ds), na camada de 0-0,20 m. As características da cultura avaliadas foram: densidade radicular, diâmetro radicular, matéria seca das raízes, altura das plantas, altura de inserção da primeira espiga, diâmetro do colmo e matéria seca das plantas. As análises de agrupamentos e componentes principais permitiram identificar três grupos de alta, média e baixa produtividade de plantas de milho, segundo variáveis do solo, do sistema radicular e da parte aérea das plantas. A classificação dos acessos em grupos foi feita por três métodos: método de agrupamentos hierárquico, método não-hierárquico k-means e análise de componentes principais. Os componentes principais evidenciaram que elevadas produtividades de milho estão correlacionadas com o bom crescimento da parte aérea das plantas, em condições de menor densidade do solo, proporcionando elevada produção de matéria seca das raízes, contudo, de pequeno diâmetro. A qualidade física do Latossolo Vermelho para o cultivo do milho foi assegurada até à densidade do solo de 1,38 Mg m-3.
Resumo:
A erodibilidade é um fator de extrema importância na caracterização da perda de solo, representando os processos que regulam a infiltração de água e sua resistência à desagregação e o transporte de partículas. Assim, por meio da análise de dependência espacial dos componentes principais da erodibilidade (fator K), objetivou-se estimar a erodibilidade do solo em uma área de nascentes da microbacia do Córrego do Tijuco, Monte Alto-SP, e analisar a variabilidade espacial das variáveis granulométricas do solo ao longo do relevo. A erodibilidade média da área foi considerada alta, e a análise de agrupamento k-means apontou para uma formação de cinco grupos: no primeiro, os altos teores de areia grossa (AG) e média (AM) condicionaram sua distribuição nas áreas planas; o segundo, caracterizado pelo alto teor de areia fina (AF), distribui-se nos declives mais convexos; o terceiro, com altos teores de silte e areia muito fina (AMF), concentrou-se nos maiores declives e concavidades; o quarto, com maior teor de argila, seguiu as zonas de escoamento de água; e o quinto, com alto teor de matéria orgânica (MO) e areia grossa (AG), distribui-se nas proximidades da zona urbana. A análise de componentes principais (ACP) mostrou quatro componentes com 87,4 % das informações, sendo o primeiro componente principal (CP1) discriminado pelo transporte seletivo de partículas principalmente em zonas pontuais de maior declividade e acúmulo de sedimentos; o segundo (CP2), discriminado pela baixa coesão entre as partículas, mostra acúmulo da areia fina nas áreas de menor cota em toda a área de concentração de água; o terceiro (CP3), discriminado pela maior agregação do solo, concentra-se principalmente nas bases de grandes declives; e o quarto (CP4), discriminado pela areia muito fina, distribui-se ao longo das declividades nas maiores altitudes. Os resultados sugerem o comportamento granulométrico do solo, que se mostra suscetível ao processo erosivo devido às condições texturais superficiais e à movimentação do relevo.
Resumo:
A compreensão do potencial agrícola de um solo muitas vezes é com base apenas na interpretação por meio de análises univariadas, podendo elevar a dimensão dos problemas no momento de escolher o manejo adequado ao solo. Dessa forma, a análise multivariada surge como alternativa, pois ela é um conjunto de procedimentos que visa agrupar e discriminar grupos de indivíduos. Também, serve como instrumento de seleção de variáveis, na medida em que aquelas com maior peso na construção dos primeiros componentes principais são as possíveis que melhor representam o conjunto de dados estudados. O objetivo deste trabalho foi identificar, por meio de análises multivariadas, os atributos do solo que melhor explicam a variabilidade espacial na produção da cultura do feijão. No ano agrícola de 2006/2007, no município de Selvíria, MS, foi analisada a produtividade do feijão em razão de alguns atributos físicos e químicos de um Latossolo Vermelho distroférrico, cultivado nas condições de elevado nível tecnológico de manejo pelo sistema de cultivo mínimo irrigado com pivô central. Foi demarcada a malha geoestatística para a coleta de dados do solo e da planta, com 117 pontos amostrais, numa área de 2.025 m2 e declive homogêneo de 0,055 m m-1. A classificação em grupos foi feita por três métodos: método de agrupamentos hierárquico, método não hierárquico k-means e análise de componentes principais. Essa última análise permitiu identificar três grupos que explicam 86,3 % da variabilidade total dos dados, os quais são constituídos pelos atributos físicos: densidade do solo, porosidade total, umidade gravimétrica e umidade volumétrica, que se evidenciaram com maior poder de explicação da variação da produtividade. Pode-se concluir que a análise multivariada aliada à geoestatística é importante ferramenta no auxílio ao manejo localizado.
Resumo:
Axée dans un premier temps sur le formalisme et les méthodes, cette thèse est construite sur trois concepts formalisés: une table de contingence, une matrice de dissimilarités euclidiennes et une matrice d'échange. À partir de ces derniers, plusieurs méthodes d'Analyse des données ou d'apprentissage automatique sont exprimées et développées: l'analyse factorielle des correspondances (AFC), vue comme un cas particulier du multidimensional scaling; la classification supervisée, ou non, combinée aux transformations de Schoenberg; et les indices d'autocorrélation et d'autocorrélation croisée, adaptés à des analyses multivariées et permettant de considérer diverses familles de voisinages. Ces méthodes débouchent dans un second temps sur une pratique de l'analyse exploratoire de différentes données textuelles et musicales. Pour les données textuelles, on s'intéresse à la classification automatique en types de discours de propositions énoncées, en se basant sur les catégories morphosyntaxiques (CMS) qu'elles contiennent. Bien que le lien statistique entre les CMS et les types de discours soit confirmé, les résultats de la classification obtenus avec la méthode K- means, combinée à une transformation de Schoenberg, ainsi qu'avec une variante floue de l'algorithme K-means, sont plus difficiles à interpréter. On traite aussi de la classification supervisée multi-étiquette en actes de dialogue de tours de parole, en se basant à nouveau sur les CMS qu'ils contiennent, mais aussi sur les lemmes et le sens des verbes. Les résultats obtenus par l'intermédiaire de l'analyse discriminante combinée à une transformation de Schoenberg sont prometteurs. Finalement, on examine l'autocorrélation textuelle, sous l'angle des similarités entre diverses positions d'un texte, pensé comme une séquence d'unités. En particulier, le phénomène d'alternance de la longueur des mots dans un texte est observé pour des voisinages d'empan variable. On étudie aussi les similarités en fonction de l'apparition, ou non, de certaines parties du discours, ainsi que les similarités sémantiques des diverses positions d'un texte. Concernant les données musicales, on propose une représentation d'une partition musicale sous forme d'une table de contingence. On commence par utiliser l'AFC et l'indice d'autocorrélation pour découvrir les structures existant dans chaque partition. Ensuite, on opère le même type d'approche sur les différentes voix d'une partition, grâce à l'analyse des correspondances multiples, dans une variante floue, et à l'indice d'autocorrélation croisée. Qu'il s'agisse de la partition complète ou des différentes voix qu'elle contient, des structures répétées sont effectivement détectées, à condition qu'elles ne soient pas transposées. Finalement, on propose de classer automatiquement vingt partitions de quatre compositeurs différents, chacune représentée par une table de contingence, par l'intermédiaire d'un indice mesurant la similarité de deux configurations. Les résultats ainsi obtenus permettent de regrouper avec succès la plupart des oeuvres selon leur compositeur.
Resumo:
It is estimated that around 230 people die each year due to radon (222Rn) exposure in Switzerland. 222Rn occurs mainly in closed environments like buildings and originates primarily from the subjacent ground. Therefore it depends strongly on geology and shows substantial regional variations. Correct identification of these regional variations would lead to substantial reduction of 222Rn exposure of the population based on appropriate construction of new and mitigation of already existing buildings. Prediction of indoor 222Rn concentrations (IRC) and identification of 222Rn prone areas is however difficult since IRC depend on a variety of different variables like building characteristics, meteorology, geology and anthropogenic factors. The present work aims at the development of predictive models and the understanding of IRC in Switzerland, taking into account a maximum of information in order to minimize the prediction uncertainty. The predictive maps will be used as a decision-support tool for 222Rn risk management. The construction of these models is based on different data-driven statistical methods, in combination with geographical information systems (GIS). In a first phase we performed univariate analysis of IRC for different variables, namely the detector type, building category, foundation, year of construction, the average outdoor temperature during measurement, altitude and lithology. All variables showed significant associations to IRC. Buildings constructed after 1900 showed significantly lower IRC compared to earlier constructions. We observed a further drop of IRC after 1970. In addition to that, we found an association of IRC with altitude. With regard to lithology, we observed the lowest IRC in sedimentary rocks (excluding carbonates) and sediments and the highest IRC in the Jura carbonates and igneous rock. The IRC data was systematically analyzed for potential bias due to spatially unbalanced sampling of measurements. In order to facilitate the modeling and the interpretation of the influence of geology on IRC, we developed an algorithm based on k-medoids clustering which permits to define coherent geological classes in terms of IRC. We performed a soil gas 222Rn concentration (SRC) measurement campaign in order to determine the predictive power of SRC with respect to IRC. We found that the use of SRC is limited for IRC prediction. The second part of the project was dedicated to predictive mapping of IRC using models which take into account the multidimensionality of the process of 222Rn entry into buildings. We used kernel regression and ensemble regression tree for this purpose. We could explain up to 33% of the variance of the log transformed IRC all over Switzerland. This is a good performance compared to former attempts of IRC modeling in Switzerland. As predictor variables we considered geographical coordinates, altitude, outdoor temperature, building type, foundation, year of construction and detector type. Ensemble regression trees like random forests allow to determine the role of each IRC predictor in a multidimensional setting. We found spatial information like geology, altitude and coordinates to have stronger influences on IRC than building related variables like foundation type, building type and year of construction. Based on kernel estimation we developed an approach to determine the local probability of IRC to exceed 300 Bq/m3. In addition to that we developed a confidence index in order to provide an estimate of uncertainty of the map. All methods allow an easy creation of tailor-made maps for different building characteristics. Our work is an essential step towards a 222Rn risk assessment which accounts at the same time for different architectural situations as well as geological and geographical conditions. For the communication of 222Rn hazard to the population we recommend to make use of the probability map based on kernel estimation. The communication of 222Rn hazard could for example be implemented via a web interface where the users specify the characteristics and coordinates of their home in order to obtain the probability to be above a given IRC with a corresponding index of confidence. Taking into account the health effects of 222Rn, our results have the potential to substantially improve the estimation of the effective dose from 222Rn delivered to the Swiss population.
Resumo:
The aim of this research was to investigate the effects of high pressure processing (HPP) on consumer acceptance for chilled ready meals manufactured using a low-value beef cut. Three hundred consumers evaluated chilled ready meals subjected to 4 pressure treatments and a non-treated control monadically on a 9-point scale for liking for beef tenderness and juiciness, overall flavour, overall liking, and purchase intent. Data were also collected on consumers' food consumption patterns, their attitudes towards food by means of the reduced food-related lifestyle (FRL) instrument, and socio-demographics. The results indicated that a pressure treatment of 200 MPa was acceptable to most consumers. K-means cluster analysis identified 4 consumer groups with similar preferences, and the optimal pressure treatments acceptable to specific consumer groups were identified for those firms that would wish to target attitudinally differentiated consumer segments
Resumo:
Tässä työssä raportoidaan hybridihitsauksesta otettujen suurnopeuskuvasarjojen automaattisen analyysijärjestelmän kehittäminen.Järjestelmän tarkoitus oli tuottaa tietoa, joka avustaisi analysoijaa arvioimaan kuvatun hitsausprosessin laatua. Tutkimus keskittyi valokaaren taajuuden säännöllisyyden ja lisäainepisaroiden lentosuuntien mittaamiseen. Valokaaria havaittiin kuvasarjoista sumean c-means-klusterointimenetelmän avullaja perättäisten valokaarien välistä aikaväliä käytettiin valokaaren taajuuden säännöllisyyden mittarina. Pisaroita paikannettiin menetelmällä, jossa yhdistyi pääkomponenttianalyysi ja tukivektoriluokitin. Kalman-suodinta käytettiin tuottamaan arvioita pisaroiden lentosuunnista ja nopeuksista. Lentosuunnanmääritysmenetelmä luokitteli pisarat niiden arvioitujen lentosuuntien perusteella. Järjestelmän kehittämiseen käytettävissä olleet kuvasarjat poikkesivat merkittävästi toisistaan kuvanlaadun ja pisaroiden ulkomuodon osalta, johtuen eroista kuvaus- ja hitsausprosesseissa. Analyysijärjestelmä kehitettiin toimimaan pienellä osajoukolla kuvasarjoja, joissa oli tietynlainen kuvaus- ja hitsausprosessi ja joiden kuvanlaatu ja pisaroiden ulkomuoto olivat samankaltaisia, mutta järjestelmää testattiin myös osajoukon ulkopuolisilla kuvasarjoilla. Testitulokset osoittivat, että lentosuunnanmääritystarkkuus oli kohtuullisen suuri osajoukonsisällä ja pieni muissa kuvasarjoissa. Valokaaren taajuuden säännöllisyyden määritys oli tarkka useammassa kuvasarjassa.
Resumo:
En aquest projecte fem un estudi de diferents mètodes per a la segmentació i extracció de línies de mapes de metro com a suport per a daltònics. Hem aplicat dos mètodes amb intervenció de l’usuari i cinc mètodes automàtics on fem servir K-means per a la segmentació de color i Hough per a l’extracció de línies. Dels mètodes amb intervenció obtenim millors resultats amb un mètode d’assignació aproximada del color, i entre els autoàatics tenim com a millor una solució ad-hoc sense paràmetres aplicada sobre l’espai RGB. D’acord amb els resultats experimentals, aquests mètodes ens permeten fer una bona segmentació i extracció de les línies de metro.
Resumo:
Chironomidae spatial distribution was investigated at 63 near-pristine sites in 22 catchments of the Iberian Mediterranean coast. We used partial redundancy analysis to study Chironomidae community responses to a number of environmental factors acting at several spatial scales. The percentage of variation explained by local factors (23.3%) was higher than that explained by geographical (8.5%) or regional factors(8%). Catchment area, longitude, pH, % siliceous rocks in the catchment, and altitude were the best predictors of Chironomidae assemblages. We used a k-means cluster analysis to classified sites into 3 major groups based on Chironomidae assemblages. These groups were explained mainly by longitudinal zonation and geographical position, and were defined as 1) siliceous headwater streams, 2) mid-altitude streams with small catchment areas, and 3) medium-sized calcareous streams. Distinct species assemblages with associated indicator taxa were established for each stream category using IndVal analysis. Species responses to previously identified key environmental variables were determined, and optima and tolerances were established by weighted average regression. Distinct ecological requirements were observed among genera and among species of the same genus. Some genera were restricted to headwater systems (e.g., Diamesa), whereas others (e.g., Eukiefferiella) had wider ecological preferences but with distinct distributions among congenerics. In the present period of climate change, optima and tolerances of species might be a useful tool to predict responses of different species to changes in significant environmental variables, such as temperature and hydrology.
Resumo:
Early identification of beginning readers at risk of developing reading and writing difficulties plays an important role in the prevention and provision of appropriate intervention. In Tanzania, as in other countries, there are children in schools who are at risk of developing reading and writing difficulties. Many of these children complete school without being identified and without proper and relevant support. The main language in Tanzania is Kiswahili, a transparent language. Contextually relevant, reliable and valid instruments of identification are needed in Tanzanian schools. This study aimed at the construction and validation of a group-based screening instrument in the Kiswahili language for identifying beginning readers at risk of reading and writing difficulties. In studying the function of the test there was special interest in analyzing the explanatory power of certain contextual factors related to the home and school. Halfway through grade one, 337 children from four purposively selected primary schools in Morogoro municipality were screened with a group test consisting of 7 subscales measuring phonological awareness, word and letter knowledge and spelling. A questionnaire about background factors and the home and school environments related to literacy was also used. The schools were chosen based on performance status (i.e. high, good, average and low performing schools) in order to include variation. For validation, 64 children were chosen from the original sample to take an individual test measuring nonsense word reading, word reading, actual text reading, one-minute reading and writing. School marks from grade one and a follow-up test half way through grade two were also used for validation. The correlations between the results from the group test and the three measures used for validation were very high (.83-.95). Content validity of the group test was established by using items drawn from authorized text books for reading in grade one. Construct validity was analyzed through item analysis and principal component analysis. The difficulty level of most items in both the group test and the follow-up test was good. The items also discriminated well. Principal component analysis revealed one powerful latent dimension (initial literacy factor), accounting for 93% of the variance. This implies that it could be possible to use any set of the subtests of the group test for screening and prediction. The K-Means cluster analysis revealed four clusters: at-risk children, strugglers, readers and good readers. The main concern in this study was with the groups of at-risk children (24%) and strugglers (22%), who need the most assistance. The predictive validity of the group test was analyzed by correlating the measures from the two school years and by cross tabulating grade one and grade two clusters. All the correlations were positive and very high, and 94% of the at-risk children in grade two were already identified in the group test in grade one. The explanatory power of some of the home and school factors was very strong. The number of books at home accounted for 38% of the variance in reading and writing ability measured by the group test. Parents´ reading ability and the support children received at home for schoolwork were also influential factors. Among the studied school factors school attendance had the strongest explanatory power, accounting for 21% of the variance in reading and writing ability. Having been in nursery school was also of importance. Based on the findings in the study a short version of the group test was created. It is suggested for use in the screening processes in grade one aiming at identifying children at risk of reading and writing difficulties in the Tanzanian context. Suggestions for further research as well as for actions for improving the literacy skills of Tanzanian children are presented.
Resumo:
The aim of this study was to group temporal profiles of 10-day composites NDVI product by similarity, which was obtained by the SPOT Vegetation sensor, for municipalities with high soybean production in the state of Paraná, Brazil, in the 2005/2006 cropping season. Data mining is a valuable tool that allows extracting knowledge from a database, identifying valid, new, potentially useful and understandable patterns. Therefore, it was used the methods for clusters generation by means of the algorithms K-Means, MAXVER and DBSCAN, implemented in the WEKA software package. Clusters were created based on the average temporal profiles of NDVI of the 277 municipalities with high soybean production in the state and the best results were found with the K-Means algorithm, grouping the municipalities into six clusters, considering the period from the beginning of October until the end of March, which is equivalent to the crop vegetative cycle. Half of the generated clusters presented spectro-temporal pattern, a characteristic of soybeans and were mostly under the soybean belt in the state of Paraná, which shows good results that were obtained with the proposed methodology as for identification of homogeneous areas. These results will be useful for the creation of regional soybean "masks" to estimate the planted area for this crop.