57 resultados para Data clustering. Fuzzy C-Means. Cluster centers initialization. Validation indices
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
Our purpose is to provide a set-theoretical frame to clustering fuzzy relational data basically based on cardinality of the fuzzy subsets that represent objects and their complementaries, without applying any crisp property. From this perspective we define a family of fuzzy similarity indexes which includes a set of fuzzy indexes introduced by Tolias et al, and we analyze under which conditions it is defined a fuzzy proximity relation. Following an original idea due to S. Miyamoto we evaluate the similarity between objects and features by means the same mathematical procedure. Joining these concepts and methods we establish an algorithm to clustering fuzzy relational data. Finally, we present an example to make clear all the process
Resumo:
Zonal management in vineyards requires the prior delineation of stable yield zones within the parcel. Among the different methodologies used for zone delineation, cluster analysis of yield data from several years is one of the possibilities cited in scientific literature. However, there exist reasonable doubts concerning the cluster algorithm to be used and the number of zones that have to be delineated within a field. In this paper two different cluster algorithms have been compared (k-means and fuzzy c-means) using the grape yield data corresponding to three successive years (2002, 2003 and 2004), for a ‘Pinot Noir’ vineyard parcel. Final choice of the most recommendable algorithm has been linked to obtaining a stable pattern of spatial yield distribution and to allowing for the delineation of compact and average sized areas. The general recommendation is to use reclassified maps of two clusters or yield classes (low yield zone and high yield zone) and, consequently, the site-specific vineyard management should be based on the prior delineation of just two different zones or sub-parcels. The two tested algorithms are good options for this purpose. However, the fuzzy c-means algorithm allows for a better zoning of the parcel, forming more compact areas and with more equilibrated zonal differences over time.
Resumo:
In an earlier investigation (Burger et al., 2000) five sediment cores near the RodriguesTriple Junction in the Indian Ocean were studied applying classical statistical methods(fuzzy c-means clustering, linear mixing model, principal component analysis) for theextraction of endmembers and evaluating the spatial and temporal variation ofgeochemical signals. Three main factors of sedimentation were expected by the marinegeologists: a volcano-genetic, a hydro-hydrothermal and an ultra-basic factor. Thedisplay of fuzzy membership values and/or factor scores versus depth providedconsistent results for two factors only; the ultra-basic component could not beidentified. The reason for this may be that only traditional statistical methods wereapplied, i.e. the untransformed components were used and the cosine-theta coefficient assimilarity measure.During the last decade considerable progress in compositional data analysis was madeand many case studies were published using new tools for exploratory analysis of thesedata. Therefore it makes sense to check if the application of suitable data transformations,reduction of the D-part simplex to two or three factors and visualinterpretation of the factor scores would lead to a revision of earlier results and toanswers to open questions . In this paper we follow the lines of a paper of R. Tolosana-Delgado et al. (2005) starting with a problem-oriented interpretation of the biplotscattergram, extracting compositional factors, ilr-transformation of the components andvisualization of the factor scores in a spatial context: The compositional factors will beplotted versus depth (time) of the core samples in order to facilitate the identification ofthe expected sources of the sedimentary process.Kew words: compositional data analysis, biplot, deep sea sediments
Resumo:
When continuous data are coded to categorical variables, two types of coding are possible: crisp coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where each observation is transformed to a set of "degrees of membership" between 0 and 1, using co-called membership functions. It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the solution in a low-dimensional space. Since the crisp data only code the categories to which each individual case belongs, an alternative measure of fit is simply to count how well these categories are predicted by the solution. Another approach is to consider multiple correspondence analysis equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal tables of the Burt matrix - the measure of fit is then computed as the quality of explaining these tables only. The correspondence analysis of fuzzy coded data, called "fuzzy multiple correspondence analysis", suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions are made of the categories which have highest degree of membership. But here one can also defuzzify the results of the analysis to obtain estimated values of the original data, and then calculate a measure of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance. Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the crisp case can be applied to analyse the off-diagonal part of this matrix. In this paper these alternative measures of fit are defined and applied to a data set of continuous meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit is further discussed when the data set consists of a mixture of discrete and continuous variables.
Resumo:
This work focuses on the prediction of the two main nitrogenous variables that describe the water quality at the effluent of a Wastewater Treatment Plant. We have developed two kind of Neural Networks architectures based on considering only one output or, in the other hand, the usual five effluent variables that define the water quality: suspended solids, biochemical organic matter, chemical organic matter, total nitrogen and total Kjedhal nitrogen. Two learning techniques based on a classical adaptative gradient and a Kalman filter have been implemented. In order to try to improve generalization and performance we have selected variables by means genetic algorithms and fuzzy systems. The training, testing and validation sets show that the final networks are able to learn enough well the simulated available data specially for the total nitrogen
Resumo:
The objective of research was to analyse the potential of Normalized Difference Vegetation Index (NDVI) maps from satellite images, yield maps and grapevine fertility and load variables to delineate zones with different wine grape properties for selective harvesting. Two vineyard blocks located in NE Spain (Cabernet Sauvignon and Syrah) were analysed. The NDVI was computed from a Quickbird-2 multi-spectral image at veraison (July 2005). Yield data was acquired by means of a yield monitor during September 2005. Other variables, such as the number of buds, number of shoots, number of wine grape clusters and weight of 100 berries were sampled in a 10 rows × 5 vines pattern and used as input variables, in combination with the NDVI, to define the clusters as alternative to yield maps. Two days prior to the harvesting, grape samples were taken. The analysed variables were probable alcoholic degree, pH of the juice, total acidity, total phenolics, colour, anthocyanins and tannins. The input variables, alone or in combination, were clustered (2 and 3 Clusters) by using the ISODATA algorithm, and an analysis of variance and a multiple rang test were performed. The results show that the zones derived from the NDVI maps are more effective to differentiate grape maturity and quality variables than the zones derived from the yield maps. The inclusion of other grapevine fertility and load variables did not improve the results.
Resumo:
Projecte de recerca elaborat a partir d’una estada a la National Oceanography Centre of Southampton (NOCS), Gran Bretanya, entre maig i juliol del 2006. La possibilitat d’obtenir una estimació precissa de la salinitat marina (SSS) és important per a investigar i predir l’extensió del fenòmen del canvi climàtic. La missió Soil Moisture and Ocean Salinity (SMOS) va ser seleccionada per l’Agència Espacial Europea (ESA) per a obtenir mapes de salinitat de la superfície marina a escala global i amb un temps de revisita petit. Abans del llençament de SMOS es preveu l’anàlisi de la variabilitat horitzontal de la SSS i del potencial de les dades recuperades a partir de mesures de SMOS per a reproduir comportaments oceanogràfics coneguts. L’objectiu de tot plegat és emplenar el buit existent entre les fonts de dades d’entrada/auxiliars fiables i les eines desenvolupades per a simular i processar les dades adquirides segons la configuració de SMOS. El SMOS End-to-end Performance Simulator (SEPS) és un simulador adhoc desenvolupat per la Universitat Politècnica de Catalunya (UPC) per a generar dades segons la configuració de SMOS. Es va utilitzar dades d’entrada a SEPS procedents del projecte Ocean Circulation and Climate Advanced Modeling (OCCAM), utilitzat al NOCS, a diferents resolucions espacials. Modificant SEPS per a poder fer servir com a entrada les dades OCCAM es van obtenir dades de temperatura de brillantor simulades durant un mes amb diferents observacions ascendents que cobrien la zona seleccionada. Les tasques realitzades durant l’estada a NOCS tenien la finalitat de proporcionar una tècnica fiable per a realitzar la calibració externa i per tant cancel•lar el bias, una metodologia per a promitjar temporalment les diferents adquisicions durant les observacions ascendents, i determinar la millor configuració de la funció de cost abans d’explotar i investigar les posibiltats de les dades SEPS/OCCAM per a derivar la SSS recuperada amb patrons d’alta resolució.
Mejora diagnóstica de hepatopatías de afectación difusa mediante técnicas de inteligencia artificial
Resumo:
The automatic diagnostic discrimination is an application of artificial intelligence techniques that can solve clinical cases based on imaging. Diffuse liver diseases are diseases of wide prominence in the population and insidious course, yet early in its progression. Early and effective diagnosis is necessary because many of these diseases progress to cirrhosis and liver cancer. The usual technique of choice for accurate diagnosis is liver biopsy, an invasive and not without incompatibilities one. It is proposed in this project an alternative non-invasive and free of contraindications method based on liver ultrasonography. The images are digitized and then analyzed using statistical techniques and analysis of texture. The results are validated from the pathology report. Finally, we apply artificial intelligence techniques as Fuzzy k-Means or Support Vector Machines and compare its significance to the analysis Statistics and the report of the clinician. The results show that this technique is significantly valid and a promising alternative as a noninvasive diagnostic chronic liver disease from diffuse involvement. Artificial Intelligence classifying techniques significantly improve the diagnosing discrimination compared to other statistics.
Resumo:
R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computingand graphics. The environment in which many classical and modern statistical techniques havebeen implemented, but many are supplied as packages. There are 8 standard packages and many moreare available through the cran family of Internet sites http://cran.r-project.org .We started to develop a library of functions in R to support the analysis of mixtures and our goal isa MixeR package for compositional data analysis that provides support foroperations on compositions: perturbation and power multiplication, subcomposition with or withoutresiduals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances,compositional Kullback-Leibler divergence etc.graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features:barycenter, geometric mean of the data set, the percentiles lines, marking and coloring ofsubsets of the data set, theirs geometric means, notation of individual data in the set . . .dealing with zeros and missing values in compositional data sets with R procedures for simpleand multiplicative replacement strategy,the time series analysis of compositional data.We’ll present the current status of MixeR development and illustrate its use on selected data sets
Resumo:
Aquest article recull un mateix procés metodològic en el qual s'emmarquen diverses investigacions presentades en aquest volum de la revista, amb un mateix objectiu: la construcció de tipologies en diferents àmbits temàtics. L'article especifica el marc general del disseny, descriu el seu procés metodològic i d'anàlisi de dades que pot caracteritzar-se per: (1) La font de dades prové d'una gran enquesta sobre hàbits i condicions de vida de la població, l'Enquesta Metropolitana de Barcelona 1990. (2) El plantejament d'un objecte d'estudi concret dins d'una realitat multivariable. (3) La utilització de tecniques d'anàlisi multivariables, en concret,l'Anàlisi de Correspondències Múltiples i les Tècniques de Classificació Automàtica.
Resumo:
The aim of this research was to investigate the effects of high pressure processing (HPP) on consumer acceptance for chilled ready meals manufactured using a low-value beef cut. Three hundred consumers evaluated chilled ready meals subjected to 4 pressure treatments and a non-treated control monadically on a 9-point scale for liking for beef tenderness and juiciness, overall flavour, overall liking, and purchase intent. Data were also collected on consumers' food consumption patterns, their attitudes towards food by means of the reduced food-related lifestyle (FRL) instrument, and socio-demographics. The results indicated that a pressure treatment of 200 MPa was acceptable to most consumers. K-means cluster analysis identified 4 consumer groups with similar preferences, and the optimal pressure treatments acceptable to specific consumer groups were identified for those firms that would wish to target attitudinally differentiated consumer segments
Resumo:
Chironomidae spatial distribution was investigated at 63 near-pristine sites in 22 catchments of the Iberian Mediterranean coast. We used partial redundancy analysis to study Chironomidae community responses to a number of environmental factors acting at several spatial scales. The percentage of variation explained by local factors (23.3%) was higher than that explained by geographical (8.5%) or regional factors(8%). Catchment area, longitude, pH, % siliceous rocks in the catchment, and altitude were the best predictors of Chironomidae assemblages. We used a k-means cluster analysis to classified sites into 3 major groups based on Chironomidae assemblages. These groups were explained mainly by longitudinal zonation and geographical position, and were defined as 1) siliceous headwater streams, 2) mid-altitude streams with small catchment areas, and 3) medium-sized calcareous streams. Distinct species assemblages with associated indicator taxa were established for each stream category using IndVal analysis. Species responses to previously identified key environmental variables were determined, and optima and tolerances were established by weighted average regression. Distinct ecological requirements were observed among genera and among species of the same genus. Some genera were restricted to headwater systems (e.g., Diamesa), whereas others (e.g., Eukiefferiella) had wider ecological preferences but with distinct distributions among congenerics. In the present period of climate change, optima and tolerances of species might be a useful tool to predict responses of different species to changes in significant environmental variables, such as temperature and hydrology.
Resumo:
Background: Assessing of the costs of treating disease is necessary to demonstrate cost-effectiveness and to estimate the budget impact of new interventions and therapeutic innovations. However, there are few comprehensive studies on resource use and costs associated with lung cancer patients in clinical practice in Spain or internationally. The aim of this paper was to assess the hospital cost associated with lung cancer diagnosis and treatment by histology, type of cost and stage at diagnosis in the Spanish National Health Service. Methods: A retrospective, descriptive analysis on resource use and a direct medical cost analysis were performed. Resource utilisation data were collected by means of patient files from nine teaching hospitals. From a hospital budget impact perspective, the aggregate and mean costs per patient were calculated over the first three years following diagnosis or up to death. Both aggregate and mean costs per patient were analysed by histology, stage at diagnosis and cost type. Results: A total of 232 cases of lung cancer were analysed, of which 74.1% corresponded to non-small cell lung cancer (NSCLC) and 11.2% to small cell lung cancer (SCLC); 14.7% had no cytohistologic confirmation. The mean cost per patient in NSCLC ranged from 13,218 Euros in Stage III to 16,120 Euros in Stage II. The main cost components were chemotherapy (29.5%) and surgery (22.8%). Advanced disease stages were associated with a decrease in the relative weight of surgical and inpatient care costs but an increase in chemotherapy costs. In SCLC patients, the mean cost per patient was 15,418 Euros for limited disease and 12,482 Euros for extensive disease. The main cost components were chemotherapy (36.1%) and other inpatient costs (28.7%). In both groups, the Kruskall-Wallis test did not show statistically significant differences in mean cost per patient between stages. Conclusions: This study provides the costs of lung cancer treatment based on patient file reviews, with chemotherapy and surgery accounting for the major components of costs. This cost analysis is a baseline study that will provide a useful source of information for future studies on cost-effectiveness and on the budget impact of different therapeutic innovations in Spain.
Resumo:
Aquest projecte presenta un estudi científic dels mètodes de generació de dades sintètiques dins de l’àrea de la privadesa de dades. Aquests mètodes permeten controlar la transferència de dades sensibles a terceres parts i la utilitat estadística de les dades que es generen sintèticament. S’han introduït tots els conceptes bàsics necessaris per a situar al lector i s’ha analitzat un dels mètodes existents més amplament utilitzat (IPSO). Seguidament, s’ha proposat un nou mètode per a la generació de dades sintètiques (FCRM) que es basa en Fuzzy c-Regression i permet controlar l’equilibri entre pèrdua d’informació i risc de revelació mitjançant un paràmetre c.
Resumo:
Estudi i implementació d'un sistema multiagent intel·ligent i la seva aplicació a sistemes difusos. Utilització de les llibreries JADE i JFuzzyLogic.