775 resultados para Data Mining, Rough Sets, Multi-Dimension, Association Rules, Constraint


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Visual data mining, multi-dimensional scaling, POLARMAP, Sammon's mapping, clustering, outlier detection

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The algorithmic approach to data modelling has developed rapidly these last years, in particular methods based on data mining and machine learning have been used in a growing number of applications. These methods follow a data-driven methodology, aiming at providing the best possible generalization and predictive abilities instead of concentrating on the properties of the data model. One of the most successful groups of such methods is known as Support Vector algorithms. Following the fruitful developments in applying Support Vector algorithms to spatial data, this paper introduces a new extension of the traditional support vector regression (SVR) algorithm. This extension allows for the simultaneous modelling of environmental data at several spatial scales. The joint influence of environmental processes presenting different patterns at different scales is here learned automatically from data, providing the optimum mixture of short and large-scale models. The method is adaptive to the spatial scale of the data. With this advantage, it can provide efficient means to model local anomalies that may typically arise in situations at an early phase of an environmental emergency. However, the proposed approach still requires some prior knowledge on the possible existence of such short-scale patterns. This is a possible limitation of the method for its implementation in early warning systems. The purpose of this paper is to present the multi-scale SVR model and to illustrate its use with an application to the mapping of Cs137 activity given the measurements taken in the region of Briansk following the Chernobyl accident.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

O presente trabalho cujo Título é técnicas de Data e Text Mining para a anotação dum Arquivo Digital, tem como objectivo testar a viabilidade da utilização de técnicas de processamento automático de texto para a anotação das sessões dos debates parlamentares da Assembleia da República de Portugal. Ao longo do trabalho abordaram-se conceitos como tecnologias de descoberta do conhecimento (KDD), o processo da descoberta do conhecimento em texto, a caracterização das várias etapas do processamento de texto e a descrição de algumas ferramentas open souce para a mineração de texto. A metodologia utilizada baseou-se na experimentação de várias técnicas de processamento textual utilizando a open source R/tm. Apresentam-se, como resultados, a influência do pré-processamento, tamanho dos documentos e tamanhos dos corpora no resultado do processamento utilizando o algoritmo knnflex.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The induction of fungal metabolites by fungal co-cultures grown on solid media was explored using multi-well co-cultures in 2 cm diameter Petri dishes. Fungi were grown in 12-well plates to easily and rapidly obtain the large number of replicates necessary for employing metabolomic approaches. Fungal culture using such a format accelerated the production of metabolites by several weeks compared with using the large-format 9 cm Petri dishes. This strategy was applied to a co-culture of a Fusarium and an Aspergillus strain. The metabolite composition of the cultures was assessed using ultra-high pressure liquid chromatography coupled to electrospray ionisation and time-of-flight mass spectrometry, followed by automated data mining. The de novo production of metabolites was dramatically increased by nutriment reduction. A time-series study of the induction of the fungal metabolites of interest over nine days revealed that they exhibited various induction patterns. The concentrations of most of the de novo induced metabolites increased over time. However, interesting patterns were observed, such as with the presence of some compounds only at certain time points. This result indicates the complexity and dynamic nature of fungal metabolism. The large-scale production of the compounds of interest was verified by co-culture in 15 cm Petri dishes; most of the induced metabolites of interest (16/18) were found to be produced as effectively as on a small scale, although not in the same time frames. Large-scale production is a practical solution for the future production, identification and biological evaluation of these metabolites.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Digital information generates the possibility of a high degree of redundancy in the data available for fitting predictive models used for Digital Soil Mapping (DSM). Among these models, the Decision Tree (DT) technique has been increasingly applied due to its capacity of dealing with large datasets. The purpose of this study was to evaluate the impact of the data volume used to generate the DT models on the quality of soil maps. An area of 889.33 km² was chosen in the Northern region of the State of Rio Grande do Sul. The soil-landscape relationship was obtained from reambulation of the studied area and the alignment of the units in the 1:50,000 scale topographic mapping. Six predictive covariates linked to the factors soil formation, relief and organisms, together with data sets of 1, 3, 5, 10, 15, 20 and 25 % of the total data volume, were used to generate the predictive DT models in the data mining program Waikato Environment for Knowledge Analysis (WEKA). In this study, sample densities below 5 % resulted in models with lower power of capturing the complexity of the spatial distribution of the soil in the study area. The relation between the data volume to be handled and the predictive capacity of the models was best for samples between 5 and 15 %. For the models based on these sample densities, the collected field data indicated an accuracy of predictive mapping close to 70 %.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There are many known examples of multiple semi-independent associations at individual loci; such associations might arise either because of true allelic heterogeneity or because of imperfect tagging of an unobserved causal variant. This phenomenon is of great importance in monogenic traits but has not yet been systematically investigated and quantified in complex-trait genome-wide association studies (GWASs). Here, we describe a multi-SNP association method that estimates the effect of loci harboring multiple association signals by using GWAS summary statistics. Applying the method to a large anthropometric GWAS meta-analysis (from the Genetic Investigation of Anthropometric Traits consortium study), we show that for height, body mass index (BMI), and waist-to-hip ratio (WHR), 3%, 2%, and 1%, respectively, of additional phenotypic variance can be explained on top of the previously reported 10% (height), 1.5% (BMI), and 1% (WHR). The method also permitted a substantial increase (by up to 50%) in the number of loci that replicate in a discovery-validation design. Specifically, we identified 74 loci at which the multi-SNP, a linear combination of SNPs, explains significantly more variance than does the best individual SNP. A detailed analysis of multi-SNPs shows that most of the additional variability explained is derived from SNPs that are not in linkage disequilibrium with the lead SNP, suggesting a major contribution of allelic heterogeneity to the missing heritability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The book presents the state of the art in machine learning algorithms (artificial neural networks of different architectures, support vector machines, etc.) as applied to the classification and mapping of spatially distributed environmental data. Basic geostatistical algorithms are presented as well. New trends in machine learning and their application to spatial data are given, and real case studies based on environmental and pollution data are carried out. The book provides a CD-ROM with the Machine Learning Office software, including sample sets of data, that will allow both students and researchers to put the concepts rapidly to practice.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis presents a topological approach to studying fuzzy setsby means of modifier operators. Modifier operators are mathematical models, e.g., for hedges, and we present briefly different approaches to studying modifier operators. We are interested in compositional modifier operators, modifiers for short, and these modifiers depend on binary relations. We show that if a modifier depends on a reflexive and transitive binary relation on U, then there exists a unique topology on U such that this modifier is the closure operator in that topology. Also, if U is finite then there exists a lattice isomorphism between the class of all reflexive and transitive relations and the class of all topologies on U. We define topological similarity relation "≈" between L-fuzzy sets in an universe U, and show that the class LU/ ≈ is isomorphic with the class of all topologies on U, if U is finite and L is suitable. We consider finite bitopological spaces as approximation spaces, and we show that lower and upper approximations can be computed by means of α-level sets also in the case of equivalence relations. This means that approximations in the sense of Rough Set Theory can be computed by means of α-level sets. Finally, we present and application to data analysis: we study an approach to detecting dependencies of attributes in data base-like systems, called information systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One main assumption in the theory of rough sets applied to information tables is that the elements that exhibit the same information are indiscernible (similar) and form blocks that can be understood as elementary granules of knowledge about the universe. We propose a variant of this concept defining a measure of similarity between the elements of the universe in order to consider that two objects can be indiscernible even though they do not share all the attribute values because the knowledge is partial or uncertain. The set of similarities define a matrix of a fuzzy relation satisfying reflexivity and symmetry but transitivity thus a partition of the universe is not attained. This problem can be solved calculating its transitive closure what ensure a partition for each level belonging to the unit interval [0,1]. This procedure allows generalizing the theory of rough sets depending on the minimum level of similarity accepted. This new point of view increases the rough character of the data because increases the set of indiscernible objects. Finally, we apply our results to a not real application to be capable to remark the differences and the improvements between this methodology and the classical one

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dans son texte, l’auteur répond à une question posée lors d’une Conférence organisée conjointement par l’US Department of Commerce et le Groupe de l’article 29 et qui appelle à déterminer la façon dont les règles de protection des données doivent s’appliquer lors des transferts de données personnelles dans une société globale, multi-économique et multiculturelle. La question est pertinente dans une telle société, caractérisée par le besoin, d’une part d’assurer, sans considération de frontières, un certain régime de protection des données et d’autre part, de respecter la diversité des réalités économiques et culturelles qui se côtoient de plus en plus. L’auteur rappelle d’abord comment l’Europe a progressivement mis en place le système du droit à la protection des données personnelles. Il explique ensuite comment l’Union européenne a considéré la question de la réglementation des flux transfrontières pour en arriver au développement d’un système de protection adéquat et efficace lors des transferts de données hors de l’Union européenne. Toutefois, un tel système mis en place ne semble plus répondre de nos jours à la réalité des flux transfrontières, d’où la nécessité éventuelle de le réformer.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Knowledge discovery support environments include beside classical data analysis tools also data mining tools. For supporting both kinds of tools, a unified knowledge representation is needed. We show that concept lattices which are used as knowledge representation in Conceptual Information Systems can also be used for structuring the results of mining association rules. Vice versa, we use ideas of association rules for reducing the complexity of the visualization of Conceptual Information Systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we study two orthogonal extensions of the classical data mining problem of mining association rules, and show how they naturally interact. The first is the extension from a propositional representation to datalog, and the second is the condensed representation of frequent itemsets by means of Formal Concept Analysis (FCA). We combine the notion of frequent datalog queries with iceberg concept lattices (also called closed itemsets) of FCA and introduce two kinds of iceberg query lattices as condensed representations of frequent datalog queries. We demonstrate that iceberg query lattices provide a natural way to visualize relational association rules in a non-redundant way.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The main challenge for statistical models in this context is to overcome the inherent data sparseness and to estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical data organization. A novel family of mixture models is proposed which explain the observed data by a finite number of shared aspects or clusters. This provides a common framework for statistical inference and structure discovery and also includes several recently proposed models as special cases. Adopting the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop improved versions of EM which largely avoid overfitting problems and overcome the inherent locality of EM--based optimization. Among the broad variety of possible applications, e.g., in information retrieval, natural language processing, data mining, and computer vision, we have chosen document retrieval, the statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images to test and evaluate the proposed algorithms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract This seminar is a research discussion around a very interesting problem, which may be a good basis for a WAISfest theme. A little over a year ago Professor Alan Dix came to tell us of his plans for a magnificent adventure:to walk all of the way round Wales - 1000 miles 'Alan Walks Wales'. The walk was a personal journey, but also a technological and community one, exploring the needs of the walker and the people along the way. Whilst walking he recorded his thoughts in an audio diary, took lots of photos, wrote a blog and collected data from the tech instruments he was wearing. As a result Alan has extensive quantitative data (bio-sensing and location) and qualitative data (text, images and some audio). There are challenges in analysing individual kinds of data, including merging similar data streams, entity identification, time-series and textual data mining, dealing with provenance, ontologies for paths, and journeys. There are also challenges for author and third-party annotation, linking the data-sets and visualising the merged narrative or facets of it.