945 resultados para Data mining models


Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, we develop a game theoretic approach for clustering features in a learning problem. Feature clustering can serve as an important preprocessing step in many problems such as feature selection, dimensionality reduction, etc. In this approach, we view features as rational players of a coalitional game where they form coalitions (or clusters) among themselves in order to maximize their individual payoffs. We show how Nash Stable Partition (NSP), a well known concept in the coalitional game theory, provides a natural way of clustering features. Through this approach, one can obtain some desirable properties of the clusters by choosing appropriate payoff functions. For a small number of features, the NSP based clustering can be found by solving an integer linear program (ILP). However, for large number of features, the ILP based approach does not scale well and hence we propose a hierarchical approach. Interestingly, a key result that we prove on the equivalence between a k-size NSP of a coalitional game and minimum k-cut of an appropriately constructed graph comes in handy for large scale problems. In this paper, we use feature selection problem (in a classification setting) as a running example to illustrate our approach. We conduct experiments to illustrate the efficacy of our approach.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Ranking problems have become increasingly important in machine learning and data mining in recent years, with applications ranging from information retrieval and recommender systems to computational biology and drug discovery. In this paper, we describe a new ranking algorithm that directly maximizes the number of relevant objects retrieved at the absolute top of the list. The algorithm is a support vector style algorithm, but due to the different objective, it no longer leads to a quadratic programming problem. Instead, the dual optimization problem involves l1, ∞ constraints; we solve this dual problem using the recent l1, ∞ projection method of Quattoni et al (2009). Our algorithm can be viewed as an l∞-norm extreme of the lp-norm based algorithm of Rudin (2009) (albeit in a support vector setting rather than a boosting setting); thus we refer to the algorithm as the ‘Infinite Push’. Experiments on real-world data sets confirm the algorithm’s focus on accuracy at the absolute top of the list.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, we present a methodology for identifying best features from a large feature space. In high dimensional feature space nearest neighbor search is meaningless. In this feature space we see quality and performance issue with nearest neighbor search. Many data mining algorithms use nearest neighbor search. So instead of doing nearest neighbor search using all the features we need to select relevant features. We propose feature selection using Non-negative Matrix Factorization(NMF) and its application to nearest neighbor search. Recent clustering algorithm based on Locally Consistent Concept Factorization(LCCF) shows better quality of document clustering by using local geometrical and discriminating structure of the data. By using our feature selection method we have shown further improvement of performance in the clustering.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Learning from Positive and Unlabelled examples (LPU) has emerged as an important problem in data mining and information retrieval applications. Existing techniques are not ideally suited for real world scenarios where the datasets are linearly inseparable, as they either build linear classifiers or the non-linear classifiers fail to achieve the desired performance. In this work, we propose to extend maximum margin clustering ideas and present an iterative procedure to design a non-linear classifier for LPU. In particular, we build a least squares support vector classifier, suitable for handling this problem due to symmetry of its loss function. Further, we present techniques for appropriately initializing the labels of unlabelled examples and for enforcing the ratio of positive to negative examples while obtaining these labels. Experiments on real-world datasets demonstrate that the non-linear classifier designed using the proposed approach gives significantly better generalization performance than the existing relevant approaches for LPU.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning and data mining. Clustering is grouping of a data set or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait according to some defined distance measure. In this paper we present the genetically improved version of particle swarm optimization algorithm which is a population based heuristic search technique derived from the analysis of the particle swarm intelligence and the concepts of genetic algorithms (GA). The algorithm combines the concepts of PSO such as velocity and position update rules together with the concepts of GA such as selection, crossover and mutation. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The performance of our method is better than k-means and PSO algorithm.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Query suggestion is an important feature of the search engine with the explosive and diverse growth of web contents. Different kind of suggestions like query, image, movies, music and book etc. are used every day. Various types of data sources are used for the suggestions. If we model the data into various kinds of graphs then we can build a general method for any suggestions. In this paper, we have proposed a general method for query suggestion by combining two graphs: (1) query click graph which captures the relationship between queries frequently clicked on common URLs and (2) query text similarity graph which finds the similarity between two queries using Jaccard similarity. The proposed method provides literally as well as semantically relevant queries for users' need. Simulation results show that the proposed algorithm outperforms heat diffusion method by providing more number of relevant queries. It can be used for recommendation tasks like query, image, and product suggestion.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Infrared magnitude-redshift relations for the 3CR and 6C samples of radio galaxies are presented for a wide range of plausible cosmological models, including those with non-zero cosmological constant OmegaLambda. Variations in the galaxy formation redshift, metallicity and star formation history are also considered. The results of the modelling are displayed in terms of magnitude differences between the models and no-evolution tracks, illustrating the amount of K-band evolution necessary to account for the observational data. Given a number of plausible assumptions, the results of these analyses suggest that: (i) cosmologies which predict T_0xH_0>1 (where T_0 denotes the current age of the universe) can be excluded; (ii) the star formation redshift should lie in the redshift interval 5data; (iv) models with finite values of OmegaLambda can provide good agreement with the observations only if appropriate adjustments of other parameters such as the galaxy metallicities and star-formation histories are made. Without such modifications, even after accounting for stellar evolution, the high redshift radio galaxies are more luminous (ie. more massive) than those nearby in models with finite OmegaLambda, including the favoured model with Omega=0.3, OmegaLambda=0.7. For cosmological models with larger values of T_0xH_0, the conclusions are the same regardless of whether any adjustments are made or not. The implications of these results for cosmology and models of galaxy formation are discussed.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

La tesis contiene 4 capítulos principales. El primero de ellos recapitula sobre el concepto de data mining y su tipología, desde la perspectiva del análisis de datos de encuestas. Se realiza una clasificación entre técnicas exploratorias y técnicas predictivas, poniendo el énfasis en los análisis de componentes, de correspondencias simples, múltiples y clasificación, por un lado, y la metodología PLS path modelling y modelos Logit por otro. En el siguiente capítulo se realiza una aplicación de los métodos anteriores sobre los datos obtenidos de una encuesta on-line sobre satisfacción respecto a una institución y la viabilidad de una tienda de productos corporativos con el logotipo de la misma, comparando los resultados de las diferentes técnicas empleadas. El siguiente capítulo trata sobre una técnica relacionada con las técnicas exploratorias expuestas anteriormente que tiene que ver con la situación que se produce cuando se quieren analizar varias tablas de datos simultáneamente y de forma equilibrada. En particular trata sobre el problema que se presenta cuando esas tablas contienen distintos y distinto número de individuos. Se presenta una modificación del método original que permite dicho análisis y cuya efectividad es probada mediante un pequeño ejercicio de simulación así como el análisis práctico de una encuesta real sobre desigualdad social en un conjunto de 10 países diferentes. Para acabar, el último capítulo considera el caso en el que se quieren analizar respuestas a diferentes tipos de preguntas en un análisis de tipo exploratorio. En particular, cuando las preguntas dan lugar a variables continuas, categóricas y frecuencias provenientes de corpus textuales generados a partir de las respuestas a una pregunta abierta. Se considera en concreto la situación producida cuando existen dos tipos de entrevistados diferenciados por el idioma en que contestan, generando corpus distintos. Se muestra una posible manera de tratar esta situación, utilizando para ello la misma encuesta del primer capítulo.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This report describes cases relating to the management of national marine sanctuaries in which certain scientific information was required so managers could make decisions that effectively protected trust resources. The cases presented represent only a fraction of difficult issues that marine sanctuary managers deal with daily. They include, among others, problems related to wildlife disturbance, vessel routing, marine reserve placement, watershed management, oil spill response, and habitat restoration. Scientific approaches to address these problems vary significantly, and include literature surveys, data mining, field studies (monitoring, mapping, observations, and measurement), geospatial and biogeographic analysis, and modeling. In most cases there is also an element of expert consultation and collaboration among multiple partners, agencies with resource protection responsibilities, and other users and stakeholders. The resulting management responses may involve direct intervention (e.g., for spill response or habitat restoration issues), proposal of boundary alternatives for marine sanctuaries or reserves, changes in agency policy or regulations, making recommendations to other agencies with resource protection responsibilities, proposing changes to international or domestic shipping rules, or development of new education or outreach programs. (PDF contains 37 pages.)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

As academic libraries are increasingly supported by a matrix of databases functions, the use of data mining and visualization techniques offer significant potential for future collection development and service initiatives based on quantifiable data. While data collection techniques are still not standardized and results may be skewed because of granularity problems, faulty algorithms, and a host of other factors, useful baseline data is extractable and broad trends can be identified. The purpose of the current study is to provide an initial assessment of data associated with science monograph collection at the Marston Science Library (MSL), University of Florida. These sciences fall within the major Library of Congress Classification schedules of Q, S, and T, excluding R, TN, TR, and TT. Overall strategy of this project is to look at the potential science audiences within the university community and analyze data related to purchasing and circulation patterns, e-book usage, and interlibrary loan statistics. While a longitudinal study from 2004 to the present would be ideal, this paper presents the results from the academic year July 1, 2008 to June 30, 2009 which was chosen as the pilot period because all data reservoirs identified above were available.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

En este proyecto se describirá como construir un modelo predictivo de tipo gradient boosting para predecir el número de ventas online de un producto X del cual solo sabremos su número de identificación, teniendo en cuenta las campañas publicitarias y las características tanto cualitativas y cuantitativas de éste. Para ello se utilizarán y se explicarán las diferentes técnicas utilizadas, como son: la técnica de la validación cruzada y el Blending. El objetivo del proyecto es implementar el modelo así como explicar con exactitud cada técnica y herramienta utilizada y obtener un resultado válido para la competición propuesta en Kaggle con el nombre de Online Product Sales.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The CTC algorithm, Consolidated Tree Construction algorithm, is a machine learning paradigm that was designed to solve a class imbalance problem, a fraud detection problem in the area of car insurance [1] where, besides, an explanation about the classification made was required. The algorithm is based on a decision tree construction algorithm, in this case the well-known C4.5, but it extracts knowledge from data using a set of samples instead of a single one as C4.5 does. In contrast to other methodologies based on several samples to build a classifier, such as bagging, the CTC builds a single tree and as a consequence, it obtains comprehensible classifiers. The main motivation of this implementation is to make public and available an implementation of the CTC algorithm. With this purpose we have implemented the algorithm within the well-known WEKA data mining environment http://www.cs.waikato.ac.nz/ml/weka/). WEKA is an open source project that contains a collection of machine learning algorithms written in Java for data mining tasks. J48 is the implementation of C4.5 algorithm within the WEKA package. We called J48Consolidated to the implementation of CTC algorithm based on the J48 Java class.