868 resultados para Data mining


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Ranking problems have become increasingly important in machine learning and data mining in recent years, with applications ranging from information retrieval and recommender systems to computational biology and drug discovery. In this paper, we describe a new ranking algorithm that directly maximizes the number of relevant objects retrieved at the absolute top of the list. The algorithm is a support vector style algorithm, but due to the different objective, it no longer leads to a quadratic programming problem. Instead, the dual optimization problem involves l1, ∞ constraints; we solve this dual problem using the recent l1, ∞ projection method of Quattoni et al (2009). Our algorithm can be viewed as an l∞-norm extreme of the lp-norm based algorithm of Rudin (2009) (albeit in a support vector setting rather than a boosting setting); thus we refer to the algorithm as the ‘Infinite Push’. Experiments on real-world data sets confirm the algorithm’s focus on accuracy at the absolute top of the list.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The rapid emergence of infectious diseases calls for immediate attention to determine practical solutions for intervention strategies. To this end, it becomes necessary to obtain a holistic view of the complex hostpathogen interactome. Advances in omics and related technology have resulted in massive generation of data for the interacting systems at unprecedented levels of detail. Systems-level studies with the aid of mathematical tools contribute to a deeper understanding of biological systems, where intuitive reasoning alone does not suffice. In this review, we discuss different aspects of hostpathogen interactions (HPIs) and the available data resources and tools used to study them. We discuss in detail models of HPIs at various levels of abstraction, along with their applications and limitations. We also enlist a few case studies, which incorporate different modeling approaches, providing significant insights into disease. (c) 2013 Wiley Periodicals, Inc.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we present a methodology for identifying best features from a large feature space. In high dimensional feature space nearest neighbor search is meaningless. In this feature space we see quality and performance issue with nearest neighbor search. Many data mining algorithms use nearest neighbor search. So instead of doing nearest neighbor search using all the features we need to select relevant features. We propose feature selection using Non-negative Matrix Factorization(NMF) and its application to nearest neighbor search. Recent clustering algorithm based on Locally Consistent Concept Factorization(LCCF) shows better quality of document clustering by using local geometrical and discriminating structure of the data. By using our feature selection method we have shown further improvement of performance in the clustering.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Learning from Positive and Unlabelled examples (LPU) has emerged as an important problem in data mining and information retrieval applications. Existing techniques are not ideally suited for real world scenarios where the datasets are linearly inseparable, as they either build linear classifiers or the non-linear classifiers fail to achieve the desired performance. In this work, we propose to extend maximum margin clustering ideas and present an iterative procedure to design a non-linear classifier for LPU. In particular, we build a least squares support vector classifier, suitable for handling this problem due to symmetry of its loss function. Further, we present techniques for appropriately initializing the labels of unlabelled examples and for enforcing the ratio of positive to negative examples while obtaining these labels. Experiments on real-world datasets demonstrate that the non-linear classifier designed using the proposed approach gives significantly better generalization performance than the existing relevant approaches for LPU.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We study the problem of analyzing influence of various factors affecting individual messages posted in social media. The problem is challenging because of various types of influences propagating through the social media network that act simultaneously on any user. Additionally, the topic composition of the influencing factors and the susceptibility of users to these influences evolve over time. This problem has not been studied before, and off-the-shelf models are unsuitable for this purpose. To capture the complex interplay of these various factors, we propose a new non-parametric model called the Dynamic Multi-Relational Chinese Restaurant Process. This accounts for the user network for data generation and also allows the parameters to evolve over time. Designing inference algorithms for this model suited for large scale social-media data is another challenge. To this end, we propose a scalable and multi-threaded inference algorithm based on online Gibbs Sampling. Extensive evaluations on large-scale Twitter and Face book data show that the extracted topics when applied to authorship and commenting prediction outperform state-of-the-art baselines. More importantly, our model produces valuable insights on topic trends and user personality trends beyond the capability of existing approaches.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning and data mining. Clustering is grouping of a data set or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait according to some defined distance measure. In this paper we present the genetically improved version of particle swarm optimization algorithm which is a population based heuristic search technique derived from the analysis of the particle swarm intelligence and the concepts of genetic algorithms (GA). The algorithm combines the concepts of PSO such as velocity and position update rules together with the concepts of GA such as selection, crossover and mutation. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The performance of our method is better than k-means and PSO algorithm.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Elastic Net Regularizers have shown much promise in designing sparse classifiers for linear classification. In this work, we propose an alternating optimization approach to solve the dual problems of elastic net regularized linear classification Support Vector Machines (SVMs) and logistic regression (LR). One of the sub-problems turns out to be a simple projection. The other sub-problem can be solved using dual coordinate descent methods developed for non-sparse L2-regularized linear SVMs and LR, without altering their iteration complexity and convergence properties. Experiments on very large datasets indicate that the proposed dual coordinate descent - projection (DCD-P) methods are fast and achieve comparable generalization performance after the first pass through the data, with extremely sparse models.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The performance of prediction models is often based on ``abstract metrics'' that estimate the model's ability to limit residual errors between the observed and predicted values. However, meaningful evaluation and selection of prediction models for end-user domains requires holistic and application-sensitive performance measures. Inspired by energy consumption prediction models used in the emerging ``big data'' domain of Smart Power Grids, we propose a suite of performance measures to rationally compare models along the dimensions of scale independence, reliability, volatility and cost. We include both application independent and dependent measures, the latter parameterized to allow customization by domain experts to fit their scenario. While our measures are generalizable to other domains, we offer an empirical analysis using real energy use data for three Smart Grid applications: planning, customer education and demand response, which are relevant for energy sustainability. Our results underscore the value of the proposed measures to offer a deeper insight into models' behavior and their impact on real applications, which benefit both data mining researchers and practitioners.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The rapid emergence of infectious diseases calls for immediate attention to determine practical solutions for intervention strategies. To this end, it becomes necessary to obtain a holistic view of the complex hostpathogen interactome. Advances in omics and related technology have resulted in massive generation of data for the interacting systems at unprecedented levels of detail. Systems-level studies with the aid of mathematical tools contribute to a deeper understanding of biological systems, where intuitive reasoning alone does not suffice. In this review, we discuss different aspects of hostpathogen interactions (HPIs) and the available data resources and tools used to study them. We discuss in detail models of HPIs at various levels of abstraction, along with their applications and limitations. We also enlist a few case studies, which incorporate different modeling approaches, providing significant insights into disease. (c) 2013 Wiley Periodicals, Inc.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Query suggestion is an important feature of the search engine with the explosive and diverse growth of web contents. Different kind of suggestions like query, image, movies, music and book etc. are used every day. Various types of data sources are used for the suggestions. If we model the data into various kinds of graphs then we can build a general method for any suggestions. In this paper, we have proposed a general method for query suggestion by combining two graphs: (1) query click graph which captures the relationship between queries frequently clicked on common URLs and (2) query text similarity graph which finds the similarity between two queries using Jaccard similarity. The proposed method provides literally as well as semantically relevant queries for users' need. Simulation results show that the proposed algorithm outperforms heat diffusion method by providing more number of relevant queries. It can be used for recommendation tasks like query, image, and product suggestion.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

La tesis contiene 4 capítulos principales. El primero de ellos recapitula sobre el concepto de data mining y su tipología, desde la perspectiva del análisis de datos de encuestas. Se realiza una clasificación entre técnicas exploratorias y técnicas predictivas, poniendo el énfasis en los análisis de componentes, de correspondencias simples, múltiples y clasificación, por un lado, y la metodología PLS path modelling y modelos Logit por otro. En el siguiente capítulo se realiza una aplicación de los métodos anteriores sobre los datos obtenidos de una encuesta on-line sobre satisfacción respecto a una institución y la viabilidad de una tienda de productos corporativos con el logotipo de la misma, comparando los resultados de las diferentes técnicas empleadas. El siguiente capítulo trata sobre una técnica relacionada con las técnicas exploratorias expuestas anteriormente que tiene que ver con la situación que se produce cuando se quieren analizar varias tablas de datos simultáneamente y de forma equilibrada. En particular trata sobre el problema que se presenta cuando esas tablas contienen distintos y distinto número de individuos. Se presenta una modificación del método original que permite dicho análisis y cuya efectividad es probada mediante un pequeño ejercicio de simulación así como el análisis práctico de una encuesta real sobre desigualdad social en un conjunto de 10 países diferentes. Para acabar, el último capítulo considera el caso en el que se quieren analizar respuestas a diferentes tipos de preguntas en un análisis de tipo exploratorio. En particular, cuando las preguntas dan lugar a variables continuas, categóricas y frecuencias provenientes de corpus textuales generados a partir de las respuestas a una pregunta abierta. Se considera en concreto la situación producida cuando existen dos tipos de entrevistados diferenciados por el idioma en que contestan, generando corpus distintos. Se muestra una posible manera de tratar esta situación, utilizando para ello la misma encuesta del primer capítulo.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This report describes cases relating to the management of national marine sanctuaries in which certain scientific information was required so managers could make decisions that effectively protected trust resources. The cases presented represent only a fraction of difficult issues that marine sanctuary managers deal with daily. They include, among others, problems related to wildlife disturbance, vessel routing, marine reserve placement, watershed management, oil spill response, and habitat restoration. Scientific approaches to address these problems vary significantly, and include literature surveys, data mining, field studies (monitoring, mapping, observations, and measurement), geospatial and biogeographic analysis, and modeling. In most cases there is also an element of expert consultation and collaboration among multiple partners, agencies with resource protection responsibilities, and other users and stakeholders. The resulting management responses may involve direct intervention (e.g., for spill response or habitat restoration issues), proposal of boundary alternatives for marine sanctuaries or reserves, changes in agency policy or regulations, making recommendations to other agencies with resource protection responsibilities, proposing changes to international or domestic shipping rules, or development of new education or outreach programs. (PDF contains 37 pages.)