818 resultados para MULTI-RELATIONAL DATA MINING


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Australia Telescope Low-brightness Survey (ATLBS) regions have been mosaic imaged at a radio frequency of 1.4 GHz with 6 `' angular resolution and 72 mu Jy beam(-1) rms noise. The images (centered at R. A. 00(h)35(m)00(s), decl. -67 degrees 00'00 `' and R. A. 00(h)59(m)17(s), decl. -67.00'00 `', J2000 epoch) cover 8.42 deg(2) sky area and have no artifacts or imaging errors above the image thermal noise. Multi-resolution radio and optical r-band images (made using the 4 m CTIO Blanco telescope) were used to recognize multi-component sources and prepare a source list; the detection threshold was 0.38 mJy in a low-resolution radio image made with beam FWHM of 50 `'. Radio source counts in the flux density range 0.4-8.7 mJy are estimated, with corrections applied for noise bias, effective area correction, and resolution bias. The resolution bias is mitigated using low-resolution radio images, while effects of source confusion are removed by using high-resolution images for identifying blended sources. Below 1 mJy the ATLBS counts are systematically lower than the previous estimates. Showing no evidence for an upturn down to 0.4 mJy, they do not require any changes in the radio source population down to the limit of the survey. The work suggests that automated image analysis for counts may be dependent on the ability of the imaging to reproduce connecting emission with low surface brightness and on the ability of the algorithm to recognize sources, which may require that source finding algorithms effectively work with multi-resolution and multi-wavelength data. The work underscores the importance of using source lists-as opposed to component lists-and correcting for the noise bias in order to precisely estimate counts close to the image noise and determine the upturn at sub-mJy flux density.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Fast content addressable data access mechanisms have compelling applications in today's systems. Many of these exploit the powerful wildcard matching capabilities provided by ternary content addressable memories. For example, TCAM based implementations of important algorithms in data mining been developed in recent years; these achieve an an order of magnitude speedup over prevalent techniques. However, large hardware TCAMs are still prohibitively expensive in terms of power consumption and cost per bit. This has been a barrier to extending their exploitation beyond niche and special purpose systems. We propose an approach to overcome this barrier by extending the traditional virtual memory hierarchy to scale up the user visible capacity of TCAMs while mitigating the power consumption overhead. By exploiting the notion of content locality (as opposed to spatial locality), we devise a novel combination of software and hardware techniques to provide an abstraction of a large virtual ternary content addressable space. In the long run, such abstractions enable applications to disassociate considerations of spatial locality and contiguity from the way data is referenced. If successful, ideas for making content addressability a first class abstraction in computing systems can open up a radical shift in the way applications are optimized for memory locality, just as storage class memories are soon expected to shift away from the way in which applications are typically optimized for disk access locality.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Users can rarely reveal their information need in full detail to a search engine within 1--2 words, so search engines need to "hedge their bets" and present diverse results within the precious 10 response slots. Diversity in ranking is of much recent interest. Most existing solutions estimate the marginal utility of an item given a set of items already in the response, and then use variants of greedy set cover. Others design graphs with the items as nodes and choose diverse items based on visit rates (PageRank). Here we introduce a radically new and natural formulation of diversity as finding centers in resistive graphs. Unlike in PageRank, we do not specify the edge resistances (equivalently, conductances) and ask for node visit rates. Instead, we look for a sparse set of center nodes so that the effective conductance from the center to the rest of the graph has maximum entropy. We give a cogent semantic justification for turning PageRank thus on its head. In marked deviation from prior work, our edge resistances are learnt from training data. Inference and learning are NP-hard, but we give practical solutions. In extensive experiments with subtopic retrieval, social network search, and document summarization, our approach convincingly surpasses recently-published diversity algorithms like subtopic cover, max-marginal relevance (MMR), Grasshopper, DivRank, and SVMdiv.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to compu- tational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classifi- cation. In the last few years, large margin classifiers like sup-port vector machines (SVMs) have shown much promise for structured output learning. The related optimization prob -lem is a convex quadratic program (QP) with a large num-ber of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes re-peated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems.Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we develop a game theoretic approach for clustering features in a learning problem. Feature clustering can serve as an important preprocessing step in many problems such as feature selection, dimensionality reduction, etc. In this approach, we view features as rational players of a coalitional game where they form coalitions (or clusters) among themselves in order to maximize their individual payoffs. We show how Nash Stable Partition (NSP), a well known concept in the coalitional game theory, provides a natural way of clustering features. Through this approach, one can obtain some desirable properties of the clusters by choosing appropriate payoff functions. For a small number of features, the NSP based clustering can be found by solving an integer linear program (ILP). However, for large number of features, the ILP based approach does not scale well and hence we propose a hierarchical approach. Interestingly, a key result that we prove on the equivalence between a k-size NSP of a coalitional game and minimum k-cut of an appropriately constructed graph comes in handy for large scale problems. In this paper, we use feature selection problem (in a classification setting) as a running example to illustrate our approach. We conduct experiments to illustrate the efficacy of our approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Ranking problems have become increasingly important in machine learning and data mining in recent years, with applications ranging from information retrieval and recommender systems to computational biology and drug discovery. In this paper, we describe a new ranking algorithm that directly maximizes the number of relevant objects retrieved at the absolute top of the list. The algorithm is a support vector style algorithm, but due to the different objective, it no longer leads to a quadratic programming problem. Instead, the dual optimization problem involves l1, ∞ constraints; we solve this dual problem using the recent l1, ∞ projection method of Quattoni et al (2009). Our algorithm can be viewed as an l∞-norm extreme of the lp-norm based algorithm of Rudin (2009) (albeit in a support vector setting rather than a boosting setting); thus we refer to the algorithm as the ‘Infinite Push’. Experiments on real-world data sets confirm the algorithm’s focus on accuracy at the absolute top of the list.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The rapid emergence of infectious diseases calls for immediate attention to determine practical solutions for intervention strategies. To this end, it becomes necessary to obtain a holistic view of the complex hostpathogen interactome. Advances in omics and related technology have resulted in massive generation of data for the interacting systems at unprecedented levels of detail. Systems-level studies with the aid of mathematical tools contribute to a deeper understanding of biological systems, where intuitive reasoning alone does not suffice. In this review, we discuss different aspects of hostpathogen interactions (HPIs) and the available data resources and tools used to study them. We discuss in detail models of HPIs at various levels of abstraction, along with their applications and limitations. We also enlist a few case studies, which incorporate different modeling approaches, providing significant insights into disease. (c) 2013 Wiley Periodicals, Inc.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we present a methodology for identifying best features from a large feature space. In high dimensional feature space nearest neighbor search is meaningless. In this feature space we see quality and performance issue with nearest neighbor search. Many data mining algorithms use nearest neighbor search. So instead of doing nearest neighbor search using all the features we need to select relevant features. We propose feature selection using Non-negative Matrix Factorization(NMF) and its application to nearest neighbor search. Recent clustering algorithm based on Locally Consistent Concept Factorization(LCCF) shows better quality of document clustering by using local geometrical and discriminating structure of the data. By using our feature selection method we have shown further improvement of performance in the clustering.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Learning from Positive and Unlabelled examples (LPU) has emerged as an important problem in data mining and information retrieval applications. Existing techniques are not ideally suited for real world scenarios where the datasets are linearly inseparable, as they either build linear classifiers or the non-linear classifiers fail to achieve the desired performance. In this work, we propose to extend maximum margin clustering ideas and present an iterative procedure to design a non-linear classifier for LPU. In particular, we build a least squares support vector classifier, suitable for handling this problem due to symmetry of its loss function. Further, we present techniques for appropriately initializing the labels of unlabelled examples and for enforcing the ratio of positive to negative examples while obtaining these labels. Experiments on real-world datasets demonstrate that the non-linear classifier designed using the proposed approach gives significantly better generalization performance than the existing relevant approaches for LPU.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning and data mining. Clustering is grouping of a data set or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait according to some defined distance measure. In this paper we present the genetically improved version of particle swarm optimization algorithm which is a population based heuristic search technique derived from the analysis of the particle swarm intelligence and the concepts of genetic algorithms (GA). The algorithm combines the concepts of PSO such as velocity and position update rules together with the concepts of GA such as selection, crossover and mutation. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The performance of our method is better than k-means and PSO algorithm.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Identification and mapping of crevasses in glaciated regions is important for safe movement. However, the remote and rugged glacial terrain in the Himalaya poses greater challenges for field data collection. In the present study crevasse signatures were collected from Siachen and Samudra Tapu glaciers in the Indian Himalaya using ground-penetrating radar (GPR). The surveys were conducted using the antennas of 250 MHz frequency in ground mode and 350 MHz in airborne mode. The identified signatures of open and hidden crevasses in GPR profiles collected in ground mode were validated by ground truthing. The crevasse zones and buried boulder areas in a glacier were identified using a combination of airborne GPR profiles and SAR data, and the same have been validated with the high-resolution optical satellite imagery (Cartosat-1) and Survey of India mapsheet. Using multi-sensor data, a crevasse map for Samudra Tapu glacier was prepared. The present methodology can also be used for mapping the crevasse zones in other glaciers in the Himalaya.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Elastic Net Regularizers have shown much promise in designing sparse classifiers for linear classification. In this work, we propose an alternating optimization approach to solve the dual problems of elastic net regularized linear classification Support Vector Machines (SVMs) and logistic regression (LR). One of the sub-problems turns out to be a simple projection. The other sub-problem can be solved using dual coordinate descent methods developed for non-sparse L2-regularized linear SVMs and LR, without altering their iteration complexity and convergence properties. Experiments on very large datasets indicate that the proposed dual coordinate descent - projection (DCD-P) methods are fast and achieve comparable generalization performance after the first pass through the data, with extremely sparse models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Land use (LU) land cover (LC) information at a temporal scale illustrates the physical coverage of the Earth's terrestrial surface according to its use and provides the intricate information for effective planning and management activities. LULC changes are stated as local and location specific, collectively they act as drivers of global environmental changes. Understanding and predicting the impact of LULC change processes requires long term historical restorations and projecting into the future of land cover changes at regional to global scales. The present study aims at quantifying spatio temporal landscape dynamics along the gradient of varying terrains presented in the landscape by multi-data approach (MDA). MDA incorporates multi temporal satellite imagery with demographic data and other additional relevant data sets. The gradient covers three different types of topographic features, planes; hilly terrain and coastal region to account the significant role of elevation in land cover change. The seasonality is another aspect to be considered in the vegetation dominated landscapes; variations are accounted using multi seasonal data. Spatial patterns of the various patches are identified and analysed using landscape metrics to understand the forest fragmentation. The prediction of likely changes in 2020 through scenario analysis has been done to account for the changes, considering the present growth rates and due to the proposed developmental projects. This work summarizes recent estimates on changes in cropland, agricultural intensification, deforestation, pasture expansion, and urbanization as the causal factors for LULC change.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The performance of prediction models is often based on ``abstract metrics'' that estimate the model's ability to limit residual errors between the observed and predicted values. However, meaningful evaluation and selection of prediction models for end-user domains requires holistic and application-sensitive performance measures. Inspired by energy consumption prediction models used in the emerging ``big data'' domain of Smart Power Grids, we propose a suite of performance measures to rationally compare models along the dimensions of scale independence, reliability, volatility and cost. We include both application independent and dependent measures, the latter parameterized to allow customization by domain experts to fit their scenario. While our measures are generalizable to other domains, we offer an empirical analysis using real energy use data for three Smart Grid applications: planning, customer education and demand response, which are relevant for energy sustainability. Our results underscore the value of the proposed measures to offer a deeper insight into models' behavior and their impact on real applications, which benefit both data mining researchers and practitioners.