970 resultados para Datasets
Resumo:
Practical usage of machine learning is gaining strategic importance in enterprises looking for business intelligence. However, most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a flat form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a practical, two-phase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. The proposed algorithm was evaluated on three diverse datasets. namely TPCH, PKDD and UCI benchmarks and showed considerable reduction in classification time without any loss of prediction accuracy. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Structural Support Vector Machines (SSVMs) have become a popular tool in machine learning for predicting structured objects like parse trees, Part-of-Speech (POS) label sequences and image segments. Various efficient algorithmic techniques have been proposed for training SSVMs for large datasets. The typical SSVM formulation contains a regularizer term and a composite loss term. The loss term is usually composed of the Linear Maximum Error (LME) associated with the training examples. Other alternatives for the loss term are yet to be explored for SSVMs. We formulate a new SSVM with Linear Summed Error (LSE) loss term and propose efficient algorithms to train the new SSVM formulation using primal cutting-plane method and sequential dual coordinate descent method. Numerical experiments on benchmark datasets demonstrate that the sequential dual coordinate descent method is faster than the cutting-plane method and reaches the steady-state generalization performance faster. It is thus a useful alternative for training SSVMs when linear summed error is used.
Resumo:
This study uses precipitation estimates from the Tropical Rainfall Measuring Mission to quantify the spatial and temporal scales of northward propagation of convection over the Indian monsoon region during boreal summer. Propagating modes of convective systems in the intraseasonal time scales such as the Madden-Julian oscillation can interact with the intertropical convergence zone and bring active and break spells of the Indian summer monsoon. Wavelet analysis was used to quantify the spatial extent (scale) and center of these propagating convective bands, as well as the time period associated with different spatial scales. Results presented here suggest that during a good monsoon year the spatial scale of this oscillation is about 30 degrees centered around 10 degrees N. During weak monsoon years, the scale of propagation decreases and the center shifts farther south closer to the equator. A strong linear relationship is obtained between the center/scale of convective wave bands and intensity of monsoon precipitation over Indian land on the interannual time scale. Moreover, the spatial scale and its center during the break monsoon were found to be similar to an overall weak monsoon year. Based on this analysis, a new index is proposed to quantify the spatial scales associated with propagating convective bands. This automated wavelet-based technique developed here can be used to study meridional propagation of convection in a large volume of datasets from observations and model simulations. The information so obtained can be related to the interannual and intraseasonal variation of Indian monsoon precipitation.
Resumo:
In this study, the authors have investigated the likely future changes in the summer monsoon over the Western Ghats (WG) orographic region of India in response to global warming, using time-slice simulations of an ultra high-resolution global climate model and climate datasets of recent past. The model with approximately 20-km mesh horizontal resolution resolves orographic features on finer spatial scales leading to a quasi-realistic simulation of the spatial distribution of the present-day summer monsoon rainfall over India and trends in monsoon rainfall over the west coast of India. As a result, a higher degree of confidence appears to emerge in many aspects of the 20-km model simulation, and therefore, we can have better confidence in the validity of the model prediction of future changes in the climate over WG mountains. Our analysis suggests that the summer mean rainfall and the vertical velocities over the orographic regions of Western Ghats have significantly weakened during the recent past and the model simulates these features realistically in the present-day climate simulation. Under future climate scenario, by the end of the twenty-first century, the model projects reduced orographic precipitation over the narrow Western Ghats south of 16A degrees N that is found to be associated with drastic reduction in the southwesterly winds and moisture transport into the region, weakening of the summer mean meridional circulation and diminished vertical velocities. We show that this is due to larger upper tropospheric warming relative to the surface and lower levels, which decreases the lapse rate causing an increase in vertical moist static stability (which in turn inhibits vertical ascent) in response to global warming. Increased stability that weakens vertical velocities leads to reduction in large-scale precipitation which is found to be the major contributor to summer mean rainfall over WG orographic region. This is further corroborated by a significant decrease in the frequency of moderate-to-heavy rainfall days over WG which is a typical manifestation of the decrease in large-scale precipitation over this region. Thus, the drastic reduction of vertical ascent and weakening of circulation due to `upper tropospheric warming effect' predominates over the `moisture build-up effect' in reducing the rainfall over this narrow orographic region. This analysis illustrates that monsoon rainfall over mountainous regions is strongly controlled by processes and parameterized physics which need to be resolved with adequately high resolution for accurate assessment of local and regional-scale climate change.
Resumo:
Background: The correlation of genetic distances between pairs of protein sequence alignments has been used to infer protein-protein interactions. It has been suggested that these correlations are based on the signal of co-evolution between interacting proteins. However, although mutations in different proteins associated with maintaining an interaction clearly occur (particularly in binding interfaces and neighbourhoods), many other factors contribute to correlated rates of sequence evolution. Proteins in the same genome are usually linked by shared evolutionary history and so it would be expected that there would be topological similarities in their phylogenetic trees, whether they are interacting or not. For this reason the underlying species tree is often corrected for. Moreover processes such as expression level, are known to effect evolutionary rates. However, it has been argued that the correlated rates of evolution used to predict protein interaction explicitly includes shared evolutionary history; here we test this hypothesis. Results: In order to identify the evolutionary mechanisms giving rise to the correlations between interaction proteins, we use phylogenetic methods to distinguish similarities in tree topologies from similarities in genetic distances. We use a range of datasets of interacting and non-interacting proteins from Saccharomyces cerevisiae. We find that the signal of correlated evolution between interacting proteins is predominantly a result of shared evolutionary rates, rather than similarities in tree topology, independent of evolutionary divergence. Conclusions: Since interacting proteins do not have tree topologies that are more similar than the control group of non-interacting proteins, it is likely that coevolution does not contribute much to, if any, of the observed correlations.
Resumo:
We investigate the problem of influence limitation in the presence of competing campaigns in a social network. Given a negative campaign which starts propagating from a specified source and a positive/counter campaign that is initiated, after a certain time delay, to limit the the influence or spread of misinformation by the negative campaign, we are interested in finding the top k influential nodes at which the positive campaign may be triggered. This problem has numerous applications in situations such as limiting the propagation of rumor, arresting the spread of virus through inoculation, initiating a counter-campaign against malicious propaganda, etc. The influence function for the generic influence limitation problem is non-submodular. Restricted versions of the influence limitation problem, reported in the literature, assume submodularity of the influence function and do not capture the problem in a realistic setting. In this paper, we propose a novel computational approach for the influence limitation problem based on Shapley value, a solution concept in cooperative game theory. Our approach works equally effectively for both submodular and non-submodular influence functions. Experiments on standard real world social network datasets reveal that the proposed approach outperforms existing heuristics in the literature. As a non-trivial extension, we also address the problem of influence limitation in the presence of multiple competing campaigns.
Resumo:
Time series classification deals with the problem of classification of data that is multivariate in nature. This means that one or more of the attributes is in the form of a sequence. The notion of similarity or distance, used in time series data, is significant and affects the accuracy, time, and space complexity of the classification algorithm. There exist numerous similarity measures for time series data, but each of them has its own disadvantages. Instead of relying upon a single similarity measure, our aim is to find the near optimal solution to the classification problem by combining different similarity measures. In this work, we use genetic algorithms to combine the similarity measures so as to get the best performance. The weightage given to different similarity measures evolves over a number of generations so as to get the best combination. We test our approach on a number of benchmark time series datasets and present promising results.
Resumo:
Adaptive Gaussian Mixture Models (GMM) have been one of the most popular and successful approaches to perform foreground segmentation on multimodal background scenes. However, the good accuracy of the GMM algorithm comes at a high computational cost. An improved GMM technique was proposed by Zivkovic to reduce computational cost by minimizing the number of modes adaptively. In this paper, we propose a modification to his adaptive GMM algorithm that further reduces execution time by replacing expensive floating point computations with low cost integer operations. To maintain accuracy, we derive a heuristic that computes periodic floating point updates for the GMM weight parameter using the value of an integer counter. Experiments show speedups in the range of 1.33 - 1.44 on standard video datasets where a large fraction of pixels are multimodal.
Resumo:
In this paper, a comparative study is carried using three nature-inspired algorithms namely Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Cuckoo Search (CS) on clustering problem. Cuckoo search is used with levy flight. The heavy-tail property of levy flight is exploited here. These algorithms are used on three standard benchmark datasets and one real-time multi-spectral satellite dataset. The results are tabulated and analysed using various techniques. Finally we conclude that under the given set of parameters, cuckoo search works efficiently for majority of the dataset and levy flight plays an important role.
Resumo:
Structural Support Vector Machines (SSVMs) have recently gained wide prominence in classifying structured and complex objects like parse-trees, image segments and Part-of-Speech (POS) tags. Typical learning algorithms used in training SSVMs result in model parameters which are vectors residing in a large-dimensional feature space. Such a high-dimensional model parameter vector contains many non-zero components which often lead to slow prediction and storage issues. Hence there is a need for sparse parameter vectors which contain a very small number of non-zero components. L1-regularizer and elastic net regularizer have been traditionally used to get sparse model parameters. Though L1-regularized structural SVMs have been studied in the past, the use of elastic net regularizer for structural SVMs has not been explored yet. In this work, we formulate the elastic net SSVM and propose a sequential alternating proximal algorithm to solve the dual formulation. We compare the proposed method with existing methods for L1-regularized Structural SVMs. Experiments on large-scale benchmark datasets show that the proposed dual elastic net SSVM trained using the sequential alternating proximal algorithm scales well and results in highly sparse model parameters while achieving a comparable generalization performance. Hence the proposed sequential alternating proximal algorithm is a competitive method to achieve sparse model parameters and a comparable generalization performance when elastic net regularized Structural SVMs are used on very large datasets.
Resumo:
Scenic word images undergo degradations due to motion blur, uneven illumination, shadows and defocussing, which lead to difficulty in segmentation. As a result, the recognition results reported on the scenic word image datasets of ICDAR have been low. We introduce a novel technique, where we choose the middle row of the image as a sub-image and segment it first. Then, the labels from this segmented sub-image are used to propagate labels to other pixels in the image. This approach, which is unique and distinct from the existing methods, results in improved segmentation. Bayesian classification and Max-flow methods have been independently used for label propagation. This midline based approach limits the impact of degradations that happens to the image. The segmented text image is recognized using the trial version of Omnipage OCR. We have tested our method on ICDAR 2003 and ICDAR 2011 datasets. Our word recognition results of 64.5% and 71.6% are better than those of methods in the literature and also methods that competed in the Robust reading competition. Our method makes an implicit assumption that degradation is not present in the middle row.
Resumo:
This work proposes a boosting-based transfer learning approach for head-pose classification from multiple, low-resolution views. Head-pose classification performance is adversely affected when the source (training) and target (test) data arise from different distributions (due to change in face appearance, lighting, etc). Under such conditions, we employ Xferboost, a Logitboost-based transfer learning framework that integrates knowledge from a few labeled target samples with the source model to effectively minimize misclassifications on the target data. Experiments confirm that the Xferboost framework can improve classification performance by up to 6%, when knowledge is transferred between the CLEAR and FBK four-view headpose datasets.
Resumo:
In this paper we present a segmentation algorithm to extract foreground object motion in a moving camera scenario without any preprocessing step such as tracking selected features, video alignment, or foreground segmentation. By viewing it as a curve fitting problem on advected particle trajectories, we use RANSAC to find the polynomial that best fits the camera motion and identify all trajectories that correspond to the camera motion. The remaining trajectories are those due to the foreground motion. By using the superposition principle, we subtract the motion due to camera from foreground trajectories and obtain the true object-induced trajectories. We show that our method performs on par with state-of-the-art technique, with an execution time speed-up of 10x-40x. We compare the results on real-world datasets such as UCF-ARG, UCF Sports and Liris-HARL. We further show that it can be used toper-form video alignment.
Resumo:
Clustering has been the most popular method for data exploration. Clustering is partitioning the data set into sub-partitions based on some measures say the distance measure, each partition has its own significant information. There are a number of algorithms explored for this purpose, one such algorithm is the Particle Swarm Optimization(PSO) which is a population based heuristic search technique derived from swarm intelligence. In this paper we present an improved version of the Particle Swarm Optimization where, each feature of the data set is given significance accordingly by adding some random weights, which also minimizes the distortions in the dataset if any. The performance of the above proposed algorithm is evaluated using some benchmark datasets from Machine Learning Repository. The experimental results shows that our proposed methodology performs significantly better than the previously performed experiments.
Resumo:
Chebyshev-inequality-based convex relaxations of Chance-Constrained Programs (CCPs) are shown to be useful for learning classifiers on massive datasets. In particular, an algorithm that integrates efficient clustering procedures and CCP approaches for computing classifiers on large datasets is proposed. The key idea is to identify high density regions or clusters from individual class conditional densities and then use a CCP formulation to learn a classifier on the clusters. The CCP formulation ensures that most of the data points in a cluster are correctly classified by employing a Chebyshev-inequality-based convex relaxation. This relaxation is heavily dependent on the second-order statistics. However, this formulation and in general such relaxations that depend on the second-order moments are susceptible to moment estimation errors. One of the contributions of the paper is to propose several formulations that are robust to such errors. In particular a generic way of making such formulations robust to moment estimation errors is illustrated using two novel confidence sets. An important contribution is to show that when either of the confidence sets is employed, for the special case of a spherical normal distribution of clusters, the robust variant of the formulation can be posed as a second-order cone program. Empirical results show that the robust formulations achieve accuracies comparable to that with true moments, even when moment estimates are erroneous. Results also illustrate the benefits of employing the proposed methodology for robust classification of large-scale datasets.