74 resultados para real world
Resumo:
This paper presents a novel Second Order Cone Programming (SOCP) formulation for large scale binary classification tasks. Assuming that the class conditional densities are mixture distributions, where each component of the mixture has a spherical covariance, the second order statistics of the components can be estimated efficiently using clustering algorithms like BIRCH. For each cluster, the second order moments are used to derive a second order cone constraint via a Chebyshev-Cantelli inequality. This constraint ensures that any data point in the cluster is classified correctly with a high probability. This leads to a large margin SOCP formulation whose size depends on the number of clusters rather than the number of training data points. Hence, the proposed formulation scales well for large datasets when compared to the state-of-the-art classifiers, Support Vector Machines (SVMs). Experiments on real world and synthetic datasets show that the proposed algorithm outperforms SVM solvers in terms of training time and achieves similar accuracies.
Resumo:
When hosting XML information on relational backends, a mapping has to be established between the schemas of the information source and the target storage repositories. A rich body of recent literature exists for mapping isolated components of XML Schema to their relational counterparts, especially with regard to table configurations. In this paper, we present the Elixir system for designing industrial-strength mappings for real-world applications. Specifically, it produces an information-preserving holistic mapping that transforms the complete XML world-view (XML schema with constraints, XML documents XQuery queries including triggers and views) into a full-scale relational mapping (table definitions, integrity constraints, indices, triggers and views) that is tuned to the application workload. A key design feature of Elixir is that it performs all its mapping-related optimizations in the XML source space, rather than in the relational target space. Further, unlike the XML mapping tools of commercial database systems, which rely heavily on user inputs, Elixir takes a principled cost-based approach to automatically find an efficient relational mapping. A prototype of Elixir is operational and we quantitatively demonstrate its functionality and efficacy on a variety of real-life XML schemas.
Resumo:
We consider the problem of scheduling semiconductor burn-in operations, where burn-in ovens are modelled as batch processing machines. Most of the studies assume that ready times and due dates of jobs are agreeable (i.e., ri < rj implies di ≤ dj). In many real world applications, the agreeable property assumption does not hold. Therefore, in this paper, scheduling of a single burn-in oven with non-agreeable release times and due dates along with non-identical job sizes as well as non-identical processing of time problem is formulated as a Non-Linear (0-1) Integer Programming optimisation problem. The objective measure of the problem is minimising the maximum completion time (makespan) of all jobs. Due to computational intractability, we have proposed four variants of a two-phase greedy heuristic algorithm. Computational experiments indicate that two out of four proposed algorithms have excellent average performance and also capable of solving any large-scale real life problems with a relatively low computational effort on a Pentium IV computer.
Resumo:
Differential evolution (DE) is arguably one of the most powerful stochastic real-parameter optimization algorithms of current interest. Since its inception in the mid 1990s, DE has been finding many successful applications in real-world optimization problems from diverse domains of science and engineering. This paper takes a first significant step toward the convergence analysis of a canonical DE (DE/rand/1/bin) algorithm. It first deduces a time-recursive relationship for the probability density function (PDF) of the trial solutions, taking into consideration the DE-type mutation, crossover, and selection mechanisms. Then, by applying the concepts of Lyapunov stability theorems, it shows that as time approaches infinity, the PDF of the trial solutions concentrates narrowly around the global optimum of the objective function, assuming the shape of a Dirac delta distribution. Asymptotic convergence behavior of the population PDF is established by constructing a Lyapunov functional based on the PDF and showing that it monotonically decreases with time. The analysis is applicable to a class of continuous and real-valued objective functions that possesses a unique global optimum (but may have multiple local optima). Theoretical results have been substantiated with relevant computer simulations.
Resumo:
The questions that one should answer in engineering computations - deterministic, probabilistic/randomized, as well as heuristic - are (i) how good the computed results/outputs are and (ii) how much the cost in terms of amount of computation and the amount of storage utilized in getting the outputs is. The absolutely errorfree quantities as well as the completely errorless computations done in a natural process can never be captured by any means that we have at our disposal. While the computations including the input real quantities in nature/natural processes are exact, all the computations that we do using a digital computer or are carried out in an embedded form are never exact. The input data for such computations are also never exact because any measuring instrument has inherent error of a fixed order associated with it and this error, as a matter of hypothesis and not as a matter of assumption, is not less than 0.005 per cent. Here by error we imply relative error bounds. The fact that exact error is never known under any circumstances and any context implies that the term error is nothing but error-bounds. Further, in engineering computations, it is the relative error or, equivalently, the relative error-bounds (and not the absolute error) which is supremely important in providing us the information regarding the quality of the results/outputs. Another important fact is that inconsistency and/or near-consistency in nature, i.e., in problems created from nature is completely nonexistent while in our modelling of the natural problems we may introduce inconsistency or near-inconsistency due to human error or due to inherent non-removable error associated with any measuring device or due to assumptions introduced to make the problem solvable or more easily solvable in practice. Thus if we discover any inconsistency or possibly any near-inconsistency in a mathematical model, it is certainly due to any or all of the three foregoing factors. We do, however, go ahead to solve such inconsistent/near-consistent problems and do get results that could be useful in real-world situations. The talk considers several deterministic, probabilistic, and heuristic algorithms in numerical optimisation, other numerical and statistical computations, and in PAC (probably approximately correct) learning models. It highlights the quality of the results/outputs through specifying relative error-bounds along with the associated confidence level, and the cost, viz., amount of computations and that of storage through complexity. It points out the limitation in error-free computations (wherever possible, i.e., where the number of arithmetic operations is finite and is known a priori) as well as in the usage of interval arithmetic. Further, the interdependence among the error, the confidence, and the cost is discussed.
Resumo:
Many knowledge based systems (KBS) transform a situation information into an appropriate decision using an in built knowledge base. As the knowledge in real world situation is often uncertain, the degree of truth of a proposition provides a measure of uncertainty in the underlying knowledge. This uncertainty can be evaluated by collecting `evidence' about the truth or falsehood of the proposition from multiple sources. In this paper we propose a simple framework for representing uncertainty in using the notion of an evidence space.
Resumo:
Image segmentation is formulated as a stochastic process whose invariant distribution is concentrated at points of the desired region. By choosing multiple seed points, different regions can be segmented. The algorithm is based on the theory of time-homogeneous Markov chains and has been largely motivated by the technique of simulated annealing. The method proposed here has been found to perform well on real-world clean as well as noisy images while being computationally far less expensive than stochastic optimisation techniques
Resumo:
In this paper we propose a new algorithm for learning polyhedral classifiers. In contrast to existing methods for learning polyhedral classifier which solve a constrained optimization problem, our method solves an unconstrained optimization problem. Our method is based on a logistic function based model for the posterior probability function. We propose an alternating optimization algorithm, namely, SPLA1 (Single Polyhedral Learning Algorithm1) which maximizes the loglikelihood of the training data to learn the parameters. We also extend our method to make it independent of any user specified parameter (e.g., number of hyperplanes required to form a polyhedral set) in SPLA2. We show the effectiveness of our approach with experiments on various synthetic and real world datasets and compare our approach with a standard decision tree method (OC1) and a constrained optimization based method for learning polyhedral sets.
Resumo:
A terrestrial biosphere model with dynamic vegetation capability, Integrated Biosphere Simulator (IBIS2), coupled to the NCAR Community Atmosphere Model (CAM2) is used to investigate the multiple climate-forest equilibrium states of the climate system. A 1000-year control simulation and another 1000-year land cover change simulation that consisted of global deforestation for 100 years followed by re-growth of forests for the subsequent 900 years were performed. After several centuries of interactive climate-vegetation dynamics, the land cover change simulation converged to essentially the same climate state as the control simulation. However, the climate system takes about a millennium to reach the control forest state. In the absence of deep ocean feedbacks in our model, the millennial time scale for converging to the original climate state is dictated by long time scales of the vegetation dynamics in the northern high latitudes. Our idealized modeling study suggests that the equilibrium state reached after complete global deforestation followed by re-growth of forests is unlikely to be distinguishable from the control climate. The real world, however, could have multiple climate-forest states since our modeling study is unlikely to have represented all the essential ecological processes (e. g. altered fire regimes, seed sources and seedling establishment dynamics) for the reestablishment of major biomes.
Resumo:
Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred for clustering a target task, by providing a relevant supervised partitioning of a dataset from a different source task. The target clustering is made more meaningful for the human user by trading-off intrinsic clustering goodness on the target task for alignment with relevant supervised partitions in the source task, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-task similarity measure that discovers hidden relationships across tasks. When the source and target tasks correspond to different domains with potentially different vocabularies, we propose a projection approach using pivot vocabularies for the cross-domain similarity measure. Using multiple real-world and synthetic datasets, we show that our approach improves clustering accuracy significantly over traditional k-means and state-of-the-art semi-supervised clustering baselines, over a wide range of data characteristics and parameter settings.
Resumo:
In recent times computational algorithms inspired by biological processes and evolution are gaining much popularity for solving science and engineering problems. These algorithms are broadly classified into evolutionary computation and swarm intelligence algorithms, which are derived based on the analogy of natural evolution and biological activities. These include genetic algorithms, genetic programming, differential evolution, particle swarm optimization, ant colony optimization, artificial neural networks, etc. The algorithms being random-search techniques, use some heuristics to guide the search towards optimal solution and speed-up the convergence to obtain the global optimal solutions. The bio-inspired methods have several attractive features and advantages compared to conventional optimization solvers. They also facilitate the advantage of simulation and optimization environment simultaneously to solve hard-to-define (in simple expressions), real-world problems. These biologically inspired methods have provided novel ways of problem-solving for practical problems in traffic routing, networking, games, industry, robotics, economics, mechanical, chemical, electrical, civil, water resources and others fields. This article discusses the key features and development of bio-inspired computational algorithms, and their scope for application in science and engineering fields.
Resumo:
In this paper we study the problem of designing SVM classifiers when the kernel matrix, K, is affected by uncertainty. Specifically K is modeled as a positive affine combination of given positive semi definite kernels, with the coefficients ranging in a norm-bounded uncertainty set. We treat the problem using the Robust Optimization methodology. This reduces the uncertain SVM problem into a deterministic conic quadratic problem which can be solved in principle by a polynomial time Interior Point (IP) algorithm. However, for large-scale classification problems, IP methods become intractable and one has to resort to first-order gradient type methods. The strategy we use here is to reformulate the robust counterpart of the uncertain SVM problem as a saddle point problem and employ a special gradient scheme which works directly on the convex-concave saddle function. The algorithm is a simplified version of a general scheme due to Juditski and Nemirovski (2011). It achieves an O(1/T-2) reduction of the initial error after T iterations. A comprehensive empirical study on both synthetic data and real-world protein structure data sets show that the proposed formulations achieve the desired robustness, and the saddle point based algorithm outperforms the IP method significantly.
Resumo:
We investigate the problem of influence limitation in the presence of competing campaigns in a social network. Given a negative campaign which starts propagating from a specified source and a positive/counter campaign that is initiated, after a certain time delay, to limit the the influence or spread of misinformation by the negative campaign, we are interested in finding the top k influential nodes at which the positive campaign may be triggered. This problem has numerous applications in situations such as limiting the propagation of rumor, arresting the spread of virus through inoculation, initiating a counter-campaign against malicious propaganda, etc. The influence function for the generic influence limitation problem is non-submodular. Restricted versions of the influence limitation problem, reported in the literature, assume submodularity of the influence function and do not capture the problem in a realistic setting. In this paper, we propose a novel computational approach for the influence limitation problem based on Shapley value, a solution concept in cooperative game theory. Our approach works equally effectively for both submodular and non-submodular influence functions. Experiments on standard real world social network datasets reveal that the proposed approach outperforms existing heuristics in the literature. As a non-trivial extension, we also address the problem of influence limitation in the presence of multiple competing campaigns.
Resumo:
Savitzky-Golay (S-G) filters are finite impulse response lowpass filters obtained while smoothing data using a local least-squares (LS) polynomial approximation. Savitzky and Golay proved in their hallmark paper that local LS fitting of polynomials and their evaluation at the mid-point of the approximation interval is equivalent to filtering with a fixed impulse response. The problem that we address here is, ``how to choose a pointwise minimum mean squared error (MMSE) S-G filter length or order for smoothing, while preserving the temporal structure of a time-varying signal.'' We solve the bias-variance tradeoff involved in the MMSE optimization using Stein's unbiased risk estimator (SURE). We observe that the 3-dB cutoff frequency of the SURE-optimal S-G filter is higher where the signal varies fast locally, and vice versa, essentially enabling us to suitably trade off the bias and variance, thereby resulting in near-MMSE performance. At low signal-to-noise ratios (SNRs), it is seen that the adaptive filter length algorithm performance improves by incorporating a regularization term in the SURE objective function. We consider the algorithm performance on real-world electrocardiogram (ECG) signals. The results exhibit considerable SNR improvement. Noise performance analysis shows that the proposed algorithms are comparable, and in some cases, better than some standard denoising techniques available in the literature.
Resumo:
Facet-based sentiment analysis involves discovering the latent facets, sentiments and their associations. Traditional facet-based sentiment analysis algorithms typically perform the various tasks in sequence, and fail to take advantage of the mutual reinforcement of the tasks. Additionally,inferring sentiment levels typically requires domain knowledge or human intervention. In this paper, we propose aseries of probabilistic models that jointly discover latent facets and sentiment topics, and also order the sentiment topics with respect to a multi-point scale, in a language and domain independent manner. This is achieved by simultaneously capturing both short-range syntactic structure and long range semantic dependencies between the sentiment and facet words. The models further incorporate coherence in reviews, where reviewers dwell on one facet or sentiment level before moving on, for more accurate facet and sentiment discovery. For reviews which are supplemented with ratings, our models automatically order the latent sentiment topics, without requiring seed-words or domain-knowledge. To the best of our knowledge, our work is the first attempt to combine the notions of syntactic and semantic dependencies in the domain of review mining. Further, the concept of facet and sentiment coherence has not been explored earlier either. Extensive experimental results on real world review data show that the proposed models outperform various state of the art baselines for facet-based sentiment analysis.