970 resultados para Datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper studies the problem of constructing robust classifiers when the training is plagued with uncertainty. The problem is posed as a Chance-Constrained Program (CCP) which ensures that the uncertain data points are classified correctly with high probability. Unfortunately such a CCP turns out to be intractable. The key novelty is in employing Bernstein bounding schemes to relax the CCP as a convex second order cone program whose solution is guaranteed to satisfy the probabilistic constraint. Prior to this work, only the Chebyshev based relaxations were exploited in learning algorithms. Bernstein bounds employ richer partial information and hence can be far less conservative than Chebyshev bounds. Due to this efficient modeling of uncertainty, the resulting classifiers achieve higher classification margins and hence better generalization. Methodologies for classifying uncertain test data points and error measures for evaluating classifiers robust to uncertain data are discussed. Experimental results on synthetic and real-world datasets show that the proposed classifiers are better equipped to handle data uncertainty and outperform state-of-the-art in many cases.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Due to its wide applicability, semi-supervised learning is an attractive method for using unlabeled data in classification. In this work, we present a semi-supervised support vector classifier that is designed using quasi-Newton method for nonsmooth convex functions. The proposed algorithm is suitable in dealing with very large number of examples and features. Numerical experiments on various benchmark datasets showed that the proposed algorithm is fast and gives improved generalization performance over the existing methods. Further, a non-linear semi-supervised SVM has been proposed based on a multiple label switching scheme. This non-linear semi-supervised SVM is found to converge faster and it is found to improve generalization performance on several benchmark datasets. (C) 2010 Elsevier Ltd. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We have analysed the diurnal cycle of rainfall over the Indian region (10S-35N, 60E-100E) using both satellite and in-situ data, and found many interesting features associated with this fundamental, yet under-explored, mode of variability. Since there is a distinct and strong diurnal mode of variability associated with the Indian summer monsoon rainfall, we evaluate the ability of the Weather Research and Forecasting Model (WRF) to simulate the observed diurnal rainfall characteristics. The model (at 54km grid-spacing) is integrated for the month of July, 2006, since this period was particularly favourable for the study of diurnal cycle. We first evaluate the sensitivity of the model to the prescribed sea surface temperature (SST), by using two different SST datasets, namely, Final Analyses (FNL) and Real-time Global (RTG). It was found that with RTG SST the rainfall simulation over central India (CI) was significantly better than that with FNL. On the other hand, over the Bay of Bengal (BoB), rainfall simulated with FNL was marginally better than with RTG. However, the overall performance of RTG SST was found to be better than FNL, and hence it was used for further model simulations. Next, we investigated the role of the convective parameterization scheme on the simulation of diurnal cycle of rainfall. We found that the Kain-Fritsch (KF) scheme performs significantly better than Betts-Miller-Janjić (BMJ) and Grell-Devenyi schemes. We also studied the impact of other physical parameterizations, namely, microphysics, boundary layer, land surface, and the radiation parameterization, on the simulation of diurnal cycle of rainfall, and identified the “best” model configuration. We used this configuration of the “best” model to perform a sensitivity study on the role of various convective components used in the KF scheme. In particular, we studied the role of convective downdrafts, convective timescale, and feedback fraction, on the simulated diurnal cycle of rainfall. The “best” model simulations, in general, show a good agreement with observations. Specifically, (i) Over CI, the simulated diurnal rainfall peak is at 1430 IST, in comparison to the observed 1430-1730 IST peak; (ii) Over Western Ghats and Burmese mountains, the model simulates a diurnal rainfall peak at 1430 IST, as opposed to the observed peak of 1430-1730 IST; (iii) Over Sumatra, both model and observations show a diurnal peak at 1730 IST; (iv) The observed southward propagating diurnal rainfall bands over BoB are weakly simulated by WRF. Besides the diurnal cycle of rainfall, the mean spatial pattern of total rainfall and its partitioning between the convective and stratiform components, are also well simulated. The “best” model configuration was used to conduct two nested simulations with one-way, three-level nesting (54-18-6km) over CI and BoB. While, the 54km and 18km simulations were conducted for the whole of July, 2006, the 6km simulation was carried out for the period 18 - 24 July, 2006. The results of our coarse- and fine-scale numerical simulations of the diurnal cycle of monsoon rainfall will be discussed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Temporal analysis of gene expression data has been limited to identifying genes whose expression varies with time and/or correlation between genes that have similar temporal profiles. Often, the methods do not consider the underlying network constraints that connect the genes. It is becoming increasingly evident that interactions change substantially with time. Thus far, there is no systematic method to relate the temporal changes in gene expression to the dynamics of interactions between them. Information on interaction dynamics would open up possibilities for discovering new mechanisms of regulation by providing valuable insight into identifying time-sensitive interactions as well as permit studies on the effect of a genetic perturbation. Results: We present NETGEM, a tractable model rooted in Markov dynamics, for analyzing the dynamics of the interactions between proteins based on the dynamics of the expression changes of the genes that encode them. The model treats the interaction strengths as random variables which are modulated by suitable priors. This approach is necessitated by the extremely small sample size of the datasets, relative to the number of interactions. The model is amenable to a linear time algorithm for efficient inference. Using temporal gene expression data, NETGEM was successful in identifying (i) temporal interactions and determining their strength, (ii) functional categories of the actively interacting partners and (iii) dynamics of interactions in perturbed networks. Conclusions: NETGEM represents an optimal trade-off between model complexity and data requirement. It was able to deduce actively interacting genes and functional categories from temporal gene expression data. It permits inference by incorporating the information available in perturbed networks. Given that the inputs to NETGEM are only the network and the temporal variation of the nodes, this algorithm promises to have widespread applications, beyond biological systems. The source code for NETGEM is available from https://github.com/vjethava/NETGEM

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper addresses the problem of maximum margin classification given the moments of class conditional densities and the false positive and false negative error rates. Using Chebyshev inequalities, the problem can be posed as a second order cone programming problem. The dual of the formulation leads to a geometric optimization problem, that of computing the distance between two ellipsoids, which is solved by an iterative algorithm. The formulation is extended to non-linear classifiers using kernel methods. The resultant classifiers are applied to the case of classification of unbalanced datasets with asymmetric costs for misclassification. Experimental results on benchmark datasets show the efficacy of the proposed method.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Support Vector Clustering has gained reasonable attention from the researchers in exploratory data analysis due to firm theoretical foundation in statistical learning theory. Hard Partitioning of the data set achieved by support vector clustering may not be acceptable in real world scenarios. Rough Support Vector Clustering is an extension of Support Vector Clustering to attain a soft partitioning of the data set. But the Quadratic Programming Problem involved in Rough Support Vector Clustering makes it computationally expensive to handle large datasets. In this paper, we propose Rough Core Vector Clustering algorithm which is a computationally efficient realization of Rough Support Vector Clustering. Here Rough Support Vector Clustering problem is formulated using an approximate Minimum Enclosing Ball problem and is solved using an approximate Minimum Enclosing Ball finding algorithm. Experiments done with several Large Multi class datasets such as Forest cover type, and other Multi class datasets taken from LIBSVM page shows that the proposed strategy is efficient, finds meaningful soft cluster abstractions which provide a superior generalization performance than the SVM classifier.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Applications in various domains often lead to very large and frequently high-dimensional data. Successful algorithms must avoid the curse of dimensionality but at the same time should be computationally efficient. Finding useful patterns in large datasets has attracted considerable interest recently. The primary goal of the paper is to implement an efficient Hybrid Tree based clustering method based on CF-Tree and KD-Tree, and combine the clustering methods with KNN-Classification. The implementation of the algorithm involves many issues like good accuracy, less space and less time. We will evaluate the time and space efficiency, data input order sensitivity, and clustering quality through several experiments.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents a novel Second Order Cone Programming (SOCP) formulation for large scale binary classification tasks. Assuming that the class conditional densities are mixture distributions, where each component of the mixture has a spherical covariance, the second order statistics of the components can be estimated efficiently using clustering algorithms like BIRCH. For each cluster, the second order moments are used to derive a second order cone constraint via a Chebyshev-Cantelli inequality. This constraint ensures that any data point in the cluster is classified correctly with a high probability. This leads to a large margin SOCP formulation whose size depends on the number of clusters rather than the number of training data points. Hence, the proposed formulation scales well for large datasets when compared to the state-of-the-art classifiers, Support Vector Machines (SVMs). Experiments on real world and synthetic datasets show that the proposed algorithm outperforms SVM solvers in terms of training time and achieves similar accuracies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we propose a new algorithm for learning polyhedral classifiers. In contrast to existing methods for learning polyhedral classifier which solve a constrained optimization problem, our method solves an unconstrained optimization problem. Our method is based on a logistic function based model for the posterior probability function. We propose an alternating optimization algorithm, namely, SPLA1 (Single Polyhedral Learning Algorithm1) which maximizes the loglikelihood of the training data to learn the parameters. We also extend our method to make it independent of any user specified parameter (e.g., number of hyperplanes required to form a polyhedral set) in SPLA2. We show the effectiveness of our approach with experiments on various synthetic and real world datasets and compare our approach with a standard decision tree method (OC1) and a constrained optimization based method for learning polyhedral sets.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this paper we consider the process of discovering frequent episodes in event sequences. The most computationally intensive part of this process is that of counting the frequencies of a set of candidate episodes. We present two new frequency counting algorithms for speeding up this part. These, referred to as non-overlapping and non-inteleaved frequency counts, are based on directly counting suitable subsets of the occurrences of an episode. Hence they are different from the frequency counts of Mannila et al [1], where they count the number of windows in which the episode occurs. Our new frequency counts offer a speed-up factor of 7 or more on real and synthetic datasets. We also show how the new frequency counts can be used when the events in episodes have time-durations as well.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

During summer, the northern Indian Ocean exhibits significant atmospheric intraseasonal variability associated with active and break phases of the monsoon in the 30-90 days band. In this paper, we investigate mechanisms of the Sea Surface Temperature (SST) signature of this atmospheric variability, using a combination of observational datasets and Ocean General Circulation Model sensitivity experiments. In addition to the previously-reported intraseasonal SST signature in the Bay of Bengal, observations show clear SST signals in the Arabian Sea related to the active/break cycle of the monsoon. As the atmospheric intraseasonal oscillation moves northward, SST variations appear first at the southern tip of India (day 0), then in the Somali upwelling region (day 10), northern Bay of Bengal (day 19) and finally in the Oman upwelling region (day 23). The Bay of Bengal and Oman signals are most clearly associated with the monsoon active/break index, whereas the relationship with signals near Somali upwelling and the southern tip of India is weaker. In agreement with previous studies, we find that heat flux variations drive most of the intraseasonal SST variability in the Bay of Bengal, both in our model (regression coefficient, 0.9, against similar to 0.25 for wind stress) and in observations (0.8 regression coefficient); similar to 60% of the heat flux variation is due do shortwave radiation and similar to 40% due to latent heat flux. On the other hand, both observations and model results indicate a prominent role of dynamical oceanic processes in the Arabian Sea. Wind-stress variations force about 70-100% of SST intraseasonal variations in the Arabian Sea, through modulation of oceanic processes (entrainment, mixing, Ekman pumping, lateral advection). Our similar to 100 km resolution model suggests that internal oceanic variability (i.e. eddies) contributes substantially to intraseasonal variability at small-scale in the Somali upwelling region, but does not contribute to large-scale intraseasonal SST variability due to its small spatial scale and random phase relation to the active-break monsoon cycle. The effect of oceanic eddies; however, remains to be explored at a higher spatial resolution.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a ID sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods. (C) 2012 Elsevier Masson SAS. All rights reserved.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred for clustering a target task, by providing a relevant supervised partitioning of a dataset from a different source task. The target clustering is made more meaningful for the human user by trading-off intrinsic clustering goodness on the target task for alignment with relevant supervised partitions in the source task, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-task similarity measure that discovers hidden relationships across tasks. When the source and target tasks correspond to different domains with potentially different vocabularies, we propose a projection approach using pivot vocabularies for the cross-domain similarity measure. Using multiple real-world and synthetic datasets, we show that our approach improves clustering accuracy significantly over traditional k-means and state-of-the-art semi-supervised clustering baselines, over a wide range of data characteristics and parameter settings.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: There has been growing interest in integrative taxonomy that uses data from multiple disciplines for species delimitation. Typically, in such studies, monophyly is taken as a proxy for taxonomic distinctiveness and these units are treated as potential species. However, monophyly could arise due to stochastic processes. Thus here, we have employed a recently developed tool based on coalescent approach to ascertain the taxonomic distinctiveness of various monophyletic units. Subsequently, the species status of these taxonomic units was further tested using corroborative evidence from morphology and ecology. This inter-disciplinary approach was implemented on endemic centipedes of the genus Digitipes (Attems 1930) from the Western Ghats (WG) biodiversity hotspot of India. The species of the genus Digitipes are morphologically conserved, despite their ancient late Cretaceous origin. Principal Findings: Our coalescent analysis based on mitochondrial dataset indicated the presence of nine putative species. The integrative approach, which includes nuclear, morphology, and climate datasets supported distinctiveness of eight putative species, of which three represent described species and five were new species. Among the five new species, three were morphologically cryptic species, emphasizing the effectiveness of this approach in discovering cryptic diversity in less explored areas of the tropics like the WG. In addition, species pairs showed variable divergence along the molecular, morphological and climate axes. Conclusions: A multidisciplinary approach illustrated here is successful in discovering cryptic diversity with an indication that the current estimates of invertebrate species richness for the WG might have been underestimated. Additionally, the importance of measuring multiple secondary properties of species while defining species boundaries was highlighted given variable divergence of each species pair across the disciplines.