44 resultados para Large Data Sets
em Indian Institute of Science - Bangalore - Índia
Resumo:
We revise and extend the extreme value statistic, introduced in Gupta et al., to study direction dependence in the high-redshift supernova data, arising either from departures, from the cosmological principle or due to direction-dependent statistical systematics in the data. We introduce a likelihood function that analytically marginalizes over the,Hubble constant and use it to extend our previous statistic. We also introduce a new statistic that is sensitive to direction dependence arising from living off-centre inside a large void as well as from previously mentioned reasons for anisotropy. We show that for large data sets, this statistic has a limiting form that can be computed analytically. We apply our statistics to the gold data sets from Riess et al., as in our previous work. Our revision and extension of the previous statistic show that the effect of marginalizing over the Hubble constant instead of using its best-fitting value on our results is only marginal. However, correction of errors in our previous work reduces the level of non-Gaussianity in the 2004 gold data that were found in our earlier work. The revised results for the 2007 gold data show that the data are consistent with isotropy and Gaussianity. Our second statistic confirms these results.
Resumo:
We consider an inverse elasticity problem in which forces and displacements are known on the boundary and the material property distribution inside the body is to be found. In other words, we need to estimate the distribution of constitutive properties using the finite boundary data sets. Uniqueness of the solution to this problem is proved in the literature only under certain assumptions for a given complete Dirichlet-to-Neumann map. Another complication in the numerical solution of this problem is that the number of boundary data sets needed to establish uniqueness is not known even under the restricted cases where uniqueness is proved theoretically. In this paper, we present a numerical technique that can assess the sufficiency of given boundary data sets by computing the rank of a sensitivity matrix that arises in the Gauss-Newton method used to solve the problem. Numerical experiments are presented to illustrate the method.
Resumo:
We have benchmarked the maximum obtainable recognition accuracy on five publicly available standard word image data sets using semi-automated segmentation and a commercial OCR. These images have been cropped from camera captured scene images, born digital images (BDI) and street view images. Using the Matlab based tool developed by us, we have annotated at the pixel level more than 3600 word images from the five data sets. The word images binarized by the tool, as well as by our own midline analysis and propagation of segmentation (MAPS) algorithm are recognized using the trial version of Nuance Omnipage OCR and these two results are compared with the best reported in the literature. The benchmark word recognition rates obtained on ICDAR 2003, Sign evaluation, Street view, Born-digital and ICDAR 2011 data sets are 83.9%, 89.3%, 79.6%, 88.5% and 86.7%, respectively. The results obtained from MAPS binarized word images without the use of any lexicon are 64.5% and 71.7% for ICDAR 2003 and 2011 respectively, and these values are higher than the best reported values in the literature of 61.1% and 41.2%, respectively. MAPS results of 82.8% for BDI 2011 dataset matches the performance of the state of the art method based on power law transform.
Resumo:
In this study, we applied the integration methodology developed in the companion paper by Aires (2014) by using real satellite observations over the Mississippi Basin. The methodology provides basin-scale estimates of the four water budget components (precipitation P, evapotranspiration E, water storage change Delta S, and runoff R) in a two-step process: the Simple Weighting (SW) integration and a Postprocessing Filtering (PF) that imposes the water budget closure. A comparison with in situ observations of P and E demonstrated that PF improved the estimation of both components. A Closure Correction Model (CCM) has been derived from the integrated product (SW+PF) that allows to correct each observation data set independently, unlike the SW+PF method which requires simultaneous estimates of the four components. The CCM allows to standardize the various data sets for each component and highly decrease the budget residual (P - E - Delta S - R). As a direct application, the CCM was combined with the water budget equation to reconstruct missing values in any component. Results of a Monte Carlo experiment with synthetic gaps demonstrated the good performances of the method, except for the runoff data that has a variability of the same order of magnitude as the budget residual. Similarly, we proposed a reconstruction of Delta S between 1990 and 2002 where no Gravity Recovery and Climate Experiment data are available. Unlike most of the studies dealing with the water budget closure at the basin scale, only satellite observations and in situ runoff measurements are used. Consequently, the integrated data sets are model independent and can be used for model calibration or validation.
Resumo:
Support Vector Machines(SVMs) are hyperplane classifiers defined in a kernel induced feature space. The data size dependent training time complexity of SVMs usually prohibits its use in applications involving more than a few thousands of data points. In this paper we propose a novel kernel based incremental data clustering approach and its use for scaling Non-linear Support Vector Machines to handle large data sets. The clustering method introduced can find cluster abstractions of the training data in a kernel induced feature space. These cluster abstractions are then used for selective sampling based training of Support Vector Machines to reduce the training time without compromising the generalization performance. Experiments done with real world datasets show that this approach gives good generalization performance at reasonable computational expense.
Resumo:
Core Vector Machine(CVM) is suitable for efficient large-scale pattern classification. In this paper, a method for improving the performance of CVM with Gaussian kernel function irrespective of the orderings of patterns belonging to different classes within the data set is proposed. This method employs a selective sampling based training of CVM using a novel kernel based scalable hierarchical clustering algorithm. Empirical studies made on synthetic and real world data sets show that the proposed strategy performs well on large data sets.
Resumo:
This paper discusses a method for scaling SVM with Gaussian kernel function to handle large data sets by using a selective sampling strategy for the training set. It employs a scalable hierarchical clustering algorithm to construct cluster indexing structures of the training data in the kernel induced feature space. These are then used for selective sampling of the training data for SVM to impart scalability to the training process. Empirical studies made on real world data sets show that the proposed strategy performs well on large data sets.
Resumo:
Even though several techniques have been proposed in the literature for achieving multiclass classification using Support Vector Machine(SVM), the scalability aspect of these approaches to handle large data sets still needs much of exploration. Core Vector Machine(CVM) is a technique for scaling up a two class SVM to handle large data sets. In this paper we propose a Multiclass Core Vector Machine(MCVM). Here we formulate the multiclass SVM problem as a Quadratic Programming(QP) problem defining an SVM with vector valued output. This QP problem is then solved using the CVM technique to achieve scalability to handle large data sets. Experiments done with several large synthetic and real world data sets show that the proposed MCVM technique gives good generalization performance as that of SVM at a much lesser computational expense. Further, it is observed that MCVM scales well with the size of the data set.
Resumo:
In many real world prediction problems the output is a structured object like a sequence or a tree or a graph. Such problems range from natural language processing to compu- tational biology or computer vision and have been tackled using algorithms, referred to as structured output learning algorithms. We consider the problem of structured classifi- cation. In the last few years, large margin classifiers like sup-port vector machines (SVMs) have shown much promise for structured output learning. The related optimization prob -lem is a convex quadratic program (QP) with a large num-ber of constraints, which makes the problem intractable for large data sets. This paper proposes a fast sequential dual method (SDM) for structural SVMs. The method makes re-peated passes over the training set and optimizes the dual variables associated with one example at a time. The use of additional heuristics makes the proposed method more efficient. We present an extensive empirical evaluation of the proposed method on several sequence learning problems.Our experiments on large data sets demonstrate that the proposed method is an order of magnitude faster than state of the art methods like cutting-plane method and stochastic gradient descent method (SGD). Further, SDM reaches steady state generalization performance faster than the SGD method. The proposed SDM is thus a useful alternative for large scale structured output learning.
Resumo:
There are many applications such as software for processing customer records in telecom, patient records in hospitals, email processing software accessing a single email in a mailbox etc. which require to access a single record in a database consisting of millions of records. A basic feature of these applications is that they need to access data sets which are very large but simple. Cloud computing provides computing requirements for these kinds of new generation of applications involving very large data sets which cannot possibly be handled efficiently using traditional computing infrastructure. In this paper, we describe storage services provided by three well-known cloud service providers and give a comparison of their features with a view to characterize storage requirements of very large data sets as examples and we hope that it would act as a catalyst for the design of storage services for very large data set requirements in future. We also give a brief overview of other kinds of storage that have come up in the recent past for cloud computing.
Resumo:
The contour tree is a topological abstraction of a scalar field that captures evolution in level set connectivity. It is an effective representation for visual exploration and analysis of scientific data. We describe a work-efficient, output sensitive, and scalable parallel algorithm for computing the contour tree of a scalar field defined on a domain that is represented using either an unstructured mesh or a structured grid. A hybrid implementation of the algorithm using the GPU and multi-core CPU can compute the contour tree of an input containing 16 million vertices in less than ten seconds with a speedup factor of upto 13. Experiments based on an implementation in a multi-core CPU environment show near-linear speedup for large data sets.
Resumo:
Maximum entropy approach to classification is very well studied in applied statistics and machine learning and almost all the methods that exists in literature are discriminative in nature. In this paper, we introduce a maximum entropy classification method with feature selection for large dimensional data such as text datasets that is generative in nature. To tackle the curse of dimensionality of large data sets, we employ conditional independence assumption (Naive Bayes) and we perform feature selection simultaneously, by enforcing a `maximum discrimination' between estimated class conditional densities. For two class problems, in the proposed method, we use Jeffreys (J) divergence to discriminate the class conditional densities. To extend our method to the multi-class case, we propose a completely new approach by considering a multi-distribution divergence: we replace Jeffreys divergence by Jensen-Shannon (JS) divergence to discriminate conditional densities of multiple classes. In order to reduce computational complexity, we employ a modified Jensen-Shannon divergence (JS(GM)), based on AM-GM inequality. We show that the resulting divergence is a natural generalization of Jeffreys divergence to a multiple distributions case. As far as the theoretical justifications are concerned we show that when one intends to select the best features in a generative maximum entropy approach, maximum discrimination using J-divergence emerges naturally in binary classification. Performance and comparative study of the proposed algorithms have been demonstrated on large dimensional text and gene expression datasets that show our methods scale up very well with large dimensional datasets.
Resumo:
The fluctuations exhibited by the cross sections generated in a compound-nucleus reaction or, more generally, in a quantum-chaotic scattering process, when varying the excitation energy or another external parameter, are characterized by the width Gamma(corr) of the cross-section correlation function. Brink and Stephen Phys. Lett. 5, 77 (1963)] proposed a method for its determination by simply counting the number of maxima featured by the cross sections as a function of the parameter under consideration. They stated that the product of the average number of maxima per unit energy range and Gamma(corr) is constant in the Ercison region of strongly overlapping resonances. We use the analogy between the scattering formalism for compound-nucleus reactions and for microwave resonators to test this method experimentally with unprecedented accuracy using large data sets and propose an analytical description for the regions of isolated and overlapping resonances.
Resumo:
Background: A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data. Results: The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification. Conclusion: A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational – experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.
Resumo:
It has long been thought that tropical rainfall retrievals from satellites have large errors. Here we show, using a new daily 1 degree gridded rainfall data set based on about 1800 gauges from the India Meteorology Department (IMD), that modern satellite estimates are reasonably close to observed rainfall over the Indian monsoon region. Daily satellite rainfalls from the Global Precipitation Climatology Project (GPCP 1DD) and the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) are available since 1998. The high summer monsoon (June-September) rain over the Western Ghats and Himalayan foothills is captured in TMPA data. Away from hilly regions, the seasonal mean and intraseasonal variability of rainfall (averaged over regions of a few hundred kilometers linear dimension) from both satellite products are about 15% of observations. Satellite data generally underestimate both the mean and variability of rain, but the phase of intraseasonal variations is accurate. On synoptic timescales, TMPA gives reasonable depiction of the pattern and intensity of torrential rain from individual monsoon low-pressure systems and depressions. A pronounced biennial oscillation of seasonal total central India rain is seen in all three data sets, with GPCP 1DD being closest to IMD observations. The new satellite data are a promising resource for the study of tropical rainfall variability.