31 resultados para Data sets storage

em Indian Institute of Science - Bangalore - Índia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this study, we applied the integration methodology developed in the companion paper by Aires (2014) by using real satellite observations over the Mississippi Basin. The methodology provides basin-scale estimates of the four water budget components (precipitation P, evapotranspiration E, water storage change Delta S, and runoff R) in a two-step process: the Simple Weighting (SW) integration and a Postprocessing Filtering (PF) that imposes the water budget closure. A comparison with in situ observations of P and E demonstrated that PF improved the estimation of both components. A Closure Correction Model (CCM) has been derived from the integrated product (SW+PF) that allows to correct each observation data set independently, unlike the SW+PF method which requires simultaneous estimates of the four components. The CCM allows to standardize the various data sets for each component and highly decrease the budget residual (P - E - Delta S - R). As a direct application, the CCM was combined with the water budget equation to reconstruct missing values in any component. Results of a Monte Carlo experiment with synthetic gaps demonstrated the good performances of the method, except for the runoff data that has a variability of the same order of magnitude as the budget residual. Similarly, we proposed a reconstruction of Delta S between 1990 and 2002 where no Gravity Recovery and Climate Experiment data are available. Unlike most of the studies dealing with the water budget closure at the basin scale, only satellite observations and in situ runoff measurements are used. Consequently, the integrated data sets are model independent and can be used for model calibration or validation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We consider an inverse elasticity problem in which forces and displacements are known on the boundary and the material property distribution inside the body is to be found. In other words, we need to estimate the distribution of constitutive properties using the finite boundary data sets. Uniqueness of the solution to this problem is proved in the literature only under certain assumptions for a given complete Dirichlet-to-Neumann map. Another complication in the numerical solution of this problem is that the number of boundary data sets needed to establish uniqueness is not known even under the restricted cases where uniqueness is proved theoretically. In this paper, we present a numerical technique that can assess the sufficiency of given boundary data sets by computing the rank of a sensitivity matrix that arises in the Gauss-Newton method used to solve the problem. Numerical experiments are presented to illustrate the method.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We have benchmarked the maximum obtainable recognition accuracy on five publicly available standard word image data sets using semi-automated segmentation and a commercial OCR. These images have been cropped from camera captured scene images, born digital images (BDI) and street view images. Using the Matlab based tool developed by us, we have annotated at the pixel level more than 3600 word images from the five data sets. The word images binarized by the tool, as well as by our own midline analysis and propagation of segmentation (MAPS) algorithm are recognized using the trial version of Nuance Omnipage OCR and these two results are compared with the best reported in the literature. The benchmark word recognition rates obtained on ICDAR 2003, Sign evaluation, Street view, Born-digital and ICDAR 2011 data sets are 83.9%, 89.3%, 79.6%, 88.5% and 86.7%, respectively. The results obtained from MAPS binarized word images without the use of any lexicon are 64.5% and 71.7% for ICDAR 2003 and 2011 respectively, and these values are higher than the best reported values in the literature of 61.1% and 41.2%, respectively. MAPS results of 82.8% for BDI 2011 dataset matches the performance of the state of the art method based on power law transform.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There are many applications such as software for processing customer records in telecom, patient records in hospitals, email processing software accessing a single email in a mailbox etc. which require to access a single record in a database consisting of millions of records. A basic feature of these applications is that they need to access data sets which are very large but simple. Cloud computing provides computing requirements for these kinds of new generation of applications involving very large data sets which cannot possibly be handled efficiently using traditional computing infrastructure. In this paper, we describe storage services provided by three well-known cloud service providers and give a comparison of their features with a view to characterize storage requirements of very large data sets as examples and we hope that it would act as a catalyst for the design of storage services for very large data set requirements in future. We also give a brief overview of other kinds of storage that have come up in the recent past for cloud computing.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a comprehensive and robust strategy for the estimation of battery model parameters from noise corrupted data. The deficiencies of the existing methods for parameter estimation are studied and the proposed parameter estimation strategy improves on earlier methods by working optimally for low as well as high discharge currents, providing accurate estimates even under high levels of noise, and with a wide range of initial values. Testing on different data sets confirms the performance of the proposed parameter estimation strategy.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Background: A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data. Results: The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification. Conclusion: A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational – experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

K-means algorithm is a well known nonhierarchical method for clustering data. The most important limitations of this algorithm are that: (1) it gives final clusters on the basis of the cluster centroids or the seed points chosen initially, and (2) it is appropriate for data sets having fairly isotropic clusters. But this algorithm has the advantage of low computation and storage requirements. On the other hand, hierarchical agglomerative clustering algorithm, which can cluster nonisotropic (chain-like and concentric) clusters, requires high storage and computation requirements. This paper suggests a new method for selecting the initial seed points, so that theK-means algorithm gives the same results for any input data order. This paper also describes a hybrid clustering algorithm, based on the concepts of multilevel theory, which is nonhierarchical at the first level and hierarchical from second level onwards, to cluster data sets having (i) chain-like clusters and (ii) concentric clusters. It is observed that this hybrid clustering algorithm gives the same results as the hierarchical clustering algorithm, with less computation and storage requirements.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

It has long been thought that tropical rainfall retrievals from satellites have large errors. Here we show, using a new daily 1 degree gridded rainfall data set based on about 1800 gauges from the India Meteorology Department (IMD), that modern satellite estimates are reasonably close to observed rainfall over the Indian monsoon region. Daily satellite rainfalls from the Global Precipitation Climatology Project (GPCP 1DD) and the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) are available since 1998. The high summer monsoon (June-September) rain over the Western Ghats and Himalayan foothills is captured in TMPA data. Away from hilly regions, the seasonal mean and intraseasonal variability of rainfall (averaged over regions of a few hundred kilometers linear dimension) from both satellite products are about 15% of observations. Satellite data generally underestimate both the mean and variability of rain, but the phase of intraseasonal variations is accurate. On synoptic timescales, TMPA gives reasonable depiction of the pattern and intensity of torrential rain from individual monsoon low-pressure systems and depressions. A pronounced biennial oscillation of seasonal total central India rain is seen in all three data sets, with GPCP 1DD being closest to IMD observations. The new satellite data are a promising resource for the study of tropical rainfall variability.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We revise and extend the extreme value statistic, introduced in Gupta et al., to study direction dependence in the high-redshift supernova data, arising either from departures, from the cosmological principle or due to direction-dependent statistical systematics in the data. We introduce a likelihood function that analytically marginalizes over the,Hubble constant and use it to extend our previous statistic. We also introduce a new statistic that is sensitive to direction dependence arising from living off-centre inside a large void as well as from previously mentioned reasons for anisotropy. We show that for large data sets, this statistic has a limiting form that can be computed analytically. We apply our statistics to the gold data sets from Riess et al., as in our previous work. Our revision and extension of the previous statistic show that the effect of marginalizing over the Hubble constant instead of using its best-fitting value on our results is only marginal. However, correction of errors in our previous work reduces the level of non-Gaussianity in the 2004 gold data that were found in our earlier work. The revised results for the 2007 gold data show that the data are consistent with isotropy and Gaussianity. Our second statistic confirms these results.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Tower platforms, with instrumentation at six levels above the surface to a height of 30 m, were used to record various atmospheric parameters in the surface layer. Sensors for measuring both mean and fluctuating quantities were used, with the majority of them indigenously built. Soil temperature sensors up to a depth of 30 cm from the surface were among the variables connected to the mean data logger. A PC-based data acquisition system built at the Centre for Atmospheric Sciences, IISc, was used to acquire the data from fast response sensors. This paper reports the various components of a typical MONTBLEX tower observatory and describes the actual experiments carried out in the surface layer at four sites over the monsoon trough region as a part of the MONTBLEX programme. It also describes and discusses several checks made on randomly selected tower data-sets acquired during the experiment. Checks made include visual inspection of time traces from various sensors, comparative plots of sensors measuring the same variable, wind and temperature profile plots calculation of roughness lengths, statistical and stability parameters, diurnal variation of stability parameters, and plots of probability density and energy spectrum for the different sensors. Results from these checks are found to be very encouraging and reveal the potential for further detailed analysis to understand more about surface layer characteristics.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We introduce a multifield comparison measure for scalar fields that helps in studying relations between them. The comparison measure is insensitive to noise in the scalar fields and to noise in their gradients. Further, it can be computed robustly and efficiently. Results from the visual analysis of various data sets from climate science and combustion applications demonstrate the effective use of the measure.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In most taxa, species boundaries are inferred based on differences in morphology or DNA sequences revealed by taxonomic or phylogenetic analyses. In crickets, acoustic mating signals or calling songs have species-specific structures and provide a third data set to infer species boundaries. We examined the concordance in species boundaries obtained using acoustic, morphological, and molecular data sets in the field cricket genus Itaropsis. This genus is currently described by only one valid species, Itaropsis tenella, with a broad distribution in western peninsular India and Sri Lanka. Calling songs of males sampled from four sites in peninsular India exhibited significant differences in a number of call features, suggesting the existence of multiple species. Cluster analysis of the acoustic data, molecular phylogenetic analyses, and phylogenetic analyses combining all data sets suggested the existence of three clades. Whatever the differences in calling signals, no full congruence was obtained between all the data sets, even though the resultant lineages were largely concordant with the acoustic clusters. The genus Itaropsis could thus be represented by three morphologically cryptic incipient species in peninsular India; their distributions are congruent with usual patterns of endemism in the Western Ghats, India. Song evolution is analysed through the divergence in syllable period, syllable and call duration, and dominant frequency.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Outlier detection in high dimensional categorical data has been a problem of much interest due to the extensive use of qualitative features for describing the data across various application areas. Though there exist various established methods for dealing with the dimensionality aspect through feature selection on numerical data, the categorical domain is actively being explored. As outlier detection is generally considered as an unsupervised learning problem due to lack of knowledge about the nature of various types of outliers, the related feature selection task also needs to be handled in a similar manner. This motivates the need to develop an unsupervised feature selection algorithm for efficient detection of outliers in categorical data. Addressing this aspect, we propose a novel feature selection algorithm based on the mutual information measure and the entropy computation. The redundancy among the features is characterized using the mutual information measure for identifying a suitable feature subset with less redundancy. The performance of the proposed algorithm in comparison with the information gain based feature selection shows its effectiveness for outlier detection. The efficacy of the proposed algorithm is demonstrated on various high-dimensional benchmark data sets employing two existing outlier detection methods.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper proposes a sparse modeling approach to solve ordinal regression problems using Gaussian processes (GP). Designing a sparse GP model is important from training time and inference time viewpoints. We first propose a variant of the Gaussian process ordinal regression (GPOR) approach, leave-one-out GPOR (LOO-GPOR). It performs model selection using the leave-one-out cross-validation (LOO-CV) technique. We then provide an approach to design a sparse model for GPOR. The sparse GPOR model reduces computational time and storage requirements. Further, it provides faster inference. We compare the proposed approaches with the state-of-the-art GPOR approach on some benchmark data sets. Experimental results show that the proposed approaches are competitive.