8 resultados para Large Data Sets

em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Astronomy has evolved almost exclusively by the use of spectroscopic and imaging techniques, operated separately. With the development of modern technologies, it is possible to obtain data cubes in which one combines both techniques simultaneously, producing images with spectral resolution. To extract information from them can be quite complex, and hence the development of new methods of data analysis is desirable. We present a method of analysis of data cube (data from single field observations, containing two spatial and one spectral dimension) that uses Principal Component Analysis (PCA) to express the data in the form of reduced dimensionality, facilitating efficient information extraction from very large data sets. PCA transforms the system of correlated coordinates into a system of uncorrelated coordinates ordered by principal components of decreasing variance. The new coordinates are referred to as eigenvectors, and the projections of the data on to these coordinates produce images we will call tomograms. The association of the tomograms (images) to eigenvectors (spectra) is important for the interpretation of both. The eigenvectors are mutually orthogonal, and this information is fundamental for their handling and interpretation. When the data cube shows objects that present uncorrelated physical phenomena, the eigenvector`s orthogonality may be instrumental in separating and identifying them. By handling eigenvectors and tomograms, one can enhance features, extract noise, compress data, extract spectra, etc. We applied the method, for illustration purpose only, to the central region of the low ionization nuclear emission region (LINER) galaxy NGC 4736, and demonstrate that it has a type 1 active nucleus, not known before. Furthermore, we show that it is displaced from the centre of its stellar bulge.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In Information Visualization, adding and removing data elements can strongly impact the underlying visual space. We have developed an inherently incremental technique (incBoard) that maintains a coherent disposition of elements from a dynamic multidimensional data set on a 2D grid as the set changes. Here, we introduce a novel layout that uses pairwise similarity from grid neighbors, as defined in incBoard, to reposition elements on the visual space, free from constraints imposed by the grid. The board continues to be updated and can be displayed alongside the new space. As similar items are placed together, while dissimilar neighbors are moved apart, it supports users in the identification of clusters and subsets of related elements. Densely populated areas identified in the incSpace can be efficiently explored with the corresponding incBoard visualization, which is not susceptible to occlusion. The solution remains inherently incremental and maintains a coherent disposition of elements, even for fully renewed sets. The algorithm considers relative positions for the initial placement of elements, and raw dissimilarity to fine tune the visualization. It has low computational cost, with complexity depending only on the size of the currently viewed subset, V. Thus, a data set of size N can be sequentially displayed in O(N) time, reaching O(N (2)) only if the complete set is simultaneously displayed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Precipitation and temperature climate indices are calculated using the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis and validated against observational data from some stations over Brazil and other data sources. The spatial patterns of the climate indices trends are analyzed for the period 1961-1990 over South America. In addition, the correlation and linear regression coefficients for some specific stations were also obtained in order to compare with the reanalysis data. In general, the results suggest that NCEP/NCAR reanalysis can provide useful information about minimum temperature and consecutive dry days indices at individual grid cells in Brazil. However, some regional differences in the climate indices trends are observed when different data sets are compared. For instance, the NCEP/NCAR reanalysis shows a reversal signal for all rainfall annual indices and the cold night index over Argentina. Despite these differences, maps of the trends for most of the annual climate indices obtained from the NCEP/NCAR reanalysis and BRANT analysis are generally in good agreement with other available data sources and previous findings in the literature for large areas of southern South America. The pattern of trends for the precipitation annual indices over the 30 years analyzed indicates a change to wetter conditions over southern and southeastern parts of Brazil, Paraguay, Uruguay, central and northern Argentina, and parts of Chile and a decrease over southwestern South America. All over South America, the climate indices related to the minimum temperature (warm or cold nights) have clearly shown a warming tendency; however, no consistent changes in maximum temperature extremes (warm and cold days) have been observed. Therefore, one must be careful before suggesting an), trends for warm or cold days.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Phylogenetic analyses of representative species from the five genera of Winteraceae (Drimys, Pseudowintera, Takhtajania, Tasmannia, and Zygogynum s.l.) were performed using ITS nuclear sequences and a combined data-set of ITS + psbA-trnH + rpS16 sequences (sampling of 30 and 15 species, respectively). Indel informativity using simple gap coding or gaps as a fifth character was examined in both data-sets. Parsimony and Bayesian analyses support the monophyly of Drimys, Tasmannia, and Zygogynum s.l., but do not support the monophyly of Belliolum, Zygogynum s.s., and Bubbia. Within Drimys, the combined data-set recovers two subclades. Divergence time estimates suggest that the splitting between Drimys and its sister clade (Pseudowintera + Zygogynum s.l.) occurred around the end of the Cretaceous; in contrast, the divergence between the two subclades within Drimys is more recent (15.5-18.5 MY) and coincides in time with the Andean uplift. Estimates suggest that the earliest divergences within Winteraceae could have predated the first events of Gondwana fragmentation. (C) 2009 Elsevier Inc. All rights reserved.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this paper, we present an algorithm for cluster analysis that integrates aspects from cluster ensemble and multi-objective clustering. The algorithm is based on a Pareto-based multi-objective genetic algorithm, with a special crossover operator, which uses clustering validation measures as objective functions. The algorithm proposed can deal with data sets presenting different types of clusters, without the need of expertise in cluster analysis. its result is a concise set of partitions representing alternative trade-offs among the objective functions. We compare the results obtained with our algorithm, in the context of gene expression data sets, to those achieved with multi-objective Clustering with automatic K-determination (MOCK). the algorithm most closely related to ours. (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this paper we introduce the Weibull power series (WPS) class of distributions which is obtained by compounding Weibull and power series distributions where the compounding procedure follows same way that was previously carried out by Adamidis and Loukas (1998) This new class of distributions has as a particular case the two-parameter exponential power series (EPS) class of distributions (Chahkandi and Gawk 2009) which contains several lifetime models such as exponential geometric (Adamidis and Loukas 1998) exponential Poisson (Kus 2007) and exponential logarithmic (Tahmasbi and Rezaei 2008) distributions The hazard function of our class can be increasing decreasing and upside down bathtub shaped among others while the hazard function of an EPS distribution is only decreasing We obtain several properties of the WPS distributions such as moments order statistics estimation by maximum likelihood and inference for a large sample Furthermore the EM algorithm is also used to determine the maximum likelihood estimates of the parameters and we discuss maximum entropy characterizations under suitable constraints Special distributions are studied in some detail Applications to two real data sets are given to show the flexibility and potentiality of the new class of distributions (C) 2010 Elsevier B V All rights reserved