882 resultados para large scale data gathering


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Doutor em Engenharia Electrotécnica e de Computadores

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data envelopment analysis (DEA) is the most widely used methods for measuring the efficiency and productivity of decision-making units (DMUs). The need for huge computer resources in terms of memory and CPU time in DEA is inevitable for a large-scale data set, especially with negative measures. In recent years, wide ranges of studies have been conducted in the area of artificial neural network and DEA combined methods. In this study, a supervised feed-forward neural network is proposed to evaluate the efficiency and productivity of large-scale data sets with negative values in contrast to the corresponding DEA method. Results indicate that the proposed network has some computational advantages over the corresponding DEA models; therefore, it can be considered as a useful tool for measuring the efficiency of DMUs with (large-scale) negative data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The seminal multiple-view stereo benchmark evaluations from Middlebury and by Strecha et al. have played a major role in propelling the development of multi-view stereopsis (MVS) methodology. The somewhat small size and variability of these data sets, however, limit their scope and the conclusions that can be derived from them. To facilitate further development within MVS, we here present a new and varied data set consisting of 80 scenes, seen from 49 or 64 accurate camera positions. This is accompanied by accurate structured light scans for reference and evaluation. In addition all images are taken under seven different lighting conditions. As a benchmark and to validate the use of our data set for obtaining reasonable and statistically significant findings about MVS, we have applied the three state-of-the-art MVS algorithms by Campbell et al., Furukawa et al., and Tola et al. to the data set. To do this we have extended the evaluation protocol from the Middlebury evaluation, necessitated by the more complex geometry of some of our scenes. The data set and accompanying evaluation framework are made freely available online. Based on this evaluation, we are able to observe several characteristics of state-of-the-art MVS, e.g. that there is a tradeoff between the quality of the reconstructed 3D points (accuracy) and how much of an object’s surface is captured (completeness). Also, several issues that we hypothesized would challenge MVS, such as specularities and changing lighting conditions did not pose serious problems. Our study finds that the two most pressing issues for MVS are lack of texture and meshing (forming 3D points into closed triangulated surfaces).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As massive data sets become increasingly available, people are facing the problem of how to effectively process and understand these data. Traditional sequential computing models are giving way to parallel and distributed computing models, such as MapReduce, both due to the large size of the data sets and their high dimensionality. This dissertation, as in the same direction of other researches that are based on MapReduce, tries to develop effective techniques and applications using MapReduce that can help people solve large-scale problems. Three different problems are tackled in the dissertation. The first one deals with processing terabytes of raster data in a spatial data management system. Aerial imagery files are broken into tiles to enable data parallel computation. The second and third problems deal with dimension reduction techniques that can be used to handle data sets of high dimensionality. Three variants of the nonnegative matrix factorization technique are scaled up to factorize matrices of dimensions in the order of millions in MapReduce based on different matrix multiplication implementations. Two algorithms, which compute CANDECOMP/PARAFAC and Tucker tensor decompositions respectively, are parallelized in MapReduce based on carefully partitioning the data and arranging the computation to maximize data locality and parallelism.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present a new framework for large-scale data clustering. The main idea is to modify functional dimensionality reduction techniques to directly optimize over discrete labels using stochastic gradient descent. Compared to methods like spectral clustering our approach solves a single optimization problem, rather than an ad-hoc two-stage optimization approach, does not require a matrix inversion, can easily encode prior knowledge in the set of implementable functions, and does not have an ?out-of-sample? problem. Experimental results on both artificial and real-world datasets show the usefulness of our approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-06

Relevância:

100.00% 100.00%

Publicador:

Resumo:

High-throughput technologies are now used to generate more than one type of data from the same biological samples. To properly integrate such data, we propose using co-modules, which describe coherent patterns across paired data sets, and conceive several modular methods for their identification. We first test these methods using in silico data, demonstrating that the integrative scheme of our Ping-Pong Algorithm uncovers drug-gene associations more accurately when considering noisy or complex data. Second, we provide an extensive comparative study using the gene-expression and drug-response data from the NCI-60 cell lines. Using information from the DrugBank and the Connectivity Map databases we show that the Ping-Pong Algorithm predicts drug-gene associations significantly better than other methods. Co-modules provide insights into possible mechanisms of action for a wide range of drugs and suggest new targets for therapy

Relevância:

100.00% 100.00%

Publicador:

Resumo:

SUMMARY: We present a tool designed for visualization of large-scale genetic and genomic data exemplified by results from genome-wide association studies. This software provides an integrated framework to facilitate the interpretation of SNP association studies in genomic context. Gene annotations can be retrieved from Ensembl, linkage disequilibrium data downloaded from HapMap and custom data imported in BED or WIG format. AssociationViewer integrates functionalities that enable the aggregation or intersection of data tracks. It implements an efficient cache system and allows the display of several, very large-scale genomic datasets. AVAILABILITY: The Java code for AssociationViewer is distributed under the GNU General Public Licence and has been tested on Microsoft Windows XP, MacOSX and GNU/Linux operating systems. It is available from the SourceForge repository. This also includes Java webstart, documentation and example datafiles.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Large scale image mosaicing methods are in great demand among scientists who study different aspects of the seabed, and have been fostered by impressive advances in the capabilities of underwater robots in gathering optical data from the seafloor. Cost and weight constraints mean that lowcost Remotely operated vehicles (ROVs) usually have a very limited number of sensors. When a low-cost robot carries out a seafloor survey using a down-looking camera, it usually follows a predetermined trajectory that provides several non time-consecutive overlapping image pairs. Finding these pairs (a process known as topology estimation) is indispensable to obtaining globally consistent mosaics and accurate trajectory estimates, which are necessary for a global view of the surveyed area, especially when optical sensors are the only data source. This thesis presents a set of consistent methods aimed at creating large area image mosaics from optical data obtained during surveys with low-cost underwater vehicles. First, a global alignment method developed within a Feature-based image mosaicing (FIM) framework, where nonlinear minimisation is substituted by two linear steps, is discussed. Then, a simple four-point mosaic rectifying method is proposed to reduce distortions that might occur due to lens distortions, error accumulation and the difficulties of optical imaging in an underwater medium. The topology estimation problem is addressed by means of an augmented state and extended Kalman filter combined framework, aimed at minimising the total number of matching attempts and simultaneously obtaining the best possible trajectory. Potential image pairs are predicted by taking into account the uncertainty in the trajectory. The contribution of matching an image pair is investigated using information theory principles. Lastly, a different solution to the topology estimation problem is proposed in a bundle adjustment framework. Innovative aspects include the use of fast image similarity criterion combined with a Minimum spanning tree (MST) solution, to obtain a tentative topology. This topology is improved by attempting image matching with the pairs for which there is the most overlap evidence. Unlike previous approaches for large-area mosaicing, our framework is able to deal naturally with cases where time-consecutive images cannot be matched successfully, such as completely unordered sets. Finally, the efficiency of the proposed methods is discussed and a comparison made with other state-of-the-art approaches, using a series of challenging datasets in underwater scenarios

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Climate-G is a large scale distributed testbed devoted to climate change research. It is an unfunded effort started in 2008 and involving a wide community both in Europe and US. The testbed is an interdisciplinary effort involving partners from several institutions and joining expertise in the field of climate change and computational science. Its main goal is to allow scientists carrying out geographical and cross-institutional data discovery, access, analysis, visualization and sharing of climate data. It represents an attempt to address, in a real environment, challenging data and metadata management issues. This paper presents a complete overview about the Climate-G testbed highlighting the most important results that have been achieved since the beginning of this project.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The time evolution of the circulation change at the end of the Baiu season is investigated using ERA40 data. An end-day is defined for each of the 23 years based on the 850 hPa θe value at 40˚Nin the 130-140˚E sector exceeding 330 K. Daily time series of variables are composited with respect to this day. These composite time-series exhibit a clearer and more rapid change in the precipitation and the large-scale circulation over the whole East Asia region than those performed using calendar days. The precipitation change includes the abrupt end of the Baiu rain, the northward shift of tropical convection perhaps starting a few days before this, and the start of the heavier rain at higher latitudes. The northward migration of lower tropospheric warm, moist tropical air, a general feature of the seasonal march in the region, is fast over the continent and slow over the ocean. By mid to late July the cooler air over the Sea of Japan is surrounded on 3 sides by the tropical air. It is suggestive that the large-scale stage has been set for a jump to the post-Baiu state, i.e., for the end of the Baiu season. Two likely triggers for the actual change emerge from the analysis. The first is the northward movement of tropical convection into the Philippine region. The second is an equivalent barotropic Rossby wave-train, that over a 10-day period develops downstream across Eurasia. It appears likely that in most years one or both mechanisms can be important in triggering the actual end of the Baiu season.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The best irrigation management depends on accurate estimation of reference evapotranspiration (ET0) and then selection of the appropriate crop coefficient for each phenological stage. However, the evaluation of water productivity on a large scale can be done by using actual evapotranspiration (ETa), determined by coupling agrometeorological and remote sensing data. This paper describes methodologies used for estimating ETa for 20 centerpivots using three different approaches: the traditional FAO crop coefficient (K-c) method and two remote sensing algorithms, one called SEBAL and other named TEIXEIRA. The methods were applied to one Landsat 5 Thematic Mapper image acquired in July 2010 over the Northwest portion of the Sao Paulo State, Brazil. The corn, bean and sugar cane crops are grown under center pivot sprinkler irrigation. ET0 was calculated by the Penman-Monteith method with data from one automated weather station close to the study site. The results showed that for the crops at effective full cover, SEBAL and TEIXEIRA's methods agreed well comparing with the traditional method. However, both remote sensing methods overestimated ETa according to the degree of exposed soil, with the TEIXEIRA method presenting closer ETa values with those resulted from the traditional FAO K-c method. This study showed that remote sensing algorithms can be useful tools for monitoring and establishing realistic K-c values to further determine ETa on a large scale. However, several images during the growing seasons must be used to establish the necessary adjustments to the traditional FAO crop coefficient method.