981 resultados para Large datasets


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this work, a new approach for supervised pattern recognition is presented which improves the learning algorithm of the Optimum-Path Forest classifier (OPF), centered on detection and elimination of outliers in the training set. Identification of outliers is based on a penalty computed for each sample in the training set from the corresponding number of imputable false positive and false negative classification of samples. This approach enhances the accuracy of OPF while still gaining in classification time, at the expense of a slight increase in training time. © 2010 Springer-Verlag.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

[EN]Gender recognition has achieved impressive results based on the face appearance in controlled datasets. Its application in the wild and large datasets is still a challenging task for researchers. In this paper, we make use of classical techniques to analyze their performance in controlled and uncontrolled condition respectively with the LFW and MORPH datasets. For both sets the benchmarking protocol follows the 5-fold cross-validation proposed by the BEFIT challenge.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In epidemiological work, outcomes are frequently non-normal, sample sizes may be large, and effects are often small. To relate health outcomes to geographic risk factors, fast and powerful methods for fitting spatial models, particularly for non-normal data, are required. We focus on binary outcomes, with the risk surface a smooth function of space. We compare penalized likelihood models, including the penalized quasi-likelihood (PQL) approach, and Bayesian models based on fit, speed, and ease of implementation. A Bayesian model using a spectral basis representation of the spatial surface provides the best tradeoff of sensitivity and specificity in simulations, detecting real spatial features while limiting overfitting and being more efficient computationally than other Bayesian approaches. One of the contributions of this work is further development of this underused representation. The spectral basis model outperforms the penalized likelihood methods, which are prone to overfitting, but is slower to fit and not as easily implemented. Conclusions based on a real dataset of cancer cases in Taiwan are similar albeit less conclusive with respect to comparing the approaches. The success of the spectral basis with binary data and similar results with count data suggest that it may be generally useful in spatial models and more complicated hierarchical models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the ability to collect and store increasingly large datasets on modern computers comes the need to be able to process the data in a way that can be useful to a Geostatistician or application scientist. Although the storage requirements only scale linearly with the number of observations in the dataset, the computational complexity in terms of memory and speed, scale quadratically and cubically respectively for likelihood-based Geostatistics. Various methods have been proposed and are extensively used in an attempt to overcome these complexity issues. This thesis introduces a number of principled techniques for treating large datasets with an emphasis on three main areas: reduced complexity covariance matrices, sparsity in the covariance matrix and parallel algorithms for distributed computation. These techniques are presented individually, but it is also shown how they can be combined to produce techniques for further improving computational efficiency.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Heterogeneous datasets arise naturally in most applications due to the use of a variety of sensors and measuring platforms. Such datasets can be heterogeneous in terms of the error characteristics and sensor models. Treating such data is most naturally accomplished using a Bayesian or model-based geostatistical approach; however, such methods generally scale rather badly with the size of dataset, and require computationally expensive Monte Carlo based inference. Recently within the machine learning and spatial statistics communities many papers have explored the potential of reduced rank representations of the covariance matrix, often referred to as projected or fixed rank approaches. In such methods the covariance function of the posterior process is represented by a reduced rank approximation which is chosen such that there is minimal information loss. In this paper a sequential Bayesian framework for inference in such projected processes is presented. The observations are considered one at a time which avoids the need for high dimensional integrals typically required in a Bayesian approach. A C++ library, gptk, which is part of the INTAMAP web service, is introduced which implements projected, sequential estimation and adds several novel features. In particular the library includes the ability to use a generic observation operator, or sensor model, to permit data fusion. It is also possible to cope with a range of observation error characteristics, including non-Gaussian observation errors. Inference for the covariance parameters is explored, including the impact of the projected process approximation on likelihood profiles. We illustrate the projected sequential method in application to synthetic and real datasets. Limitations and extensions are discussed. © 2010 Elsevier Ltd.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the earth sciences, data are commonly cast on complex grids in order to model irregular domains such as coastlines, or to evenly distribute grid points over the globe. It is common for a scientist to wish to re-cast such data onto a grid that is more amenable to manipulation, visualization, or comparison with other data sources. The complexity of the grids presents a significant technical difficulty to the regridding process. In particular, the regridding of complex grids may suffer from severe performance issues, in the worst case scaling with the product of the sizes of the source and destination grids. We present a mechanism for the fast regridding of such datasets, based upon the construction of a spatial index that allows fast searching of the source grid. We discover that the most efficient spatial index under test (in terms of memory usage and query time) is a simple look-up table. A kd-tree implementation was found to be faster to build and to give similar query performance at the expense of a larger memory footprint. Using our approach, we demonstrate that regridding of complex data may proceed at speeds sufficient to permit regridding on-the-fly in an interactive visualization application, or in a Web Map Service implementation. For large datasets with complex grids the new mechanism is shown to significantly outperform algorithms used in many scientific visualization packages.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Data Envelopment Analysis (DEA) is one of the most widely used methods in the measurement of the efficiency and productivity of Decision Making Units (DMUs). DEA for a large dataset with many inputs/outputs would require huge computer resources in terms of memory and CPU time. This paper proposes a neural network back-propagation Data Envelopment Analysis to address this problem for the very large scale datasets now emerging in practice. Neural network requirements for computer memory and CPU time are far less than that needed by conventional DEA methods and can therefore be a useful tool in measuring the efficiency of large datasets. Finally, the back-propagation DEA algorithm is applied to five large datasets and compared with the results obtained by conventional DEA.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The occurrence of mid-latitude windstorms is related to strong socio-economic effects. For detailed and reliable regional impact studies, large datasets of high-resolution wind fields are required. In this study, a statistical downscaling approach in combination with dynamical downscaling is introduced to derive storm related gust speeds on a high-resolution grid over Europe. Multiple linear regression models are trained using reanalysis data and wind gusts from regional climate model simulations for a sample of 100 top ranking windstorm events. The method is computationally inexpensive and reproduces individual windstorm footprints adequately. Compared to observations, the results for Germany are at least as good as pure dynamical downscaling. This new tool can be easily applied to large ensembles of general circulation model simulations and thus contribute to a better understanding of the regional impact of windstorms based on decadal and climate change projections.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Traditional pattern recognition techniques can not handle the classification of large datasets with both efficiency and effectiveness. In this context, the Optimum-Path Forest (OPF) classifier was recently introduced, trying to achieve high recognition rates and low computational cost. Although OPF was much faster than Support Vector Machines for training, it was slightly slower for classification. In this paper, we present the Efficient OPF (EOPF), which is an enhanced and faster version of the traditional OPF, and validate it for the automatic recognition of white matter and gray matter in magnetic resonance images of the human brain. © 2010 IEEE.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Well-known data mining algorithms rely on inputs in the form of pairwise similarities between objects. For large datasets it is computationally impossible to perform all pairwise comparisons. We therefore propose a novel approach that uses approximate Principal Component Analysis to efficiently identify groups of similar objects. The effectiveness of the approach is demonstrated in the context of binary classification using the supervised normalized cut as a classifier. For large datasets from the UCI repository, the approach significantly improves run times with minimal loss in accuracy.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper analyses local geographical contexts targeted by transnational large-scale land acquisitions (>200 ha per deal) in order to understand how emerging patterns of socio-ecological characteristics can be related to processes of large-scale foreign investment in land. Using a sample of 139 land deals georeferenced with high spatial accuracy, we first analyse their target contexts in terms of land cover, population density, accessibility, and indicators for agricultural potential. Three distinct patterns emerge from the analysis: densely populated and easily accessible croplands (35% of land deals); remote forestlands with lower population densities (34% of land deals); and moderately populated and moderately accessible shrub- or grasslands (26% of land deals). These patterns are consistent with processes described in the relevant case study literature, and they each involve distinct types of stakeholders and associated competition over land. We then repeat the often-cited analysis that postulates a link between land investments and target countries with abundant so-called “idle” or “marginal” lands as measured by yield gap and available suitable but uncultivated land; our methods differ from the earlier approach, however, in that we examine local context (10-km radius) rather than countries as a whole. The results show that earlier findings are disputable in terms of concepts, methods, and contents. Further, we reflect on methodologies for exploring linkages between socioecological patterns and land investment processes. Improving and enhancing large datasets of georeferenced land deals is an important next step; at the same time, careful choice of the spatial scale of analysis is crucial for ensuring compatibility between the spatial accuracy of land deal locations and the resolution of available geospatial data layers. Finally, we argue that new approaches and methods must be developed to empirically link socio-ecological patterns in target contexts to key determinants of land investment processes. This would help to improve the validity and the reach of our findings as an input for evidence-informed policy debates.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Forecasting abrupt variations in wind power generation (the so-called ramps) helps achieve large scale wind power integration. One of the main issues to be confronted when addressing wind power ramp forecasting is the way in which relevant information is identified from large datasets to optimally feed forecasting models. To this end, an innovative methodology oriented to systematically relate multivariate datasets to ramp events is presented. The methodology comprises two stages: the identification of relevant features in the data and the assessment of the dependence between these features and ramp occurrence. As a test case, the proposed methodology was employed to explore the relationships between atmospheric dynamics at the global/synoptic scales and ramp events experienced in two wind farms located in Spain. The achieved results suggested different connection degrees between these atmospheric scales and ramp occurrence. For one of the wind farms, it was found that ramp events could be partly explained from regional circulations and zonal pressure gradients. To perform a comprehensive analysis of ramp underlying causes, the proposed methodology could be applied to datasets related to other stages of the wind-topower conversion chain.