782 resultados para Spatial Data mining
Resumo:
In this paper we discuss the current state-of-the-art in estimating, evaluating, and selecting among non-linear forecasting models for economic and financial time series. We review theoretical and empirical issues, including predictive density, interval and point evaluation and model selection, loss functions, data-mining, and aggregation. In addition, we argue that although the evidence in favor of constructing forecasts using non-linear models is rather sparse, there is reason to be optimistic. However, much remains to be done. Finally, we outline a variety of topics for future research, and discuss a number of areas which have received considerable attention in the recent literature, but where many questions remain.
Resumo:
A glance along the finance shelves at any bookshop reveals a large number of books that seek to show readers how to ‘make a million’ or ‘beat the market’ with allegedly highly profitable equity trading strategies. This paper investigates whether useful trading strategies can be derived from popular books of investment strategy, with What Works on Wall Street by James P. O'Shaughnessy used as an example. Specifically, we test whether this strategy would have produced a similarly spectacular performance in the UK context as was demonstrated by the author for the US market. As part of our investigation, we highlight a general methodology for determining whether the observed superior performance of a trading rule could be attributed in part or in entirety to data mining. Overall, we find that the O'Shaughnessy rule performs reasonably well in the UK equity market, yielding higher returns than the FTSE All-Share Index, but lower returns than an equally weighted benchmark
Resumo:
Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.
Resumo:
Twitter is both a micro-blogging service and a platform for public conversation. Direct conversation is facilitated in Twitter through the use of @’s (mentions) and replies. While the conversational element of Twitter is of particular interest to the marketing sector, relatively few data-mining studies have focused on this area. We analyse conversations associated with reciprocated mentions that take place in a data-set consisting of approximately 4 million tweets collected over a period of 28 days that contain at least one mention. We ignore tweet content and instead use the mention network structure and its dynamical properties to identify and characterise Twitter conversations between pairs of users and within larger groups. We consider conversational balance, meaning the fraction of content contributed by each party. The goal of this work is to draw out some of the mechanisms driving conversation in Twitter, with the potential aim of developing conversational models.
Resumo:
To analyze patterns in marine productivity, harmful algal blooms, thermal stress in coral reefs, and oceanographic processes, optical and biophysical marine parameters, such as sea surface temperature, and ocean color products, such as chlorophyll-a concentration, diffuse attenuation coefficient, total suspended matter concentration, chlorophyll fluorescence line height, and remote sensing reflectance, are required. In this paper we present a novel automatic Satellite-based Ocean Monitoring System (SATMO) developed to provide, in near real-time, continuous spatial data sets of the above-mentioned variables for marine-coastal ecosystems in the Gulf of Mexico, northeastern Pacific Ocean, and western Caribbean Sea, with 1 km spatial resolution. The products are obtained from Moderate Resolution Imaging Spectroradiometer (MODIS) images received at the Direct Readout Ground Station (located at CONABIO) after each overpass of the Aqua and Terra satellites. In addition, at the end of each week and month the system provides composite images for several ocean products, as well as weekly and monthly anomaly composites for chlorophyll-a concentration and sea surface temperature. These anomaly data are reported for the first time for the study region and represent valuable information for analyzing time series of ocean color data for the study of coastal and marine ecosystems in Mexico, Central America, and the western Caribbean.
Resumo:
Classical regression methods take vectors as covariates and estimate the corresponding vectors of regression parameters. When addressing regression problems on covariates of more complex form such as multi-dimensional arrays (i.e. tensors), traditional computational models can be severely compromised by ultrahigh dimensionality as well as complex structure. By exploiting the special structure of tensor covariates, the tensor regression model provides a promising solution to reduce the model’s dimensionality to a manageable level, thus leading to efficient estimation. Most of the existing tensor-based methods independently estimate each individual regression problem based on tensor decomposition which allows the simultaneous projections of an input tensor to more than one direction along each mode. As a matter of fact, multi-dimensional data are collected under the same or very similar conditions, so that data share some common latent components but can also have their own independent parameters for each regression task. Therefore, it is beneficial to analyse regression parameters among all the regressions in a linked way. In this paper, we propose a tensor regression model based on Tucker Decomposition, which identifies not only the common components of parameters across all the regression tasks, but also independent factors contributing to each particular regression task simultaneously. Under this paradigm, the number of independent parameters along each mode is constrained by a sparsity-preserving regulariser. Linked multiway parameter analysis and sparsity modeling further reduce the total number of parameters, with lower memory cost than their tensor-based counterparts. The effectiveness of the new method is demonstrated on real data sets.
Resumo:
The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.
Resumo:
A basic data requirement of a river flood inundation model is a Digital Terrain Model (DTM) of the reach being studied. The scale at which modeling is required determines the accuracy required of the DTM. For modeling floods in urban areas, a high resolution DTM such as that produced by airborne LiDAR (Light Detection And Ranging) is most useful, and large parts of many developed countries have now been mapped using LiDAR. In remoter areas, it is possible to model flooding on a larger scale using a lower resolution DTM, and in the near future the DTM of choice is likely to be that derived from the TanDEM-X Digital Elevation Model (DEM). A variable-resolution global DTM obtained by combining existing high and low resolution data sets would be useful for modeling flood water dynamics globally, at high resolution wherever possible and at lower resolution over larger rivers in remote areas. A further important data resource used in flood modeling is the flood extent, commonly derived from Synthetic Aperture Radar (SAR) images. Flood extents become more useful if they are intersected with the DTM, when water level observations (WLOs) at the flood boundary can be estimated at various points along the river reach. To illustrate the utility of such a global DTM, two examples of recent research involving WLOs at opposite ends of the spatial scale are discussed. The first requires high resolution spatial data, and involves the assimilation of WLOs from a real sequence of high resolution SAR images into a flood model to update the model state with observations over time, and to estimate river discharge and model parameters, including river bathymetry and friction. The results indicate the feasibility of such an Earth Observation-based flood forecasting system. The second example is at a larger scale, and uses SAR-derived WLOs to improve the lower-resolution TanDEM-X DEM in the area covered by the flood extents. The resulting reduction in random height error is significant.
Resumo:
Sparse coding aims to find a more compact representation based on a set of dictionary atoms. A well-known technique looking at 2D sparsity is the low rank representation (LRR). However, in many computer vision applications, data often originate from a manifold, which is equipped with some Riemannian geometry. In this case, the existing LRR becomes inappropriate for modeling and incorporating the intrinsic geometry of the manifold that is potentially important and critical to applications. In this paper, we generalize the LRR over the Euclidean space to the LRR model over a specific Rimannian manifold—the manifold of symmetric positive matrices (SPD). Experiments on several computer vision datasets showcase its noise robustness and superior performance on classification and segmentation compared with state-of-the-art approaches.
Resumo:
The contraction of a species’ distribution range, which results from the extirpation of local populations, generally precedes its extinction. Therefore, understanding drivers of range contraction is important for conservation and management. Although there are many processes that can potentially lead to local extirpation and range contraction, three main null models have been proposed: demographic, contagion, and refuge. The first two models postulate that the probability of local extirpation for a given area depends on its relative position within the range; but these models generate distinct spatial predictions because they assume either a ubiquitous (demographic) or a clinal (contagion) distribution of threats. The third model (refuge) postulates that extirpations are determined by the intensity of human impacts, leading to heterogeneous spatial predictions potentially compatible with those made by the other two null models. A few previous studies have explored the generality of some of these null models, but we present here the first comprehensive evaluation of all three models. Using descriptive indices and regression analyses we contrast the predictions made by each of the null models using empirical spatial data describing range contraction in 386 terrestrial vertebrates (mammals, birds, amphibians, and reptiles) distributed across the World. Observed contraction patterns do not consistently conform to the predictions of any of the three models, suggesting that these may not be adequate null models to evaluate range contraction dynamics among terrestrial vertebrates. Instead, our results support alternative null models that account for both relative position and intensity of human impacts. These new models provide a better multifactorial baseline to describe range contraction patterns in vertebrates. This general baseline can be used to explore how additional factors influence contraction, and ultimately extinction for particular areas or species as well as to predict future changes in light of current and new threats.
Resumo:
Prior to deforestation, So Paulo State had 79,000 km(2) covered by Cerrado (Brazilian savanna) physiognomies, but today less than 8.5% of this biodiversity hotspot remains, mostly in private lands. The global demand for agricultural goods has imposed strong pressure on natural areas, and the economic decisions of agribusiness managers are crucial to the fate of Cerrado domain remaining areas (CDRA) in Brazil. Our aim was to investigate the effectiveness of Brazilian private protected areas policy, and to propose a feasible alternative to promote CDRA protection. This article assessed the main agribusiness opportunity costs for natural areas preservation: the land use profitability and the arable land price. The CDRA percentage and the opportunity costs were estimated for 349 municipal districts of So Paulo State through secondary spatial data and profitability values of 38 main agricultural products. We found that Brazilian private protected areas policy fails to preserve CDRA, although the values of non-compliance fines were higher than average opportunity costs. The scenario with very restrictive laws on private protected areas and historical high interest rates allowed us to conceive a feasible cross compliance proposal to improve environmental and agricultural policies.
Resumo:
Predictive performance evaluation is a fundamental issue in design, development, and deployment of classification systems. As predictive performance evaluation is a multidimensional problem, single scalar summaries such as error rate, although quite convenient due to its simplicity, can seldom evaluate all the aspects that a complete and reliable evaluation must consider. Due to this, various graphical performance evaluation methods are increasingly drawing the attention of machine learning, data mining, and pattern recognition communities. The main advantage of these types of methods resides in their ability to depict the trade-offs between evaluation aspects in a multidimensional space rather than reducing these aspects to an arbitrarily chosen (and often biased) single scalar measure. Furthermore, to appropriately select a suitable graphical method for a given task, it is crucial to identify its strengths and weaknesses. This paper surveys various graphical methods often used for predictive performance evaluation. By presenting these methods in the same framework, we hope this paper may shed some light on deciding which methods are more suitable to use in different situations.
Resumo:
One of the top ten most influential data mining algorithms, k-means, is known for being simple and scalable. However, it is sensitive to initialization of prototypes and requires that the number of clusters be specified in advance. This paper shows that evolutionary techniques conceived to guide the application of k-means can be more computationally efficient than systematic (i.e., repetitive) approaches that try to get around the above-mentioned drawbacks by repeatedly running the algorithm from different configurations for the number of clusters and initial positions of prototypes. To do so, a modified version of a (k-means based) fast evolutionary algorithm for clustering is employed. Theoretical complexity analyses for the systematic and evolutionary algorithms under interest are provided. Computational experiments and statistical analyses of the results are presented for artificial and text mining data sets. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Multidimensional Visualization techniques are invaluable tools for analysis of structured and unstructured data with variable dimensionality. This paper introduces PEx-Image-Projection Explorer for Images-a tool aimed at supporting analysis of image collections. The tool supports a methodology that employs interactive visualizations to aid user-driven feature detection and classification tasks, thus offering improved analysis and exploration capabilities. The visual mappings employ similarity-based multidimensional projections and point placement to layout the data on a plane for visual exploration. In addition to its application to image databases, we also illustrate how the proposed approach can be successfully employed in simultaneous analysis of different data types, such as text and images, offering a common visual representation for data expressed in different modalities.