6 resultados para Large Data
em Digital Commons at Florida International University
Resumo:
Modern geographical databases, which are at the core of geographic information systems (GIS), store a rich set of aspatial attributes in addition to geographic data. Typically, aspatial information comes in textual and numeric format. Retrieving information constrained on spatial and aspatial data from geodatabases provides GIS users the ability to perform more interesting spatial analyses, and for applications to support composite location-aware searches; for example, in a real estate database: “Find the nearest homes for sale to my current location that have backyard and whose prices are between $50,000 and $80,000”. Efficient processing of such queries require combined indexing strategies of multiple types of data. Existing spatial query engines commonly apply a two-filter approach (spatial filter followed by nonspatial filter, or viceversa), which can incur large performance overheads. On the other hand, more recently, the amount of geolocation data has grown rapidly in databases due in part to advances in geolocation technologies (e.g., GPS-enabled smartphones) that allow users to associate location data to objects or events. The latter poses potential data ingestion challenges of large data volumes for practical GIS databases. In this dissertation, we first show how indexing spatial data with R-trees (a typical data pre-processing task) can be scaled in MapReduce—a widely-adopted parallel programming model for data intensive problems. The evaluation of our algorithms in a Hadoop cluster showed close to linear scalability in building R-tree indexes. Subsequently, we develop efficient algorithms for processing spatial queries with aspatial conditions. Novel techniques for simultaneously indexing spatial with textual and numeric data are developed to that end. Experimental evaluations with real-world, large spatial datasets measured query response times within the sub-second range for most cases, and up to a few seconds for a small number of cases, which is reasonable for interactive applications. Overall, the previous results show that the MapReduce parallel model is suitable for indexing tasks in spatial databases, and the adequate combination of spatial and aspatial attribute indexes can attain acceptable response times for interactive spatial queries with constraints on aspatial data.
Resumo:
Cloud computing realizes the long-held dream of converting computing capability into a type of utility. It has the potential to fundamentally change the landscape of the IT industry and our way of life. However, as cloud computing expanding substantially in both scale and scope, ensuring its sustainable growth is a critical problem. Service providers have long been suffering from high operational costs. Especially the costs associated with the skyrocketing power consumption of large data centers. In the meantime, while efficient power/energy utilization is indispensable for the sustainable growth of cloud computing, service providers must also satisfy a user's quality of service (QoS) requirements. This problem becomes even more challenging considering the increasingly stringent power/energy and QoS constraints, as well as other factors such as the highly dynamic, heterogeneous, and distributed nature of the computing infrastructures, etc. ^ In this dissertation, we study the problem of delay-sensitive cloud service scheduling for the sustainable development of cloud computing. We first focus our research on the development of scheduling methods for delay-sensitive cloud services on a single server with the goal of maximizing a service provider's profit. We then extend our study to scheduling cloud services in distributed environments. In particular, we develop a queue-based model and derive efficient request dispatching and processing decisions in a multi-electricity-market environment to improve the profits for service providers. We next study a problem of multi-tier service scheduling. By carefully assigning sub deadlines to the service tiers, our approach can significantly improve resource usage efficiencies with statistically guaranteed QoS. Finally, we study the power conscious resource provision problem for service requests with different QoS requirements. By properly sharing computing resources among different requests, our method statistically guarantees all QoS requirements with a minimized number of powered-on servers and thus the power consumptions. The significance of our research is that it is one part of the integrated effort from both industry and academia to ensure the sustainable growth of cloud computing as it continues to evolve and change our society profoundly.^
Resumo:
Cloud computing realizes the long-held dream of converting computing capability into a type of utility. It has the potential to fundamentally change the landscape of the IT industry and our way of life. However, as cloud computing expanding substantially in both scale and scope, ensuring its sustainable growth is a critical problem. Service providers have long been suffering from high operational costs. Especially the costs associated with the skyrocketing power consumption of large data centers. In the meantime, while efficient power/energy utilization is indispensable for the sustainable growth of cloud computing, service providers must also satisfy a user's quality of service (QoS) requirements. This problem becomes even more challenging considering the increasingly stringent power/energy and QoS constraints, as well as other factors such as the highly dynamic, heterogeneous, and distributed nature of the computing infrastructures, etc. In this dissertation, we study the problem of delay-sensitive cloud service scheduling for the sustainable development of cloud computing. We first focus our research on the development of scheduling methods for delay-sensitive cloud services on a single server with the goal of maximizing a service provider's profit. We then extend our study to scheduling cloud services in distributed environments. In particular, we develop a queue-based model and derive efficient request dispatching and processing decisions in a multi-electricity-market environment to improve the profits for service providers. We next study a problem of multi-tier service scheduling. By carefully assigning sub deadlines to the service tiers, our approach can significantly improve resource usage efficiencies with statistically guaranteed QoS. Finally, we study the power conscious resource provision problem for service requests with different QoS requirements. By properly sharing computing resources among different requests, our method statistically guarantees all QoS requirements with a minimized number of powered-on servers and thus the power consumptions. The significance of our research is that it is one part of the integrated effort from both industry and academia to ensure the sustainable growth of cloud computing as it continues to evolve and change our society profoundly.
Resumo:
As massive data sets become increasingly available, people are facing the problem of how to effectively process and understand these data. Traditional sequential computing models are giving way to parallel and distributed computing models, such as MapReduce, both due to the large size of the data sets and their high dimensionality. This dissertation, as in the same direction of other researches that are based on MapReduce, tries to develop effective techniques and applications using MapReduce that can help people solve large-scale problems. Three different problems are tackled in the dissertation. The first one deals with processing terabytes of raster data in a spatial data management system. Aerial imagery files are broken into tiles to enable data parallel computation. The second and third problems deal with dimension reduction techniques that can be used to handle data sets of high dimensionality. Three variants of the nonnegative matrix factorization technique are scaled up to factorize matrices of dimensions in the order of millions in MapReduce based on different matrix multiplication implementations. Two algorithms, which compute CANDECOMP/PARAFAC and Tucker tensor decompositions respectively, are parallelized in MapReduce based on carefully partitioning the data and arranging the computation to maximize data locality and parallelism.
Resumo:
With the exponential increasing demands and uses of GIS data visualization system, such as urban planning, environment and climate change monitoring, weather simulation, hydrographic gauge and so forth, the geospatial vector and raster data visualization research, application and technology has become prevalent. However, we observe that current web GIS techniques are merely suitable for static vector and raster data where no dynamic overlaying layers. While it is desirable to enable visual explorations of large-scale dynamic vector and raster geospatial data in a web environment, improving the performance between backend datasets and the vector and raster applications remains a challenging technical issue. This dissertation is to implement these challenging and unimplemented areas: how to provide a large-scale dynamic vector and raster data visualization service with dynamic overlaying layers accessible from various client devices through a standard web browser, and how to make the large-scale dynamic vector and raster data visualization service as rapid as the static one. To accomplish these, a large-scale dynamic vector and raster data visualization geographic information system based on parallel map tiling and a comprehensive performance improvement solution are proposed, designed and implemented. They include: the quadtree-based indexing and parallel map tiling, the Legend String, the vector data visualization with dynamic layers overlaying, the vector data time series visualization, the algorithm of vector data rendering, the algorithm of raster data re-projection, the algorithm for elimination of superfluous level of detail, the algorithm for vector data gridding and re-grouping and the cluster servers side vector and raster data caching.
Resumo:
Large-extent vegetation datasets that co-occur with long-term hydrology data provide new ways to develop biologically meaningful hydrologic variables and to determine plant community responses to hydrology. We analyzed the suitability of different hydrological variables to predict vegetation in two water conservation areas (WCAs) in the Florida Everglades, USA, and developed metrics to define realized hydrologic optima and tolerances. Using vegetation data spatially co-located with long-term hydrological records, we evaluated seven variables describing water depth, hydroperiod length, and number of wet/dry events; each variable was tested for 2-, 4- and 10-year intervals for Julian annual averages and environmentally-defined hydrologic intervals. Maximum length and maximum water depth during the wet period calculated for environmentally-defined hydrologic intervals over a 4-year period were the best predictors of vegetation type. Proportional abundance of vegetation types along hydrological gradients indicated that communities had different realized optima and tolerances across WCAs. Although in both WCAs, the trees/shrubs class was on the drier/shallower end of hydrological gradients, while slough communities occupied the wetter/deeper end, the distribution ofCladium, Typha, wet prairie and Salix communities, which were intermediate for most hydrological variables, varied in proportional abundance along hydrologic gradients between WCAs, indicating that realized optima and tolerances are context-dependent.