Biblioteca Digital

828 resultados para cloud computing resources

A multi-resource load balancing algorithm for cloud cache systems

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the advent of cloud computing model, distributed caches have become the cornerstone for building scalable applications. Popular systems like Facebook [1] or Twitter use Memcached [5], a highly scalable distributed object cache, to speed up applications by avoiding database accesses. Distributed object caches assign objects to cache instances based on a hashing function, and objects are not moved from a cache instance to another unless more instances are added to the cache and objects are redistributed. This may lead to situations where some cache instances are overloaded when some of the objects they store are frequently accessed, while other cache instances are less frequently used. In this paper we propose a multi-resource load balancing algorithm for distributed cache systems. The algorithm aims at balancing both CPU and Memory resources among cache instances by redistributing stored data. Considering the possible conflict of balancing multiple resources at the same time, we give CPU and Memory resources weighted priorities based on the runtime load distributions. A scarcer resource is given a higher weight than a less scarce resource when load balancing. The system imbalance degree is evaluated based on monitoring information, and the utility load of a node, a unit for resource consumption. Besides, since continuous rebalance of the system may affect the QoS of applications utilizing the cache system, our data selection policy ensures that each data migration minimizes the system imbalance degree and hence, the total reconfiguration cost can be minimized. An extensive simulation is conducted to compare our policy with other policies. Our policy shows a significant improvement in time efficiency and decrease in reconfiguration cost.

A proposal for a modular and application-aware autonomic manager of private cloud infrastructures

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recientemente, el paradigma de la computación en la nube ha recibido mucho interés por parte tanto de la industria como del mundo académico. Las infraestructuras cloud públicas están posibilitando nuevos modelos de negocio y ayudando a reducir costes. Sin embargo, una compañía podría desear ubicar sus datos y servicios en sus propias instalaciones, o tener que atenerse a leyes de protección de datos. Estas circunstancias hacen a las infraestructuras cloud privadas ciertamente deseables, ya sea para complementar a las públicas o para sustituirlas por completo. Por desgracia, las carencias en materia de estándares han impedido que las soluciones para la gestión de infraestructuras privadas se hayan desarrollado adecuadamente. Además, la multitud de opciones disponibles ha creado en los clientes el miedo a depender de una tecnología concreta (technology lock-in). Una de las causas de este problema es la falta de alineación entre la investigación académica y los productos comerciales, ya que aquella está centrada en el estudio de escenarios idealizados sin correspondencia con el mundo real, mientras que éstos consisten en soluciones desarrolladas sin tener en cuenta cómo van a encajar con los estándares más comunes o sin preocuparse de hacer públicos sus resultados. Con objeto de resolver este problema, propongo un sistema de gestión modular para infraestructuras cloud privadas enfocado en tratar con las aplicaciones en lugar de centrarse únicamente en los recursos hardware. Este sistema de gestión sigue el paradigma de la computación autónoma y está diseñado en torno a un modelo de información sencillo, desarrollado para ser compatible con los estándares más comunes. Este modelo divide el entorno en dos vistas, que sirven para separar aquello que debe preocupar a cada actor involucrado del resto de información, pero al mismo tiempo permitiendo relacionar el entorno físico con las máquinas virtuales que se despliegan encima de él. En dicho modelo, las aplicaciones cloud están divididas en tres tipos genéricos (Servicios, Trabajos de Big Data y Reservas de Instancias), para que así el sistema de gestión pueda sacar partido de las características propias de cada tipo. El modelo de información está complementado por un conjunto de acciones de gestión atómicas, reversibles e independientes, que determinan las operaciones que se pueden llevar a cabo sobre el entorno y que es usado para hacer posible la escalabilidad en el entorno. También describo un motor de gestión encargado de, a partir del estado del entorno y usando el ya mencionado conjunto de acciones, la colocación de recursos. Está dividido en dos niveles: la capa de Gestores de Aplicación, encargada de tratar sólo con las aplicaciones; y la capa del Gestor de Infraestructura, responsable de los recursos físicos. Dicho motor de gestión obedece un ciclo de vida con dos fases, para así modelar mejor el comportamiento de una infraestructura real. El problema de la colocación de recursos es atacado durante una de las fases (la de consolidación) por un resolutor de programación entera, y durante la otra (la online) por un heurístico hecho ex-profeso. Varias pruebas han demostrado que este acercamiento combinado es superior a otras estrategias. Para terminar, el sistema de gestión está acoplado a arquitecturas de monitorización y de actuadores. Aquella estando encargada de recolectar información del entorno, y ésta siendo modular en su diseño y capaz de conectarse con varias tecnologías y ofrecer varios modos de acceso. ABSTRACT The cloud computing paradigm has raised in popularity within the industry and the academia. Public cloud infrastructures are enabling new business models and helping to reduce costs. However, the desire to host company’s data and services on premises, and the need to abide to data protection laws, make private cloud infrastructures desirable, either to complement or even fully substitute public oferings. Unfortunately, a lack of standardization has precluded private infrastructure management solutions to be developed to a certain level, and a myriad of diferent options have induced the fear of lock-in in customers. One of the causes of this problem is the misalignment between academic research and industry ofering, with the former focusing in studying idealized scenarios dissimilar from real-world situations, and the latter developing solutions without taking care about how they f t with common standards, or even not disseminating their results. With the aim to solve this problem I propose a modular management system for private cloud infrastructures that is focused on the applications instead of just the hardware resources. This management system follows the autonomic system paradigm, and is designed around a simple information model developed to be compatible with common standards. This model splits the environment in two views that serve to separate the concerns of the stakeholders while at the same time enabling the traceability between the physical environment and the virtual machines deployed onto it. In it, cloud applications are classifed in three broad types (Services, Big Data Jobs and Instance Reservations), in order for the management system to take advantage of each type’s features. The information model is paired with a set of atomic, reversible and independent management actions which determine the operations that can be performed over the environment and is used to realize the cloud environment’s scalability. From the environment’s state and using the aforementioned set of actions, I also describe a management engine tasked with the resource placement. It is divided in two tiers: the Application Managers layer, concerned just with applications; and the Infrastructure Manager layer, responsible of the actual physical resources. This management engine follows a lifecycle with two phases, to better model the behavior of a real infrastructure. The placement problem is tackled during one phase (consolidation) by using an integer programming solver, and during the other (online) with a custom heuristic. Tests have demonstrated that this combined approach is superior to other strategies. Finally, the management system is paired with monitoring and actuators architectures. The former able to collect the necessary information from the environment, and the later modular in design and capable of interfacing with several technologies and ofering several access interfaces.

A new paradigm: cloud agile manufacturing

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cloud Agile Manufacturing is a new paradigm proposed in this article. The main objective of Cloud Agile Manufacturing is to offer industrial production systems as a service. Thus users can access any functionality available in the cloud of manufacturing (process design, production, management, business integration, factories virtualization, etc.) without knowledge — or at least without having to be experts — in managing the required resources. The proposal takes advantage of many of the benefits that can offer technologies and models like: Business Process Management (BPM), Cloud Computing, Service Oriented Architectures (SOA) and Ontologies. To develop the proposal has been taken as a starting point the Semantic Industrial Machinery as a Service (SIMaaS) proposed in previous work. This proposal facilitates the effective integration of industrial machinery in a computing environment, offering it as a network service. The work also includes an analysis of the benefits and disadvantages of the proposal.

Cloud agile manufacturing

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper proposes a new manufacturing paradigm, we call Cloud Agile Manufacturing, and whose principal objective is to offer industrial production systems as a service. Thus users can access any functionality available in the cloud of manufacturing (process design, production, management, business integration, factories virtualization, etc.) without knowledge — or at least without having to be experts — in managing the required resources. The proposal takes advantage of many of the benefits that can offer technologies and models like: Business Process Management (BPM), Cloud Computing, Service Oriented Architectures (SOA) and Ontologies. To develop the proposal has been taken as a starting point the Semantic Industrial Machinery as a Service (SIMaaS) proposed in previous work. This proposal facilitates the effective integration of industrial machinery in a computing environment, offering it as a network service. The work also includes an analysis of the benefits and disadvantages of the proposal.

Towards Automation in Digital Investigations : Seeking Efficiency in Digital Forensics in Mobile and Cloud Environments

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cybercrime and related malicious activity in our increasingly digital world has become more prevalent and sophisticated, evading traditional security mechanisms. Digital forensics has been proposed to help investigate, understand and eventually mitigate such attacks. The practice of digital forensics, however, is still fraught with various challenges. Some of the most prominent of these challenges include the increasing amounts of data and the diversity of digital evidence sources appearing in digital investigations. Mobile devices and cloud infrastructures are an interesting specimen, as they inherently exhibit these challenging circumstances and are becoming more prevalent in digital investigations today. Additionally they embody further characteristics such as large volumes of data from multiple sources, dynamic sharing of resources, limited individual device capabilities and the presence of sensitive data. These combined set of circumstances make digital investigations in mobile and cloud environments particularly challenging. This is not aided by the fact that digital forensics today still involves manual, time consuming tasks within the processes of identifying evidence, performing evidence acquisition and correlating multiple diverse sources of evidence in the analysis phase. Furthermore, industry standard tools developed are largely evidence-oriented, have limited support for evidence integration and only automate certain precursory tasks, such as indexing and text searching. In this study, efficiency, in the form of reducing the time and human labour effort expended, is sought after in digital investigations in highly networked environments through the automation of certain activities in the digital forensic process. To this end requirements are outlined and an architecture designed for an automated system that performs digital forensics in highly networked mobile and cloud environments. Part of the remote evidence acquisition activity of this architecture is built and tested on several mobile devices in terms of speed and reliability. A method for integrating multiple diverse evidence sources in an automated manner, supporting correlation and automated reasoning is developed and tested. Finally the proposed architecture is reviewed and enhancements proposed in order to further automate the architecture by introducing decentralization particularly within the storage and processing functionality. This decentralization also improves machine to machine communication supporting several digital investigation processes enabled by the architecture through harnessing the properties of various peer-to-peer overlays. Remote evidence acquisition helps to improve the efficiency (time and effort involved) in digital investigations by removing the need for proximity to the evidence. Experiments show that a single TCP connection client-server paradigm does not offer the required scalability and reliability for remote evidence acquisition and that a multi-TCP connection paradigm is required. The automated integration, correlation and reasoning on multiple diverse evidence sources demonstrated in the experiments improves speed and reduces the human effort needed in the analysis phase by removing the need for time-consuming manual correlation. Finally, informed by published scientific literature, the proposed enhancements for further decentralizing the Live Evidence Information Aggregator (LEIA) architecture offer a platform for increased machine-to-machine communication thereby enabling automation and reducing the need for manual human intervention.

Benchmarking of distributed computing engines spark and GraphLab for big data analytics

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.

Scheduling medical application workloads on virtualized computing systems

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This dissertation presents and evaluates a methodology for scheduling medical application workloads in virtualized computing environments. Such environments are being widely adopted by providers of "cloud computing" services. In the context of provisioning resources for medical applications, such environments allow users to deploy applications on distributed computing resources while keeping their data secure. Furthermore, higher level services that further abstract the infrastructure-related issues can be built on top of such infrastructures. For example, a medical imaging service can allow medical professionals to process their data in the cloud, easing them from the burden of having to deploy and manage these resources themselves. In this work, we focus on issues related to scheduling scientific workloads on virtualized environments. We build upon the knowledge base of traditional parallel job scheduling to address the specific case of medical applications while harnessing the benefits afforded by virtualization technology. To this end, we provide the following contributions: (1) An in-depth analysis of the execution characteristics of the target applications when run in virtualized environments. (2) A performance prediction methodology applicable to the target environment. (3) A scheduling algorithm that harnesses application knowledge and virtualization-related benefits to provide strong scheduling performance and quality of service guarantees. In the process of addressing these pertinent issues for our target user base (i.e. medical professionals and researchers), we provide insight that benefits a large community of scientific application users in industry and academia. Our execution time prediction and scheduling methodologies are implemented and evaluated on a real system running popular scientific applications. We find that we are able to predict the execution time of a number of these applications with an average error of 15%. Our scheduling methodology, which is tested with medical image processing workloads, is compared to that of two baseline scheduling solutions and we find that it outperforms them in terms of both the number of jobs processed and resource utilization by 20–30%, without violating any deadlines. We conclude that our solution is a viable approach to supporting the computational needs of medical users, even if the cloud computing paradigm is not widely adopted in its current form.

Cloud Stratus: uma plataforma de middleware para desenvolvimento de aplicações em nuvem

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cloud Computing is a paradigm that enables the access, in a simple and pervasive way, through the network, to shared and configurable computing resources. Such resources can be offered on demand to users in a pay-per-use model. With the advance of this paradigm, a single service offered by a cloud platform might not be enough to meet all the requirements of clients. Ergo, it is needed to compose services provided by different cloud platforms. However, current cloud platforms are not implemented using common standards, each one has its own APIs and development tools, which is a barrier for composing different services. In this context, the Cloud Integrator, a service-oriented middleware platform, provides an environment to facilitate the development and execution of multi-cloud applications. The applications are compositions of services, from different cloud platforms and, represented by abstract workflows. However, Cloud Integrator has some limitations, such as: (i) applications are locally executed; (ii) users cannot specify the application in terms of its inputs and outputs, and; (iii) experienced users cannot directly determine the concrete Web services that will perform the workflow. In order to deal with such limitations, this work proposes Cloud Stratus, a middleware platform that extends Cloud Integrator and offers different ways to specify an application: as an abstract workflow or a complete/partial execution flow. The platform enables the application deployment in cloud virtual machines, so that several users can access it through the Internet. It also supports the access and management of virtual machines in different cloud platforms and provides services monitoring mechanisms and assessment of QoS parameters. Cloud Stratus was validated through a case study that consists of an application that uses different services provided by different cloud platforms. Cloud Stratus was also evaluated through computing experiments that analyze the performance of its processes.

Cloud query manager: uso de web semântica para evitar o problema de aprisionamento em IaaS

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cloud computing can be defined as a distributed computational model by through resources (hardware, storage, development platforms and communication) are shared, as paid services accessible with minimal management effort and interaction. A great benefit of this model is to enable the use of various providers (e.g a multi-cloud architecture) to compose a set of services in order to obtain an optimal configuration for performance and cost. However, the multi-cloud use is precluded by the problem of cloud lock-in. The cloud lock-in is the dependency between an application and a cloud platform. It is commonly addressed by three strategies: (i) use of intermediate layer that stands to consumers of cloud services and the provider, (ii) use of standardized interfaces to access the cloud, or (iii) use of models with open specifications. This paper outlines an approach to evaluate these strategies. This approach was performed and it was found that despite the advances made by these strategies, none of them actually solves the problem of lock-in cloud. In this sense, this work proposes the use of Semantic Web to avoid cloud lock-in, where RDF models are used to specify the features of a cloud, which are managed by SPARQL queries. In this direction, this work: (i) presents an evaluation model that quantifies the problem of cloud lock-in, (ii) evaluates the cloud lock-in from three multi-cloud solutions and three cloud platforms, (iii) proposes using RDF and SPARQL on management of cloud resources, (iv) presents the cloud Query Manager (CQM), an SPARQL server that implements the proposal, and (v) comparing three multi-cloud solutions in relation to CQM on the response time and the effectiveness in the resolution of cloud lock-in.

Design and Implementation of a Cloud Infrastructure for Distributed Scientific Calculation

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cloud computing enables independent end users and applications to share data and pooled resources, possibly located in geographically distributed Data Centers, in a fully transparent way. This need is particularly felt by scientific applications to exploit distributed resources in efficient and scalable way for the processing of big amount of data. This paper proposes an open so- lution to deploy a Platform as a service (PaaS) over a set of multi- site data centers by applying open source virtualization tools to facilitate operation among virtual machines while optimizing the usage of distributed resources. An experimental testbed is set up in Openstack environment to obtain evaluations with different types of TCP sample connections to demonstrate the functionality of the proposed solution and to obtain throughput measurements in relation to relevant design parameters.

Cumulon: Simplified Matrix-Based Data Analytics in the Cloud

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cumulon is a system aimed at simplifying the development and deployment of statistical analysis of big data in public clouds. Cumulon allows users to program in their familiar language of matrices and linear algebra, without worrying about how to map data and computation to specific hardware and cloud software platforms. Given user-specified requirements in terms of time, monetary cost, and risk tolerance, Cumulon automatically makes intelligent decisions on implementation alternatives, execution parameters, as well as hardware provisioning and configuration settings -- such as what type of machines and how many of them to acquire. Cumulon also supports clouds with auction-based markets: it effectively utilizes computing resources whose availability varies according to market conditions, and suggests best bidding strategies for them. Cumulon explores two alternative approaches toward supporting such markets, with different trade-offs between system and optimization complexity. Experimental study is conducted to show the efficiency of Cumulon's execution engine, as well as the optimizer's effectiveness in finding the optimal plan in the vast plan space.

Simulating Business Processes of Manufacturing SMEs on the Cloud

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Simulating the efficiency of business processes could reveal crucial bottlenecks for manufacturing companies and could lead to significant optimizations resulting in decreased time to market, more efficient resource utilization, and larger profit. While such business optimization software is widely utilized by larger companies, SMEs typically do not have the required expertise and resources to efficiently exploit these advantages. The aim of this work is to explore how simulation software vendors and consultancies can extend their portfolio to SMEs by providing business process optimization based on a cloud computing platform. By executing simulation runs on the cloud, software vendors and associated business consultancies can get access to large computing power and data storage capacity on demand, run large simulation scenarios on behalf of their clients, analyze simulation results, and advise their clients regarding process optimization. The solution is mutually beneficial for both vendor/consultant and the end-user SME. End-user companies will only pay for the service without requiring large upfront costs for software licenses and expensive hardware. Software vendors can extend their business towards the SME market with potentially huge benefits.

HPC management and engineering in the hybrid cloud

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The evolution and maturation of Cloud Computing created an opportunity for the emergence of new Cloud applications. High-performance Computing, a complex problem solving class, arises as a new business consumer by taking advantage of the Cloud premises and leaving the expensive datacenter management and difficult grid development. Standing on an advanced maturing phase, today’s Cloud discarded many of its drawbacks, becoming more and more efficient and widespread. Performance enhancements, prices drops due to massification and customizable services on demand triggered an emphasized attention from other markets. HPC, regardless of being a very well established field, traditionally has a narrow frontier concerning its deployment and runs on dedicated datacenters or large grid computing. The problem with common placement is mainly the initial cost and the inability to fully use resources which not all research labs can afford. The main objective of this work was to investigate new technical solutions to allow the deployment of HPC applications on the Cloud, with particular emphasis on the private on-premise resources – the lower end of the chain which reduces costs. The work includes many experiments and analysis to identify obstacles and technology limitations. The feasibility of the objective was tested with new modeling, architecture and several applications migration. The final application integrates a simplified incorporation of both public and private Cloud resources, as well as HPC applications scheduling, deployment and management. It uses a well-defined user role strategy, based on federated authentication and a seamless procedure to daily usage with balanced low cost and performance.

Radio over fiber enabling PON fronthaul in a two-tiered cloud

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Avec l’avènement des objets connectés, la bande passante nécessaire dépasse la capacité des interconnections électriques et interface sans fils dans les réseaux d’accès mais aussi dans les réseaux coeurs. Des systèmes photoniques haute capacité situés dans les réseaux d’accès utilisant la technologie radio sur fibre systèmes ont été proposés comme solution dans les réseaux sans fil de 5e générations. Afin de maximiser l’utilisation des ressources des serveurs et des ressources réseau, le cloud computing et des services de stockage sont en cours de déploiement. De cette manière, les ressources centralisées pourraient être diffusées de façon dynamique comme l’utilisateur final le souhaite. Chaque échange nécessitant une synchronisation entre le serveur et son infrastructure, une couche physique optique permet au cloud de supporter la virtualisation des réseaux et de les définir de façon logicielle. Les amplificateurs à semi-conducteurs réflectifs (RSOA) sont une technologie clé au niveau des ONU(unité de communications optiques) dans les réseaux d’accès passif (PON) à fibres. Nous examinons ici la possibilité d’utiliser un RSOA et la technologie radio sur fibre pour transporter des signaux sans fil ainsi qu’un signal numérique sur un PON. La radio sur fibres peut être facilement réalisée grâce à l’insensibilité a la longueur d’onde du RSOA. Le choix de la longueur d’onde pour la couche physique est cependant choisi dans les couches 2/3 du modèle OSI. Les interactions entre la couche physique et la commutation de réseaux peuvent être faites par l’ajout d’un contrôleur SDN pour inclure des gestionnaires de couches optiques. La virtualisation réseau pourrait ainsi bénéficier d’une couche optique flexible grâce des ressources réseau dynamique et adaptée. Dans ce mémoire, nous étudions un système disposant d’une couche physique optique basé sur un RSOA. Celle-ci nous permet de façon simultanée un envoi de signaux sans fil et le transport de signaux numérique au format modulation tout ou rien (OOK) dans un système WDM(multiplexage en longueur d’onde)-PON. Le RSOA a été caractérisé pour montrer sa capacité à gérer une plage dynamique élevée du signal sans fil analogique. Ensuite, les signaux RF et IF du système de fibres sont comparés avec ses avantages et ses inconvénients. Finalement, nous réalisons de façon expérimentale une liaison point à point WDM utilisant la transmission en duplex intégral d’un signal wifi analogique ainsi qu’un signal descendant au format OOK. En introduisant deux mélangeurs RF dans la liaison montante, nous avons résolu le problème d’incompatibilité avec le système sans fil basé sur le TDD (multiplexage en temps duplexé).

BUILDING EFFICIENT AND COST-EFFECTIVE CLOUD-BASED BIG DATA MANAGEMENT SYSTEMS

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.

«
1
2
...
7
8
9
10
11
12
13
...
55
56
»