850 resultados para parallel scalability
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
“Parallel Ruptures: Jews of Bessarabia and Transnistria between Romanian Nationalism and Soviet Communism, 1918-1940,” explores the political and social debates that took place in Jewish communities in Romanian-held Bessarabia and the Moldovan Autonomous Soviet Socialist Republic during the interwar era. Both had been part of the Russian Pale of Settlement until its dissolution in 1917; they were then divided by the Romanian Army’s occupation of Bessarabia in 1918 with the establishment of a well-guarded border along the Dniester River between two newly-formed states, Greater Romania and the Soviet Union. At its core, the project focuses in comparative context on the traumatic and multi-faceted confrontation with these two modernizing states: exclusion, discrimination and growing violence in Bessarabia; destruction of religious tradition, agricultural resettlement, and socialist re-education and assimilation in Soviet Transnistria. It examines also the similarities in both states’ striving to create model subjects usable by the homeland, as well as commonalities within Jewish responses on both sides of the border. Contacts between Jews on either side of the border remained significant after 1918 despite the efforts of both states to curb them, thereby necessitating a transnational view in order to examine Jewish political and social life in borderland regions. The desire among Jewish secular leaders to mold their co-religionists into modern Jews reached across state borders and ideological divides and sought to manipulate respective governments to establish these goals, however unsuccessful in the final analysis. Finally, strained relations between Jews in peripheral borderlands with those at national/imperial cores, Moscow and Bucharest, sheds light on the complex circumstances surrounding the inclusion versus exclusion debates at the heart of all interwar European states and the complicated negotiations that took place within all minority communities that responded to state policies.
Resumo:
In the past decade, systems that extract information from millions of Internet documents have become commonplace. Knowledge graphs -- structured knowledge bases that describe entities, their attributes and the relationships between them -- are a powerful tool for understanding and organizing this vast amount of information. However, a significant obstacle to knowledge graph construction is the unreliability of the extracted information, due to noise and ambiguity in the underlying data or errors made by the extraction system and the complexity of reasoning about the dependencies between these noisy extractions. My dissertation addresses these challenges by exploiting the interdependencies between facts to improve the quality of the knowledge graph in a scalable framework. I introduce a new approach called knowledge graph identification (KGI), which resolves the entities, attributes and relationships in the knowledge graph by incorporating uncertain extractions from multiple sources, entity co-references, and ontological constraints. I define a probability distribution over possible knowledge graphs and infer the most probable knowledge graph using a combination of probabilistic and logical reasoning. Such probabilistic models are frequently dismissed due to scalability concerns, but my implementation of KGI maintains tractable performance on large problems through the use of hinge-loss Markov random fields, which have a convex inference objective. This allows the inference of large knowledge graphs using 4M facts and 20M ground constraints in 2 hours. To further scale the solution, I develop a distributed approach to the KGI problem which runs in parallel across multiple machines, reducing inference time by 90%. Finally, I extend my model to the streaming setting, where a knowledge graph is continuously updated by incorporating newly extracted facts. I devise a general approach for approximately updating inference in convex probabilistic models, and quantify the approximation error by defining and bounding inference regret for online models. Together, my work retains the attractive features of probabilistic models while providing the scalability necessary for large-scale knowledge graph construction. These models have been applied on a number of real-world knowledge graph projects, including the NELL project at Carnegie Mellon and the Google Knowledge Graph.
Resumo:
Scientific applications rely heavily on floating point data types. Floating point operations are complex and require complicated hardware that is both area and power intensive. The emergence of massively parallel architectures like Rigel creates new challenges and poses new questions with respect to floating point support. The massively parallel aspect of Rigel places great emphasis on area efficient, low power designs. At the same time, Rigel is a general purpose accelerator and must provide high performance for a wide class of applications. This thesis presents an analysis of various floating point unit (FPU) components with respect to Rigel, and attempts to present a candidate design of an FPU that balances performance, area, and power and is suitable for massively parallel architectures like Rigel.
Resumo:
Small-colony variants (SCVs) are commonly observed in evolution experiments and clinical isolates, being associated with antibiotic resistance and persistent infections. We recently observed the repeated emergence of Escherichia coli SCVs during adaptation to the interaction with macrophages. To identify the genetic targets underlying the emergence of this clinically relevant morphotype, we performed whole-genome sequencing of independently evolved SCV clones. We uncovered novel mutational targets, not previously associated with SCVs (e.g. cydA, pepP) and observed widespread functional parallelism. All SCV clones had mutations in genes related to the electron-transport chain. As SCVs emerged during adaptation to macrophages, and often show increased antibiotic resistance, we measured SCV fitness inside macrophages and measured their antibiotic resistance profiles. SCVs had a fitness advantage inside macrophages and showed increased aminoglycoside resistance in vitro, but had collateral sensitivity to other antibiotics (e.g. tetracycline). Importantly, we observed similar results in vivo. SCVs had a fitness advantage upon colonization of the mouse gut, which could be tuned by antibiotic treatment: kanamycin (aminoglycoside) increased SCV fitness, but tetracycline strongly reduced it. Our results highlight the power of using experimental evolution as the basis for identifying the causes and consequences of adaptation during host-microbe interactions.
Resumo:
Solving linear systems is an important problem for scientific computing. Exploiting parallelism is essential for solving complex systems, and this traditionally involves writing parallel algorithms on top of a library such as MPI. The SPIKE family of algorithms is one well-known example of a parallel solver for linear systems. The Hierarchically Tiled Array data type extends traditional data-parallel array operations with explicit tiling and allows programmers to directly manipulate tiles. The tiles of the HTA data type map naturally to the block nature of many numeric computations, including the SPIKE family of algorithms. The higher level of abstraction of the HTA enables the same program to be portable across different platforms. Current implementations target both shared-memory and distributed-memory models. In this thesis we present a proof-of-concept for portable linear solvers. We implement two algorithms from the SPIKE family using the HTA library. We show that our implementations of SPIKE exploit the abstractions provided by the HTA to produce a compact, clean code that can run on both shared-memory and distributed-memory models without modification. We discuss how we map the algorithms to HTA programs as well as examine their performance. We compare the performance of our HTA codes to comparable codes written in MPI as well as current state-of-the-art linear algebra routines.
Resumo:
Vertebrate genomes are organised into a variety of nuclear environments and chromatin states that have profound effects on the regulation of gene transcription. This variation presents a major challenge to the expression of transgenes for experimental research, genetic therapies and the production of biopharmaceuticals. The majority of transgenes succumb to transcriptional silencing by their chromosomal environment when they are randomly integrated into the genome, a phenomenon known as chromosomal position effect (CPE). It is not always feasible to target transgene integration to transcriptionally permissive “safe harbour” loci that favour transgene expression, so there remains an unmet need to identify gene regulatory elements that can be added to transgenes which protect them against CPE. Dominant regulatory elements (DREs) with chromatin barrier (or boundary) activity have been shown to protect transgenes from CPE. The HS4 element from the chicken beta-globin locus and the A2UCOE element from a human housekeeping gene locus have been shown to function as DRE barriers in a wide variety of cell types and species. Despite rapid advances in the profiling of transcription factor binding, chromatin states and chromosomal looping interactions, progress towards functionally validating the many candidate barrier elements in vertebrates has been very slow. This is largely due to the lack of a tractable and efficient assay for chromatin barrier activity. In this study, I have developed the RGBarrier assay system to test the chromatin barrier activity of candidate DREs at pre-defined isogenic loci in human cells. The RGBarrier assay consists in a Flp-based RMCE reaction for the integration of an expression construct, carrying candidate DREs, in a pre-characterised chromosomal location. The RGBarrier system involves the tracking of red, green and blue fluorescent proteins by flow cytometry to monitor on-target versus off-target integration and transgene expression. The analysis of the reporter (GFP) expression for several weeks gives a measure of the protective ability of each candidate elements from chromosomal silencing. This assay can be scaled up to test tens of new putative barrier elements in the same chromosomal context in parallel. The defined chromosomal contexts of the RGBarrier assays will allow for detailed mechanistic studies of chromosomal silencing and DRE barrier element action. Understanding these mechanisms will be of paramount importance for the design of specific solutions for overcoming chromosomal silencing in specific transgenic applications.
Resumo:
A poster of this paper will be presented at the 25th International Conference on Parallel Architecture and Compilation Technology (PACT ’16), September 11-15, 2016, Haifa, Israel.
Resumo:
Virtual Screening (VS) methods can considerably aid clinical research, predicting how ligands interact with drug targets. Most VS methods suppose a unique binding site for the target, but it has been demonstrated that diverse ligands interact with unrelated parts of the target and many VS methods do not take into account this relevant fact. This problem is circumvented by a novel VS methodology named BINDSURF that scans the whole protein surface to find new hotspots, where ligands might potentially interact with, and which is implemented in massively parallel Graphics Processing Units, allowing fast processing of large ligand databases. BINDSURF can thus be used in drug discovery, drug design, drug repurposing and therefore helps considerably in clinical research. However, the accuracy of most VS methods is constrained by limitations in the scoring function that describes biomolecular interactions, and even nowadays these uncertainties are not completely understood. In order to solve this problem, we propose a novel approach where neural networks are trained with databases of known active (drugs) and inactive compounds, and later used to improve VS predictions.
Resumo:
In Brazil, human and canine visceral leishmaniasis (CVL) caused by Leishmania infantum has undergone urbanisation since 1980, constituting a public health problem, and serological tests are tools of choice for identifying infected dogs. Until recently, the Brazilian zoonoses control program recommended enzyme-linked immunosorbent assays (ELISA) and indirect immunofluorescence assays (IFA) as the screening and confirmatory methods, respectively, for the detection of canine infection. The purpose of this study was to estimate the accuracy of ELISA and IFA in parallel or serial combinations. The reference standard comprised the results of direct visualisation of parasites in histological sections, immunohistochemical test, or isolation of the parasite in culture. Samples from 98 cases and 1,327 noncases were included. Individually, both tests presented sensitivity of 91.8% and 90.8%, and specificity of 83.4 and 53.4%, for the ELISA and IFA, respectively. When tests were used in parallel combination, sensitivity attained 99.2%, while specificity dropped to 44.8%. When used in serial combination (ELISA followed by IFA), decreased sensitivity (83.3%) and increased specificity (92.5%) were observed. Serial testing approach improved specificity with moderate loss in sensitivity. This strategy could partially fulfill the needs of public health and dog owners for a more accurate diagnosis of CVL.
Resumo:
Performance and scalability of model transformations are becoming prominent topics in Model-Driven Engineering. In previous works we introduced LinTra, a platform for executing model transformations in parallel. LinTra is based on the Linda model of a coordination language and is intended to be used as a middleware where high-level model transformation languages are compiled. In this paper we present the initial results of our analyses on the scalability of out-place model-to-model transformation executions in LinTra when the models and the processing elements are distributed over a set of machines.
Resumo:
With hundreds of millions of users reporting locations and embracing mobile technologies, Location Based Services (LBSs) are raising new challenges. In this dissertation, we address three emerging problems in location services, where geolocation data plays a central role. First, to handle the unprecedented growth of generated geolocation data, existing location services rely on geospatial database systems. However, their inability to leverage combined geographical and textual information in analytical queries (e.g. spatial similarity joins) remains an open problem. To address this, we introduce SpsJoin, a framework for computing spatial set-similarity joins. SpsJoin handles combined similarity queries that involve textual and spatial constraints simultaneously. LBSs use this system to tackle different types of problems, such as deduplication, geolocation enhancement and record linkage. We define the spatial set-similarity join problem in a general case and propose an algorithm for its efficient computation. Our solution utilizes parallel computing with MapReduce to handle scalability issues in large geospatial databases. Second, applications that use geolocation data are seldom concerned with ensuring the privacy of participating users. To motivate participation and address privacy concerns, we propose iSafe, a privacy preserving algorithm for computing safety snapshots of co-located mobile devices as well as geosocial network users. iSafe combines geolocation data extracted from crime datasets and geosocial networks such as Yelp. In order to enhance iSafe's ability to compute safety recommendations, even when crime information is incomplete or sparse, we need to identify relationships between Yelp venues and crime indices at their locations. To achieve this, we use SpsJoin on two datasets (Yelp venues and geolocated businesses) to find venues that have not been reviewed and to further compute the crime indices of their locations. Our results show a statistically significant dependence between location crime indices and Yelp features. Third, review centered LBSs (e.g., Yelp) are increasingly becoming targets of malicious campaigns that aim to bias the public image of represented businesses. Although Yelp actively attempts to detect and filter fraudulent reviews, our experiments showed that Yelp is still vulnerable. Fraudulent LBS information also impacts the ability of iSafe to provide correct safety values. We take steps toward addressing this problem by proposing SpiDeR, an algorithm that takes advantage of the richness of information available in Yelp to detect abnormal review patterns. We propose a fake venue detection solution that applies SpsJoin on Yelp and U.S. housing datasets. We validate the proposed solutions using ground truth data extracted by our experiments and reviews filtered by Yelp.
Resumo:
Since precise linear actuators of a compliant parallel manipulator suffer from their inability to tolerate the transverse motion/load in the multi-axis motion, actuation isolation should be considered in the compliant manipulator design to eliminate the transverse motion at the point of actuation. This paper presents an effective design method for constructing compliant parallel manipulators with actuation isolation, by adding the same number of actuation legs as the number of the DOF (degree of freedom) of the original mechanism. The method is demonstrated by two design case studies, one of which is quantitatively studied by analytical modelling. The modelling results confirm possible inherent issues of the proposed structure design method such as increased primary stiffness, introduced extra parasitic motions and cross-axis coupling motions.
Resumo:
This work aimed to study the drying of ryegrass seeds (Lolium multiflorum L.) in fixed bed dryer with parallel air flow.