23 resultados para cache-oblivious
Resumo:
FastFlow is a programming framework specifically targeting cache-coherent shared-memory multi-cores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code via offloading onto a dynamically created software accelerator is presented. The new methodology has been validated using a set of simple micro-benchmarks and some real applications. © 2011 Springer-Verlag.
Resumo:
This paper describes the ParaPhrase project, a new 3-year targeted research project funded under EU Framework 7 Objective 3.4 (Computer Systems), starting in October 2011. ParaPhrase aims to follow a new approach to introducing parallelism using advanced refactoring techniques coupled with high-level parallel design patterns. The refactoring approach will use these design patterns to restructure programs defined as networks of software components into other forms that are more suited to parallel execution. The programmer will be aided by high-level cost information that will be integrated into the refactoring tools. The implementation of these patterns will then use a well-understood algorithmic skeleton approach to achieve good parallelism. A key ParaPhrase design goal is that parallel components are intended to match heterogeneous architectures, defined in terms of CPU/GPU combinations, for example. In order to achieve this, the ParaPhrase approach will map components at link time to the available hardware, and will then re-map them during program execution, taking account of multiple applications, changes in hardware resource availability, the desire to reduce communication costs etc. In this way, we aim to develop a new approach to programming that will be able to produce software that can adapt to dynamic changes in the system environment. Moreover, by using a strong component basis for parallelism, we can achieve potentially significant gains in terms of reducing sharing at a high level of abstraction, and so in reducing or even eliminating the costs that are usually associated with cache management, locking, and synchronisation. © 2013 Springer-Verlag Berlin Heidelberg.
Resumo:
This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demon- strates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing.
Resumo:
A framework for assessing the robustness of long-duration repetitive orchestrations in uncertain evolving environments is proposed. The model assumes that service-based evaluation environments are stable over short time-frames only; over longer periods service-based environments evolve as demand fluctuates and contention for shared resources varies. The behaviour of a short-duration orchestration E in a stable environment is assessed by an uncertainty profile U and a corresponding zero-sum angel-daemon game Γ(U) [2]. Here the angel-daemon approach is extended to assess evolving environments by means of a subfamily of stochastic games. These games are called strategy oblivious because their transition probabilities are strategy independent. It is shown that the value of a strategy oblivious stochastic game is well defined and that it can be computed by solving a linear system. Finally, the proposed stochastic framework is used to assess the evolution of the Gabrmn IT system.
Resumo:
Motivated by the need for designing efficient and robust fully-distributed computation in highly dynamic networks such as Peer-to-Peer (P2P) networks, we study distributed protocols for constructing and maintaining dynamic network topologies with good expansion properties. Our goal is to maintain a sparse (bounded degree) expander topology despite heavy {\em churn} (i.e., nodes joining and leaving the network continuously over time). We assume that the churn is controlled by an adversary that has complete knowledge and control of what nodes join and leave and at what time and has unlimited computational power, but is oblivious to the random choices made by the algorithm. Our main contribution is a randomized distributed protocol that guarantees with high probability the maintenance of a {\em constant} degree graph with {\em high expansion} even under {\em continuous high adversarial} churn. Our protocol can tolerate a churn rate of up to $O(n/\poly\log(n))$ per round (where $n$ is the stable network size). Our protocol is efficient, lightweight, and scalable, and it incurs only $O(\poly\log(n))$ overhead for topology maintenance: only polylogarithmic (in $n$) bits needs to be processed and sent by each node per round and any node's computation cost per round is also polylogarithmic. The given protocol is a fundamental ingredient that is needed for the design of efficient fully-distributed algorithms for solving fundamental distributed computing problems such as agreement, leader election, search, and storage in highly dynamic P2P networks and enables fast and scalable algorithms for these problems that can tolerate a large amount of churn.
Resumo:
We study the fundamental Byzantine leader election problem in dynamic networks where the topology can change from round to round and nodes can also experience heavy {\em churn} (i.e., nodes can join and leave the network continuously over time). We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power and can deviate arbitrarily from the protocol. The churn is controlled by an adversary that has complete knowledge and control over which nodes join and leave and at what times and also may rewire the topology in every round and has unlimited computational power, but is oblivious to the random choices made by the algorithm. Our main contribution is an $O(\log^3 n)$ round algorithm that achieves Byzantine leader election under the presence of up to $O({n}^{1/2 - \epsilon})$ Byzantine nodes (for a small constant $\epsilon > 0$) and a churn of up to \\$O(\sqrt{n}/\poly\log(n))$ nodes per round (where $n$ is the stable network size).The algorithm elects a leader with probability at least $1-n^{-\Omega(1)}$ and guarantees that it is an honest node with probability at least $1-n^{-\Omega(1)}$; assuming the algorithm succeeds, the leader's identity will be known to a $1-o(1)$ fraction of the honest nodes. Our algorithm is fully-distributed, lightweight, and is simple to implement. It is also scalable, as it runs in polylogarithmic (in $n$) time and requires nodes to send and receive messages of only polylogarithmic size per round.To the best of our knowledge, our algorithm is the first scalable solution for Byzantine leader election in a dynamic network with a high rate of churn; our protocol can also be used to solve Byzantine agreement in a straightforward way.We also show how to implement an (almost-everywhere) public coin with constant bias in a dynamic network with Byzantine nodes and provide a mechanism for enabling honest nodes to store information reliably in the network, which might be of independent interest.
Resumo:
WHIRLBOB, also known as STRIBOBr2, is an AEAD (Authenticated Encryption with Associated Data) algorithm derived from STRIBOBr1 and the Whirlpool hash algorithm. WHIRLBOB/STRIBOBr2 is a second round candidate in the CAESAR competition. As with STRIBOBr1, the reduced-size Sponge design has a strong provable security link with a standardized hash algorithm. The new design utilizes only the LPS or ρ component of Whirlpool in flexibly domain-separated BLNK Sponge mode. The number of rounds is increased from 10 to 12 as a countermeasure against Rebound Distinguishing attacks. The 8 ×8 - bit S-Box used by Whirlpool and WHIRLBOB is constructed from 4 ×4 - bit “MiniBoxes”. We report on fast constant-time Intel SSSE3 and ARM NEON SIMD WHIRLBOB implementations that keep full miniboxes in registers and access them via SIMD shuffles. This is an efficient countermeasure against AES-style cache timing side-channel attacks. Another main advantage of WHIRLBOB over STRIBOBr1 (and most other AEADs) is its greatly reduced implementation footprint on lightweight platforms. On many lower-end microcontrollers the total software footprint of π+BLNK = WHIRLBOB AEAD is less than half a kilobyte. We also report an FPGA implementation that requires 4,946 logic units for a single round of WHIRLBOB, which compares favorably to 7,972 required for Keccak / Keyak on the same target platform. The relatively small S-Box gate count also enables efficient 64-bit bitsliced straight-line implementations. We finally present some discussion and analysis on the relationships between WHIRLBOB, Whirlpool, the Russian GOST Streebog hash, and the recent draft Russian Encryption Standard Kuznyechik.
Resumo:
The worsening of process variations and the consequent increased spreads in circuit performance and consumed power hinder the satisfaction of the targeted budgets and lead to yield loss. Corner based design and adoption of design guardbands might limit the yield loss. However, in many cases such methods may not be able to capture the real effects which might be way better than the predicted ones leading to increasingly pessimistic designs. The situation is even more severe in memories which consist of substantially different individual building blocks, further complicating the accurate analysis of the impact of variations at the architecture level leaving many potential issues uncovered and opportunities unexploited. In this paper, we develop a framework for capturing non-trivial statistical interactions among all the components of a memory/cache. The developed tool is able to find the optimum memory/cache configuration under various constraints allowing the designers to make the right choices early in the design cycle and consequently improve performance, energy, and especially yield. Our, results indicate that the consideration of the architectural interactions between the memory components allow to relax the pessimistic access times that are predicted by existing techniques.