44 resultados para Benchmarks


Relevância:

10.00% 10.00%

Publicador:

Resumo:

The efficient development of multi-threaded software has, for many years, been an unsolved problem in computer science. Finding a solution to this problem has become urgent with the advent of multi-core processors. Furthermore, the problem has become more complicated because multi-cores are everywhere (desktop, laptop, embedded system). As such, they execute generic programs which exhibit very different characteristics than the scientific applications that have been the focus of parallel computing in the past.
Implicitly parallel programming is an approach to parallel pro- gramming that promises high productivity and efficiency and rules out synchronization errors and race conditions by design. There are two main ingredients to implicitly parallel programming: (i) a con- ventional sequential programming language that is extended with annotations that describe the semantics of the program and (ii) an automatic parallelizing compiler that uses the annotations to in- crease the degree of parallelization.
It is extremely important that the annotations and the automatic parallelizing compiler are designed with the target application do- main in mind. In this paper, we discuss the Paralax approach to im- plicitly parallel programming and we review how the annotations and the compiler design help to successfully parallelize generic programs. We evaluate Paralax on SPECint benchmarks, which are a model for such programs, and demonstrate scalable speedups, up to a factor of 6 on 8 cores.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Many scientific applications are programmed using hybrid programming models that use both message passing and shared memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared memory or message passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoption of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74 percent on average and up to 13.8 percent) with some performance gain (up to 7.5 percent) or negligible performance loss.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Purpose: Collaboration in academic medicine is encouraged, yet no one has studied the environment in which faculty collaborate. The authors investigated how faculty experienced collaboration and the institutional atmosphere for collaboration. Method: In 2007, as part of a qualitative study of faculty in five disparate U.S. medical schools, the authors interviewed 96 medical faculty at different career stages and in diverse specialties, with an oversampling of women, minorities, and generalists, regarding their perceptions and experiences of collaboration in academic medicine. Data analysis was inductive and driven by the grounded theory tradition. Results: Female faculty expressed enthusiasm about the potential and process of collaboration; male faculty were more likely to focus on outcomes. Senior faculty experienced a more collaborative environment than early career faculty, who faced numerous barriers to collaboration: the hierarchy of medical academe, advancement criteria, and the lack of infrastructure supportive of collaboration. Research faculty appreciated shared ideas, knowledge, resources, and the increased productivity that could result from collaboration, but they were acutely aware that advancement requires an independent body of work, which was a major deterrent to collaboration among early career faculty. Conclusions: Academic medicine faculty have differing views on the impact and benefits of collaboration. Early career faculty face concerning obstacles to collaboration. Female faculty seemed more appreciative of the process of collaboration, which may be of importance for transitioning to a more collaborative academic environment. A reevaluation of effective benchmarks for promotion of faculty is warranted to address the often exclusive reliance on individualistic achievement. © 2009 The Association of American Medical Colleges.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Dual-rail encoding, return-to-spacer protocol, and hazard-free logic can be used to resist power analysis attacks by making energy consumed per clock cycle independent of processed data. Standard dual-rail logic uses a protocol with a single spacer, e.g., all-zeros, which gives rise to energy balancing problems. We address these problems by incorporating two spacers; the spacers alternate between adjacent clock cycles. This guarantees that all gates switch in every clock cycle regardless of the transmitted data values. To generate these dual-rail circuits, an automated tool has been developed. It is capable of converting synchronous netlists into dual-rail circuits and it is interfaced to industry CAD tools. Dual-rail and single-rail benchmarks based upon the advanced encryption standard (AES) have been simulated and compared in order to evaluate the method and the tool.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

FastFlow is a programming framework specifically targeting cache-coherent shared-memory multi-cores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code via offloading onto a dynamically created software accelerator is presented. The new methodology has been validated using a set of simple micro-benchmarks and some real applications. © 2011 Springer-Verlag.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a trace-driven approach to predict the performance degradation of disk request response times due to storage device contention in consolidated virtualized environments. Our performance model evaluates a queueing network with fair share scheduling using trace-driven simulation. The model parameters can be deduced from measurements obtained inside Virtual Machines (VMs) from a system where a single VM accesses a remote storage server. The parameterized model can then be used to predict the effect of storage contention when multiple VMs are consolidated on the same virtualized server. The model parameter estimation relies on a search technique that tries to estimate the splitting and merging of blocks at the the Virtual Machine Monitor (VMM) level in the case of multiple competing VMs. Simulation experiments based on traces of the Postmark and FFSB disk benchmarks show that our model is able to accurately predict the impact of workload consolidation on VM disk IO response times.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We present BDDT, a task-parallel runtime system that dynamically discovers and resolves dependencies among parallel tasks. BDDT allows the programmer to specify detailed task footprints on any memory address range, multidimensional array tile or dynamic region. BDDT uses a block-based dependence analysis with arbitrary granularity. The analysis is applicable to existing C programs without having to restructure object or array allocation, and provides flexibility in array layouts and tile dimensions.
We evaluate BDDT using a representative set of benchmarks, and we compare it to SMPSs (the equivalent runtime system in StarSs) and OpenMP. BDDT performs comparable to or better than SMPSs and is able to cope with task granularity as much as one order of magnitude finer than SMPSs. Compared to OpenMP, BDDT performs up to 3.9× better for benchmarks that benefit from dynamic dependence analysis. BDDT provides additional data annotations to bypass dependence analysis. Using these annotations, BDDT outperforms OpenMP also in benchmarks where dependence analysis does not discover additional parallelism, thanks to a more efficient implementation of the runtime system.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The global ETF industry provides more complicated investment vehicles than low-cost index trackers. Instead, we find that the real investments of ETFs that do not fully replicate their benchmarks may deviate from their benchmarks to leverage informational advantages (which leads to a surprising stock-selection ability), to benefit from the securities lending market, to support ETF-affiliated banks’ stock prices, and to help affiliated OEFs through cross-trading. These effects are more prevalent in ETFs domiciled in Europe. Market awareness of such additional risk is reflected in ETF outflows. These results have important normative implications for consumer protection and financial stability.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper examines power quality benchmarks in the electricity supply industry (ESI) and impact of standards for the reduction of voltage dip incidents. The paper considers adherence to particular standards and is supported by several case studies from incidents where voltage dips have been detected and assessed by the power systems division of Scottish Power and where improvements have been implemented to help militate against subsequent incidents.

Relevância:

10.00% 10.00%

Publicador:

Relevância:

10.00% 10.00%

Publicador:

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Processor architectures has taken a turn towards many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multi-core and single-core processors due to the need of writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy.

Task dataflow programming models have a high potential to simplify parallel program- ming because they alleviate the programmer from identifying precisely all inter-task de- pendences when writing programs. Instead, the task dataflow runtime system detects and enforces inter-task dependences during execution based on the description of memory each task accesses. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel taking into account dependences specified in the task graph.

Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this paper we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs and lists. We also consider a fourth edge-less scheme that synchronizes tasks using integers. Analysis using micro-benchmarks shows that the graph representation is not always scalable and that the edge-less scheme introduces least overhead in nearly all situations.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Performance evaluation of parallel software and architectural exploration of innovative hardware support face a common challenge with emerging manycore platforms: they are limited by the slow running time and the low accuracy of software simulators. Manycore FPGA prototypes are difficult to build, but they offer great rewards. Software running on such prototypes runs orders of magnitude faster than current simulators. Moreover, researchers gain significant architectural insight during the modeling process. We use the Formic FPGA prototyping board [1], which specifically targets scalable and cost-efficient multi-board prototyping, to build and test a 64-board model of a 512-core, MicroBlaze-based, non-coherent hardware prototype with a full network-on-chip in a 3D-mesh topology. We expand the hardware architecture to include the ARM Versatile Express platforms and build a 520-core heterogeneous prototype of 8 Cortex-A9 cores and 512 MicroBlaze cores. We then develop an MPI library for the prototype and evaluate it extensively using several bare-metal and MPI benchmarks. We find that our processor prototype is highly scalable, models faithfully single-chip multicore architectures, and is a very efficient platform for parallel programming research, being 50,000 times faster than software simulation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71× over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix++, a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demon- strates improvements of a factor of 3.1× and 3.2× respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Abstract—Power capping is an essential function for efficient power budgeting and cost management on modern server systems. Contemporary server processors operate under power caps by using dynamic voltage and frequency scaling (DVFS). However, these processors are often deployed in non-uniform memory
access (NUMA) architectures, where thread allocation between cores may significantly affect performance and power consumption. This paper proposes a method which maximizes performance under power caps on NUMA systems by dynamically optimizing two knobs: DVFS and thread allocation. The method selects the optimal combination of the two knobs with models based on artificial neural network (ANN) that captures the nonlinear effect of thread allocation on performance. We implement
the proposed method as a runtime system and evaluate it with twelve multithreaded benchmarks on a real AMD Opteron based NUMA system. The evaluation results show that our method outperforms a naive technique optimizing only DVFS by up to
67.1%, under a power cap.