222 resultados para multicore


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bitwidth and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on OpenCL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. OpenCL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified OpenCL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3x faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Energy consumption has become an important area of research of late. With the advent of new manycore processors, situations have arisen where not all the processors need to be active to reach an optimal relation between performance and energy usage. In this paper a study of the power and energy usage of a series of benchmarks, the PARSEC and the SPLASH- 2X Benchmark Suites, on the Intel Xeon Phi for different threads configurations, is presented. To carry out this study, a tool was designed to monitor and record the power usage in real time during execution time and afterwards to compare the r

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Field programmable gate array devices boast abundant resources with which custom accelerator components for signal, image and data processing may be realised; however, realising high performance, low cost accelerators currently demands manual register transfer level design. Software-programmable ’soft’ processors have been proposed as a way to reduce this design burden but they are unable to support performance and cost comparable to custom circuits. This paper proposes a new soft processing approach for FPGA which promises to overcome this barrier. A high performance, fine-grained streaming processor, known as a Streaming Accelerator Element, is proposed which realises accelerators as large scale custom multicore networks. By adopting a streaming execution approach with advanced program control and memory addressing capabilities, typical program inefficiencies can be almost completely eliminated to enable performance and cost which are unprecedented amongst software-programmable solutions. When used to realise accelerators for fast fourier transform, motion estimation, matrix multiplication and sobel edge detection it is shown how the proposed architecture enables real-time performance and with performance and cost comparable with hand-crafted custom circuit accelerators and up to two orders of magnitude beyond existing soft processors.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In the reinsurance market, the risks natural catastrophes pose to portfolios of properties must be quantified, so that they can be priced, and insurance offered. The analysis of such risks at a portfolio level requires a simulation of up to 800 000 trials with an average of 1000 catastrophic events per trial. This is sufficient to capture risk for a global multi-peril reinsurance portfolio covering a range of perils including earthquake, hurricane, tornado, hail, severe thunderstorm, wind storm, storm surge and riverine flooding, and wildfire. Such simulations are both computation and data intensive, making the application of high-performance computing techniques desirable.

In this paper, we explore the design and implementation of portfolio risk analysis on both multi-core and many-core computing platforms. Given a portfolio of property catastrophe insurance treaties, key risk measures, such as probable maximum loss, are computed by taking both primary and secondary uncertainties into account. Primary uncertainty is associated with whether or not an event occurs in a simulated year, while secondary uncertainty captures the uncertainty in the level of loss due to the use of simplified physical models and limitations in the available data. A combination of fast lookup structures, multi-threading and careful hand tuning of numerical operations is required to achieve good performance. Experimental results are reported for multi-core processors and systems using NVIDIA graphics processing unit and Intel Phi many-core accelerators.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Gene expression connectivity mapping has proven to be a powerful and flexible tool for research. Its application has been shown in a broad range of research topics, most commonly as a means of identifying potential small molecule compounds, which may be further investigated as candidates for repurposing to treat diseases. The public release of voluminous data from the Library of Integrated Cellular Signatures (LINCS) programme further enhanced the utilities and potentials of gene expression connectivity mapping in biomedicine. Results: We describe QUADrATiC (http://go.qub.ac.uk/QUADrATiC), a user-friendly tool for the exploration of gene expression connectivity on the subset of the LINCS data set corresponding to FDA-approved small molecule compounds. It enables the identification of compounds for repurposing therapeutic potentials. The software is designed to cope with the increased volume of data over existing tools, by taking advantage of multicore computing architectures to provide a scalable solution, which may be installed and operated on a range of computers, from laptops to servers. This scalability is provided by the use of the modern concurrent programming paradigm provided by the Akka framework. The QUADrATiC Graphical User Interface (GUI) has been developed using advanced Javascript frameworks, providing novel visualization capabilities for further analysis of connections. There is also a web services interface, allowing integration with other programs or scripts.Conclusions: QUADrATiC has been shown to provide an improvement over existing connectivity map software, in terms of scope (based on the LINCS data set), applicability (using FDA-approved compounds), usability and speed. It offers potential to biological researchers to analyze transcriptional data and generate potential therapeutics for focussed study in the lab. QUADrATiC represents a step change in the process of investigating gene expression connectivity and provides more biologically-relevant results than previous alternative solutions.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Energy consumption is an important concern in modern multicore processors. The energy consumed by a multicore processor during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy minimization using Global DVFS (Dynamic Voltage and Frequency Scaling), despite being thorough, ignores the time and the energy consumed by the CPU on memory accesses and the dynamic energy consumed by the idle cores. This article presents an analytical energy-performance model for parallel workloads that accounts for the time and the energy consumed by the CPU chip on memory accesses in addition to the time and energy consumed by the CPU on CPU instructions. In addition, the model we present also accounts for the dynamic energy consumed by the idle cores. The existing work on global DVFS for parallel workloads shows that using a single frequency for the entire duration of a parallel application is not energy optimal and that varying the frequency according to the changes in the parallelism of the workload can save energy. We present an analytical framework around our energy-performance model to predict the operating frequencies (that depend upon the amount of parallelism) for global DVFS that minimize the overall CPU energy consumption. We show how the optimal frequencies in our model differ from the optimal frequencies in a model that does not account for memory accesses. We further show how the memory intensity of an application affects the optimal frequencies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Tese de doutoramento, Informática (Ciências da Computação), Universidade de Lisboa, Faculdade de Ciências, 2015

Relevância:

10.00% 10.00%

Publicador:

Resumo:

High-level parallel languages offer a simple way for application programmers to specify parallelism in a form that easily scales with problem size, leaving the scheduling of the tasks onto processors to be performed at runtime. Therefore, if the underlying system cannot efficiently execute those applications on the available cores, the benefits will be lost. In this paper, we consider how to schedule highly heterogenous parallel applications that require real-time performance guarantees on multicore processors. The paper proposes a novel scheduling approach that combines the global Earliest Deadline First (EDF) scheduler with a priority-aware work-stealing load balancing scheme, which enables parallel realtime tasks to be executed on more than one processor at a given time instant. Experimental results demonstrate the better scalability and lower scheduling overhead of the proposed approach comparatively to an existing real-time deadline-oriented scheduling class for the Linux kernel.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The usage of COTS-based multicores is becoming widespread in the field of embedded systems. Providing realtime guarantees at design-time is a pre-requisite to deploy real-time systems on these multicores. This necessitates the consideration of the impact of the contention due to shared low-level hardware resources on the Worst-Case Execution Time (WCET) of the tasks. As a step towards this aim, this paper first identifies the different factors that make the WCET analysis a challenging problem in a typical COTS-based multicore system. Then, we propose and prove, a mathematically correct method to determine tight upper bounds on the WCET of the tasks, when they are co-scheduled on different cores.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Consider the problem of scheduling a set of sporadic tasks on a multiprocessor system to meet deadlines using a task-splitting scheduling algorithm. Task-splitting (also called semi-partitioning) scheduling algorithms assign most tasks to just one processor but a few tasks are assigned to two or more processors, and they are dispatched in a way that ensures that a task never executes on two or more processors simultaneously. A particular type of task-splitting algorithms, called slot-based task-splitting dispatching, is of particular interest because of its ability to schedule tasks with high processor utilizations. Unfortunately, no slot-based task-splitting algorithm has been implemented in a real operating system so far. In this paper we discuss and propose some modifications to the slot-based task-splitting algorithm driven by implementation concerns, and we report the first implementation of this family of algorithms in a real operating system running Linux kernel version 2.6.34. We have also conducted an extensive range of experiments on a 4-core multicore desktop PC running task-sets with utilizations of up to 88%. The results show that the behavior of our implementation is in line with the theoretical framework behind it.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Embedded real-time systems often have to support the embedding system in very different and changing application scenarios. An aircraft taxiing, taking off and in cruise flight is one example. The different application scenarios are reflected in the software structure with a changing task set and thus different operational modes. At the same time there is a strong push for integrating previously isolated functionalities in single-chip multicore processors. On such multicores the behavior of the system during a mode change, when the systems transitions from one mode to another, is complex but crucial to get right. In the past we have investigated mode change in multiprocessor systems where a mode change requires a complete change of task set. Now, we present the first analysis which considers mode changes in multicore systems, which use global EDF to schedule a set of mode independent (MI) and mode specific (MS) tasks. In such systems, only the set of MS tasks has to be replaced during mode changes, without jeopardizing the schedulability of the MI tasks. Of prime concern is that the mode change is safe and efficient: i.e. the mode change needs to be performed in a predefined time window and no deadlines may be missed as a function of the mode change.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Nowadays, many real-time operating systems discretize the time relying on a system time unit. To take this behavior into account, real-time scheduling algorithms must adopt a discrete-time model in which both timing requirements of tasks and their time allocations have to be integer multiples of the system time unit. That is, tasks cannot be executed for less than one time unit, which implies that they always have to achieve a minimum amount of work before they can be preempted. Assuming such a discrete-time model, the authors of Zhu et al. (Proceedings of the 24th IEEE international real-time systems symposium (RTSS 2003), 2003, J Parallel Distrib Comput 71(10):1411–1425, 2011) proposed an efficient “boundary fair” algorithm (named BF) and proved its optimality for the scheduling of periodic tasks while achieving full system utilization. However, BF cannot handle sporadic tasks due to their inherent irregular and unpredictable job release patterns. In this paper, we propose an optimal boundary-fair scheduling algorithm for sporadic tasks (named BF TeX ), which follows the same principle as BF by making scheduling decisions only at the job arrival times and (expected) task deadlines. This new algorithm was implemented in Linux and we show through experiments conducted upon a multicore machine that BF TeX outperforms the state-of-the-art discrete-time optimal scheduler (PD TeX ), benefiting from much less scheduling overheads. Furthermore, it appears from these experimental results that BF TeX is barely dependent on the length of the system time unit while PD TeX —the only other existing solution for the scheduling of sporadic tasks in discrete-time systems—sees its number of preemptions, migrations and the time spent to take scheduling decisions increasing linearly when improving the time resolution of the system.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Heterogeneous multicore platforms are becoming an interesting alternative for embedded computing systems with limited power supply as they can execute specific tasks in an efficient manner. Nonetheless, one of the main challenges of such platforms consists of optimising the energy consumption in the presence of temporal constraints. This paper addresses the problem of task-to-core allocation onto heterogeneous multicore platforms such that the overall energy consumption of the system is minimised. To this end, we propose a two-phase approach that considers both dynamic and leakage energy consumption: (i) the first phase allocates tasks to the cores such that the dynamic energy consumption is reduced; (ii) the second phase refines the allocation performed in the first phase in order to achieve better sleep states by trading off the dynamic energy consumption with the reduction in leakage energy consumption. This hybrid approach considers core frequency set-points, tasks energy consumption and sleep states of the cores to reduce the energy consumption of the system. Major value has been placed on a realistic power model which increases the practical relevance of the proposed approach. Finally, extensive simulations have been carried out to demonstrate the effectiveness of the proposed algorithm. In the best-case, savings up to 18% of energy are reached over the first fit algorithm, which has shown, in previous works, to perform better than other bin-packing heuristics for the target heterogeneous multicore platform.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The last decade has witnessed a major shift towards the deployment of embedded applications on multi-core platforms. However, real-time applications have not been able to fully benefit from this transition, as the computational gains offered by multi-cores are often offset by performance degradation due to shared resources, such as main memory. To efficiently use multi-core platforms for real-time systems, it is hence essential to tightly bound the interference when accessing shared resources. Although there has been much recent work in this area, a remaining key problem is to address the diversity of memory arbiters in the analysis to make it applicable to a wide range of systems. This work handles diverse arbiters by proposing a general framework to compute the maximum interference caused by the shared memory bus and its impact on the execution time of the tasks running on the cores, considering different bus arbiters. Our novel approach clearly demarcates the arbiter-dependent and independent stages in the analysis of these upper bounds. The arbiter-dependent phase takes the arbiter and the task memory-traffic pattern as inputs and produces a model of the availability of the bus to a given task. Then, based on the availability of the bus, the arbiter-independent phase determines the worst-case request-release scenario that maximizes the interference experienced by the tasks due to the contention for the bus. We show that the framework addresses the diversity problem by applying it to a memory bus shared by a fixed-priority arbiter, a time-division multiplexing (TDM) arbiter, and an unspecified work-conserving arbiter using applications from the MediaBench test suite. We also experimentally evaluate the quality of the analysis by comparison with a state-of-the-art TDM analysis approach and consistently showing a considerable reduction in maximum interference.