988 resultados para Multi-core
Resumo:
Modern multicore processors for the embedded market are often heterogeneous in nature. One feature often available are multiple sleep states with varying transition cost for entering and leaving said sleep states. This research effort explores the energy efficient task-mapping on such a heterogeneous multicore platform to reduce overall energy consumption of the system. This is performed in the context of a partitioned scheduling approach and a very realistic power model, which improves over some of the simplifying assumptions often made in the state-of-the-art. The developed heuristic consists of two phases, in the first phase, tasks are allocated to minimise their active energy consumption, while the second phase trades off a higher active energy consumption for an increased ability to exploit savings through more efficient sleep states. Extensive simulations demonstrate the effectiveness of the approach.
Resumo:
The last decade has witnessed a major shift towards the deployment of embedded applications on multi-core platforms. However, real-time applications have not been able to fully benefit from this transition, as the computational gains offered by multi-cores are often offset by performance degradation due to shared resources, such as main memory. To efficiently use multi-core platforms for real-time systems, it is hence essential to tightly bound the interference when accessing shared resources. Although there has been much recent work in this area, a remaining key problem is to address the diversity of memory arbiters in the analysis to make it applicable to a wide range of systems. This work handles diverse arbiters by proposing a general framework to compute the maximum interference caused by the shared memory bus and its impact on the execution time of the tasks running on the cores, considering different bus arbiters. Our novel approach clearly demarcates the arbiter-dependent and independent stages in the analysis of these upper bounds. The arbiter-dependent phase takes the arbiter and the task memory-traffic pattern as inputs and produces a model of the availability of the bus to a given task. Then, based on the availability of the bus, the arbiter-independent phase determines the worst-case request-release scenario that maximizes the interference experienced by the tasks due to the contention for the bus. We show that the framework addresses the diversity problem by applying it to a memory bus shared by a fixed-priority arbiter, a time-division multiplexing (TDM) arbiter, and an unspecified work-conserving arbiter using applications from the MediaBench test suite. We also experimentally evaluate the quality of the analysis by comparison with a state-of-the-art TDM analysis approach and consistently showing a considerable reduction in maximum interference.
Resumo:
Distributed real-time systems such as automotive applications are becoming larger and more complex, thus, requiring the use of more powerful hardware and software architectures. Furthermore, those distributed applications commonly have stringent real-time constraints. This implies that such applications would gain in flexibility if they were parallelized and distributed over the system. In this paper, we consider the problem of allocating fixed-priority fork-join Parallel/Distributed real-time tasks onto distributed multi-core nodes connected through a Flexible Time Triggered Switched Ethernet network. We analyze the system requirements and present a set of formulations based on a constraint programming approach. Constraint programming allows us to express the relations between variables in the form of constraints. Our approach is guaranteed to find a feasible solution, if one exists, in contrast to other approaches based on heuristics. Furthermore, approaches based on constraint programming have shown to obtain solutions for these type of formulations in reasonable time.
Resumo:
EMC2 finds solutions for dynamic adaptability in open systems. It provides handling of mixed criticality multicore applications in r eal-time conditions, withscalability and utmost flexibility, full-scale deployment and management of integrated tool chains, through the entire lifecycle.
Resumo:
Multi-core processors is a design philosophy that has become mainstream in scientific and engineering applications. Increasing performance and gate capacity of recent FPGA devices has permitted complex logic systems to be implemented on a single programmable device. By using VHDL here we present an implementation of one multi-core processor by using the PLASMA IP core based on the (most) MIPS I ISA and give an overview of the processor architecture and share theexecution results.
Resumo:
Los procesadores multi-core y el multi-threading por hardware permiten aumentar el rendimiento de las aplicaciones. Por un lado, los procesadores multi-core combinan 2 o más procesadores en un mismo chip. Por otro lado, el multi-threading por hardware es una técnica que incrementa la utilización de los recursos del procesador. Este trabajo presenta un análisis de rendimiento de los resultados obtenidos en dos aplicaciones, multiplicación de matrices densas y transformada rápida de Fourier. Ambas aplicaciones se han ejecutado en arquitecturas multi-core que explotan el paralelismo a nivel de thread pero con un modelo de multi-threading diferente. Los resultados obtenidos muestran la importancia de entender y saber analizar el efecto del multi-core y multi-threading en el rendimiento.
Resumo:
La gestión de recursos en los procesadores multi-core ha ganado importancia con la evolución de las aplicaciones y arquitecturas. Pero esta gestión es muy compleja. Por ejemplo, una misma aplicación paralela ejecutada múltiples veces con los mismos datos de entrada, en un único nodo multi-core, puede tener tiempos de ejecución muy variables. Hay múltiples factores hardware y software que afectan al rendimiento. La forma en que los recursos hardware (cómputo y memoria) se asignan a los procesos o threads, posiblemente de varias aplicaciones que compiten entre sí, es fundamental para determinar este rendimiento. La diferencia entre hacer la asignación de recursos sin conocer la verdadera necesidad de la aplicación, frente a asignación con una meta específica es cada vez mayor. La mejor manera de realizar esta asignación és automáticamente, con una mínima intervención del programador. Es importante destacar, que la forma en que la aplicación se ejecuta en una arquitectura no necesariamente es la más adecuada, y esta situación puede mejorarse a través de la gestión adecuada de los recursos disponibles. Una apropiada gestión de recursos puede ofrecer ventajas tanto al desarrollador de las aplicaciones, como al entorno informático donde ésta se ejecuta, permitiendo un mayor número de aplicaciones en ejecución con la misma cantidad de recursos. Así mismo, esta gestión de recursos no requeriría introducir cambios a la aplicación, o a su estrategia operativa. A fin de proponer políticas para la gestión de los recursos, se analizó el comportamiento de aplicaciones intensivas de cómputo e intensivas de memoria. Este análisis se llevó a cabo a través del estudio de los parámetros de ubicación entre los cores, la necesidad de usar la memoria compartida, el tamaño de la carga de entrada, la distribución de los datos dentro del procesador y la granularidad de trabajo. Nuestro objetivo es identificar cómo estos parámetros influyen en la eficiencia de la ejecución, identificar cuellos de botella y proponer posibles mejoras. Otra propuesta es adaptar las estrategias ya utilizadas por el Scheduler con el fin de obtener mejores resultados.
Diseño y evaluación de un algoritmo paralelo para la eliminación gausiana en procesadores multi-core
Resumo:
Las aplicaciones de alineamiento de secuencias son una herramienta importante para la comunidad científica. Estas aplicaciones bioinformáticas son usadas en muchos campos distintos como pueden ser la medicina, la biología, la farmacología, la genética, etc. A día de hoy los algoritmos de alineamiento de secuencias tienen una complejidad elevada y cada día tienen que manejar un volumen de datos más grande. Por esta razón se deben buscar alternativas para que estas aplicaciones sean capaces de manejar el aumento de tamaño que los bancos de secuencias están sufriendo día a día. En este proyecto se estudian y se investigan mejoras en este tipo de aplicaciones como puede ser el uso de sistemas paralelos que pueden mejorar el rendimiento notablemente.
Resumo:
The simulated annealing approach to structure solution from powder diffraction data, as implemented in the DASH program, is easily amenable to parallelization at the individual run level. Modest increases in speed of execution can therefore be achieved by executing individual DASH runs on the individual cores of CPUs.
Resumo:
Since its introduction in 1993, the Message Passing Interface (MPI) has become a de facto standard for writing High Performance Computing (HPC) applications on clusters and Massively Parallel Processors (MPPs). The recent emergence of multi-core processor systems presents a new challenge for established parallel programming paradigms, including those based on MPI. This paper presents a new Java messaging system called MPJ Express. Using this system, we exploit multiple levels of parallelism - messaging and threading - to improve application performance on multi-core processors. We refer to our approach as nested parallelism. This MPI-like Java library can support nested parallelism by using Java or Java OpenMP (JOMP) threads within an MPJ Express process. Practicality of this approach is assessed by porting to Java a massively parallel structure formation code from Cosmology called Gadget-2. We introduce nested parallelism in the Java version of the simulation code and report good speed-ups. To the best of our knowledge it is the first time this kind of hybrid parallelism is demonstrated in a high performance Java application. (C) 2009 Elsevier Inc. All rights reserved.
A benchmark-driven modelling approach for evaluating deployment choices on a multi-core architecture
Resumo:
The complexity of current and emerging architectures provides users with options about how best to use the available resources, but makes predicting performance challenging. In this work a benchmark-driven model is developed for a simple shallow water code on a Cray XE6 system, to explore how deployment choices such as domain decomposition and core affinity affect performance. The resource sharing present in modern multi-core architectures adds various levels of heterogeneity to the system. Shared resources often includes cache, memory, network controllers and in some cases floating point units (as in the AMD Bulldozer), which mean that the access time depends on the mapping of application tasks, and the core's location within the system. Heterogeneity further increases with the use of hardware-accelerators such as GPUs and the Intel Xeon Phi, where many specialist cores are attached to general-purpose cores. This trend for shared resources and non-uniform cores is expected to continue into the exascale era. The complexity of these systems means that various runtime scenarios are possible, and it has been found that under-populating nodes, altering the domain decomposition and non-standard task to core mappings can dramatically alter performance. To find this out, however, is often a process of trial and error. To better inform this process, a performance model was developed for a simple regular grid-based kernel code, shallow. The code comprises two distinct types of work, loop-based array updates and nearest-neighbour halo-exchanges. Separate performance models were developed for each part, both based on a similar methodology. Application specific benchmarks were run to measure performance for different problem sizes under different execution scenarios. These results were then fed into a performance model that derives resource usage for a given deployment scenario, with interpolation between results as necessary.
Resumo:
Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.
Resumo:
This thesis deals with heterogeneous architectures in standard workstations. Heterogeneous architectures represent an appealing alternative to traditional supercomputers because they are based on commodity components fabricated in large quantities. Hence their price-performance ratio is unparalleled in the world of high performance computing (HPC). In particular, different aspects related to the performance and consumption of heterogeneous architectures have been explored. The thesis initially focuses on an efficient implementation of a parallel application, where the execution time is dominated by an high number of floating point instructions. Then the thesis touches the central problem of efficient management of power peaks in heterogeneous computing systems. Finally it discusses a memory-bounded problem, where the execution time is dominated by the memory latency. Specifically, the following main contributions have been carried out: A novel framework for the design and analysis of solar field for Central Receiver Systems (CRS) has been developed. The implementation based on desktop workstation equipped with multiple Graphics Processing Units (GPUs) is motivated by the need to have an accurate and fast simulation environment for studying mirror imperfection and non-planar geometries. Secondly, a power-aware scheduling algorithm on heterogeneous CPU-GPU architectures, based on an efficient distribution of the computing workload to the resources, has been realized. The scheduler manages the resources of several computing nodes with a view to reducing the peak power. The two main contributions of this work follow: the approach reduces the supply cost due to high peak power whilst having negligible impact on the parallelism of computational nodes. from another point of view the developed model allows designer to increase the number of cores without increasing the capacity of the power supply unit. Finally, an implementation for efficient graph exploration on reconfigurable architectures is presented. The purpose is to accelerate graph exploration, reducing the number of random memory accesses.