Biblioteca Digital

905 resultados para shared memory

Distributed and shared memory parallelism with a smoothed particle hydrodynamics code

Relevância:

100.00% 100.00%

Publicador:

Veja mais

Response time analysis of COTS-Based multicores considering the contention on the shared memory bus

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The current industry trend is towards using Commercially available Off-The-Shelf (COTS) based multicores for developing real time embedded systems, as opposed to the usage of custom-made hardware. In typical implementation of such COTS-based multicores, multiple cores access the main memory via a shared bus. This often leads to contention on this shared channel, which results in an increase of the response time of the tasks. Analyzing this increased response time, considering the contention on the shared bus, is challenging on COTS-based systems mainly because bus arbitration protocols are often undocumented and the exact instants at which the shared bus is accessed by tasks are not explicitly controlled by the operating system scheduler; they are instead a result of cache misses. This paper makes three contributions towards analyzing tasks scheduled on COTS-based multicores. Firstly, we describe a method to model the memory access patterns of a task. Secondly, we apply this model to analyze the worst case response time for a set of tasks. Although the required parameters to obtain the request profile can be obtained by static analysis, we provide an alternative method to experimentally obtain them by using performance monitoring counters (PMCs). We also compare our work against an existing approach and show that our approach outperforms it by providing tighter upper-bound on the number of bus requests generated by a task.

Veja mais

Rely-guarantee protocols for safe interference over shared memory

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Mutable state can be useful in certain algorithms, to structure programs, or for efficiency purposes. However, when shared mutable state is used in non-local or nonobvious ways, the interactions that can occur via aliases to that shared memory can be a source of program errors. Undisciplined uses of shared state may unsafely interfere with local reasoning as other aliases may interleave their changes to the shared state in unexpected ways. We propose a novel technique, rely-guarantee protocols, that structures the interactions between aliases and ensures that only safe interference is possible. We present a linear type system outfitted with our novel sharing mechanism that enables controlled interference over shared mutable resources. Each alias is assigned separate, local roles encoded in a protocol abstraction that constrains how an alias can legally use that shared state. By following the spirit of rely-guarantee reasoning, our rely-guarantee protocols ensure that only safe interference can occur but still allow many interesting uses of shared state, such as going beyond invariant and monotonic usages. This thesis describes the three core mechanisms that enable our type-based technique to work: 1) we show how a protocol models an alias’s perspective on how the shared state evolves and constrains that alias’s interactions with the shared state; 2) we show how protocols can be used while enforcing the agreed interference contract; and finally, 3) we show how to check that all local protocols to some shared state can be safely composed to ensure globally safe interference over that shared memory. The interference caused by shared state is rooted at how the uses of di↵erent aliases to that state may be interleaved (perhaps even in non-deterministic ways) at run-time. Therefore, our technique is mostly agnostic as to whether this interference was the result of alias interleaving caused by sequential or concurrent semantics. We show implementations of our technique in both settings, and highlight their di↵erences. Because sharing is “first-class” (and not tied to a module), we show a polymorphic procedure that enables abstract compositions of protocols. Thus, protocols can be specialized or extended without requiring specific knowledge of the interference produce by other protocols to that state. We show that protocol composition can ensure safety even when considering abstracted protocols. We show that this core composition mechanism is sound, decidable (without the need for manual intervention), and provide an algorithm implementation.

Veja mais

Use of shared memory in the context of embedded multi-core processors: exploration of the technology and its limits

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.

Veja mais

OpenMP task scheduling strategies to mitigate hardware variability in tightly-coupled shared memory clusters

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In questa tesi sono stati apportati due importanti contributi nel campo degli acceleratori embedded many-core. Abbiamo implementato un runtime OpenMP ottimizzato per la gestione del tasking model per sistemi a processori strettamente accoppiati in cluster e poi interconnessi attraverso una network on chip. Ci siamo focalizzati sulla loro scalabilità e sul supporto di task di granularità fine, come è tipico nelle applicazioni embedded. Il secondo contributo di questa tesi è stata proporre una estensione del runtime di OpenMP che cerca di prevedere la manifestazione di errori dati da fenomeni di variability tramite una schedulazione efficiente del carico di lavoro.

Veja mais

Memory performance of and-parallel prolog on shared-memory architectures

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology.

Veja mais

Jasmine: A shared-object multi-locking distributed shared memory system for heterogeneous computers

Relevância:

100.00% 100.00%

Publicador:

Veja mais

Evaluation and measurement of multiprocessor systems with shared memory

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study is concerned with several proposals concerning multiprocessor systems and with the various possible methods of evaluating such proposals. After a discussion of the advantages and disadvantages of several performance evaluation tools, the author decides that simulation is the only tool powerful enough to develop a model which would be of practical use, in the design, comparison and extension of systems. The main aims of the simulation package developed as part of this study are cost effectiveness, ease of use and generality. The methodology on which the simulation package is based is described in detail. The fundamental principles are that model design should reflect actual systems design, that measuring procedures should be carried out alongside design that models should be well documented and easily adaptable and that models should be dynamic. The simulation package itself is modular, and in this way reflects current design trends. This approach also aids documentation and ensures that the model is easily adaptable. It contains a skeleton structure and a library of segments which can be added to or directly swapped with segments of the skeleton structure, to form a model which fits a user's requirements. The study also contains the results of some experimental work carried out using the model, the first part of which tests• the model's capabilities by simulating a large operating system, the ICL George 3 system; the second part deals with general questions and some of the many proposals concerning multiprocessor systems.

Veja mais

A framework for memory contention analysis in multi-core platforms

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The last decade has witnessed a major shift towards the deployment of embedded applications on multi-core platforms. However, real-time applications have not been able to fully benefit from this transition, as the computational gains offered by multi-cores are often offset by performance degradation due to shared resources, such as main memory. To efficiently use multi-core platforms for real-time systems, it is hence essential to tightly bound the interference when accessing shared resources. Although there has been much recent work in this area, a remaining key problem is to address the diversity of memory arbiters in the analysis to make it applicable to a wide range of systems. This work handles diverse arbiters by proposing a general framework to compute the maximum interference caused by the shared memory bus and its impact on the execution time of the tasks running on the cores, considering different bus arbiters. Our novel approach clearly demarcates the arbiter-dependent and independent stages in the analysis of these upper bounds. The arbiter-dependent phase takes the arbiter and the task memory-traffic pattern as inputs and produces a model of the availability of the bus to a given task. Then, based on the availability of the bus, the arbiter-independent phase determines the worst-case request-release scenario that maximizes the interference experienced by the tasks due to the contention for the bus. We show that the framework addresses the diversity problem by applying it to a memory bus shared by a fixed-priority arbiter, a time-division multiplexing (TDM) arbiter, and an unspecified work-conserving arbiter using applications from the MediaBench test suite. We also experimentally evaluate the quality of the analysis by comparison with a state-of-the-art TDM analysis approach and consistently showing a considerable reduction in maximum interference.

Veja mais

Analytic model of a cache-only memory architecture

Relevância:

70.00% 70.00%

Publicador:

Resumo:

An approximate analytic model of a shared memory multiprocessor with a Cache Only Memory Architecture (COMA), the busbased Data Difussion Machine (DDM), is presented and validated. It describes the timing and interference in the system as a function of the hardware, the protocols, the topology and the workload. Model results have been compared to results from an independent simulator. The comparison shows good model accuracy specially for non-saturated systems, where the errors in response times and device utilizations are independent of the number of processors and remain below 10% in 90% of the simulations. Therefore, the model can be used as an average performance prediction tool that avoids expensive simulations in the design of systems with many processors.

Veja mais

Energy Efficiency of Software Transactional Memory in a Heterogeneous Architecture

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Hardware vendors make an important effort creating low-power CPUs that keep battery duration and durability above acceptable levels. In order to achieve this goal and provide good performance-energy for a wide variety of applications, ARM designed the big.LITTLE architecture. This heterogeneous multi-core architecture features two different types of cores: big cores oriented to performance and little cores, slower and aimed to save energy consumption. As all the cores have access to the same memory, multi-threaded applications must resort to some mutual exclusion mechanism to coordinate the access to shared data by the concurrent threads. Transactional Memory (TM) represents an optimistic approach for shared-memory synchronization. To take full advantage of the features offered by software TM, but also benefit from the characteristics of the heterogeneous big.LITTLE architectures, our focus is to propose TM solutions that take into account the power/performance requirements of the application and what it is offered by the architecture. In order to understand the current state-of-the-art and obtain useful information for future power-aware software TM solutions, we have performed an analysis of a popular TM library running on top of an ARM big.LITTLE processor. Experiments show, in general, better scalability for the LITTLE cores for most of the applications except for one, which requires the computing performance that the big cores offer.

Veja mais

Theory manual to OctVCE – a cartesian cell CFD code with special application to blast wave problems, Department of Mechanical Engineering Report 2007/12

Relevância:

60.00% 60.00%

Publicador:

Resumo:

OctVCE is a cartesian cell CFD code produced especially for numerical simulations of shock and blast wave interactions with complex geometries, in particular, from explosions. Virtual Cell Embedding (VCE) was chosen as its cartesian cell kernel for its simplicity and sufficiency for practical engineering design problems. The code uses a finite-volume formulation of the unsteady Euler equations with a second order explicit Runge-Kutta Godonov (MUSCL) scheme. Gradients are calculated using a least-squares method with a minmod limiter. Flux solvers used are AUSM, AUSMDV and EFM. No fluid-structure coupling or chemical reactions are allowed, but gas models can be perfect gas and JWL or JWLB for the explosive products. This report also describes the code’s ‘octree’ mesh adaptive capability and point-inclusion query procedures for the VCE geometry engine. Finally, some space will also be devoted to describing code parallelization using the shared-memory OpenMP paradigm. The user manual to the code is to be found in the companion report 2007/13.

Veja mais

User guide for shock and blast simulation with the OctVCE code (version 3.5+), Department of Mechanical Engineering Report 2007/13

Relevância:

60.00% 60.00%

Publicador:

Resumo:

OctVCE is a cartesian cell CFD code produced especially for numerical simulations of shock and blast wave interactions with complex geometries. Virtual Cell Embedding (VCE) was chosen as its cartesian cell kernel as it is simple to code and sufficient for practical engineering design problems. This also makes the code much more ‘user-friendly’ than structured grid approaches as the gridding process is done automatically. The CFD methodology relies on a finite-volume formulation of the unsteady Euler equations and is solved using a standard explicit Godonov (MUSCL) scheme. Both octree-based adaptive mesh refinement and shared-memory parallel processing capability have also been incorporated. For further details on the theory behind the code, see the companion report 2007/12.

Veja mais

Practical parallel coset enumeration

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Coset enumeration is a most important procedure for investigating finitely presented groups. We present a practical parallel procedure for coset enumeration on shared memory processors. The shared memory architecture is particularly interesting because such parallel computation is both faster and cheaper. The lower cost comes when the program requires large amounts of memory, and additional CPU's. allow us to lower the time that the expensive memory is being used. Rather than report on a suite of test cases, we take a single, typical case, and analyze the performance factors in-depth. The parallelization is achieved through a master-slave architecture. This results in an interesting phenomenon, whereby the CPU time is divided into a sequential and a parallel portion, and the parallel part demonstrates a speedup that is linear in the number of processors. We describe an early version for which only 40% of the program was parallelized, and we describe how this was modified to achieve 90% parallelization while using 15 slave processors and a master. In the latter case, a sequential time of 158 seconds was reduced to 29 seconds using 15 slaves.

Veja mais

Sistema de aquisição de dados em tempo real

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Neste trabalho propus-me realizar um Sistema de Aquisição de Dados em Tempo Real via Porta Paralela. Para atingir com sucesso este objectivo, foi realizado um levantamento bibliográfico sobre sistemas operativos de tempo real, salientando e exemplificando quais foram marcos mais importantes ao longo da sua evolução. Este levantamento permitiu perceber o porquê da proliferação destes sistemas face aos custos que envolvem, em função da sua aplicação, bem como as dificuldades, científicas e tecnológicas, que os investigadores foram tendo, e que foram ultrapassando com sucesso. Para que Linux se comporte como um sistema de tempo real, é necessário configura-lo e adicionar um patch, como por exemplo o RTAI ou ADEOS. Como existem vários tipos de soluções que permitem aplicar as características inerentes aos sistemas de tempo real ao Linux, foi realizado um estudo, acompanhado de exemplos, sobre o tipo de arquitecturas de kernel mais utilizadas para o fazer. Nos sistemas operativos de tempo real existem determinados serviços, funcionalidades e restrições que os distinguem dos sistemas operativos de uso comum. Tendo em conta o objectivo do trabalho, e apoiado em exemplos, fizemos um pequeno estudo onde descrevemos, entre outros, o funcionamento escalonador, e os conceitos de latência e tempo de resposta. Mostramos que há apenas dois tipos de sistemas de tempo real o ‘hard’ que tem restrições temporais rígidas e o ‘soft’ que engloba as restrições temporais firmes e suaves. As tarefas foram classificadas em função dos tipos de eventos que as despoletam, e evidenciando as suas principais características. O sistema de tempo real eleito para criar o sistema de aquisição de dados via porta paralela foi o RTAI/Linux. Para melhor percebermos o seu comportamento, estudamos os serviços e funções do RTAI. Foi dada especial atenção, aos serviços de comunicação entre tarefas e processos (memória partilhada e FIFOs), aos serviços de escalonamento (tipos de escalonadores e tarefas) e atendimento de interrupções (serviço de rotina de interrupção - ISR). O estudo destes serviços levou às opções tomadas quanto ao método de comunicação entre tarefas e serviços, bem como ao tipo de tarefa a utilizar (esporádica ou periódica). Como neste trabalho, o meio físico de comunicação entre o meio ambiente externo e o hardware utilizado é a porta paralela, também tivemos necessidade de perceber como funciona este interface. Nomeadamente os registos de configuração da porta paralela. Assim, foi possível configura-lo ao nível de hardware (BIOS) e software (módulo do kernel) atendendo aos objectivos do presente trabalho, e optimizando a utilização da porta paralela, nomeadamente, aumentando o número de bits disponíveis para a leitura de dados. No desenvolvimento da tarefa de hard real-time, foram tidas em atenção as várias considerações atrás referenciadas. Foi desenvolvida uma tarefa do tipo esporádica, pois era pretendido, ler dados pela porta paralela apenas quando houvesse necessidade (interrupção), ou seja, quando houvesse dados disponíveis para ler. Desenvolvemos também uma aplicação para permitir visualizar os dados recolhidos via porta paralela. A comunicação entre a tarefa e a aplicação é assegurada através de memória partilhada, pois garantindo a consistência de dados, a comunicação entre processos do Linux e as tarefas de tempo real (RTAI) que correm ao nível do kernel torna-se muito simples. Para puder avaliar o desempenho do sistema desenvolvido, foi criada uma tarefa de soft real-time cujos tempos de resposta foram comparados com os da tarefa de hard real-time. As respostas temporais obtidas através do analisador lógico em conjunto com gráficos elaborados a partir destes dados, mostram e comprovam, os benefícios do sistema de aquisição de dados em tempo real via porta paralela, usando uma tarefa de hard real-time.

Veja mais

905 resultados para shared memory

Filtro por publicador