992 resultados para memory access complexity
Resumo:
As the performance gap between microprocessors and memory continues to increase, main memory accesses result in long latencies which become a factor limiting system performance. Previous studies show that main memory access streams contain significant localities and SDRAM devices provide parallelism through multiple banks and channels. These locality and parallelism have not been exploited thoroughly by conventional memory controllers. In this thesis, SDRAM address mapping techniques and memory access reordering mechanisms are studied and applied to memory controller design with the goal of reducing observed main memory access latency. The proposed bit-reversal address mapping attempts to distribute main memory accesses evenly in the SDRAM address space to enable bank parallelism. As memory accesses to unique banks are interleaved, the access latencies are partially hidden and therefore reduced. With the consideration of cache conflict misses, bit-reversal address mapping is able to direct potential row conflicts to different banks, further improving the performance. The proposed burst scheduling is a novel access reordering mechanism, which creates bursts by clustering accesses directed to the same rows of the same banks. Subjected to a threshold, reads are allowed to preempt writes and qualified writes are piggybacked at the end of the bursts. A sophisticated access scheduler selects accesses based on priorities and interleaves accesses to maximize the SDRAM data bus utilization. Consequentially burst scheduling reduces row conflict rate, increasing and exploiting the available row locality. Using a revised SimpleScalar and M5 simulator, both techniques are evaluated and compared with existing academic and industrial solutions. With SPEC CPU2000 benchmarks, bit-reversal reduces the execution time by 14% on average over traditional page interleaving address mapping. Burst scheduling also achieves a 15% reduction in execution time over conventional bank in order scheduling. Working constructively together, bit-reversal and burst scheduling successfully achieve a 19% speedup across simulated benchmarks.
Resumo:
Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.
Resumo:
The development of 3G (the 3rd generation telecommunication) value-added services brings higher requirements of Quality of Service (QoS). Wideband Code Division Multiple Access (WCDMA) is one of three 3G standards, and enhancement of QoS for WCDMA Core Network (CN) becomes more and more important for users and carriers. The dissertation focuses on enhancement of QoS for WCDMA CN. The purpose is to realize the DiffServ (Differentiated Services) model of QoS for WCDMA CN. Based on the parallelism characteristic of Network Processors (NPs), the NP programming model is classified as Pool of Threads (POTs) and Hyper Task Chaining (HTC). In this study, an integrated programming model that combines both of the two models was designed. This model has highly efficient and flexible features, and also solves the problems of sharing conflicts and packet ordering. We used this model as the programming model to realize DiffServ QoS for WCDMA CN. ^ The realization mechanism of the DiffServ model mainly consists of buffer management, packet scheduling and packet classification algorithms based on NPs. First, we proposed an adaptive buffer management algorithm called Packet Adaptive Fair Dropping (PAFD), which takes into consideration of both fairness and throughput, and has smooth service curves. Then, an improved packet scheduling algorithm called Priority-based Weighted Fair Queuing (PWFQ) was introduced to ensure the fairness of packet scheduling and reduce queue time of data packets. At the same time, the delay and jitter are also maintained in a small range. Thirdly, a multi-dimensional packet classification algorithm called Classification Based on Network Processors (CBNPs) was designed. It effectively reduces the memory access and storage space, and provides less time and space complexity. ^ Lastly, an integrated hardware and software system of the DiffServ model of QoS for WCDMA CN was proposed. It was implemented on the NP IXP2400. According to the corresponding experiment results, the proposed system significantly enhanced QoS for WCDMA CN. It extensively improves consistent response time, display distortion and sound image synchronization, and thus increases network efficiency and saves network resource.^
Resumo:
The current industry trend is towards using Commercially available Off-The-Shelf (COTS) based multicores for developing real time embedded systems, as opposed to the usage of custom-made hardware. In typical implementation of such COTS-based multicores, multiple cores access the main memory via a shared bus. This often leads to contention on this shared channel, which results in an increase of the response time of the tasks. Analyzing this increased response time, considering the contention on the shared bus, is challenging on COTS-based systems mainly because bus arbitration protocols are often undocumented and the exact instants at which the shared bus is accessed by tasks are not explicitly controlled by the operating system scheduler; they are instead a result of cache misses. This paper makes three contributions towards analyzing tasks scheduled on COTS-based multicores. Firstly, we describe a method to model the memory access patterns of a task. Secondly, we apply this model to analyze the worst case response time for a set of tasks. Although the required parameters to obtain the request profile can be obtained by static analysis, we provide an alternative method to experimentally obtain them by using performance monitoring counters (PMCs). We also compare our work against an existing approach and show that our approach outperforms it by providing tighter upper-bound on the number of bus requests generated by a task.
Resumo:
Master’s Thesis in Computer Engineering
Resumo:
The miniaturization race in the hardware industry aiming at continuous increasing of transistor density on a die does not bring respective application performance improvements any more. One of the most promising alternatives is to exploit a heterogeneous nature of common applications in hardware. Supported by reconfigurable computation, which has already proved its efficiency in accelerating data intensive applications, this concept promises a breakthrough in contemporary technology development. Memory organization in such heterogeneous reconfigurable architectures becomes very critical. Two primary aspects introduce a sophisticated trade-off. On the one hand, a memory subsystem should provide well organized distributed data structure and guarantee the required data bandwidth. On the other hand, it should hide the heterogeneous hardware structure from the end-user, in order to support feasible high-level programmability of the system. This thesis work explores the heterogeneous reconfigurable hardware architectures and presents possible solutions to cope the problem of memory organization and data structure. By the example of the MORPHEUS heterogeneous platform, the discussion follows the complete design cycle, starting from decision making and justification, until hardware realization. Particular emphasis is made on the methods to support high system performance, meet application requirements, and provide a user-friendly programmer interface. As a result, the research introduces a complete heterogeneous platform enhanced with a hierarchical memory organization, which copes with its task by means of separating computation from communication, providing reconfigurable engines with computation and configuration data, and unification of heterogeneous computational devices using local storage buffers. It is distinguished from the related solutions by distributed data-flow organization, specifically engineered mechanisms to operate with data on local domains, particular communication infrastructure based on Network-on-Chip, and thorough methods to prevent computation and communication stalls. In addition, a novel advanced technique to accelerate memory access was developed and implemented.
Resumo:
Evidence for expectancy-based priming in the pronunciation task was provided in three experiments. In Experiments 1 and 2, a high proportion of associatively related trials produced greater associative priming and superior retrieval of primes in a subsequent test of memory for primes, whereas high- and low-proportion groups showed comparable repetition benefits in perceptual identification of previously presented primes. In Experiment 2, the low-proportion condition had few associatively related pairs hut many identity pairs. In Experiment 3, identity priming was greater in a high- than a low-identity proportion group, with similar repetition benefits and prime retrieval responses for the two groups. These results indicate that when the prime-target relationship is salient, subjects strategically vary their processing of the prime according to the nature of the prime-target relationship.
Resumo:
Neste trabalho pretende-se introduzir os conceitos associados à lógica difusa no controlo de sistemas, neste caso na área da robótica autónoma, onde é feito um enquadramento da utilização de controladores difusos na mesma. Foi desenvolvido de raiz um AGV (Autonomous Guided Vehicle) de modo a se implementar o controlador difuso, e testar o desempenho do mesmo. Uma vez que se pretende de futuro realizar melhorias e/ou evoluções optou-se por um sistema modular em que cada módulo é responsável por uma determinada tarefa. Neste trabalho existem três módulos que são responsáveis pelo controlo de velocidade, pela aquisição dos dados dos sensores e, por último, pelo controlador difuso do sistema. Após a implementação do controlador difuso, procedeu-se a testes para validar o sistema onde foram recolhidos e registados os dados provenientes dos sensores durante o funcionamento normal do robô. Este dados permitiram uma melhor análise do desempenho do robô. Verifica-se que a lógica difusa permite obter uma maior suavidade na transição de decisões, e que com o aumento do número de regras é possível tornar o sistema ainda mais suave. Deste modo, verifica-se que a lógica difusa é uma ferramenta útil e funcional para o controlo de aplicações. Como desvantagem surge a quantidade de dados associados à implementação, tais como, os universos de discurso, as funções de pertença e as regras. Ao se aumentar o número de regras de controlo do sistema existe também um aumento das funções de pertença consideradas para cada variável linguística; este facto leva a um aumento da memória necessária e da complexidade na implementação pela quantidade de dados que têm de ser tratados. A maior dificuldade no projecto de um controlador difuso encontra-se na definição das variáveis linguísticas através dos seus universos de discurso e das suas funções de pertença, pois a definição destes pode não ser a mais adequada ao contexto de controlo e torna-se necessário efectuar testes e, consequentemente, modificações à definição das funções de pertença para melhorar o desempenho do sistema. Todos os aspectos referidos são endereçados no desenvolvimento do AGV e os respectivos resultados são apresentados e analisados.
Resumo:
Classical lock-based concurrency control does not scale with current and foreseen multi-core architectures, opening space for alternative concurrency control mechanisms. The concept of transactions executing concurrently in isolation with an underlying mechanism maintaining a consistent system state was already explored in fault-tolerant and distributed systems, and is currently being explored by transactional memory, this time being used to manage concurrent memory access. In this paper we discuss the use of Software Transactional Memory (STM), and how Ada can provide support for it. Furthermore, we draft a general programming interface to transactional memory, supporting future implementations of STM oriented to real-time systems.
Resumo:
Diplomityö tarkastelee säikeistettyä ohjelmointia rinnakkaisohjelmoinnin ylemmällä hierarkiatasolla tarkastellen erityisesti hypersäikeistysteknologiaa. Työssä tarkastellaan hypersäikeistyksen hyviä ja huonoja puolia sekä sen vaikutuksia rinnakkaisalgoritmeihin. Työn tavoitteena oli ymmärtää Intel Pentium 4 prosessorin hypersäikeistyksen toteutus ja mahdollistaa sen hyödyntäminen, missä se tuo suorituskyvyllistä etua. Työssä kerättiin ja analysoitiin suorituskykytietoa ajamalla suuri joukko suorituskykytestejä eri olosuhteissa (muistin käsittely, kääntäjän asetukset, ympäristömuuttujat...). Työssä tarkasteltiin kahdentyyppisiä algoritmeja: matriisioperaatioita ja lajittelua. Näissä sovelluksissa on säännöllinen muistinkäyttökuvio, mikä on kaksiteräinen miekka. Se on etu aritmeettis-loogisissa prosessoinnissa, mutta toisaalta huonontaa muistin suorituskykyä. Syynä siihen on nykyaikaisten prosessorien erittäin hyvä raaka suorituskyky säännöllistä dataa käsiteltäessä, mutta muistiarkkitehtuuria rajoittaa välimuistien koko ja useat puskurit. Kun ongelman koko ylittää tietyn rajan, todellinen suorituskyky voi pudota murto-osaan huippusuorituskyvystä.
A dual QPSK soft-demapper for ECMA-368 exploiting time-domain spreading and guard interval diversity
Resumo:
When considering the relative fast processing speed and low power requirements for Wireless Personal Area Networks (WPAN) and Wireless Universal Serial Bus (USB) consumer based products, then the efficiency and cost effectiveness of these products become paramount. This paper presents an improved soft-output QPSK demapper suitable for the products above that not only exploits time diversity and guard carrier diversity, but also merges the demapping and symbol combining functions together to minimize CPU cycles, or memory access dependant upon the chosen implementation architecture. The proposed demapper is presented in the context of Multiband OFDM version of UWB (ECMA-368) as the chosen physical implementation for high-rate Wireless USB.
Resumo:
When considering the relative fast processing speeds and low power requirements for Wireless Personal Area Networks (WPAN) including Wireless Universal Serial Bus (WUSB) consumer based products, then the efficiency and cost effectiveness of these products become paramount. This paper presents an improved soft-output QPSK demapper suitable for the products above that not only exploits time diversity and guard carrier diversity, but also merges the demapping and symbol combining functions together to minimize CPU cycles, or memory access dependant upon the chosen implementation architecture. The proposed demapper is presented in the context of Multiband OFDM version of Ultra Wideband (UWB) (ECMA-368) as the chosen physical implementation for high-rate Wireless US8(1).
Resumo:
Process scheduling techniques consider the current load situation to allocate computing resources. Those techniques make approximations such as the average of communication, processing, and memory access to improve the process scheduling, although processes may present different behaviors during their whole execution. They may start with high communication requirements and later just processing. By discovering how processes behave over time, we believe it is possible to improve the resource allocation. This has motivated this paper which adopts chaos theory concepts and nonlinear prediction techniques in order to model and predict process behavior. Results confirm the radial basis function technique which presents good predictions and also low processing demands show what is essential in a real distributed environment.