966 resultados para memory access complexity


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Past studies of memory interference in multiprocessor systems have generally assumed that the references of each processor are uniformly distributed among the memory modules. In this paper we develop a model with local referencing, which reflects more closely the behavior of real-life programs. This model is analyzed using Markov chain techniques and expressions are derived for the multiprocessor performance. New expressions are also obtained for the performance in the traditional uniform reference model and are compared with other expressions-available in the literature. Results of a simulation study are given to show the accuracy of the expressions for both models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As the performance gap between microprocessors and memory continues to increase, main memory accesses result in long latencies which become a factor limiting system performance. Previous studies show that main memory access streams contain significant localities and SDRAM devices provide parallelism through multiple banks and channels. These locality and parallelism have not been exploited thoroughly by conventional memory controllers. In this thesis, SDRAM address mapping techniques and memory access reordering mechanisms are studied and applied to memory controller design with the goal of reducing observed main memory access latency. The proposed bit-reversal address mapping attempts to distribute main memory accesses evenly in the SDRAM address space to enable bank parallelism. As memory accesses to unique banks are interleaved, the access latencies are partially hidden and therefore reduced. With the consideration of cache conflict misses, bit-reversal address mapping is able to direct potential row conflicts to different banks, further improving the performance. The proposed burst scheduling is a novel access reordering mechanism, which creates bursts by clustering accesses directed to the same rows of the same banks. Subjected to a threshold, reads are allowed to preempt writes and qualified writes are piggybacked at the end of the bursts. A sophisticated access scheduler selects accesses based on priorities and interleaves accesses to maximize the SDRAM data bus utilization. Consequentially burst scheduling reduces row conflict rate, increasing and exploiting the available row locality. Using a revised SimpleScalar and M5 simulator, both techniques are evaluated and compared with existing academic and industrial solutions. With SPEC CPU2000 benchmarks, bit-reversal reduces the execution time by 14% on average over traditional page interleaving address mapping. Burst scheduling also achieves a 15% reduction in execution time over conventional bank in order scheduling. Working constructively together, bit-reversal and burst scheduling successfully achieve a 19% speedup across simulated benchmarks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cache-coherent non uniform memory access (ccNUMA) architecture is a standard design pattern for contemporary multicore processors, and future generations of architectures are likely to be NUMA. NUMA architectures create new challenges for managed runtime systems. Memory-intensive applications use the system’s distributed memory banks to allocate data, and the automatic memory manager collects garbage left in these memory banks. The garbage collector may need to access remote memory banks, which entails access latency overhead and potential bandwidth saturation for the interconnection between memory banks. This dissertation makes five significant contributions to garbage collection on NUMA systems, with a case study implementation using the Hotspot Java Virtual Machine. It empirically studies data locality for a Stop-The-World garbage collector when tracing connected objects in NUMA heaps. First, it identifies a locality richness which exists naturally in connected objects that contain a root object and its reachable set— ‘rooted sub-graphs’. Second, this dissertation leverages the locality characteristic of rooted sub-graphs to develop a new NUMA-aware garbage collection mechanism. A garbage collector thread processes a local root and its reachable set, which is likely to have a large number of objects in the same NUMA node. Third, a garbage collector thread steals references from sibling threads that run on the same NUMA node to improve data locality. This research evaluates the new NUMA-aware garbage collector using seven benchmarks of an established real-world DaCapo benchmark suite. In addition, evaluation involves a widely used SPECjbb benchmark and Neo4J graph database Java benchmark, as well as an artificial benchmark. The results of the NUMA-aware garbage collector on a multi-hop NUMA architecture show an average of 15% performance improvement. Furthermore, this performance gain is shown to be as a result of an improved NUMA memory access in a ccNUMA system. Fourth, the existing Hotspot JVM adaptive policy for configuring the number of garbage collection threads is shown to be suboptimal for current NUMA machines. The policy uses outdated assumptions and it generates a constant thread count. In fact, the Hotspot JVM still uses this policy in the production version. This research shows that the optimal number of garbage collection threads is application-specific and configuring the optimal number of garbage collection threads yields better collection throughput than the default policy. Fifth, this dissertation designs and implements a runtime technique, which involves heuristics from dynamic collection behavior to calculate an optimal number of garbage collector threads for each collection cycle. The results show an average of 21% improvements to the garbage collection performance for DaCapo benchmarks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

稀疏矩阵向量乘(SpMV)采取压缩行存储格式的算法性能非常差,而寄存器分块算法可以使得数据尽量在靠近处理器的存储层次中访问而提高性能.利用RAM(h)模型进行分析和比较不同算法形式的存储访问复杂度,可以比较两种算法的优劣.通过RAM(h)分析SpMV两种实现形式的存储访问复杂度,同时在奔腾四平台上,测试了7个稀疏矩阵的SpMV性能,并统计了这两种算法中L1,L2,和TLB的缺失率,实验结果与模型分析的数据一致.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The development of 3G (the 3rd generation telecommunication) value-added services brings higher requirements of Quality of Service (QoS). Wideband Code Division Multiple Access (WCDMA) is one of three 3G standards, and enhancement of QoS for WCDMA Core Network (CN) becomes more and more important for users and carriers. The dissertation focuses on enhancement of QoS for WCDMA CN. The purpose is to realize the DiffServ (Differentiated Services) model of QoS for WCDMA CN. Based on the parallelism characteristic of Network Processors (NPs), the NP programming model is classified as Pool of Threads (POTs) and Hyper Task Chaining (HTC). In this study, an integrated programming model that combines both of the two models was designed. This model has highly efficient and flexible features, and also solves the problems of sharing conflicts and packet ordering. We used this model as the programming model to realize DiffServ QoS for WCDMA CN. ^ The realization mechanism of the DiffServ model mainly consists of buffer management, packet scheduling and packet classification algorithms based on NPs. First, we proposed an adaptive buffer management algorithm called Packet Adaptive Fair Dropping (PAFD), which takes into consideration of both fairness and throughput, and has smooth service curves. Then, an improved packet scheduling algorithm called Priority-based Weighted Fair Queuing (PWFQ) was introduced to ensure the fairness of packet scheduling and reduce queue time of data packets. At the same time, the delay and jitter are also maintained in a small range. Thirdly, a multi-dimensional packet classification algorithm called Classification Based on Network Processors (CBNPs) was designed. It effectively reduces the memory access and storage space, and provides less time and space complexity. ^ Lastly, an integrated hardware and software system of the DiffServ model of QoS for WCDMA CN was proposed. It was implemented on the NP IXP2400. According to the corresponding experiment results, the proposed system significantly enhanced QoS for WCDMA CN. It extensively improves consistent response time, display distortion and sound image synchronization, and thus increases network efficiency and saves network resource.^

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Non-Volatile Memory (NVM) technology holds promise to replace SRAM and DRAM at various levels of the memory hierarchy. The interest in NVM is motivated by the difficulty faced in scaling DRAM beyond 22 nm and, long-term, lower cost per bit. While offering higher density and negligible static power (leakage and refresh), NVM suffers increased latency and energy per memory access. This paper develops energy and performance models of memory systems and applies them to understand the energy-efficiency of replacing or complementing DRAM with NVM. Our analysis focusses on the application of NVM in main memory. We demonstrate that NVM such as STT-RAM and RRAM is energy-efficient for memory sizes commonly employed in servers and high-end workstations, but PCM is not. Furthermore, the model is well suited to quickly evaluate the impact of changes to the model parameters, which may be achieved through optimization of the memory architecture, and to determine the key parameters that impact system-level energy and performance.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The current industry trend is towards using Commercially available Off-The-Shelf (COTS) based multicores for developing real time embedded systems, as opposed to the usage of custom-made hardware. In typical implementation of such COTS-based multicores, multiple cores access the main memory via a shared bus. This often leads to contention on this shared channel, which results in an increase of the response time of the tasks. Analyzing this increased response time, considering the contention on the shared bus, is challenging on COTS-based systems mainly because bus arbitration protocols are often undocumented and the exact instants at which the shared bus is accessed by tasks are not explicitly controlled by the operating system scheduler; they are instead a result of cache misses. This paper makes three contributions towards analyzing tasks scheduled on COTS-based multicores. Firstly, we describe a method to model the memory access patterns of a task. Secondly, we apply this model to analyze the worst case response time for a set of tasks. Although the required parameters to obtain the request profile can be obtained by static analysis, we provide an alternative method to experimentally obtain them by using performance monitoring counters (PMCs). We also compare our work against an existing approach and show that our approach outperforms it by providing tighter upper-bound on the number of bus requests generated by a task.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The miniaturization race in the hardware industry aiming at continuous increasing of transistor density on a die does not bring respective application performance improvements any more. One of the most promising alternatives is to exploit a heterogeneous nature of common applications in hardware. Supported by reconfigurable computation, which has already proved its efficiency in accelerating data intensive applications, this concept promises a breakthrough in contemporary technology development. Memory organization in such heterogeneous reconfigurable architectures becomes very critical. Two primary aspects introduce a sophisticated trade-off. On the one hand, a memory subsystem should provide well organized distributed data structure and guarantee the required data bandwidth. On the other hand, it should hide the heterogeneous hardware structure from the end-user, in order to support feasible high-level programmability of the system. This thesis work explores the heterogeneous reconfigurable hardware architectures and presents possible solutions to cope the problem of memory organization and data structure. By the example of the MORPHEUS heterogeneous platform, the discussion follows the complete design cycle, starting from decision making and justification, until hardware realization. Particular emphasis is made on the methods to support high system performance, meet application requirements, and provide a user-friendly programmer interface. As a result, the research introduces a complete heterogeneous platform enhanced with a hierarchical memory organization, which copes with its task by means of separating computation from communication, providing reconfigurable engines with computation and configuration data, and unification of heterogeneous computational devices using local storage buffers. It is distinguished from the related solutions by distributed data-flow organization, specifically engineered mechanisms to operate with data on local domains, particular communication infrastructure based on Network-on-Chip, and thorough methods to prevent computation and communication stalls. In addition, a novel advanced technique to accelerate memory access was developed and implemented.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We introduce multiple-control fuzzy vaults allowing generalised threshold, compartmented and multilevel access structure. The presented schemes enable many useful applications employing multiple users and/or multiple locking sets. Introducing the original single control fuzzy vault of Juels and Sudan we identify several similarities and differences between their vault and secret sharing schemes which influence how best to obtain working generalisations. We design multiple-control fuzzy vaults suggesting applications using biometric credentials as locking and unlocking values. Furthermore we assess the security of our obtained generalisations for insider/ outsider attacks and examine the access-complexity for legitimate vault owners.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

X-ray microtomography (micro-CT) with micron resolution enables new ways of characterizing microstructures and opens pathways for forward calculations of multiscale rock properties. A quantitative characterization of the microstructure is the first step in this challenge. We developed a new approach to extract scale-dependent characteristics of porosity, percolation, and anisotropic permeability from 3-D microstructural models of rocks. The Hoshen-Kopelman algorithm of percolation theory is employed for a standard percolation analysis. The anisotropy of permeability is calculated by means of the star volume distribution approach. The local porosity distribution and local percolation probability are obtained by using the local porosity theory. Additionally, the local anisotropy distribution is defined and analyzed through two empirical probability density functions, the isotropy index and the elongation index. For such a high-resolution data set, the typical data sizes of the CT images are on the order of gigabytes to tens of gigabytes; thus an extremely large number of calculations are required. To resolve this large memory problem parallelization in OpenMP was used to optimally harness the shared memory infrastructure on cache coherent Non-Uniform Memory Access architecture machines such as the iVEC SGI Altix 3700Bx2 Supercomputer. We see adequate visualization of the results as an important element in this first pioneering study.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper we present a cryptanalysis of a new 256-bit hash function, FORK-256, proposed by Hong et al. at FSE 2006. This cryptanalysis is based on some unexpected differentials existing for the step transformation. We show their possible uses in different attack scenarios by giving a 1-bit (resp. 2-bit) near collision attack against the full compression function of FORK-256 running with complexity of 2^125 (resp. 2^120) and with negligible memory, and by exhibiting a 22-bit near pseudo-collision. We also show that we can find collisions for the full compression function with a small amount of memory with complexity not exceeding 2^126.6 hash evaluations. We further show how to reduce this complexity to 2^109.6 hash computations by using 273 memory words. Finally, we show that this attack can be extended with no additional cost to find collisions for the full hash function, i.e. with the predefined IV.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, we analyze the SHAvite-3-512 hash function, as proposed and tweaked for round 2 of the SHA-3 competition. We present cryptanalytic results on 10 out of 14 rounds of the hash function SHAvite-3-512, and on the full 14 round compression function of SHAvite-3-512. We show a second preimage attack on the hash function reduced to 10 rounds with a complexity of 2497 compression function evaluations and 216 memory. For the full 14-round compression function, we give a chosen counter, chosen salt preimage attack with 2384 compression function evaluations and 2128 memory (or complexity 2448 without memory), and a collision attack with 2192 compression function evaluations and 2128 memory.