985 resultados para Main Memory


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper evaluates the viability of user-level software management of a hybrid DRAM/NVM main memory system. We propose an operating system (OS) and programming interface to place data from within the user application. We present a profiling tool to help programmers decide on the placement of application data in hybrid memory systems. Cycle-accurate simulation of modified applications confirms that our approach is more energy-efficient than state-of-the- art hardware or OS approaches at equivalent performance. Moreover, our results are validated on several candidate NVM technologies and a wide set of 14 benchmarks.
The key observation behind this work is that, for the work- loads we evaluated, application objects are too short-lived to motivate migration. Utilizing this property significantly reduces the hardware complexity of hybrid memory systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Searching in a dataset for elements that are similar to a given query element is a core problem in applications that manage complex data, and has been aided by metric access methods (MAMs). A growing number of applications require indices that must be built faster and repeatedly, also providing faster response for similarity queries. The increase in the main memory capacity and its lowering costs also motivate using memory-based MAMs. In this paper. we propose the Onion-tree, a new and robust dynamic memory-based MAM that slices the metric space into disjoint subspaces to provide quick indexing of complex data. It introduces three major characteristics: (i) a partitioning method that controls the number of disjoint subspaces generated at each node; (ii) a replacement technique that can change the leaf node pivots in insertion operations; and (iii) range and k-NN extended query algorithms to support the new partitioning method, including a new visit order of the subspaces in k-NN queries. Performance tests with both real-world and synthetic datasets showed that the Onion-tree is very compact. Comparisons of the Onion-tree with the MM-tree and a memory-based version of the Slim-tree showed that the Onion-tree was always faster to build the index. The experiments also showed that the Onion-tree significantly improved range and k-NN query processing performance and was the most efficient MAM, followed by the MM-tree, which in turn outperformed the Slim-tree in almost all the tests. (C) 2010 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As the performance gap between microprocessors and memory continues to increase, main memory accesses result in long latencies which become a factor limiting system performance. Previous studies show that main memory access streams contain significant localities and SDRAM devices provide parallelism through multiple banks and channels. These locality and parallelism have not been exploited thoroughly by conventional memory controllers. In this thesis, SDRAM address mapping techniques and memory access reordering mechanisms are studied and applied to memory controller design with the goal of reducing observed main memory access latency. The proposed bit-reversal address mapping attempts to distribute main memory accesses evenly in the SDRAM address space to enable bank parallelism. As memory accesses to unique banks are interleaved, the access latencies are partially hidden and therefore reduced. With the consideration of cache conflict misses, bit-reversal address mapping is able to direct potential row conflicts to different banks, further improving the performance. The proposed burst scheduling is a novel access reordering mechanism, which creates bursts by clustering accesses directed to the same rows of the same banks. Subjected to a threshold, reads are allowed to preempt writes and qualified writes are piggybacked at the end of the bursts. A sophisticated access scheduler selects accesses based on priorities and interleaves accesses to maximize the SDRAM data bus utilization. Consequentially burst scheduling reduces row conflict rate, increasing and exploiting the available row locality. Using a revised SimpleScalar and M5 simulator, both techniques are evaluated and compared with existing academic and industrial solutions. With SPEC CPU2000 benchmarks, bit-reversal reduces the execution time by 14% on average over traditional page interleaving address mapping. Burst scheduling also achieves a 15% reduction in execution time over conventional bank in order scheduling. Working constructively together, bit-reversal and burst scheduling successfully achieve a 19% speedup across simulated benchmarks.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The generation of a correlation matrix for set of genomic sequences is a common requirement in many bioinformatics problems such as phylogenetic analysis. Each sequence may be millions of bases long and there may be thousands of such sequences which we wish to compare, so not all sequences may fit into main memory at the same time. Each sequence needs to be compared with every other sequence, so we will generally need to page some sequences in and out more than once. In order to minimize execution time we need to minimize this I/O. This paper develops an approach for faster and scalable computing of large-size correlation matrices through the maximal exploitation of available memory and reducing the number of I/O operations. The approach is scalable in the sense that the same algorithms can be executed on different computing platforms with different amounts of memory and can be applied to different bioinformatics problems with different correlation matrix sizes. The significant performance improvement of the approach over previous work is demonstrated through benchmark examples.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Non-Volatile Memory (NVM) technology holds promise to replace SRAM and DRAM at various levels of the memory hierarchy. The interest in NVM is motivated by the difficulty faced in scaling DRAM beyond 22 nm and, long-term, lower cost per bit. While offering higher density and negligible static power (leakage and refresh), NVM suffers increased latency and energy per memory access. This paper develops energy and performance models of memory systems and applies them to understand the energy-efficiency of replacing or complementing DRAM with NVM. Our analysis focusses on the application of NVM in main memory. We demonstrate that NVM such as STT-RAM and RRAM is energy-efficient for memory sizes commonly employed in servers and high-end workstations, but PCM is not. Furthermore, the model is well suited to quickly evaluate the impact of changes to the model parameters, which may be achieved through optimization of the memory architecture, and to determine the key parameters that impact system-level energy and performance.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Increasingly large amounts of data are stored in main memory of data center servers. However, DRAM-based memory is an important consumer of energy and is unlikely to scale in the future. Various byte-addressable non-volatile memory (NVM) technologies promise high density and near-zero static energy, however they suffer from increased latency and increased dynamic energy consumption.

This paper proposes to leverage a hybrid memory architecture, consisting of both DRAM and NVM, by novel, application-level data management policies that decide to place data on DRAM vs. NVM. We analyze modern column-oriented and key-value data stores and demonstrate the feasibility of application-level data management. Cycle-accurate simulation confirms that our methodology reduces the energy with least performance degradation as compared to the current state-of-the-art hardware or OS approaches. Moreover, we utilize our techniques to apportion DRAM and NVM memory sizes for these workloads.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The current industry trend is towards using Commercially available Off-The-Shelf (COTS) based multicores for developing real time embedded systems, as opposed to the usage of custom-made hardware. In typical implementation of such COTS-based multicores, multiple cores access the main memory via a shared bus. This often leads to contention on this shared channel, which results in an increase of the response time of the tasks. Analyzing this increased response time, considering the contention on the shared bus, is challenging on COTS-based systems mainly because bus arbitration protocols are often undocumented and the exact instants at which the shared bus is accessed by tasks are not explicitly controlled by the operating system scheduler; they are instead a result of cache misses. This paper makes three contributions towards analyzing tasks scheduled on COTS-based multicores. Firstly, we describe a method to model the memory access patterns of a task. Secondly, we apply this model to analyze the worst case response time for a set of tasks. Although the required parameters to obtain the request profile can be obtained by static analysis, we provide an alternative method to experimentally obtain them by using performance monitoring counters (PMCs). We also compare our work against an existing approach and show that our approach outperforms it by providing tighter upper-bound on the number of bus requests generated by a task.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The last decade has witnessed a major shift towards the deployment of embedded applications on multi-core platforms. However, real-time applications have not been able to fully benefit from this transition, as the computational gains offered by multi-cores are often offset by performance degradation due to shared resources, such as main memory. To efficiently use multi-core platforms for real-time systems, it is hence essential to tightly bound the interference when accessing shared resources. Although there has been much recent work in this area, a remaining key problem is to address the diversity of memory arbiters in the analysis to make it applicable to a wide range of systems. This work handles diverse arbiters by proposing a general framework to compute the maximum interference caused by the shared memory bus and its impact on the execution time of the tasks running on the cores, considering different bus arbiters. Our novel approach clearly demarcates the arbiter-dependent and independent stages in the analysis of these upper bounds. The arbiter-dependent phase takes the arbiter and the task memory-traffic pattern as inputs and produces a model of the availability of the bus to a given task. Then, based on the availability of the bus, the arbiter-independent phase determines the worst-case request-release scenario that maximizes the interference experienced by the tasks due to the contention for the bus. We show that the framework addresses the diversity problem by applying it to a memory bus shared by a fixed-priority arbiter, a time-division multiplexing (TDM) arbiter, and an unspecified work-conserving arbiter using applications from the MediaBench test suite. We also experimentally evaluate the quality of the analysis by comparison with a state-of-the-art TDM analysis approach and consistently showing a considerable reduction in maximum interference.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Current computer systems have evolved from featuring only a single processing unit and limited RAM, in the order of kilobytes or few megabytes, to include several multicore processors, o↵ering in the order of several tens of concurrent execution contexts, and have main memory in the order of several tens to hundreds of gigabytes. This allows to keep all data of many applications in the main memory, leading to the development of inmemory databases. Compared to disk-backed databases, in-memory databases (IMDBs) are expected to provide better performance by incurring in less I/O overhead. In this dissertation, we present a scalability study of two general purpose IMDBs on multicore systems. The results show that current general purpose IMDBs do not scale on multicores, due to contention among threads running concurrent transactions. In this work, we explore di↵erent direction to overcome the scalability issues of IMDBs in multicores, while enforcing strong isolation semantics. First, we present a solution that requires no modification to either database systems or to the applications, called MacroDB. MacroDB replicates the database among several engines, using a master-slave replication scheme, where update transactions execute on the master, while read-only transactions execute on slaves. This reduces contention, allowing MacroDB to o↵er scalable performance under read-only workloads, while updateintensive workloads su↵er from performance loss, when compared to the standalone engine. Second, we delve into the database engine and identify the concurrency control mechanism used by the storage sub-component as a scalability bottleneck. We then propose a new locking scheme that allows the removal of such mechanisms from the storage sub-component. This modification o↵ers performance improvement under all workloads, when compared to the standalone engine, while scalability is limited to read-only workloads. Next we addressed the scalability limitations for update-intensive workloads, and propose the reduction of locking granularity from the table level to the attribute level. This further improved performance for intensive and moderate update workloads, at a slight cost for read-only workloads. Scalability is limited to intensive-read and read-only workloads. Finally, we investigate the impact applications have on the performance of database systems, by studying how operation order inside transactions influences the database performance. We then propose a Read before Write (RbW) interaction pattern, under which transaction perform all read operations before executing write operations. The RbW pattern allowed TPC-C to achieve scalable performance on our modified engine for all workloads. Additionally, the RbW pattern allowed our modified engine to achieve scalable performance on multicores, almost up to the total number of cores, while enforcing strong isolation.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The quest for universal memory is driving the rapid development of memories with superior all-round capabilities in non-volatility, high speed, high endurance and low power. The memory subsystem accounts for a significant cost and power budget of a computer system. Current DRAM-based main memory systems are starting to hit the power and cost limit. To resolve this issue the industry is improving existing technologies such as Flash and exploring new ones. Among those new technologies is the Phase Change Memory (PCM), which overcomes some of the shortcomings of the Flash such as durability and scalability. This alternative non-volatile memory technology, which uses resistance contrast in phase-change materials, offers more density relative to DRAM, and can help to increase main memory capacity of future systems while remaining within the cost and power constraints. Chalcogenide materials can suitably be exploited for manufacturing phase-change memory devices. Charge transport in amorphous chalcogenide-GST used for memory devices is modeled using two contributions: hopping of trapped electrons and motion of band electrons in extended states. Crystalline GST exhibits an almost Ohmic I(V) curve. In contrast amorphous GST shows a high resistance at low biases while, above a threshold voltage, a transition takes place from a highly resistive to a conductive state, characterized by a negative differential-resistance behavior. A clear and complete understanding of the threshold behavior of the amorphous phase is fundamental for exploiting such materials in the fabrication of innovative nonvolatile memories. The type of feedback that produces the snapback phenomenon is described as a filamentation in energy that is controlled by electron–electron interactions between trapped electrons and band electrons. The model thus derived is implemented within a state-of-the-art simulator. An analytical version of the model is also derived and is useful for discussing the snapback behavior and the scaling properties of the device.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper explores potential for the RAMpage memory hierarchy to use a microkernel with a small memory footprint, in a specialized cache-speed static RAM (tightly-coupled memory, TCM). Dreamy memory is DRAM kept in low-power mode, unless referenced. Simulations show that a small microkernel suits RAMpage well, in that it achieves significantly better speed and energy gains than a standard hierarchy from adding TCM. RAMpage, in its best 128KB L2 case, gained 11% speed using TCM, and reduced energy 14%. Equivalent conventional hierarchy gains were under 1%. While 1MB L2 was significantly faster against lower-energy cases for the smaller L2, the larger SRAM's energy does not justify the speed gain. Using a 128KB L2 cache in a conventional architecture resulted in a best-case overall run time of 2.58s, compared with the best dreamy mode run time (RAMpage without context switches on misses) of 3.34s, a speed penalty of 29%. Energy in the fastest 128KB L2 case was 2.18J vs. 1.50J, a reduction of 31%. The same RAMpage configuration without dreamy mode took 2.83s as simulated, and used 2.39J, an acceptable trade-off (penalty under 10%) for being able to switch easily to a lower-energy mode.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We introduce K-tree in an information retrieval context. It is an efficient approximation of the k-means clustering algorithm. Unlike k-means it forms a hierarchy of clusters. It has been extended to address issues with sparse representations. We compare performance and quality to CLUTO using document collections. The K-tree has a low time complexity that is suitable for large document collections. This tree structure allows for efficient disk based implementations where space requirements exceed that of main memory.