8 resultados para cache-oblivious
em Universidad Politécnica de Madrid
Resumo:
An approximate analytic model of a shared memory multiprocessor with a Cache Only Memory Architecture (COMA), the busbased Data Difussion Machine (DDM), is presented and validated. It describes the timing and interference in the system as a function of the hardware, the protocols, the topology and the workload. Model results have been compared to results from an independent simulator. The comparison shows good model accuracy specially for non-saturated systems, where the errors in response times and device utilizations are independent of the number of processors and remain below 10% in 90% of the simulations. Therefore, the model can be used as an average performance prediction tool that avoids expensive simulations in the design of systems with many processors.
Resumo:
The first level data cache un modern processors has become a major consumer of energy due to its increasing size and high frequency access rate. In order to reduce this high energy con sumption, we propose in this paper a straightforward filtering technique based on a highly accurate forwarding predictor. Specifically, a simple structure predicts whether a load instruction will obtain its corresponding data via forwarding from the load-store structure -thus avoiding the data cache access - or if it will be provided by the data cache. This mechanism manages to reduce the data cache energy consumption by an average of 21.5% with a negligible performance penalty of less than 0.1%. Furthermore, in this paper we focus on the cache static energy consumption too by disabling a portin of sets of the L2 associative cache. Overall, when merging both proposals, the combined L1 and L2 total energy consumption is reduced by an average of 29.2% with a performance penalty of just 0.25%. Keywords: Energy consumption; filtering; forwarding predictor; cache hierarchy
Resumo:
With the advent of cloud computing model, distributed caches have become the cornerstone for building scalable applications. Popular systems like Facebook [1] or Twitter use Memcached [5], a highly scalable distributed object cache, to speed up applications by avoiding database accesses. Distributed object caches assign objects to cache instances based on a hashing function, and objects are not moved from a cache instance to another unless more instances are added to the cache and objects are redistributed. This may lead to situations where some cache instances are overloaded when some of the objects they store are frequently accessed, while other cache instances are less frequently used. In this paper we propose a multi-resource load balancing algorithm for distributed cache systems. The algorithm aims at balancing both CPU and Memory resources among cache instances by redistributing stored data. Considering the possible conflict of balancing multiple resources at the same time, we give CPU and Memory resources weighted priorities based on the runtime load distributions. A scarcer resource is given a higher weight than a less scarce resource when load balancing. The system imbalance degree is evaluated based on monitoring information, and the utility load of a node, a unit for resource consumption. Besides, since continuous rebalance of the system may affect the QoS of applications utilizing the cache system, our data selection policy ensures that each data migration minimizes the system imbalance degree and hence, the total reconfiguration cost can be minimized. An extensive simulation is conducted to compare our policy with other policies. Our policy shows a significant improvement in time efficiency and decrease in reconfiguration cost.
Resumo:
Models are an effective tool for systems and software design. They allow software architects to abstract from the non-relevant details. Those qualities are also useful for the technical management of networks, systems and software, such as those that compose service oriented architectures. Models can provide a set of well-defined abstractions over the distributed heterogeneous service infrastructure that enable its automated management. We propose to use the managed system as a source of dynamically generated runtime models, and decompose management processes into a composition of model transformations. We have created an autonomic service deployment and configuration architecture that obtains, analyzes, and transforms system models to apply the required actions, while being oblivious to the low-level details. An instrumentation layer automatically builds these models and interprets the planned management actions to the system. We illustrate these concepts with a distributed service update operation.
Resumo:
Polyvariant specialization allows generating múltiple versions of a procedure, which can then be separately optimized for different uses. Since allowing a high degree of polyvariance often results in more optimized code, polyvariant specializers, such as most partial evaluators, can genérate a large number of versions. This can produce unnecessarily large residual programs. Also, large programs can be slower due to cache miss effects. A possible solution to this problem is to introduce a minimization step which identifies sets of equivalent versions, and replace all occurrences of such versions by a single one. In this work we present a unifying view of the problem of superfluous polyvariance. It includes both partial deduction and abstract múltiple specialization. As regards partial deduction, we extend existing approaches in several ways. First, previous work has dealt with puré logic programs and a very limited class of builtins. Herein we propose an extensión to traditional characteristic trees which can be used in the presence of calis to external predicates. This includes all builtins, librarles, other user modules, etc. Second, we propose the possibility of collapsing versions which are not strictly equivalent. This allows trading time for space and can be useful in the context of embedded and pervasive systems. This is done by residualizing certain computations for external predicates which would otherwise be performed at specialization time. Third, we provide an experimental evaluation of the potential gains achievable using minimization which leads to interesting conclusions.
Resumo:
The goal of the RAP-WAM AND-parallel Prolog abstract architecture is to provide inference speeds significantly beyond those of sequential systems, while supporting Prolog semantics and preserving sequential performance and storage efficiency. This paper presents simulation results supporting these claims with special emphasis on memory performance on a two-level sharedmemory multiprocessor organization. Several solutions to the cache coherency problem are analyzed. It is shown that RAP-WAM offers good locality and storage efficiency and that it can effectively take advantage of broadcast caches. It is argued that speeds in excess of 2 ML IPS on real applications exhibiting medium parallelism can be attained with current technology.
Resumo:
Applications that operate on meshes are very popular in High Performance Computing (HPC) environments. In the past, many techniques have been developed in order to optimize the memory accesses for these datasets. Different loop transformations and domain decompositions are com- monly used for structured meshes. However, unstructured grids are more challenging. The memory accesses, based on the mesh connectivity, do not map well to the usual lin- ear memory model. This work presents a method to improve the memory performance which is suitable for HPC codes that operate on meshes. We develop a method to adjust the sequence in which the data are used inside the algorithm, by means of traversing and sorting the mesh. This sorted mesh can be transferred sequentially to the lower memory levels and allows for minimum data transfer requirements. The method also reduces the lower memory requirements dra- matically: up to 63% of the L1 cache misses are removed in a traditional cache system. We have obtained speedups of up to 2.58 on memory operations as measured in a general- purpose CPU. An improvement is also observed with se- quential access memories, where we have observed reduc- tions of up to 99% in the required low-level memory size.
Resumo:
Este trabajo contiene el estudio de las tecnologías que se están usando actualmente en web, tratando de explicar cuáles son sus principales componentes, su objetivo y funcionamiento. En base a un supuesto teórico de un montaje para un servicio web con un número muy alto de usuarios, y basándose en las tecnologías estudiadas, se propone un posible montaje completo de un sistema, que sería capaz de gestionar correctamente todas las peticiones, evitando fallos y tiempos de indisponibilidad. Se a~nade un análisis teórico de los costes deribados de la implantación del sistema, comparándolo con un sistema web convencional, y otro análisis con el funcionamiento de una caché y los benéficos, en carga, derivados de su uso.---ABSTRACT---This work contains a study about new web technologies. Its objective is to explain the web technologies componentes with their particular usage and performance. Based on a theorical postulation about a preparation of a web service with a large number of users, and working with the studied technologies, a complete system assembling is proposed. This system will be able to attend all the incoming requests, without failures nor downtimes. It is attached a theorical study of the derivative costs associated to the system implementation, compared to a traditional one. In addition, another study is included with the work ow of a cache and the benefits derived of its usage in work terms.