900 resultados para network-on-chip,deadlock, message-dependent-deadlock,NoC
Resumo:
I continui sviluppi nel campo della fabbricazione dei circuiti integrati hanno comportato frequenti travolgimenti nel design, nell’implementazione e nella scalabilità dei device elettronici, così come nel modo di utilizzarli. Anche se la legge di Moore ha anticipato e caratterizzato questo trend nelle ultime decadi, essa stessa si trova a fronteggiare attualmente enormi limitazioni, superabili solo attraverso un diverso approccio nella produzione di chip, consistente in pratica nella sovrapposizione verticale di diversi strati collegati elettricamente attraverso speciali vias. Sul singolo strato, le network on chip sono state suggerite per ovviare le profonde limitazioni dovute allo scaling di strutture di comunicazione condivise. Questa tesi si colloca principalmente nel contesto delle nascenti piattaforme multicore ad alte prestazioni basate sulle 3D NoC, in cui la network on chip viene estesa nelle 3 direzioni. L’obiettivo di questo lavoro è quello di fornire una serie di strumenti e tecniche per poter costruire e aratterizzare una piattaforma tridimensionale, cosi come dimostrato nella realizzazione del testchip 3D NOC fabbricato presso la fonderia IMEC. Il primo contributo è costituito sia una accurata caratterizzazione delle interconnessioni verticali (TSVs) (ovvero delle speciali vias che attraversano l’intero substrato del die), sia dalla caratterizzazione dei router 3D (in cui una o più porte sono estese nella direzione verticale) ed infine dal setup di un design flow 3D utilizzando interamente CAD 2D. Questo primo step ci ha permesso di effettuare delle analisi dettagliate sia sul costo sia sulle varie implicazioni. Il secondo contributo è costituito dallo sviluppo di alcuni blocchi funzionali necessari per garantire il corretto funziomento della 3D NoC, in presenza sia di guasti nelle TSVs (fault tolerant links) che di deriva termica nei vari clock tree dei vari die (alberi di clock indipendenti). Questo secondo contributo è costituito dallo sviluppo delle seguenti soluzioni circuitali: 3D fault tolerant link, Look Up Table riconfigurabili e un sicnronizzatore mesocrono. Il primo è costituito fondamentalmente un bus verticale equipaggiato con delle TSV di riserva da utilizzare per rimpiazzare le vias guaste, più la logica di controllo per effettuare il test e la riconfigurazione. Il secondo è rappresentato da una Look Up Table riconfigurabile, ad alte prestazioni e dal costo contenuto, necesaria per bilanciare sia il traffico nella NoC che per bypassare link non riparabili. Infine la terza soluzione circuitale è rappresentata da un sincronizzatore mesocrono necessario per garantire la sincronizzazione nel trasferimento dati da un layer and un altro nelle 3D Noc. Il terzo contributo di questa tesi è dato dalla realizzazione di un interfaccia multicore per memorie 3D (stacked 3D DRAM) ad alte prestazioni, e dall’esplorazione architetturale dei benefici e del costo di questo nuovo sistema in cui il la memoria principale non è piu il collo di bottiglia dell’intero sistema. Il quarto ed ultimo contributo è rappresentato dalla realizzazione di un 3D NoC test chip presso la fonderia IMEC, e di un circuito full custom per la caratterizzazione della variability dei parametri RC delle interconnessioni verticali.
Resumo:
La temperatura es una preocupación que juega un papel protagonista en el diseño de circuitos integrados modernos. El importante aumento de las densidades de potencia que conllevan las últimas generaciones tecnológicas ha producido la aparición de gradientes térmicos y puntos calientes durante el funcionamiento normal de los chips. La temperatura tiene un impacto negativo en varios parámetros del circuito integrado como el retardo de las puertas, los gastos de disipación de calor, la fiabilidad, el consumo de energía, etc. Con el fin de luchar contra estos efectos nocivos, la técnicas de gestión dinámica de la temperatura (DTM) adaptan el comportamiento del chip en función en la información que proporciona un sistema de monitorización que mide en tiempo de ejecución la información térmica de la superficie del dado. El campo de la monitorización de la temperatura en el chip ha llamado la atención de la comunidad científica en los últimos años y es el objeto de estudio de esta tesis. Esta tesis aborda la temática de control de la temperatura en el chip desde diferentes perspectivas y niveles, ofreciendo soluciones a algunos de los temas más importantes. Los niveles físico y circuital se cubren con el diseño y la caracterización de dos nuevos sensores de temperatura especialmente diseñados para los propósitos de las técnicas DTM. El primer sensor está basado en un mecanismo que obtiene un pulso de anchura variable dependiente de la relación de las corrientes de fuga con la temperatura. De manera resumida, se carga un nodo del circuito y posteriormente se deja flotando de tal manera que se descarga a través de las corrientes de fugas de un transistor; el tiempo de descarga del nodo es la anchura del pulso. Dado que la anchura del pulso muestra una dependencia exponencial con la temperatura, la conversión a una palabra digital se realiza por medio de un contador logarítmico que realiza tanto la conversión tiempo a digital como la linealización de la salida. La estructura resultante de esta combinación de elementos se implementa en una tecnología de 0,35 _m. El sensor ocupa un área muy reducida, 10.250 nm2, y consume muy poca energía, 1.05-65.5nW a 5 muestras/s, estas cifras superaron todos los trabajos previos en el momento en que se publicó por primera vez y en el momento de la publicación de esta tesis, superan a todas las implementaciones anteriores fabricadas en el mismo nodo tecnológico. En cuanto a la precisión, el sensor ofrece una buena linealidad, incluso sin calibrar; se obtiene un error 3_ de 1,97oC, adecuado para tratar con las aplicaciones de DTM. Como se ha explicado, el sensor es completamente compatible con los procesos de fabricación CMOS, este hecho, junto con sus valores reducidos de área y consumo, lo hacen especialmente adecuado para la integración en un sistema de monitorización de DTM con un conjunto de monitores empotrados distribuidos a través del chip. Las crecientes incertidumbres de proceso asociadas a los últimos nodos tecnológicos comprometen las características de linealidad de nuestra primera propuesta de sensor. Con el objetivo de superar estos problemas, proponemos una nueva técnica para obtener la temperatura. La nueva técnica también está basada en las dependencias térmicas de las corrientes de fuga que se utilizan para descargar un nodo flotante. La novedad es que ahora la medida viene dada por el cociente de dos medidas diferentes, en una de las cuales se altera una característica del transistor de descarga |la tensión de puerta. Este cociente resulta ser muy robusto frente a variaciones de proceso y, además, la linealidad obtenida cumple ampliamente los requisitos impuestos por las políticas DTM |error 3_ de 1,17oC considerando variaciones del proceso y calibrando en dos puntos. La implementación de la parte sensora de esta nueva técnica implica varias consideraciones de diseño, tales como la generación de una referencia de tensión independiente de variaciones de proceso, que se analizan en profundidad en la tesis. Para la conversión tiempo-a-digital, se emplea la misma estructura de digitalización que en el primer sensor. Para la implementación física de la parte de digitalización, se ha construido una biblioteca de células estándar completamente nueva orientada a la reducción de área y consumo. El sensor resultante de la unión de todos los bloques se caracteriza por una energía por muestra ultra baja (48-640 pJ) y un área diminuta de 0,0016 mm2, esta cifra mejora todos los trabajos previos. Para probar esta afirmación, se realiza una comparación exhaustiva con más de 40 propuestas de sensores en la literatura científica. Subiendo el nivel de abstracción al sistema, la tercera contribución se centra en el modelado de un sistema de monitorización que consiste de un conjunto de sensores distribuidos por la superficie del chip. Todos los trabajos anteriores de la literatura tienen como objetivo maximizar la precisión del sistema con el mínimo número de monitores. Como novedad, en nuestra propuesta se introducen nuevos parámetros de calidad aparte del número de sensores, también se considera el consumo de energía, la frecuencia de muestreo, los costes de interconexión y la posibilidad de elegir diferentes tipos de monitores. El modelo se introduce en un algoritmo de recocido simulado que recibe la información térmica de un sistema, sus propiedades físicas, limitaciones de área, potencia e interconexión y una colección de tipos de monitor; el algoritmo proporciona el tipo seleccionado de monitor, el número de monitores, su posición y la velocidad de muestreo _optima. Para probar la validez del algoritmo, se presentan varios casos de estudio para el procesador Alpha 21364 considerando distintas restricciones. En comparación con otros trabajos previos en la literatura, el modelo que aquí se presenta es el más completo. Finalmente, la última contribución se dirige al nivel de red, partiendo de un conjunto de monitores de temperatura de posiciones conocidas, nos concentramos en resolver el problema de la conexión de los sensores de una forma eficiente en área y consumo. Nuestra primera propuesta en este campo es la introducción de un nuevo nivel en la jerarquía de interconexión, el nivel de trillado (o threshing en inglés), entre los monitores y los buses tradicionales de periféricos. En este nuevo nivel se aplica selectividad de datos para reducir la cantidad de información que se envía al controlador central. La idea detrás de este nuevo nivel es que en este tipo de redes la mayoría de los datos es inútil, porque desde el punto de vista del controlador sólo una pequeña cantidad de datos |normalmente sólo los valores extremos| es de interés. Para cubrir el nuevo nivel, proponemos una red de monitorización mono-conexión que se basa en un esquema de señalización en el dominio de tiempo. Este esquema reduce significativamente tanto la actividad de conmutación sobre la conexión como el consumo de energía de la red. Otra ventaja de este esquema es que los datos de los monitores llegan directamente ordenados al controlador. Si este tipo de señalización se aplica a sensores que realizan conversión tiempo-a-digital, se puede obtener compartición de recursos de digitalización tanto en tiempo como en espacio, lo que supone un importante ahorro de área y consumo. Finalmente, se presentan dos prototipos de sistemas de monitorización completos que de manera significativa superan la características de trabajos anteriores en términos de área y, especialmente, consumo de energía. Abstract Temperature is a first class design concern in modern integrated circuits. The important increase in power densities associated to recent technology evolutions has lead to the apparition of thermal gradients and hot spots during run time operation. Temperature impacts several circuit parameters such as speed, cooling budgets, reliability, power consumption, etc. In order to fight against these negative effects, dynamic thermal management (DTM) techniques adapt the behavior of the chip relying on the information of a monitoring system that provides run-time thermal information of the die surface. The field of on-chip temperature monitoring has drawn the attention of the scientific community in the recent years and is the object of study of this thesis. This thesis approaches the matter of on-chip temperature monitoring from different perspectives and levels, providing solutions to some of the most important issues. The physical and circuital levels are covered with the design and characterization of two novel temperature sensors specially tailored for DTM purposes. The first sensor is based upon a mechanism that obtains a pulse with a varying width based on the variations of the leakage currents on the temperature. In a nutshell, a circuit node is charged and subsequently left floating so that it discharges away through the subthreshold currents of a transistor; the time the node takes to discharge is the width of the pulse. Since the width of the pulse displays an exponential dependence on the temperature, the conversion into a digital word is realized by means of a logarithmic counter that performs both the timeto- digital conversion and the linearization of the output. The structure resulting from this combination of elements is implemented in a 0.35_m technology and is characterized by very reduced area, 10250 nm2, and power consumption, 1.05-65.5 nW at 5 samples/s, these figures outperformed all previous works by the time it was first published and still, by the time of the publication of this thesis, they outnumber all previous implementations in the same technology node. Concerning the accuracy, the sensor exhibits good linearity, even without calibration it displays a 3_ error of 1.97oC, appropriate to deal with DTM applications. As explained, the sensor is completely compatible with standard CMOS processes, this fact, along with its tiny area and power overhead, makes it specially suitable for the integration in a DTM monitoring system with a collection of on-chip monitors distributed across the chip. The exacerbated process fluctuations carried along with recent technology nodes jeop-ardize the linearity characteristics of the first sensor. In order to overcome these problems, a new temperature inferring technique is proposed. In this case, we also rely on the thermal dependencies of leakage currents that are used to discharge a floating node, but now, the result comes from the ratio of two different measures, in one of which we alter a characteristic of the discharging transistor |the gate voltage. This ratio proves to be very robust against process variations and displays a more than suficient linearity on the temperature |1.17oC 3_ error considering process variations and performing two-point calibration. The implementation of the sensing part based on this new technique implies several issues, such as the generation of process variations independent voltage reference, that are analyzed in depth in the thesis. In order to perform the time-to-digital conversion, we employ the same digitization structure the former sensor used. A completely new standard cell library targeting low area and power overhead is built from scratch to implement the digitization part. Putting all the pieces together, we achieve a complete sensor system that is characterized by ultra low energy per conversion of 48-640pJ and area of 0.0016mm2, this figure outperforms all previous works. To prove this statement, we perform a thorough comparison with over 40 works from the scientific literature. Moving up to the system level, the third contribution is centered on the modeling of a monitoring system consisting of set of thermal sensors distributed across the chip. All previous works from the literature target maximizing the accuracy of the system with the minimum number of monitors. In contrast, we introduce new metrics of quality apart form just the number of sensors; we consider the power consumption, the sampling frequency, the possibility to consider different types of monitors and the interconnection costs. The model is introduced in a simulated annealing algorithm that receives the thermal information of a system, its physical properties, area, power and interconnection constraints and a collection of monitor types; the algorithm yields the selected type of monitor, the number of monitors, their position and the optimum sampling rate. We test the algorithm with the Alpha 21364 processor under several constraint configurations to prove its validity. When compared to other previous works in the literature, the modeling presented here is the most complete. Finally, the last contribution targets the networking level, given an allocated set of temperature monitors, we focused on solving the problem of connecting them in an efficient way from the area and power perspectives. Our first proposal in this area is the introduction of a new interconnection hierarchy level, the threshing level, in between the monitors and the traditional peripheral buses that applies data selectivity to reduce the amount of information that is sent to the central controller. The idea behind this new level is that in this kind of networks most data are useless because from the controller viewpoint just a small amount of data |normally extreme values| is of interest. To cover the new interconnection level, we propose a single-wire monitoring network based on a time-domain signaling scheme that significantly reduces both the switching activity over the wire and the power consumption of the network. This scheme codes the information in the time domain and allows a straightforward obtention of an ordered list of values from the maximum to the minimum. If the scheme is applied to monitors that employ TDC, digitization resource sharing is achieved, producing an important saving in area and power consumption. Two prototypes of complete monitoring systems are presented, they significantly overcome previous works in terms of area and, specially, power consumption.
Resumo:
Neuronal responses are conspicuously variable. We focus on one particular aspect of that variability: the precision of action potential timing. We show that for common models of noisy spike generation, elementary considerations imply that such variability is a function of the input, and can be made arbitrarily large or small by a suitable choice of inputs. Our considerations are expected to extend to virtually any mechanism of spike generation, and we illustrate them with data from the visual pathway. Thus, a simplification usually made in the application of information theory to neural processing is violated: noise is not independent of the message. However, we also show the existence of error-correcting topologies, which can achieve better timing reliability than their components.
Resumo:
O paradigma das redes em chip (NoCs) surgiu a fim de permitir alto grau de integração entre vários núcleos de sistemas em chip (SoCs), cuja comunicação é tradicionalmente baseada em barramentos. As NoCs são definidas como uma estrutura de switches e canais ponto a ponto que interconectam núcleos de propriedades intelectuais (IPs) de um SoC, provendo uma plataforma de comunicação entre os mesmos. As redes em chip sem fio (WiNoCs) são uma abordagem evolucionária do conceito de rede em chip (NoC), a qual possibilita a adoção dos mecanismos de roteamento das NoCs com o uso de tecnologias sem fio, propondo a otimização dos fluxos de tráfego, a redução de conectores e a atuação em conjunto com as NoCs tradicionais, reduzindo a carga nos barramentos. O uso do roteamento dinâmico dentro das redes em chip sem fio permite o desligamento seletivo de partes do hardware, o que reduz a energia consumida. Contudo, a escolha de onde empregar um link sem fio em uma NoC é uma tarefa complexa, dado que os nós são pontes de tráfego os quais não podem ser desligados sem potencialmente quebrar uma rota preestabelecida. Além de fornecer uma visão sobre as arquiteturas de NoCs e do estado da arte do paradigma emergente de WiNoC, este trabalho também propõe um método de avaliação baseado no já consolidado simulador ns-2, cujo objetivo é testar cenários híbridos de NoC e WiNoC. A partir desta abordagem é possível avaliar diferentes parâmetros das WiNoCs associados a aspectos de roteamento, aplicação e número de nós envolvidos em redes hierárquicas. Por meio da análise de tais simulações também é possível investigar qual estratégia de roteamento é mais recomendada para um determinado cenário de utilização, o que é relevante ao se escolher a disposição espacial dos nós em uma NoC. Os experimentos realizados são o estudo da dinâmica de funcionamento dos protocolos ad hoc de roteamento sem fio em uma topologia hierárquica de WiNoC, seguido da análise de tamanho da rede e dos padrões de tráfego na WiNoC.
Resumo:
Nanoparticles offer an ideal platform for the delivery of small molecule drugs, subunit vaccines and genetic constructs. Besides the necessity of a homogenous size distribution, defined loading efficiencies and reasonable production and development costs, one of the major bottlenecks in translating nanoparticles into clinical application is the need for rapid, robust and reproducible development techniques. Within this thesis, microfluidic methods were investigated for the manufacturing, drug or protein loading and purification of pharmaceutically relevant nanoparticles. Initially, methods to prepare small liposomes were evaluated and compared to a microfluidics-directed nanoprecipitation method. To support the implementation of statistical process control, design of experiment models aided the process robustness and validation for the methods investigated and gave an initial overview of the size ranges obtainable in each method whilst evaluating advantages and disadvantages of each method. The lab-on-a-chip system resulted in a high-throughput vesicle manufacturing, enabling a rapid process and a high degree of process control. To further investigate this method, cationic low transition temperature lipids, cationic bola-amphiphiles with delocalized charge centers, neutral lipids and polymers were used in the microfluidics-directed nanoprecipitation method to formulate vesicles. Whereas the total flow rate (TFR) and the ratio of solvent to aqueous stream (flow rate ratio, FRR) was shown to be influential for controlling the vesicle size in high transition temperature lipids, the factor FRR was found the most influential factor controlling the size of vesicles consisting of low transition temperature lipids and polymer-based nanoparticles. The biological activity of the resulting constructs was confirmed by an invitro transfection of pDNA constructs using cationic nanoprecipitated vesicles. Design of experiments and multivariate data analysis revealed the mathematical relationship and significance of the factors TFR and FRR in the microfluidics process to the liposome size, polydispersity and transfection efficiency. Multivariate tools were used to cluster and predict specific in-vivo immune responses dependent on key liposome adjuvant characteristics upon delivery a tuberculosis antigen in a vaccine candidate. The addition of a low solubility model drug (propofol) in the nanoprecipitation method resulted in a significantly higher solubilisation of the drug within the liposomal bilayer, compared to the control method. The microfluidics method underwent scale-up work by increasing the channel diameter and parallelisation of the mixers in a planar way, resulting in an overall 40-fold increase in throughput. Furthermore, microfluidic tools were developed based on a microfluidics-directed tangential flow filtration, which allowed for a continuous manufacturing, purification and concentration of liposomal drug products.
Resumo:
Many-core systems are emerging from the need of more computational power and power efficiency. However there are many issues which still revolve around the many-core systems. These systems need specialized software before they can be fully utilized and the hardware itself may differ from the conventional computational systems. To gain efficiency from many-core system, programs need to be parallelized. In many-core systems the cores are small and less powerful than cores used in traditional computing, so running a conventional program is not an efficient option. Also in Network-on-Chip based processors the network might get congested and the cores might work at different speeds. In this thesis is, a dynamic load balancing method is proposed and tested on Intel 48-core Single-Chip Cloud Computer by parallelizing a fault simulator. The maximum speedup is difficult to obtain due to severe bottlenecks in the system. In order to exploit all the available parallelism of the Single-Chip Cloud Computer, a runtime approach capable of dynamically balancing the load during the fault simulation process is used. The proposed dynamic fault simulation approach on the Single-Chip Cloud Computer shows up to 45X speedup compared to a serial fault simulation approach. Many-core systems can draw enormous amounts of power, and if this power is not controlled properly, the system might get damaged. One way to manage power is to set power budget for the system. But if this power is drawn by just few cores of the many, these few cores get extremely hot and might get damaged. Due to increase in power density multiple thermal sensors are deployed on the chip area to provide realtime temperature feedback for thermal management techniques. Thermal sensor accuracy is extremely prone to intra-die process variation and aging phenomena. These factors lead to a situation where thermal sensor values drift from the nominal values. This necessitates efficient calibration techniques to be applied before the sensor values are used. In addition, in modern many-core systems cores have support for dynamic voltage and frequency scaling. Thermal sensors located on cores are sensitive to the core's current voltage level, meaning that dedicated calibration is needed for each voltage level. In this thesis a general-purpose software-based auto-calibration approach is also proposed for thermal sensors to calibrate thermal sensors on different range of voltages.
Resumo:
International audience
Resumo:
Solid–interstitial fluid interaction, which depends on tissue permeability, is significant to the strain-rate-dependent mechanical behavior of humeral head (shoulder) cartilage. Due to anatomical and biomechanical similarities to that of the human shoulder, kangaroos present a suitable animal model. Therefore, indentation experiments were conducted on kangaroo shoulder cartilage tissues from low (10−4/s) to moderately high (10−2/s) strain-rates. A porohyperelastic model was developed based on the experimental characterization; and a permeability function that takes into account the effect of strain-rate on permeability (strain-rate-dependent permeability) was introduced into the model to investigate the effect of rate-dependent fluid flow on tissue response. The prediction of the model with the strain-rate-dependent permeability was compared with those of the models using constant permeability and strain-dependent permeability. Compared to the model with constant permeability, the models with strain-dependent and strain-rate-dependent permeability were able to better capture the experimental variation at all strain-rates (p<0.05). Significant differences were not identified between models with strain-dependent and strain-rate-dependent permeability at strain-rate of 5×10−3/s (p=0.179). However, at strain-rate of 10−2/s, the model with strain-rate-dependent permeability was significantly better at capturing the experimental results (p<0.005). The findings thus revealed the significance of rate-dependent fluid flow on tissue behavior at large strain-rates, which provides insights into the mechanical deformation mechanisms of cartilage tissues.
Resumo:
The structure and properties of the double-helical form of the alternating copolymer poly(dA-dT) are considered. Different lines of evidence are interpreted in terms of a structure in which every second phosphate-diester linkage has a conformation different from that of the normal B form. A rationale for this “alternating-B” structure is given which provides an explanation for the effects of chemical modifications of the T residues on the binding of the poly(dA-dT)· poly(dA-dT) to the lac repressor of Escherichia coli.
Resumo:
The memory subsystem is a major contributor to the performance, power, and area of complex SoCs used in feature rich multimedia products. Hence, memory architecture of the embedded DSP is complex and usually custom designed with multiple banks of single-ported or dual ported on-chip scratch pad memory and multiple banks of off-chip memory. Building software for such large complex memories with many of the software components as individually optimized software IPs is a big challenge. In order to obtain good performance and a reduction in memory stalls, the data buffers of the application need to be placed carefully in different types of memory. In this paper we present a unified framework (MODLEX) that combines different data layout optimizations to address the complex DSP memory architectures. Our method models the data layout problem as multi-objective genetic algorithm (GA) with performance and power being the objectives and presents a set of solution points which is attractive from a platform design viewpoint. While most of the work in the literature assumes that performance and power are non-conflicting objectives, our work demonstrates that there is significant trade-off (up to 70%) that is possible between power and performance.
Resumo:
Today's feature-rich multimedia products require embedded system solution with complex System-on-Chip (SoC) to meet market expectations of high performance at a low cost and lower energy consumption. The memory architecture of the embedded system strongly influences these parameters. Hence the embedded system designer performs a complete memory architecture exploration. This problem is a multi-objective optimization problem and can be tackled as a two-level optimization problem. The outer level explores various memory architecture while the inner level explores placement of data sections (data layout problem) to minimize memory stalls. Further, the designer would be interested in multiple optimal design points to address various market segments. However, tight time-to-market constraints enforces short design cycle time. In this paper we address the multi-level multi-objective memory architecture exploration problem through a combination of Multi-objective Genetic Algorithm (Memory Architecture exploration) and an efficient heuristic data placement algorithm. At the outer level the memory architecture exploration is done by picking memory modules directly from a ASIC memory Library. This helps in performing the memory architecture exploration in a integrated framework, where the memory allocation, memory exploration and data layout works in a tightly coupled way to yield optimal design points with respect to area, power and performance. We experimented our approach for 3 embedded applications and our approach explores several thousand memory architecture for each application, yielding a few hundred optimal design points in a few hours of computation time on a standard desktop.
Resumo:
Today's SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. In this article, we address the on-chip memory architecture exploration for DSP processors which are organized as multiple memory banks, where banks can be single/dual ported with non-uniform bank sizes. In this paper we propose two different methods for physical memory architecture exploration and identify the strengths and applicability of these methods in a systematic way. Both methods address the memory architecture exploration for a given target application by considering the application's data access characteristics and generates a set of Pareto-optimal design points that are interesting from a power, performance and VLSI area perspective. To the best of our knowledge, this is the first comprehensive work on memory space exploration at physical memory level that integrates data layout and memory exploration to address the system objectives from both hardware design and application software development perspective. Further we propose an automatic framework that explores the design space identifying 100's of Pareto-optimal design points within a few hours of running on a standard desktop configuration.
Resumo:
Advances in technology have increased the number of cores and size of caches present on chip multicore platforms(CMPs). As a result, leakage power consumption of on-chip caches has already become a major power consuming component of the memory subsystem. We propose to reduce leakage power consumption in static nonuniform cache architecture(SNUCA) on a tiled CMP by dynamically varying the number of cache slices used and switching off unused cache slices. A cache slice in a tile includes all cache banks present in that tile. Switched-off cache slices are remapped considering the communication costs to reduce cache usage with minimal impact on execution time. This saves leakage power consumption in switched-off L2 cache slices. On an average, there map policy achieves 41% and 49% higher EDP savings compared to static and dynamic NUCA (DNUCA) cache policies on a scalable tiled CMP, respectively.