919 resultados para Proposed architectures
Resumo:
Miniaturization of devices and the ensuing decrease in the threshold voltage has led to a substantial increase in the leakage component of the total processor energy consumption. Relatively simpler issue logic and the presence of a large number of function units in the VLIW and the clustered VLIW architectures attribute a large fraction of this leakage energy consumption in the functional units. However, functional units are not fully utilized in the VLIW architectures because of the inherent variations in the ILP of the programs. This underutilization is even more pronounced in the context of clustered VLIW architectures because of the contentions for the limited number of slow intercluster communication channels which lead to many short idle cycles.In the past, some architectural schemes have been proposed to obtain leakage energy bene .ts by aggressively exploiting the idleness of functional units. However, presence of many short idle cycles cause frequent transitions from the active mode to the sleep mode and vice-versa and adversely a ffects the energy benefits of a purely hardware based scheme. In this paper, we propose and evaluate a compiler instruction scheduling algorithm that assist such a hardware based scheme in the context of VLIW and clustered VLIW architectures. The proposed scheme exploits the scheduling slacks of instructions to orchestrate the functional unit mapping with the objective of reducing the number of transitions in functional units thereby keeping them off for a longer duration. The proposed compiler-assisted scheme obtains a further 12% reduction of energy consumption of functional units with negligible performance degradation over a hardware-only scheme for a VLIW architecture. The benefits are 15% and 17% in the context of a 2-clustered and a 4-clustered VLIW architecture respectively. Our test bed uses the Trimaran compiler infrastructure.
Resumo:
To harvest solar energy more efficiently, novel Ag2S/Bi2WO6 heterojunctions were synthesized by a hydrothermal route. This novel photocatalyst was synthesized by impregnating Ag2S into a Bi2WO6 semiconductor by a hydrothermal route without any surfactants or templates. The as prepared structures were characterized by multiple techniques such as X-ray diffraction (XRD), X-ray photoelectron spectroscopy (XPS), Brunauer-Emmet-Teller (BET) analysis, scanning electron microscopy (SEM), transmission electron microscopy (TEM), energy dispersive X-ray spectrometry (EDS), UV-vis diffuse reflection spectroscopy (DRS) and photoluminescence (PL). The characterization results suggest mesoporous hierarchical spherical structures with a high surface area and improved photo response in the visible spectrum. Compared to bare Bi2WO6, Ag2S/Bi2WO6 exhibited much higher photocatalytic activity towards the degradation of dye Rhodamine B (RhB). Although silver based catalysts are easily eroded by photogenerated holes, the Ag2S/Bi2WO6 photocatalyst was found to be highly stable in the cyclic experiments. Based on the results of BET, Pl and DRS analysis, two possible reasons have been proposed for the enhanced visible light activity and stability of this novel photocatalyst: (1) broadening of the photoabsorption range and (2) efficient separation of photoinduced charge carriers which does not allow the photoexcited electrons to accumulate on the conduction band of Ag2S and hence prevents the photocorrosion.
Resumo:
We investigate the problem of timing recovery for 2-D magnetic recording (TDMR) channels. We develop a timing error model for TDMR channel considering the phase and frequency offsets with noise. We propose a 2-D data-aided phase-locked loop (PLL) architecture for tracking variations in the position and movement of the read head in the down-track and cross-track directions and analyze the convergence of the algorithm under non-separable timing errors. We further develop a 2-D interpolation-based timing recovery scheme that works in conjunction with the 2-D PLL. We quantify the efficiency of our proposed algorithms by simulations over a 2-D magnetic recording channel with timing errors.
Resumo:
Sheaflike terbium phosphate hydrate hierarchical architectures composed of filamentary nanorods have been fabricated by a hydrothermal method. The X-ray diffraction patterns and thermogravimetric/differential thermal analysis investigations reveal that the obtained terbium phosphate hydrate has a structural formula of TbPO4 center dot H2O, which can be readily indexed to the hexagonal phase GdPO4 center dot nH(2)O in JCPDS file 39-0232. The evolution of the morphology of the products has been investigated in detail. It is found that the addition of CTAB and Na2H2L (disodium ethylenediamine tetraacetate) plays an important role in controlling the final morphology of the products. A possible formation mechanism of the sheaflike architectures was proposed according to the experimental results and analysis. In addition, the phase structure of the product changes to monoclinic phase when it is annealed at 750 degrees C for 2 h in N-2-H-2 atmosphere. Tetragonal chase TbPO4 can be obtained when annealed temperature increases to 1150 degrees C.
Resumo:
The Expectation-Maximization (EM) algorithm is an iterative approach to maximum likelihood parameter estimation. Jordan and Jacobs (1993) recently proposed an EM algorithm for the mixture of experts architecture of Jacobs, Jordan, Nowlan and Hinton (1991) and the hierarchical mixture of experts architecture of Jordan and Jacobs (1992). They showed empirically that the EM algorithm for these architectures yields significantly faster convergence than gradient ascent. In the current paper we provide a theoretical analysis of this algorithm. We show that the algorithm can be regarded as a variable metric algorithm with its searching direction having a positive projection on the gradient of the log likelihood. We also analyze the convergence of the algorithm and provide an explicit expression for the convergence rate. In addition, we describe an acceleration technique that yields a significant speedup in simulation experiments.
Resumo:
Great demand in power optimized devices shows promising economic potential and draws lots of attention in industry and research area. Due to the continuously shrinking CMOS process, not only dynamic power but also static power has emerged as a big concern in power reduction. Other than power optimization, average-case power estimation is quite significant for power budget allocation but also challenging in terms of time and effort. In this thesis, we will introduce a methodology to support modular quantitative analysis in order to estimate average power of circuits, on the basis of two concepts named Random Bag Preserving and Linear Compositionality. It can shorten simulation time and sustain high accuracy, resulting in increasing the feasibility of power estimation of big systems. For power saving, firstly, we take advantages of the low power characteristic of adiabatic logic and asynchronous logic to achieve ultra-low dynamic and static power. We will propose two memory cells, which could run in adiabatic and non-adiabatic mode. About 90% dynamic power can be saved in adiabatic mode when compared to other up-to-date designs. About 90% leakage power is saved. Secondly, a novel logic, named Asynchronous Charge Sharing Logic (ACSL), will be introduced. The realization of completion detection is simplified considerably. Not just the power reduction improvement, ACSL brings another promising feature in average power estimation called data-independency where this characteristic would make power estimation effortless and be meaningful for modular quantitative average case analysis. Finally, a new asynchronous Arithmetic Logic Unit (ALU) with a ripple carry adder implemented using the logically reversible/bidirectional characteristic exhibiting ultra-low power dissipation with sub-threshold region operating point will be presented. The proposed adder is able to operate multi-functionally.
Resumo:
Along with the growing demand for cryptosystems in systems ranging from large servers to mobile devices, suitable cryptogrophic protocols for use under certain constraints are becoming more and more important. Constraints such as calculation time, area, efficiency and security, must be considered by the designer. Elliptic curves, since their introduction to public key cryptography in 1985 have challenged established public key and signature generation schemes such as RSA, offering more security per bit. Amongst Elliptic curve based systems, pairing based cryptographies are thoroughly researched and can be used in many public key protocols such as identity based schemes. For hardware implementions of pairing based protocols, all components which calculate operations over Elliptic curves can be considered. Designers of the pairing algorithms must choose calculation blocks and arrange the basic operations carefully so that the implementation can meet the constraints of time and hardware resource area. This thesis deals with different hardware architectures to accelerate the pairing based cryptosystems in the field of characteristic two. Using different top-level architectures the hardware efficiency of operations that run at different times is first considered in this thesis. Security is another important aspect of pairing based cryptography to be considered in practically Side Channel Analysis (SCA) attacks. The naively implemented hardware accelerators for pairing based cryptographies can be vulnerable when taking the physical analysis attacks into consideration. This thesis considered the weaknesses in pairing based public key cryptography and addresses the particular calculations in the systems that are insecure. In this case, countermeasures should be applied to protect the weak link of the implementation to improve and perfect the pairing based algorithms. Some important rules that the designers must obey to improve the security of the cryptosystems are proposed. According to these rules, three countermeasures that protect the pairing based cryptosystems against SCA attacks are applied. The implementations of the countermeasures are presented and their performances are investigated.
Resumo:
We propose a data flow based run time system as an efficient tool for supporting execution of parallel code on heterogeneous architectures hosting both multicore CPUs and GPUs. We discuss how the proposed run time system may be the target of both structured parallel applications developed using algorithmic skeletons/parallel design patterns and also more "domain specific" programming models. Experimental results demonstrating the feasibility of the approach are presented. © 2012 World Scientific Publishing Company.
Resumo:
A dissertação de doutoramento apresentada insere-se na área de electrónica não-linear de rádio-frequência (RF), UHF e microondas, tendo como principal campo de acção o estudo da distorção nãolinear em arquitecturas de recepção rádio, nomeadamente receptores de conversão directa como Power Meters, RFID (Radio Frequency IDentification) ou SDR (Software Define Radio) front-ends. Partindo de um estudo exaustivo das actuais arquitecturas de recepção de radiofrequência e revendo todos os conceitos teóricos relacionados com o desempenho não-linear dos sistemas/componentes electrónicos, foram desenvolvidos algoritmos matemáticos de modulação dos comportamentos não-lineares destas arquitecturas, simulados e testados em laboratório e propostas novas arquitecturas para a minimização ou cancelamento do impacto negativo de grandes interferidores em frequências vizinhas ao do sistema pretendido.
Resumo:
The continuous demand for highly efficient wireless transmitter systems has triggered an increased interest in switching mode techniques to handle the required power amplification. The RF carrier amplitude-burst transmitter, i.e. a wireless transmitter chain where a phase-modulated carrier is modulated in amplitude in an on-off mode, according to some prescribed envelope-to-time conversion, such as pulse-width or sigma-delta modulation, constitutes a promising architecture capable of efficiently transmitting signals of highly demanding complex modulation schemes. However, the tested practical implementations present results that are way behind the theoretically advanced promises (perfect linearity and efficiency). My original contribution to knowledge presented in this thesis is the first thorough study and model of the power efficiency and linearity characteristics that can be actually achieved with this architecture. The analysis starts with a brief revision of the theoretical idealized behavior of these switched-mode amplifier systems, followed by the study of the many sources of impairments that appear when the real system is implemented. In particular, a special attention is paid to the dynamic load modulation caused by the often ignored interaction between the narrowband signal reconstruction filter and the usual single-ended switched-mode power amplifier, which, among many other performance impairments, forces a two transistor implementation. The performance of this architecture is clearly explained based on the presented theory, which is supported by simulations and corresponding measured results of a fully working implementation. The drawn conclusions allow the development of a set of design rules for future improvements, one of which is proposed and verified in this thesis. It suggests a significant modification to this traditional architecture, where now the phase modulated carrier is always on – and thus allowing a single transistor implementation – and the amplitude is impressed into the carrier phase according to a bi-phase code.
Resumo:
Recent integrated circuit technologies have opened the possibility to design parallel architectures with hundreds of cores on a single chip. The design space of these parallel architectures is huge with many architectural options. Exploring the design space gets even more difficult if, beyond performance and area, we also consider extra metrics like performance and area efficiency, where the designer tries to design the architecture with the best performance per chip area and the best sustainable performance. In this paper we present an algorithm-oriented approach to design a many-core architecture. Instead of doing the design space exploration of the many core architecture based on the experimental execution results of a particular benchmark of algorithms, our approach is to make a formal analysis of the algorithms considering the main architectural aspects and to determine how each particular architectural aspect is related to the performance of the architecture when running an algorithm or set of algorithms. The architectural aspects considered include the number of cores, the local memory available in each core, the communication bandwidth between the many-core architecture and the external memory and the memory hierarchy. To exemplify the approach we did a theoretical analysis of a dense matrix multiplication algorithm and determined an equation that relates the number of execution cycles with the architectural parameters. Based on this equation a many-core architecture has been designed. The results obtained indicate that a 100 mm(2) integrated circuit design of the proposed architecture, using a 65 nm technology, is able to achieve 464 GFLOPs (double precision floating-point) for a memory bandwidth of 16 GB/s. This corresponds to a performance efficiency of 71 %. Considering a 45 nm technology, a 100 mm(2) chip attains 833 GFLOPs which corresponds to 84 % of peak performance These figures are better than those obtained by previous many-core architectures, except for the area efficiency which is limited by the lower memory bandwidth considered. The results achieved are also better than those of previous state-of-the-art many-cores architectures designed specifically to achieve high performance for matrix multiplication.
Resumo:
Les tâches de vision artificielle telles que la reconnaissance d’objets demeurent irrésolues à ce jour. Les algorithmes d’apprentissage tels que les Réseaux de Neurones Artificiels (RNA), représentent une approche prometteuse permettant d’apprendre des caractéristiques utiles pour ces tâches. Ce processus d’optimisation est néanmoins difficile. Les réseaux profonds à base de Machine de Boltzmann Restreintes (RBM) ont récemment été proposés afin de guider l’extraction de représentations intermédiaires, grâce à un algorithme d’apprentissage non-supervisé. Ce mémoire présente, par l’entremise de trois articles, des contributions à ce domaine de recherche. Le premier article traite de la RBM convolutionelle. L’usage de champs réceptifs locaux ainsi que le regroupement d’unités cachées en couches partageant les même paramètres, réduit considérablement le nombre de paramètres à apprendre et engendre des détecteurs de caractéristiques locaux et équivariant aux translations. Ceci mène à des modèles ayant une meilleure vraisemblance, comparativement aux RBMs entraînées sur des segments d’images. Le deuxième article est motivé par des découvertes récentes en neurosciences. Il analyse l’impact d’unités quadratiques sur des tâches de classification visuelles, ainsi que celui d’une nouvelle fonction d’activation. Nous observons que les RNAs à base d’unités quadratiques utilisant la fonction softsign, donnent de meilleures performances de généralisation. Le dernière article quand à lui, offre une vision critique des algorithmes populaires d’entraînement de RBMs. Nous montrons que l’algorithme de Divergence Contrastive (CD) et la CD Persistente ne sont pas robustes : tous deux nécessitent une surface d’énergie relativement plate afin que leur chaîne négative puisse mixer. La PCD à "poids rapides" contourne ce problème en perturbant légèrement le modèle, cependant, ceci génère des échantillons bruités. L’usage de chaînes tempérées dans la phase négative est une façon robuste d’adresser ces problèmes et mène à de meilleurs modèles génératifs.
Resumo:
Les systèmes multiprocesseurs sur puce électronique (On-Chip Multiprocessor [OCM]) sont considérés comme les meilleures structures pour occuper l'espace disponible sur les circuits intégrés actuels. Dans nos travaux, nous nous intéressons à un modèle architectural, appelé architecture isométrique de systèmes multiprocesseurs sur puce, qui permet d'évaluer, de prédire et d'optimiser les systèmes OCM en misant sur une organisation efficace des nœuds (processeurs et mémoires), et à des méthodologies qui permettent d'utiliser efficacement ces architectures. Dans la première partie de la thèse, nous nous intéressons à la topologie du modèle et nous proposons une architecture qui permet d'utiliser efficacement et massivement les mémoires sur la puce. Les processeurs et les mémoires sont organisés selon une approche isométrique qui consiste à rapprocher les données des processus plutôt que d'optimiser les transferts entre les processeurs et les mémoires disposés de manière conventionnelle. L'architecture est un modèle maillé en trois dimensions. La disposition des unités sur ce modèle est inspirée de la structure cristalline du chlorure de sodium (NaCl), où chaque processeur peut accéder à six mémoires à la fois et où chaque mémoire peut communiquer avec autant de processeurs à la fois. Dans la deuxième partie de notre travail, nous nous intéressons à une méthodologie de décomposition où le nombre de nœuds du modèle est idéal et peut être déterminé à partir d'une spécification matricielle de l'application qui est traitée par le modèle proposé. Sachant que la performance d'un modèle dépend de la quantité de flot de données échangées entre ses unités, en l'occurrence leur nombre, et notre but étant de garantir une bonne performance de calcul en fonction de l'application traitée, nous proposons de trouver le nombre idéal de processeurs et de mémoires du système à construire. Aussi, considérons-nous la décomposition de la spécification du modèle à construire ou de l'application à traiter en fonction de l'équilibre de charge des unités. Nous proposons ainsi une approche de décomposition sur trois points : la transformation de la spécification ou de l'application en une matrice d'incidence dont les éléments sont les flots de données entre les processus et les données, une nouvelle méthodologie basée sur le problème de la formation des cellules (Cell Formation Problem [CFP]), et un équilibre de charge de processus dans les processeurs et de données dans les mémoires. Dans la troisième partie, toujours dans le souci de concevoir un système efficace et performant, nous nous intéressons à l'affectation des processeurs et des mémoires par une méthodologie en deux étapes. Dans un premier temps, nous affectons des unités aux nœuds du système, considéré ici comme un graphe non orienté, et dans un deuxième temps, nous affectons des valeurs aux arcs de ce graphe. Pour l'affectation, nous proposons une modélisation des applications décomposées en utilisant une approche matricielle et l'utilisation du problème d'affectation quadratique (Quadratic Assignment Problem [QAP]). Pour l'affectation de valeurs aux arcs, nous proposons une approche de perturbation graduelle, afin de chercher la meilleure combinaison du coût de l'affectation, ceci en respectant certains paramètres comme la température, la dissipation de chaleur, la consommation d'énergie et la surface occupée par la puce. Le but ultime de ce travail est de proposer aux architectes de systèmes multiprocesseurs sur puce une méthodologie non traditionnelle et un outil systématique et efficace d'aide à la conception dès la phase de la spécification fonctionnelle du système.
Resumo:
This paper deals with the key issues encountered in testing during the development of high-speed networking hardware systems by documenting a practical method for "real-life like" testing. The proposed method is empowered by modern and commonly available Field Programmable Gate Array (FPGA) technology. Innovative application of standard FPGA blocks in combination with reconfigurability are used as a back-bone of the method. A detailed elaboration of the method is given so as to serve as a general reference. The method is fully characterised and compared to alternatives through a case study proving it to be the most efficient and effective one at a reasonable cost.
Resumo:
Proposed is a unique cell histogram architecture which will process k data items in parallel to compute 2q histogram bins per time step. An array of m/2q cells computes an m-bin histogram with a speed-up factor of k; k ⩾ 2 makes it faster than current dual-ported memory implementations. Furthermore, simple mechanisms for conflict-free storing of the histogram bins into an external memory array are discussed.