Biblioteca Digital

986 resultados para Floating Point Library

Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Recent integrated circuit technologies have opened the possibility to design parallel architectures with hundreds of cores on a single chip. The design space of these parallel architectures is huge with many architectural options. Exploring the design space gets even more difficult if, beyond performance and area, we also consider extra metrics like performance and area efficiency, where the designer tries to design the architecture with the best performance per chip area and the best sustainable performance. In this paper we present an algorithm-oriented approach to design a many-core architecture. Instead of doing the design space exploration of the many core architecture based on the experimental execution results of a particular benchmark of algorithms, our approach is to make a formal analysis of the algorithms considering the main architectural aspects and to determine how each particular architectural aspect is related to the performance of the architecture when running an algorithm or set of algorithms. The architectural aspects considered include the number of cores, the local memory available in each core, the communication bandwidth between the many-core architecture and the external memory and the memory hierarchy. To exemplify the approach we did a theoretical analysis of a dense matrix multiplication algorithm and determined an equation that relates the number of execution cycles with the architectural parameters. Based on this equation a many-core architecture has been designed. The results obtained indicate that a 100 mm(2) integrated circuit design of the proposed architecture, using a 65 nm technology, is able to achieve 464 GFLOPs (double precision floating-point) for a memory bandwidth of 16 GB/s. This corresponds to a performance efficiency of 71 %. Considering a 45 nm technology, a 100 mm(2) chip attains 833 GFLOPs which corresponds to 84 % of peak performance These figures are better than those obtained by previous many-core architectures, except for the area efficiency which is limited by the lower memory bandwidth considered. The results achieved are also better than those of previous state-of-the-art many-cores architectures designed specifically to achieve high performance for matrix multiplication.

A many-core co-processor for embedded parallel computing on FPGA

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Single processor architectures are unable to provide the required performance of high performance embedded systems. Parallel processing based on general-purpose processors can achieve these performances with a considerable increase of required resources. However, in many cases, simplified optimized parallel cores can be used instead of general-purpose processors achieving better performance at lower resource utilization. In this paper, we propose a configurable many-core architecture to serve as a co-processor for high-performance embedded computing on Field-Programmable Gate Arrays. The architecture consists of an array of configurable simple cores with support for floating-point operations interconnected with a configurable interconnection network. For each core it is possible to configure the size of the internal memory, the supported operations and number of interfacing ports. The architecture was tested in a ZYNQ-7020 FPGA in the execution of several parallel algorithms. The results show that the proposed many-core architecture achieves better performance than that achieved with a parallel generalpurpose processor and that up to 32 floating-point cores can be implemented in a ZYNQ-7020 SoC FPGA.

Parallel GPU architecture for hyperspectral unmixing based on augmented Lagrangian method

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Hyperspectral imaging has become one of the main topics in remote sensing applications, which comprise hundreds of spectral bands at different (almost contiguous) wavelength channels over the same area generating large data volumes comprising several GBs per flight. This high spectral resolution can be used for object detection and for discriminate between different objects based on their spectral characteristics. One of the main problems involved in hyperspectral analysis is the presence of mixed pixels, which arise when the spacial resolution of the sensor is not able to separate spectrally distinct materials. Spectral unmixing is one of the most important task for hyperspectral data exploitation. However, the unmixing algorithms can be computationally very expensive, and even high power consuming, which compromises the use in applications under on-board constraints. In recent years, graphics processing units (GPUs) have evolved into highly parallel and programmable systems. Specifically, several hyperspectral imaging algorithms have shown to be able to benefit from this hardware taking advantage of the extremely high floating-point processing performance, compact size, huge memory bandwidth, and relatively low cost of these units, which make them appealing for onboard data processing. In this paper, we propose a parallel implementation of an augmented Lagragian based method for unsupervised hyperspectral linear unmixing on GPUs using CUDA. The method called simplex identification via split augmented Lagrangian (SISAL) aims to identify the endmembers of a scene, i.e., is able to unmix hyperspectral data sets in which the pure pixel assumption is violated. The efficient implementation of SISAL method presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory.

A Grid Supercomputing Enviroment for High Demand Computational Applications

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Despite the huge increase in processor and interprocessor network performace, many computational problems remain unsolved due to lack of some critical resources such as floating point sustained performance, memory bandwidth, etc... Examples of these problems are found in areas of climate research, biology, astrophysics, high energy physics (montecarlo simulations) and artificial intelligence, among others. For some of these problems, computing resources of a single supercomputing facility can be 1 or 2 orders of magnitude apart from the resources needed to solve some them. Supercomputer centers have to face an increasing demand on processing performance, with the direct consequence of an increasing number of processors and systems, resulting in a more difficult administration of HPC resources and the need for more physical space, higher electrical power consumption and improved air conditioning, among other problems. Some of the previous problems can´t be easily solved, so grid computing, intended as a technology enabling the addition and consolidation of computing power, can help in solving large scale supercomputing problems. In this document, we describe how 2 supercomputing facilities in Spain joined their resources to solve a problem of this kind. The objectives of this experience were, among others, to demonstrate that such a cooperation can enable the solution of bigger dimension problems and to measure the efficiency that could be achieved. In this document we show some preliminary results of this experience and to what extend these objectives were achieved.

Estudi de comparació de rendiment de Benchmark multicore i multithreading amb OpenMP 2.5 i OpenMP 3.0

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Estudi comparatiu amb benchmark del rendiment en dues plataformes multicore multithreading de diferents modalitats de paral·lelització de multiplicacions de matrius de nombres enters i de nombres en coma flotant mitjançant el model de memòria compartida OpenMP versió 2.5 i OpenMP versió 3.0.

Verkkovaihtosuuntaajan vektorisäädön toteutus FPGA-piirillä

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Verkkovaihtosuuntaajalla pystytään muuntamaan tasajännite vaihtojännitteeksi ja päinvastoin. Verkkovaihtosuuntaajan toiminta perustuu tehokytkinten ohjaukseen ja sopivan modulointimenetelmän käyttöön. Vektorisäädössä vaihtosuuntaajanvirrat ja jännitteet esitetään kompleksitasossa, jolloin virta- ja jännitekomponentit voidaan esittää vektoreina. Vektorisäädössä verkkovaihtosuuntaajan ohjaustoteutetaan laskemalla kompleksitasossa vektoreille arvot, jotka tuottavat vaihtosuuntaajan lähtöön halutun vektorin. Koska FPGA-piirit mahdollistavat nopean rinnakkaisen laskennan, soveltuvat ne hyvin vektorisäädön toteuttamiseen. FPGA-piirien rakenteesta johtuen on säätöjärjestelmän suunnittelussa huomioitava kiinteän pilkun lukujen riittävä bittileveys ja järjestelmän diskretointiaika. Työssä suunnitellaan verkkovaihtosuuntaajan vektorisäätö ja tutkitaan bittileveyden vaikutusta säädön toteuttamiseen FPGA-piirillä. Bittileveyden tarkasteluun esitetään käytettäväksi tilastollisia menetelmiä. Työssä tarkastellaan kiinteän pilkun järjestelmän ja liukulukujärjestelmän erosuureen tilastollisia tunnusmerkkejä sekä histogrammia. Tarkasteluissa huomattiin, että maksimivirhe itsessään ei tarjoa riittävästi tietoa erosuureen jakautumisesta. Näin ollen maksimivirhe ei ole kaikissa tilanteissa sovelias menetelmä riittävän bittitarkkuuden määrittämiseen. Työssä esitetään riittävän bittitarkkuuden määrittelemiseen käytettäväksi otossuureista otosvarianssia, keskipoikkeamaa ja vaihteluväliä.

Oikosulkumoottorin roottorivaurioiden tunnistaminen staattorivirran avulla

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Diplomityössä esitellään menetelmiä sauvarikon toteamiseksi. Työn tarkoituksena on tutkia roottorivaurioita staattorivirran avulla. Työ jaetaan karkeasti kolmeen osa-alueeseen: oikosulkumoottorin vikoihin, roottorivaurioiden tunnistamiseen ja signaalinkäsittelymenetelmiin, jonka avulla havaitaan sauvarikko. Oikosulkumoottorin vikoja ovat staattorikäämien vauriot ja roottorivauriot. Roottorikäämien vaurioita ovat roottori sauvojen murtuminen sekä roottorisauvan irtoaminen oikosulkujenkaan päästä. Roottorivaurioiden tunnistamismenetelmiä ovat parametrin arviointi ja virtaspektrianalyysi. Työn alkuosassa esitellään oikosulkumoottorien rakenne ja toiminta. Esitellään moottoriin kohdistuvia vikoja ja etsitään ratkaisumenetelmiä roottorivaurioiden tunnistamisessa. Lopuksi tutkitaan, kuinka staattorimittaustietojen perusteella saadut tulokset voidaan käsitellä FFT -algoritmilla ja kuinka FFT -algoritmi voidaan toteuttaa sulautettuna Sharc -prosessorin avulla. Työssä käytetään ADSP 21062 EZ -LAB kehitysympäristöä, jonka avulla voidaan ajaa ohjelmia RAM-sirusta, joka on vuorovaikutuksessa SHARC -laudassa oleviin laitteisiin.

Populaation monimuotoisuuden mittaaminen liukulukukoodatuissa evoluutioalgoritmeissa

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Diplomityössä esitetään menetelmä populaation monimuotoisuuden mittaamiseen liukulukukoodatuissa evoluutioalgoritmeissa, ja tarkastellaan kokeellisesti sen toimintaa. Evoluutioalgoritmit ovat populaatiopohjaisia menetelmiä, joilla pyritään ratkaisemaan optimointiongelmia. Evoluutioalgoritmeissa populaation monimuotoisuuden hallinta on välttämätöntä, jotta suoritettu haku olisi riittävän luotettavaa ja toisaalta riittävän nopeaa. Monimuotoisuuden mittaaminen on erityisen tarpeellista tutkittaessa evoluutioalgoritmien dynaamista käyttäytymistä. Työssä tarkastellaan haku- ja tavoitefunktioavaruuden monimuotoisuuden mittaamista. Toistaiseksi ei ole ollut olemassa täysin tyydyttäviä monimuotoisuuden mittareita, ja työn tavoitteena on kehittää yleiskäyttöinen menetelmä liukulukukoodattujen evoluutioalgoritmien suhteellisen ja absoluuttisen monimuotoisuuden mittaamiseen hakuavaruudessa. Kehitettyjen mittareiden toimintaa ja käyttökelpoisuutta tarkastellaan kokeellisesti ratkaisemalla optimointiongelmia differentiaalievoluutioalgoritmilla. Toteutettujen mittareiden toiminta perustuu keskihajontojen laskemiseen populaatiosta. Keskihajonnoille suoritetaan skaalaus, joko alkupopulaation tai nykyisen populaation suhteen, riippuen lasketaanko absoluuttista vai suhteellista monimuotoisuutta. Kokeellisessa tarkastelussa havaittiin kehitetyt mittarit toimiviksi ja käyttökelpoisiksi. Tavoitefunktion venyttäminen koordinaattiakseleiden suunnassa ei vaikuta mittarin toimintaan. Myöskään tavoitefunktion kiertäminen koordinaatistossa ei vaikuta mittareiden tuloksiin. Esitetyn menetelmän aikakompleksisuus riippuu lineaarisesti populaation koosta, ja mittarin toiminta on siten nopeaa suuriakin populaatioita käytettäessä. Suhteellinen monimuotoisuus antaa vertailukelpoisia tuloksia riippumatta parametrien lukumäärästä tai populaation koosta.

A Co-Processor Approach for Efficient Java Execution in Embedded Systems

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This thesis deals with a hardware accelerated Java virtual machine, named REALJava. The REALJava virtual machine is targeted for resource constrained embedded systems. The goal is to attain increased computational performance with reduced power consumption. While these objectives are often seen as trade-offs, in this context both of them can be attained simultaneously by using dedicated hardware. The target level of the computational performance of the REALJava virtual machine is initially set to be as fast as the currently available full custom ASIC Java processors. As a secondary goal all of the components of the virtual machine are designed so that the resulting system can be scaled to support multiple co-processor cores. The virtual machine is designed using the hardware/software co-design paradigm. The partitioning between the two domains is flexible, allowing customizations to the resulting system, for instance the floating point support can be omitted from the hardware in order to decrease the size of the co-processor core. The communication between the hardware and the software domains is encapsulated into modules. This allows the REALJava virtual machine to be easily integrated into any system, simply by redesigning the communication modules. Besides the virtual machine and the related co-processor architecture, several performance enhancing techniques are presented. These include techniques related to instruction folding, stack handling, method invocation, constant loading and control in time domain. The REALJava virtual machine is prototyped using three different FPGA platforms. The original pipeline structure is modified to suit the FPGA environment. The performance of the resulting Java virtual machine is evaluated against existing Java solutions in the embedded systems field. The results show that the goals are attained, both in terms of computational performance and power consumption. Especially the computational performance is evaluated thoroughly, and the results show that the REALJava is more than twice as fast as the fastest full custom ASIC Java processor. In addition to standard Java virtual machine benchmarks, several new Java applications are designed to both verify the results and broaden the spectrum of the tests.

One chip solution for low-cost active magnetic bearing system

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this work the implementation of the active magnetic bearing control system in a single FPGA is studied. Requirements for the full magnetic bearing control system are reviewed. Different control methods for active magnetic bearings are described shortly. Flux and the current base controllers are implemented in a FPGA. Suitability of the con-trollers for a low-cost magnetic bearing application is studied. Floating-point arithmetic’s are used in the controllers to ease designing burden and improve calculation precision. Per-formance of the flux controller is verified with simulations.

Design and synthesis of efficient mac architectures for high speed decimal processor

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Most of the commercial and financial data are stored in decimal fonn. Recently, support for decimal arithmetic has received increased attention due to the growing importance in financial analysis, banking, tax calculation, currency conversion, insurance, telephone billing and accounting. Performing decimal arithmetic with systems that do not support decimal computations may give a result with representation error, conversion error, and/or rounding error. In this world of precision, such errors are no more tolerable. The errors can be eliminated and better accuracy can be achieved if decimal computations are done using Decimal Floating Point (DFP) units. But the floating-point arithmetic units in today's general-purpose microprocessors are based on the binary number system, and the decimal computations are done using binary arithmetic. Only few common decimal numbers can be exactly represented in Binary Floating Point (BF P). ln many; cases, the law requires that results generated from financial calculations performed on a computer should exactly match with manual calculations. Currently many applications involving fractional decimal data perform decimal computations either in software or with a combination of software and hardware. The performance can be dramatically improved by complete hardware DFP units and this leads to the design of processors that include DF P hardware.VLSI implementations using same modular building blocks can decrease system design and manufacturing cost. A multiplexer realization is a natural choice from the viewpoint of cost and speed.This thesis focuses on the design and synthesis of efficient decimal MAC (Multiply ACeumulate) architecture for high speed decimal processors based on IEEE Standard for Floating-point Arithmetic (IEEE 754-2008). The research goal is to design and synthesize deeimal'MAC architectures to achieve higher performance.Efficient design methods and architectures are developed for a high performance DFP MAC unit as part of this research.

Performance analysis of double digit decimal multiplier on various FPGA logic families

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Decimal multiplication is an integral part of financial, commercial, and internet-based computations. This paper presents a novel double digit decimal multiplication (DDDM) technique that performs 2 digit multiplications simultaneously in one clock cycle. This design offers low latency and high throughput. When multiplying two n-digit operands to produce a 2n-digit product, the design has a latency of (n / 2) 1 cycles. The paper presents area and delay comparisons for 7-digit, 16-digit, 34-digit double digit decimal multipliers on different families of Xilinx, Altera, Actel and Quick Logic FPGAs. The multipliers presented can be extended to support decimal floating-point multiplication for IEEE P754 standard

Bit-Width Analysis for General Applications

Relevância:

80.00% 80.00%

Publicador:

Resumo:

It has been widely known that a significant part of the bits are useless or even unused during the program execution. Bit-width analysis targets at finding the minimum bits needed for each variable in the program, which ensures the execution correctness and resources saving. In this paper, we proposed a static analysis method for bit-widths in general applications, which approximates conservatively at compile time and is independent of runtime conditions. While most related work focus on integer applications, our method is also tailored and applicable to floating point variables, which could be extended to transform floating point number into fixed point numbers together with precision analysis. We used more precise representations for data value ranges of both scalar and array variables. Element level analysis is carried out for arrays. We also suggested an alternative for the standard fixed-point iterations in bi-directional range analysis. These techniques are implemented on the Trimaran compiler structure and tested on a set of benchmarks to show the results.

Improvements in the ray tracing of implicit surfaces based on interval arithmetic

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Las superfícies implícitas son útiles en muchas áreasde los gráficos por ordenador. Una de sus principales ventajas es que pueden ser fácilmente usadas como primitivas para modelado. Aun asi, no son muy usadas porque su visualización toma bastante tiempo. Cuando se necesita una visualización precisa, la mejor opción es usar trazado de rayos. Sin embargo, pequeñas partes de las superficies desaparecen durante la visualización. Esto ocurre por la truncación que se presenta en la representación en punto flotante de los ordenadores; algunos bits se puerden durante las operaciones matemáticas en los algoritmos de intersección. En este tesis se presentan algoritmos para solucionar esos problemas. La investigación se basa en el uso del Análisis Intervalar Modal el cual incluye herramientas para resolver problemas con incertidumbe cuantificada. En esta tesis se proporcionan los fundamentos matemáticos necesarios para el desarrollo de estos algoritmos.

Perspex machine XI: topology of the transreal numbers

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The transreal numbers are a total number system in which even, arithmetical operation is well defined even-where. This has many benefits over the real numbers as a basis for computation and, possibly, for physical theories. We define the topology of the transreal numbers and show that it gives a more coherent interpretation of two's complement arithmetic than the conventional integer model. Trans-two's-complement arithmetic handles the infinities and 0/0 more coherently, and with very much less circuitry, than floating-point arithmetic. This reduction in circuitry is especially beneficial in parallel computers, such as the Perspex machine, and the increase in functionality makes Digital Signal Processing chips better suited to general computation.

«
1
2
3
4
5
6
7
8
...
65
66
»