11 resultados para Montgomery multiplication

em Repositório Científico do Instituto Politécnico de Lisboa - Portugal


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Recent integrated circuit technologies have opened the possibility to design parallel architectures with hundreds of cores on a single chip. The design space of these parallel architectures is huge with many architectural options. Exploring the design space gets even more difficult if, beyond performance and area, we also consider extra metrics like performance and area efficiency, where the designer tries to design the architecture with the best performance per chip area and the best sustainable performance. In this paper we present an algorithm-oriented approach to design a many-core architecture. Instead of doing the design space exploration of the many core architecture based on the experimental execution results of a particular benchmark of algorithms, our approach is to make a formal analysis of the algorithms considering the main architectural aspects and to determine how each particular architectural aspect is related to the performance of the architecture when running an algorithm or set of algorithms. The architectural aspects considered include the number of cores, the local memory available in each core, the communication bandwidth between the many-core architecture and the external memory and the memory hierarchy. To exemplify the approach we did a theoretical analysis of a dense matrix multiplication algorithm and determined an equation that relates the number of execution cycles with the architectural parameters. Based on this equation a many-core architecture has been designed. The results obtained indicate that a 100 mm(2) integrated circuit design of the proposed architecture, using a 65 nm technology, is able to achieve 464 GFLOPs (double precision floating-point) for a memory bandwidth of 16 GB/s. This corresponds to a performance efficiency of 71 %. Considering a 45 nm technology, a 100 mm(2) chip attains 833 GFLOPs which corresponds to 84 % of peak performance These figures are better than those obtained by previous many-core architectures, except for the area efficiency which is limited by the lower memory bandwidth considered. The results achieved are also better than those of previous state-of-the-art many-cores architectures designed specifically to achieve high performance for matrix multiplication.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Sparse matrix-vector multiplication (SMVM) is a fundamental operation in many scientific and engineering applications. In many cases sparse matrices have thousands of rows and columns where most of the entries are zero, while non-zero data is spread over the matrix. This sparsity of data locality reduces the effectiveness of data cache in general-purpose processors quite reducing their performance efficiency when compared to what is achieved with dense matrix multiplication. In this paper, we propose a parallel processing solution for SMVM in a many-core architecture. The architecture is tested with known benchmarks using a ZYNQ-7020 FPGA. The architecture is scalable in the number of core elements and limited only by the available memory bandwidth. It achieves performance efficiencies up to almost 70% and better performances than previous FPGA designs.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

A visible/near-infrared optical sensor based on an ITO/SiOx/n-Si structure with internal gain is presented. This surface-barrier structure was fabricated by a low-temperature processing technique. The interface properties and carder transport were investigated from dark current-voltage and capacitance-voltage characteristics. Examination of the multiplication properties was performed under different light excitation and reverse bias conditions. The spectral and pulse response characteristics are analysed. The current amplification mechanism is interpreted by the control of electron current by the space charge of photogenerated holes near the SiOx/Si interface. The optical sensor output characteristics and some possible device applications are presented.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The population growth of a Staphylococcus aureus culture, an active colloidal system of spherical cells, was followed by rheological measurements, under steady-state and oscillatory shear flows. We observed a rich viscoelastic behavior as a consequence of the bacteria activity, namely, of their multiplication and density-dependent aggregation properties. In the early stages of growth (lag and exponential phases), the viscosity increases by about a factor of 20, presenting several drops and full recoveries. This allows us to evoke the existence of a percolation phenomenon. Remarkably, as the bacteria reach their late phase of development, in which the population stabilizes, the viscosity returns close to its initial value. Most probably, this is caused by a change in the bacteria physiological activity and in particular, by the decrease of their adhesion properties. The viscous and elastic moduli exhibit power-law behaviors compatible with the "soft glassy materials" model, whose exponents are dependent on the bacteria growth stage. DOI: 10.1103/PhysRevE.87.030701.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Este artigo apresenta e discute alguns aspetos sobre a aprendizagem da divisão com números naturais, focando-se nos procedimentos usados por alunos de uma turma do 3.º ano na resolução de tarefas de divisão. Os resultados apresentados fazem parte de uma investigação mais abrangente que teve como finalidade a compreensão do modo como os alunos aprofundam a aprendizagem da multiplicação numa perspetiva de desenvolvimento do sentido do número. A investigação realizada seguiu uma metodologia de design research, na modalidade de experiência de ensino. A análise das produções escritas dos alunos e de episódios de sala de aula relativos às discussões coletivas sobre as resoluções das tarefas propostas mostra que os alunos usam uma diversidade de procedimentos e que estes evoluem significativamente ao longo da experiência de ensino. Esta evolução parece ser suportada pelas características das tarefas, os seus contextos e números, assim como pela articulação, desde logo estabelecida, entre a divisão e a multiplicação. Além disso, o recurso ao modelo retangular parece, também, ter contribuído para a progressão para procedimentos multiplicativos, baseados na decomposição de um dos fatores. Os resultados do estudo permitem ainda perceber que a evolução dos procedimentos usados pelos alunos e a sua diversidade não são alheias ao ambiente de sala de aula construído.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Conferência: IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP)- Jun 05-07, 2013

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Dissertação apresentada para obtenção do grau de Mestre em Educação Matemática na Educação Pré-Escolar e nos 1.º e 2.º Ciclos do Ensino Básico

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper proposes an efficient scalable Residue Number System (RNS) architecture supporting moduli sets with an arbitrary number of channels, allowing to achieve larger dynamic range and a higher level of parallelism. The proposed architecture allows the forward and reverse RNS conversion, by reusing the arithmetic channel units. The arithmetic operations supported at the channel level include addition, subtraction, and multiplication with accumulation capability. For the reverse conversion two algorithms are considered, one based on the Chinese Remainder Theorem and the other one on Mixed-Radix-Conversion, leading to implementations optimized for delay and required circuit area. With the proposed architecture a complete and compact RNS platform is achieved. Experimental results suggest gains of 17 % in the delay in the arithmetic operations, with an area reduction of 23 % regarding the RNS state of the art. When compared with a binary system the proposed architecture allows to perform the same computation 20 times faster alongside with only 10 % of the circuit area resources.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents a single precision floating point arithmetic unit with support for multiplication, addition, fused multiply-add, reciprocal, square-root and inverse squareroot with high-performance and low resource usage. The design uses a piecewise 2nd order polynomial approximation to implement reciprocal, square-root and inverse square-root. The unit can be configured with any number of operations and is capable to calculate any function with a throughput of one operation per cycle. The floatingpoint multiplier of the unit is also used to implement the polynomial approximation and the fused multiply-add operation. We have compared our implementation with other state-of-the-art proposals, including the Xilinx Core-Gen operators, and conclude that the approach has a high relative performance/area efficiency. © 2014 Technical University of Munich (TUM).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Dissertação apresentada à Escola Superior de Educação de Lisboa para obtenção de grau de mestre em Educação Matemática na Educação Pré-escolar e nos 1.º e 2.º ciclos do Ensino Básico

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Single processor architectures are unable to provide the required performance of high performance embedded systems. Parallel processing based on general-purpose processors can achieve these performances with a considerable increase of required resources. However, in many cases, simplified optimized parallel cores can be used instead of general-purpose processors achieving better performance at lower resource utilization. In this paper, we propose a configurable many-core architecture to serve as a co-processor for high-performance embedded computing on Field-Programmable Gate Arrays. The architecture consists of an array of configurable simple cores with support for floating-point operations interconnected with a configurable interconnection network. For each core it is possible to configure the size of the internal memory, the supported operations and number of interfacing ports. The architecture was tested in a ZYNQ-7020 FPGA in the execution of several parallel algorithms. The results show that the proposed many-core architecture achieves better performance than that achieved with a parallel generalpurpose processor and that up to 32 floating-point cores can be implemented in a ZYNQ-7020 SoC FPGA.