15 resultados para Federal High Performance Computing Program (U.S.)
em Repositório Científico do Instituto Politécnico de Lisboa - Portugal
Resumo:
Floating-point computing with more than one TFLOP of peak performance is already a reality in recent Field-Programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems. © 2014 Technical University of Munich (TUM).
Resumo:
A new high performance architecture for the computation of all the DCT operations adopted in the H.264/AVC and HEVC standards is proposed in this paper. Contrasting to other dedicated transform cores, the presented multi-standard transform architecture is supported on a completely configurable, scalable and unified structure, that is able to compute not only the forward and the inverse 8×8 and 4×4 integer DCTs and the 4×4 and 2×2 Hadamard transforms defined in the H.264/AVC standard, but also the 4×4, 8×8, 16×16 and 32×32 integer transforms adopted in HEVC. Experimental results obtained using a Xilinx Virtex-7 FPGA demonstrated the superior performance and hardware efficiency levels provided by the proposed structure, which outperforms its more prominent related designs by at least 1.8 times. When integrated in a multi-core embedded system, this architecture allows the computation, in real-time, of all the transforms mentioned above for resolutions as high as the 8k Ultra High Definition Television (UHDTV) (7680×4320 @ 30fps).
Resumo:
A unified architecture for fast and efficient computation of the set of two-dimensional (2-D) transforms adopted by the most recent state-of-the-art digital video standards is presented in this paper. Contrasting to other designs with similar functionality, the presented architecture is supported on a scalable, modular and completely configurable processing structure. This flexible structure not only allows to easily reconfigure the architecture to support different transform kernels, but it also permits its resizing to efficiently support transforms of different orders (e. g. order-4, order-8, order-16 and order-32). Consequently, not only is it highly suitable to realize high-performance multi-standard transform cores, but it also offers highly efficient implementations of specialized processing structures addressing only a reduced subset of transforms that are used by a specific video standard. The experimental results that were obtained by prototyping several configurations of this processing structure in a Xilinx Virtex-7 FPGA show the superior performance and hardware efficiency levels provided by the proposed unified architecture for the implementation of transform cores for the Advanced Video Coding (AVC), Audio Video coding Standard (AVS), VC-1 and High Efficiency Video Coding (HEVC) standards. In addition, such results also demonstrate the ability of this processing structure to realize multi-standard transform cores supporting all the standards mentioned above and that are capable of processing the 8k Ultra High Definition Television (UHDTV) video format (7,680 x 4,320 at 30 fps) in real time.
Resumo:
O desenvolvimento actual de aplicações paralelas com processamento intensivo (HPC - High Performance Computing) para alojamento em computadores organizados em Cluster baseia-se muito no modelo de passagem de mensagens, do qual é de realçar os esforços de definição de standards, por exemplo, MPI - Message - Passing Interface. Por outro lado, com a generalização do paradigma de programação orientado aos objectos para ambientes distribuídos (Java RMI, .NET Remoting), existe a possibilidade de considerar que a execução de uma aplicação, de processamento paralelo e intensivo, pode ser decomposta em vários fluxos de execução paralela, em que cada fluxo é constituído por uma ou mais tarefas executadas no contexto de objectos distribuídos. Normalmente, em ambientes baseados em objectos distribuídos, a especificação, controlo e sincronização dos vários fluxos de execução paralela, é realizada de forma explicita e codificada num programa principal (hard-coded), dificultando possíveis e necessárias modificações posteriores. No entanto, existem, neste contexto, trabalhos que propõem uma abordagem de decomposição, seguindo o paradigma de workflow com interacções entre as tarefas por, entre outras, data-flow, control-flow, finite - state - machine. Este trabalho consistiu em propor e explorar um modelo de execução, sincronização e controlo de múltiplas tarefas, que permita de forma flexível desenhar aplicações de processamento intensivo, tirando partido da execução paralela de tarefas em diferentes máquinas. O modelo proposto e consequente implementação, num protótipo experimental, permite: especificar aplicações usando fluxos de execução; submeter fluxos para execução e controlar e monitorizar a execução desses fluxos. As tarefas envolvidas nos fluxos de execução podem executar-se num conjunto de recursos distribuídos. As principais características a realçar no modelo proposto, são a expansibilidade e o desacoplamento entre as diferentes componentes envolvidas na execução dos fluxos de execução. São ainda descritos casos de teste que permitiram validar o modelo e o protótipo implementado. Tendo consciência da necessidade de continuar no futuro esta linha de investigação, este trabalho é um contributo para demonstrar que o paradigma de workflow é adequado para expressar e executar, de forma paralela e distribuída, aplicações complexas de processamento intensivo.
Resumo:
Hyperspectral imaging can be used for object detection and for discriminating between different objects based on their spectral characteristics. One of the main problems of hyperspectral data analysis is the presence of mixed pixels, due to the low spatial resolution of such images. This means that several spectrally pure signatures (endmembers) are combined into the same mixed pixel. Linear spectral unmixing follows an unsupervised approach which aims at inferring pure spectral signatures and their material fractions at each pixel of the scene. The huge data volumes acquired by such sensors put stringent requirements on processing and unmixing methods. This paper proposes an efficient implementation of a unsupervised linear unmixing method on GPUs using CUDA. The method finds the smallest simplex by solving a sequence of nonsmooth convex subproblems using variable splitting to obtain a constraint formulation, and then applying an augmented Lagrangian technique. The parallel implementation of SISAL presented in this work exploits the GPU architecture at low level, using shared memory and coalesced accesses to memory. The results herein presented indicate that the GPU implementation can significantly accelerate the method's execution over big datasets while maintaining the methods accuracy.
Resumo:
Remote hyperspectral sensors collect large amounts of data per flight usually with low spatial resolution. It is known that the bandwidth connection between the satellite/airborne platform and the ground station is reduced, thus a compression onboard method is desirable to reduce the amount of data to be transmitted. This paper presents a parallel implementation of an compressive sensing method, called parallel hyperspectral coded aperture (P-HYCA), for graphics processing units (GPU) using the compute unified device architecture (CUDA). This method takes into account two main properties of hyperspectral dataset, namely the high correlation existing among the spectral bands and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. Experimental results conducted using synthetic and real hyperspectral datasets on two different GPU architectures by NVIDIA: GeForce GTX 590 and GeForce GTX TITAN, reveal that the use of GPUs can provide real-time compressive sensing performance. The achieved speedup is up to 20 times when compared with the processing time of HYCA running on one core of the Intel i7-2600 CPU (3.4GHz), with 16 Gbyte memory.
Resumo:
The application of compressive sensing (CS) to hyperspectral images is an active area of research over the past few years, both in terms of the hardware and the signal processing algorithms. However, CS algorithms can be computationally very expensive due to the extremely large volumes of data collected by imaging spectrometers, a fact that compromises their use in applications under real-time constraints. This paper proposes four efficient implementations of hyperspectral coded aperture (HYCA) for CS, two of them termed P-HYCA and P-HYCA-FAST and two additional implementations for its constrained version (CHYCA), termed P-CHYCA and P-CHYCA-FAST on commodity graphics processing units (GPUs). HYCA algorithm exploits the high correlation existing among the spectral bands of the hyperspectral data sets and the generally low number of endmembers needed to explain the data, which largely reduces the number of measurements necessary to correctly reconstruct the original data. The proposed P-HYCA and P-CHYCA implementations have been developed using the compute unified device architecture (CUDA) and the cuFFT library. Moreover, this library has been replaced by a fast iterative method in the P-HYCA-FAST and P-CHYCA-FAST implementations that leads to very significant speedup factors in order to achieve real-time requirements. The proposed algorithms are evaluated not only in terms of reconstruction error for different compressions ratios but also in terms of computational performance using two different GPU architectures by NVIDIA: 1) GeForce GTX 590; and 2) GeForce GTX TITAN. Experiments are conducted using both simulated and real data revealing considerable acceleration factors and obtaining good results in the task of compressing remotely sensed hyperspectral data sets.
Resumo:
Parallel hyperspectral unmixing problem is considered in this paper. A semisupervised approach is developed under the linear mixture model, where the abundance's physical constraints are taken into account. The proposed approach relies on the increasing availability of spectral libraries of materials measured on the ground instead of resorting to endmember extraction methods. Since Libraries are potentially very large and hyperspectral datasets are of high dimensionality a parallel implementation in a pixel-by-pixel fashion is derived to properly exploits the graphics processing units (GPU) architecture at low level, thus taking full advantage of the computational power of GPUs. Experimental results obtained for real hyperspectral datasets reveal significant speedup factors, up to 164 times, with regards to optimized serial implementation.
Resumo:
Hyperspectral instruments have been incorporated in satellite missions, providing data of high spectral resolution of the Earth. This data can be used in remote sensing applications, such as, target detection, hazard prevention, and monitoring oil spills, among others. In most of these applications, one of the requirements of paramount importance is the ability to give real-time or near real-time response. Recently, onboard processing systems have emerged, in order to overcome the huge amount of data to transfer from the satellite to the ground station, and thus, avoiding delays between hyperspectral image acquisition and its interpretation. For this purpose, compact reconfigurable hardware modules, such as field programmable gate arrays (FPGAs) are widely used. This paper proposes a parallel FPGA-based architecture for endmember’s signature extraction. This method based on the Vertex Component Analysis (VCA) has several advantages, namely it is unsupervised, fully automatic, and it works without dimensionality reduction (DR) pre-processing step. The architecture has been designed for a low cost Xilinx Zynq board with a Zynq-7020 SoC FPGA based on the Artix-7 FPGA programmable logic and tested using real hyperspectral data sets collected by the NASA’s Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) over the Cuprite mining district in Nevada. Experimental results indicate that the proposed implementation can achieve real-time processing, while maintaining the methods accuracy, which indicate the potential of the proposed platform to implement high-performance, low cost embedded systems, opening new perspectives for onboard hyperspectral image processing.
Resumo:
In this paper, we develop a fast implementation of an hyperspectral coded aperture (HYCA) algorithm on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems, which includes a wide variety of devices, from dense multicore systems from major manufactures such as Intel or ARM to new accelerators such as graphics processing units (GPUs), field programmable gate arrays (FPGAs), the Intel Xeon Phi and other custom devices. Our proposed implementation of HYCA significantly reduces its computational cost. Our experiments have been conducted using simulated data and reveal considerable acceleration factors. This kind of implementations with the same descriptive language on different architectures are very important in order to really calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.
Resumo:
Although the computational power of mobile devices has been increasing, it is still not enough for some classes of applications. In the present, these applications delegate the computing power burden on servers located on the Internet. This model assumes an always-on Internet connectivity and implies a non-negligible latency. The thesis addresses the challenges and contributions posed to the application of a mobile collaborative computing environment concept to wireless networks. The goal is to define a reference architecture for high performance mobile applications. Current work is focused on efficient data dissemination on a highly transitive environment, suitable to many mobile applications and also to the reputation and incentive system available on this mobile collaborative computing environment. For this we are improving our already published reputation/incentive algorithm with knowledge from the usage pattern from the eduroam wireless network in the Lisbon area.
Resumo:
Single processor architectures are unable to provide the required performance of high performance embedded systems. Parallel processing based on general-purpose processors can achieve these performances with a considerable increase of required resources. However, in many cases, simplified optimized parallel cores can be used instead of general-purpose processors achieving better performance at lower resource utilization. In this paper, we propose a configurable many-core architecture to serve as a co-processor for high-performance embedded computing on Field-Programmable Gate Arrays. The architecture consists of an array of configurable simple cores with support for floating-point operations interconnected with a configurable interconnection network. For each core it is possible to configure the size of the internal memory, the supported operations and number of interfacing ports. The architecture was tested in a ZYNQ-7020 FPGA in the execution of several parallel algorithms. The results show that the proposed many-core architecture achieves better performance than that achieved with a parallel generalpurpose processor and that up to 32 floating-point cores can be implemented in a ZYNQ-7020 SoC FPGA.
Resumo:
A new high throughput and scalable architecture for unified transform coding in H.264/AVC is proposed in this paper. Such flexible structure is capable of computing all the 4x4 and 2x2 transforms for Ultra High Definition Video (UHDV) applications (4320x7680@ 30fps) in real-time and with low hardware cost. These significantly high performance levels were proven with the implementation of several different configurations of the proposed structure using both FPGA and ASIC 90 nm technologies. In addition, such experimental evaluation also demonstrated the high area efficiency of theproposed architecture, which in terms of Data Throughput per Unit of Area (DTUA) is at least 1.5 times more efficient than its more prominent related designs(1).
Resumo:
This paper presents an algorithm to efficiently generate the state-space of systems specified using the IOPT Petri-net modeling formalism. IOPT nets are a non-autonomous Petri-net class, based on Place-Transition nets with an extended set of features designed to allow the rapid prototyping and synthesis of system controllers through an existing hardware-software co-design framework. To obtain coherent and deterministic operation, IOPT nets use a maximal-step execution semantics where, in a single execution step, all enabled transitions will fire simultaneously. This fact increases the resulting state-space complexity and can cause an arc "explosion" effect. Real-world applications, with several million states, will reach a higher order of magnitude number of arcs, leading to the need for high performance state-space generator algorithms. The proposed algorithm applies a compilation approach to read a PNML file containing one IOPT model and automatically generate an optimized C program to calculate the corresponding state-space.
Resumo:
This paper presents a single precision floating point arithmetic unit with support for multiplication, addition, fused multiply-add, reciprocal, square-root and inverse squareroot with high-performance and low resource usage. The design uses a piecewise 2nd order polynomial approximation to implement reciprocal, square-root and inverse square-root. The unit can be configured with any number of operations and is capable to calculate any function with a throughput of one operation per cycle. The floatingpoint multiplier of the unit is also used to implement the polynomial approximation and the fused multiply-add operation. We have compared our implementation with other state-of-the-art proposals, including the Xilinx Core-Gen operators, and conclude that the approach has a high relative performance/area efficiency. © 2014 Technical University of Munich (TUM).