256 resultados para accelerators
Resumo:
In this paper, we develop a fast implementation of an hyperspectral coded aperture (HYCA) algorithm on different platforms using OpenCL, an open standard for parallel programing on heterogeneous systems, which includes a wide variety of devices, from dense multicore systems from major manufactures such as Intel or ARM to new accelerators such as graphics processing units (GPUs), field programmable gate arrays (FPGAs), the Intel Xeon Phi and other custom devices. Our proposed implementation of HYCA significantly reduces its computational cost. Our experiments have been conducted using simulated data and reveal considerable acceleration factors. This kind of implementations with the same descriptive language on different architectures are very important in order to really calibrate the possibility of using heterogeneous platforms for efficient hyperspectral imaging processing in real remote sensing missions.
Resumo:
Observations of H3+ in the Galactic diffuse interstellar medium (ISM) have led to various surprising results, including the conclusion that the cosmic-ray ionization rate (zeta_2) is about 1 order of magnitude larger than previously thought. The present survey expands the sample of diffuse cloud sight lines with H3+ observations to 50, with detections in 21 of those. Ionization rates inferred from these detections are in the range (1.7+-1.0)x10^-16 s^-1 < zeta_2 < (10.6+-6.8)x10^-16 s^-1 with a mean value of zeta_2=(3.3+-0.4)x10^-16 s^-1. Upper limits (3sigma) derived from non-detections of H3+ are as low as zeta_2 < 0.4x10^-16 s^-1. These low upper-limits, in combination with the wide range of inferred cosmic-ray ionization rates, indicate variations in zeta_2 between different diffuse cloud sight lines. Calculations of the cosmic-ray ionization rate from theoretical cosmic-ray spectra require a large flux of low-energy (MeV) particles to reproduce values inferred from observations. Given the relatively short range of low-energy cosmic rays --- those most efficient at ionization --- the proximity of a cloud to a site of particle acceleration may set its ionization rate. Variations in zeta_2 are thus likely due to variations in the cosmic-ray spectrum at low energies resulting from the effects of particle propagation. To test this theory, H3+ was observed in sight lines passing through diffuse molecular clouds known to be interacting with the supernova remnant IC 443, a probable site of particle acceleration. Where H3+ is detected, ionization rates of zeta_2=(20+-10)x10^-16 s^-1 are inferred, higher than for any other diffuse cloud. These results support both the concept that supernova remnants act as particle accelerators, and the hypothesis that propagation effects are responsible for causing spatial variations in the cosmic-ray spectrum and ionization rate. Future observations of H3+ near other supernova remnants and in sight lines where complementary ionization tracers (OH+, H2O+, H3O+) have been observed will further our understanding of the subject.
Resumo:
Reconfigurable platforms are a promising technology that offers an interesting trade-off between flexibility and performance, which many recent embedded system applications demand, especially in fields such as multimedia processing. These applications typically involve multiple ad-hoc tasks for hardware acceleration, which are usually represented using formalisms such as Data Flow Diagrams (DFDs), Data Flow Graphs (DFGs), Control and Data Flow Graphs (CDFGs) or Petri Nets. However, none of these models is able to capture at the same time the pipeline behavior between tasks (that therefore can coexist in order to minimize the application execution time), their communication patterns, and their data dependencies. This paper proves that the knowledge of all this information can be effectively exploited to reduce the resource requirements and the timing performance of modern reconfigurable systems, where a set of hardware accelerators is used to support the computation. For this purpose, this paper proposes a novel task representation model, named Temporal Constrained Data Flow Diagram (TCDFD), which includes all this information. This paper also presents a mapping-scheduling algorithm that is able to take advantage of the new TCDFD model. It aims at minimizing the dynamic reconfiguration overhead while meeting the communication requirements among the tasks. Experimental results show that the presented approach achieves up to 75% of resources saving and up to 89% of reconfiguration overhead reduction with respect to other state-of-the-art techniques for reconfigurable platforms.
Resumo:
At the HL-LHC, proton bunches will cross each other every 25. ns, producing an average of 140 pp-collisions per bunch crossing. To operate in such an environment, the CMS experiment will need a L1 hardware trigger able to identify interesting events within a latency of 12.5. μs. The future L1 trigger will make use also of data coming from the silicon tracker to control the trigger rate. The architecture that will be used in future to process tracker data is still under discussion. One interesting proposal makes use of the Time Multiplexed Trigger concept, already implemented in the CMS calorimeter trigger for the Phase I trigger upgrade. The proposed track finding algorithm is based on the Hough Transform method. The algorithm has been tested using simulated pp-collision data. Results show a very good tracking efficiency. The algorithm will be demonstrated in hardware in the coming months using the MP7, which is a μTCA board with a powerful FPGA capable of handling data rates approaching 1. Tb/s.
Resumo:
Today, modern System-on-a-Chip (SoC) systems have grown rapidly due to the increased processing power, while maintaining the size of the hardware circuit. The number of transistors on a chip continues to increase, but current SoC designs may not be able to exploit the potential performance, especially with energy consumption and chip area becoming two major concerns. Traditional SoC designs usually separate software and hardware. Thus, the process of improving the system performance is a complicated task for both software and hardware designers. The aim of this research is to develop hardware acceleration workflow for software applications. Thus, system performance can be improved with constraints of energy consumption and on-chip resource costs. The characteristics of software applications can be identified by using profiling tools. Hardware acceleration can have significant performance improvement for highly mathematical calculations or repeated functions. The performance of SoC systems can then be improved, if the hardware acceleration method is used to accelerate the element that incurs performance overheads. The concepts mentioned in this study can be easily applied to a variety of sophisticated software applications. The contributions of SoC-based hardware acceleration in the hardware-software co-design platform include the following: (1) Software profiling methods are applied to H.264 Coder-Decoder (CODEC) core. The hotspot function of aimed application is identified by using critical attributes such as cycles per loop, loop rounds, etc. (2) Hardware acceleration method based on Field-Programmable Gate Array (FPGA) is used to resolve system bottlenecks and improve system performance. The identified hotspot function is then converted to a hardware accelerator and mapped onto the hardware platform. Two types of hardware acceleration methods – central bus design and co-processor design, are implemented for comparison in the proposed architecture. (3) System specifications, such as performance, energy consumption, and resource costs, are measured and analyzed. The trade-off of these three factors is compared and balanced. Different hardware accelerators are implemented and evaluated based on system requirements. 4) The system verification platform is designed based on Integrated Circuit (IC) workflow. Hardware optimization techniques are used for higher performance and less resource costs. Experimental results show that the proposed hardware acceleration workflow for software applications is an efficient technique. The system can reach 2.8X performance improvements and save 31.84% energy consumption by applying the Bus-IP design. The Co-processor design can have 7.9X performance and save 75.85% energy consumption.
Resumo:
Non-linear effects are responsible for peculiar phenomena in charged particles dynamics in circular accelerators. Recently, they have been used to propose novel beam manipulations where one can modify the transverse beam distribution in a controlled way, to fulfil the constraints posed by new applications. One example is the resonant beam splitting used at CERN for the Multi-Turn Extraction (MTE), to transfer proton beams from PS to SPS. The theoretical description of these effects relies on the formulation of the particle's dynamics in terms of Hamiltonian systems and symplectic maps, and on the theory of adiabatic invariance and resonant separatrix crossing. Close to resonance, new stable regions and new separatrices appear in the phase space. As non-linear effects do not preserve the Courant-Snyder invariant, it is possible for a particle to cross a separatrix, changing the value of its adiabatic invariant. This process opens the path to new beam manipulations. This thesis deals with various possible effects that can be used to shape the transverse beam dynamics, using 2D and 4D models of particles' motion. We show the possibility of splitting a beam using a resonant external exciter, or combining its action with MTE-like tune modulation close to resonance. Non-linear effects can also be used to cool a beam acting on its transverse beam distribution. We discuss the case of an annular beam distribution, showing that emittance can be reduced modulating amplitude and frequency of a resonant oscillating dipole. We then consider 4D models where, close to resonance, motion in the two transverse planes is coupled. This is exploited to operate on the transverse emittances with a 2D resonance crossing. Depending on the resonance, the result is an emittance exchange between the two planes, or an emittance sharing. These phenomena are described and understood in terms of adiabatic invariance theory.
Resumo:
Embedding intelligence in extreme edge devices allows distilling raw data acquired from sensors into actionable information, directly on IoT end-nodes. This computing paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable benefits, driving a large research area (TinyML) to deploy leading Machine Learning (ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions.
Resumo:
The first topic analyzed in the thesis will be Neural Architecture Search (NAS). I will focus on two different tools that I developed, one to optimize the architecture of Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged, and one to optimize the data precision of tensors inside CNNs. The first NAS proposed explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive field, and the number of features in each layer. Note that this is the first NAS that explicitly targets these networks. The second NAS proposed instead focuses on finding the most efficient data format for a target CNN, with the granularity of the layer filter. Note that applying these two NASes in sequence allows an "application designer" to minimize the structure of the neural network employed, minimizing the number of operations or the memory usage of the network. After that, the second topic described is the optimization of neural network deployment on edge devices. Importantly, exploiting edge platforms' scarce resources is critical for NN efficient execution on MCUs. To do so, I will introduce DORY (Deployment Oriented to memoRY) -- an automatic tool to deploy CNNs on low-cost MCUs. DORY, in different steps, can manage different levels of memory inside the MCU automatically, offload the computation workload (i.e., the different layers of a neural network) to dedicated hardware accelerators, and automatically generates ANSI C code that orchestrates off- and on-chip transfers with the computation phases. On top of this, I will introduce two optimized computation libraries that DORY can exploit to deploy TCNs and Transformers on edge efficiently. I conclude the thesis with two different applications on bio-signal analysis, i.e., heart rate tracking and sEMG-based gesture recognition.
Resumo:
Analog In-memory Computing (AIMC) has been proposed in the context of Beyond Von Neumann architectures as a valid strategy to reduce internal data transfers energy consumption and latency, and to improve compute efficiency. The aim of AIMC is to perform computations within the memory unit, typically leveraging the physical features of memory devices. Among resistive Non-volatile Memories (NVMs), Phase-change Memory (PCM) has become a promising technology due to its intrinsic capability to store multilevel data. Hence, PCM technology is currently investigated to enhance the possibilities and the applications of AIMC. This thesis aims at exploring the potential of new PCM-based architectures as in-memory computational accelerators. In a first step, a preliminar experimental characterization of PCM devices has been carried out in an AIMC perspective. PCM cells non-idealities, such as time-drift, noise, and non-linearity have been studied to develop a dedicated multilevel programming algorithm. Measurement-based simulations have been then employed to evaluate the feasibility of PCM-based operations in the fields of Deep Neural Networks (DNNs) and Structural Health Monitoring (SHM). Moreover, a first testchip has been designed and tested to evaluate the hardware implementation of Multiply-and-Accumulate (MAC) operations employing PCM cells. This prototype experimentally demonstrates the possibility to reach a 95% MAC accuracy with a circuit-level compensation of cells time drift and non-linearity. Finally, empirical circuit behavior models have been included in simulations to assess the use of this technology in specific DNN applications, and to enhance the potentiality of this innovative computation approach.