109 resultados para intel processor
Resumo:
Accurate supersymmetric spectra are required to confront data from direct and indirect searches of supersymmetry. SuSeFLAV is a numerical tool capable of computing supersymmetric spectra precisely for various supersymmetric breaking scenarios applicable even in the presence of flavor violation. The program solves MSSM RGEs with complete 3 x 3 flavor mixing at 2-loop level and one loop finite threshold corrections to all MSSM parameters by incorporating radiative electroweak symmetry breaking conditions. The program also incorporates the Type-I seesaw mechanism with three massive right handed neutrinos at user defined mass scales and mixing. It also computes branching ratios of flavor violating processes such as l(j) -> l(i)gamma, l(j) -> 3 l(i), b -> s gamma and supersymmetric contributions to flavor conserving quantities such as (g(mu) - 2). A large choice of executables suitable for various operations of the program are provided. Program summary Program title: SuSeFLAV Catalogue identifier: AEOD_v1_0 Program summary URL: http://cpc.cs.qub.ac.uk/summaries/AEOD_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GNU General Public License No. of lines in distributed program, including test data, etc.: 76552 No. of bytes in distributed program, including test data, etc.: 582787 Distribution format: tar.gz Programming language: Fortran 95. Computer: Personal Computer, Work-Station. Operating system: Linux, Unix. Classification: 11.6. Nature of problem: Determination of masses and mixing of supersymmetric particles within the context of MSSM with conserved R-parity with and without the presence of Type-I seesaw. Inter-generational mixing is considered while calculating the mass spectrum. Supersymmetry breaking parameters are taken as inputs at a high scale specified by the mechanism of supersymmetry breaking. RG equations including full inter-generational mixing are then used to evolve these parameters up to the electroweak breaking scale. The low energy supersymmetric spectrum is calculated at the scale where successful radiative electroweak symmetry breaking occurs. At weak scale standard model fermion masses, gauge couplings are determined including the supersymmetric radiative corrections. Once the spectrum is computed, the program proceeds to various lepton flavor violating observables (e.g., BR(mu -> e gamma), BR(tau -> mu gamma) etc.) at the weak scale. Solution method: Two loop RGEs with full 3 x 3 flavor mixing for all supersymmetry breaking parameters are used to compute the low energy supersymmetric mass spectrum. An adaptive step size Runge-Kutta method is used to solve the RGEs numerically between the high scale and the electroweak breaking scale. Iterative procedure is employed to get the consistent radiative electroweak symmetry breaking condition. The masses of the supersymmetric particles are computed at 1-loop order. The third generation SM particles and the gauge couplings are evaluated at the 1-loop order including supersymmetric corrections. A further iteration of the full program is employed such that the SM masses and couplings are consistent with the supersymmetric particle spectrum. Additional comments: Several executables are presented for the user. Running time: 0.2 s on a Intel(R) Core(TM) i5 CPU 650 with 3.20 GHz. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.
Resumo:
Most stencil computations allow tile-wise concurrent start, i.e., there always exists a face of the iteration space and a set of tiling hyperplanes such that all tiles along that face can be started concurrently. This provides load balance and maximizes parallelism. However, existing automatic tiling frameworks often choose hyperplanes that lead to pipelined start-up and load imbalance. We address this issue with a new tiling technique that ensures concurrent start-up as well as perfect load-balance whenever possible. We first provide necessary and sufficient conditions on tiling hyperplanes to enable concurrent start for programs with affine data accesses. We then provide an approach to find such hyperplanes. Experimental evaluation on a 12-core Intel Westmere shows that our code is able to outperform a tuned domain-specific stencil code generator by 4% to 27%, and previous compiler techniques by a factor of 2x to 10.14x.
Resumo:
Managing heat produced by computer processors is an important issue today, especially when the size of processors is decreasing rapidly while the number of transistors in the processor is increasing rapidly. This poster describes a preliminary study of the process of adding carbon nanotubes (CNTs) to a standard silicon paste covering a CPU. Measurements were made in two rounds of tests to compare the rate of cool-down with and without CNTs present. The silicon paste acts as an interface between the CPU and the heat sink, increasing the heat transfer rate away from the CPU. To the silicon paste was added 0.05% by weight of CNTs. These were not aligned. A series of K-type thermocouples was used to measure the temperature as a function of time in the vicinity of the CPU, following its shut-off. An Omega data acquisition system was attached to the thermocouples. The CPU temperature was not measured directly because attachment of a thermocouple would have prevented its automatic shut-off A thermocouple in the paste containing the CNTs actually reached a higher temperature than the standard paste, an effect easily explained. But the rate of cooling with the CNTs was about 4.55% better.
Resumo:
Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due to their sub-linear scalability. In this work, we present a strategy for parallel execution of multiple nested domain simulations based on partitioning the 2-D processor grid into disjoint rectangular regions associated with each domain. We propose a novel combination of performance prediction, processor allocation methods and topology-aware mapping of the regions on torus interconnects. Experiments on IBM Blue Gene systems using WRF show that the proposed strategies result in performance improvement of up to 33% with topology-oblivious mapping and up to additional 7% with topology-aware mapping over the default sequential strategy.
Resumo:
Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.
Resumo:
We discuss the computational bottlenecks in molecular dynamics (MD) and describe the challenges in parallelizing the computation-intensive tasks. We present a hybrid algorithm using MPI (Message Passing Interface) with OpenMP threads for parallelizing a generalized MD computation scheme for systems with short range interatomic interactions. The algorithm is discussed in the context of nano-indentation of Chromium films with carbon indenters using the Embedded Atom Method potential for Cr-Cr interaction and the Morse potential for Cr-C interactions. We study the performance of our algorithm for a range of MPI-thread combinations and find the performance to depend strongly on the computational task and load sharing in the multi-core processor. The algorithm scaled poorly with MPI and our hybrid schemes were observed to outperform the pure message passing scheme, despite utilizing the same number of processors or cores in the cluster. Speed-up achieved by our algorithm compared favorably with that achieved by standard MD packages. (C) 2013 Elsevier Inc. All rights reserved.
Resumo:
We model communication of bursty sources: 1) over multiaccess channels, with either independent decoding or joint decoding and 2) over degraded broadcast channels, by a discrete-time multiclass processor sharing queue. We utilize error exponents to give a characterization of the processor sharing queue. We analyze the processor sharing queue model for the stable region of message arrival rates, and show the existence of scheduling policies for which the stability region converges to the information-theoretic capacity region in an appropriate limiting sense.
Resumo:
A nearly constant switching frequency current hysteresis controller for a 2-level inverter fed induction motor drive is proposed in this paper: The salient features of this controller are fast dynamics for the current, inherent protection against overloads and less switching frequency variation. The large variation of switching frequency as in the conventional hysteresis controller is avoided by defining a current-error boundary which is obtained from the current-error trajectory of the standard space vector PWM. The current-error boundary is computed at every sampling interval based on the induction machine parameters and from the estimated fundamental stator voltage. The stator currents are always monitored and when the current-error exceeds the boundary, voltage space vector is switched to reduce the current-error. The proposed boundary computation algorithm is applicable in linear and over-modulation region and it is simple to implement in any standard digital signal processor: Detailed experimental verification is done using a 7.5 kW induction motor and the results are given to show the performance of the drive at various operating conditions and validate the proposed advantages.
Resumo:
This paper presents a Radix-4(3) based FFT architecture suitable for OFDM based WLAN applications. The radix-4(3) parallel unrolled architecture presented here, uses a radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. A 64 point FFT processor based on the proposed architecture has been implemented in UMC 130nm 1P8M CMOS process with a maximum clock frequency of 100 MHz and area of 0.83mm(2). The proposed processor provides a throughput of four times the clock rate and can finish one 64 point FFT computation in 16 clock cycles. For IEEE 802.11a/g WLAN, the processor needs to be operated at a clock rate of 5 MHz with a power consumption of 2.27 mW which is 27% less than the previously reported low power implementations.
Resumo:
In this paper we propose a fully parallel 64K point radix-4(4) FFT processor. The radix-4(4) parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-4(4) block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. The resultant 64K point FFT processor shows significant reduction in intermediate memory but with increased hardware complexity. Compared to the state-of-art implementation 5], our architecture shows reduced latency with comparable throughput and area. The 64K point FFT architecture was synthesized using a 130nm CMOS technology which resulted in a throughput of 1.4 GSPS and latency of 47.7 mu s with a maximum clock frequency of 350MHz. When compared to 5], the latency is reduced by 303 mu s with 50.8% reduction in area.
Resumo:
Experimental quantum simulation of a Hamiltonian H requires unitary operator decomposition (UOD) of its evolution unitary U = exp(-iHt) in terms of native unitary operators of the experimental system. Here, using a genetic algorithm, we numerically evaluate the most generic UOD (valid over a continuous range of Hamiltonian parameters) of the unitary operator U, termed fidelity-profile optimization. The optimization is obtained by systematically evaluating the functional dependence of experimental unitary operators (such as single-qubit rotations and time-evolution unitaries of the system interactions) to the Hamiltonian (H) parameters. Using this technique, we have solved the experimental unitary decomposition of a controlled-phase gate (for any phase value), the evolution unitary of the Heisenberg XY interaction, and simulation of the Dzyaloshinskii-Moriya (DM) interaction in the presence of the Heisenberg XY interaction. Using these decompositions, we studied the entanglement dynamics of a Bell state in the DM interaction and experimentally verified the entanglement preservation procedure of Hou et al. Ann. Phys. (N.Y.) 327, 292 (2012)] in a nuclear magnetic resonance quantum information processor.
Resumo:
Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final result computed. Scaling flow-sensitive pointer analysis to millions of lines of code is a major challenge. Recently, staged flow-sensitive pointer analysis has been proposed, which exploits a sparse representation of program code created by staged analysis. In this paper we formulate the staged flow-sensitive pointer analysis as a graph-rewriting problem. Graph-rewriting has already been used for flow-insensitive analysis. However, formulating flow-sensitive pointer analysis as a graph-rewriting problem adds additional challenges due to the nature of flow-sensitivity. We implement our parallel algorithm using Intel Threading Building Blocks and demonstrate considerable scaling (upto 2.6x) for 8 threads on a set of 10 benchmarks. Compared to the sequential implementation of staged flow-sensitive analysis, a single threaded execution of our implementation performs better in 8 of the benchmarks.
Resumo:
In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path. (C) 2014 Elsevier B.V. All rights reserved.
Resumo:
Polyhedral techniques for program transformation are now used in several proprietary and open source compilers. However, most of the research on polyhedral compilation has focused on imperative languages such as C, where the computation is specified in terms of statements with zero or more nested loops and other control structures around them. Graphical dataflow languages, where there is no notion of statements or a schedule specifying their relative execution order, have so far not been studied using a powerful transformation or optimization approach. The execution semantics and referential transparency of dataflow languages impose a different set of challenges. In this paper, we attempt to bridge this gap by presenting techniques that can be used to extract polyhedral representation from dataflow programs and to synthesize them from their equivalent polyhedral representation. We then describe PolyGLoT, a framework for automatic transformation of dataflow programs which we built using our techniques and other popular research tools such as Clan and Pluto. For the purpose of experimental evaluation, we used our tools to compile LabVIEW, one of the most widely used dataflow programming languages. Results show that dataflow programs transformed using our framework are able to outperform those compiled otherwise by up to a factor of seventeen, with a mean speed-up of 2.30x while running on an 8-core Intel system.