10 resultados para Compute Unified Device Architecture(CUDA)
em Massachusetts Institute of Technology
Resumo:
We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms.
Resumo:
Performance and manufacturability are two important issues that must be taken into account during MEMS design. Existing MEMS design models or systems follow a process-driven design paradigm, that is, design starts from the specification of process sequence or the customization of foundry-ready process template. There has been essentially no methodology or model that supports generic, high-level design synthesis for MEMS conceptual design. As a result, there lacks a basis for specifying the initial process sequences. To address this problem, this paper proposes a performance-driven, microfabrication-oriented methodology for MEMS conceptual design. A unified behaviour representation method is proposed which incorporates information of both physical interactions and chemical/biological/other reactions. Based on this method, a behavioural process based design synthesis model is proposed, which exploits multidisciplinary phenomena for design solutions, including both the structural components and their configuration for the MEMS device, as well as the necessary substances for the chemical/biological/other reactions. The model supports both forward and backward synthetic search for suitable phenomena. To ensure manufacturability, a strategy of using microfabrication-oriented phenomena as design knowledge is proposed, where the phenomena are developed from existing MEMS devices that have associated MEMS-specific microfabrication processes or foundry-ready process templates. To test the applicability of the proposed methodology, the paper also studies microfluidic device design and uses a micro-pump design for the case study.
Resumo:
The Scheme86 and the HP Precision Architectures represent different trends in computer processor design. The former uses wide micro-instructions, parallel hardware, and a low latency memory interface. The latter encourages pipelined implementation and visible interlocks. To compare the merits of these approaches, algorithms frequently encountered in numerical and symbolic computation were hand-coded for each architecture. Timings were done in simulators and the results were evaluated to determine the speed of each design. Based on these measurements, conclusions were drawn as to which aspects of each architecture are suitable for a high- performance computer.
Resumo:
This report describes a working autonomous mobile robot whose only goal is to collect and return empty soda cans. It operates in an unmodified office environment occupied by moving people. The robot is controlled by a collection of over 40 independent "behaviors'' distributed over a loosely coupled network of 24 processors. Together this ensemble helps the robot locate cans with its laser rangefinder, collect them with its on-board manipulator, and bring them home using a compass and an array of proximity sensors. We discuss the advantages of using such a multi-agent control system and show how to decompose the required tasks into component activities. We also examine the benefits and limitations of spatially local, stateless, and independent computation by the agents.
Resumo:
This thesis defines Pi, a parallel architecture interface that separates model and machine issues, allowing them to be addressed independently. This provides greater flexibility for both the model and machine builder. Pi addresses a set of common parallel model requirements including low latency communication, fast task switching, low cost synchronization, efficient storage management, the ability to exploit locality, and efficient support for sequential code. Since Pi provides generic parallel operations, it can efficiently support many parallel programming models including hybrids of existing models. Pi also forms a basis of comparison for architectural components.
Resumo:
General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine- grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite- state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20x the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures.
Resumo:
The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.
Resumo:
Regularization Networks and Support Vector Machines are techniques for solving certain problems of learning from examples -- in particular the regression problem of approximating a multivariate function from sparse data. We present both formulations in a unified framework, namely in the context of Vapnik's theory of statistical learning which provides a general foundation for the learning problem, combining functional analysis and statistics.
Resumo:
The Space Systems, Policy and Architecture Research Consortium (SSPARC) was formed to make substantial progress on problems of national importance. The goals of SSPARC were to: • Provide technologies and methods that will allow the creation of flexible, upgradable space systems, • Create a “clean sheet” approach to space systems architecture determination and design, including the incorporation of risk, uncertainty, and flexibility issues, and • Consider the impact of national space policy on the above. This report covers the last two goals, and demonstrates that the effort was largely successful.
Resumo:
This paper proposes three tests to determine whether a given nonlinear device noise model is in agreement with accepted thermodynamic principles. These tests are applied to several models. One conclusion is that every Gaussian noise model for any nonlinear device predicts thermodynamically impossible circuit behavior: these models should be abandoned. But the nonlinear shot-noise model predicts thermodynamically acceptable behavior under a constraint derived here. Further, this constraint specifies the current noise amplitude at each operating point from knowledge of the device v - i curve alone. For the Gaussian and shot-noise models, this paper shows how the thermodynamic requirements can be reduced to concise mathematical tests involving no approximatio