960 resultados para CUDA (Computer architecture)


Relevância:

100.00% 100.00%

Publicador:

Resumo:

MinneSPEC proposes reduced input sets that microprocessor designers can use to model representative short-running workloads. A four-step methodology verifies the program behavior similarity of these input sets to reference sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Techniques of image combination, with extraction of objects to set a final scene, are very used in applications from photos montages to cinematographic productions. These techniques are called digital matting. With them is possible to decrease the cost of productions, because it is not necessary for the actor to be filmed in the location where the final scene occurs. This feature also favors its use in programs made to digital television, which demands a high quality image. Many digital matting algorithms use markings done on the images, to demarcate what is the foreground, the background and the uncertainty areas. This marking is called trimap, which is a triple map containing these three informations. The trimap is done, typically, from manual markings. In this project, methods were created that can be used in digital matting algorithms, with restriction of time and without human interaction, that is, the creation of an algorithm that generates the trimap automatically. This last one can be generated from the difference between a color of an arbitrary background and the foreground, or by using a depth map. It was also created a matting method, based on the Geodesic Matting (BAI; SAPIRO, 2009), which has an inferior processing time then the original one. Aiming to improve the performance of the applications that generates the trimap and of the algorithms that generates the alphamap (map that associates a value to the transparency of each pixel of the image), allowing its use in applications with time restrictions, it was used the CUDA architecture. Taking advantage, this way, of the computational power and the features of the GPGPU, which is massively parallel

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The term "Logic Programming" refers to a variety of computer languages and execution models which are based on the traditional concept of Symbolic Logic. The expressive power of these languages offers promise to be of great assistance in facing the programming challenges of present and future symbolic processing applications in Artificial Intelligence, Knowledge-based systems, and many other areas of computing. The sequential execution speed of logic programs has been greatly improved since the advent of the first interpreters. However, higher inference speeds are still required in order to meet the demands of applications such as those contemplated for next generation computer systems. The execution of logic programs in parallel is currently considered a promising strategy for attaining such inference speeds. Logic Programming in turn appears as a suitable programming paradigm for parallel architectures because of the many opportunities for parallel execution present in the implementation of logic programs. This dissertation presents an efficient parallel execution model for logic programs. The model is described from the source language level down to an "Abstract Machine" level suitable for direct implementation on existing parallel systems or for the design of special purpose parallel architectures. Few assumptions are made at the source language level and therefore the techniques developed and the general Abstract Machine design are applicable to a variety of logic (and also functional) languages. These techniques offer efficient solutions to several areas of parallel Logic Programming implementation previously considered problematic or a source of considerable overhead, such as the detection and handling of variable binding conflicts in AND-Parallelism, the specification of control and management of the execution tree, the treatment of distributed backtracking, and goal scheduling and memory management issues, etc. A parallel Abstract Machine design is offered, specifying data areas, operation, and a suitable instruction set. This design is based on extending to a parallel environment the techniques introduced by the Warren Abstract Machine, which have already made very fast and space efficient sequential systems a reality. Therefore, the model herein presented is capable of retaining sequential execution speed similar to that of high performance sequential systems, while extracting additional gains in speed by efficiently implementing parallel execution. These claims are supported by simulations of the Abstract Machine on sample programs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We calculate the electron exchange coupling for a phosphorus donor pair in silicon perturbed by a J-gate potential and the boundary effects of the silicon host geometry. In addition to the electron-electron exchange interaction we also calculate the contact hyperfine interaction between the donor nucleus and electron as a function of the varying experimental conditions. Donor separation, depth of the P nuclei below the silicon oxide layer and J-gate voltage become decisive factors in determining the strength of both the exchange coupling and hyperfine interaction-both crucial components for qubit operations in the Kane quantum computer. These calculations were performed using an anisotropic effective-mass Hamiltonian approach. The behaviour of the donor exchange coupling as a function of the parameters varied in this work provides relevant information for the experimental design of these devices.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Computer simulation programs are essential tools for scientists and engineers to understand a particular system of interest. As expected, the complexity of the software increases with the depth of the model used. In addition to the exigent demands of software engineering, verification of simulation programs is especially challenging because the models represented are complex and ridden with unknowns that will be discovered by developers in an iterative process. To manage such complexity, advanced verification techniques for continually matching the intended model to the implemented model are necessary. Therefore, the main goal of this research work is to design a useful verification and validation framework that is able to identify model representation errors and is applicable to generic simulators. The framework that was developed and implemented consists of two parts. The first part is First-Order Logic Constraint Specification Language (FOLCSL) that enables users to specify the invariants of a model under consideration. From the first-order logic specification, the FOLCSL translator automatically synthesizes a verification program that reads the event trace generated by a simulator and signals whether all invariants are respected. The second part consists of mining the temporal flow of events using a newly developed representation called State Flow Temporal Analysis Graph (SFTAG). While the first part seeks an assurance of implementation correctness by checking that the model invariants hold, the second part derives an extended model of the implementation and hence enables a deeper understanding of what was implemented. The main application studied in this work is the validation of the timing behavior of micro-architecture simulators. The study includes SFTAGs generated for a wide set of benchmark programs and their analysis using several artificial intelligence algorithms. This work improves the computer architecture research and verification processes as shown by the case studies and experiments that have been conducted.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper presents some observations on how computer animation was used in the early years of a degree program in Electrical and Electronic Engineering to enhance the teaching of key skills and professional practice. This paper presents the results from two case studies. First, in a first year course which seeks to teach students how to manage and report on group projects in a professional way. Secondly, in a technical course on virtual reality, where the students are asked to use computer animation in a way that subliminally coerces them to come to terms with the fine detail of the mathematical principles that underlie 3D graphics, geometry, etc. as well as the most significant principles of computer architecture and software engineering. In addition, the findings reveal that by including a significant element of self and peer review processes into the assessment procedure students became more engaged with the course and achieved a deeper level of comprehension of the material in the course.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This work shows the design, simulation, and analysis of two optical interconnection networks for a Dataflow parallel computer architecture. To verify the optical interconnection network performance on the Dataflow architecture, we have analyzed the load balancing among the processors during the parallel programs executions. The load balancing is a very important parameter because it is directly associated to the dataflow parallelism degree. This article proves that optical interconnection networks designed with simple optical devices can provide efficiently the dataflow requirements of a high performance communication system.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The vascular segmentation is important in diagnosing vascular diseases like stroke and is hampered by noise in the image and very thin vessels that can pass unnoticed. One way to accomplish the segmentation is extracting the centerline of the vessel with height ridges, which uses the intensity as features for segmentation. This process can take from seconds to minutes, depending on the current technology employed. In order to accelerate the segmentation method proposed by Aylward [Aylward & Bullitt 2002] we have adapted it to run in parallel using CUDA architecture. The performance of the segmentation method running on GPU is compared to both the same method running on CPU and the original Aylward s method running also in CPU. The improvemente of the new method over the original one is twofold: the starting point for the segmentation process is not a single point in the blood vessel but a volume, thereby making it easier for the user to segment a region of interest, and; the overall gain method was 873 times faster running on GPU and 150 times more fast running on the CPU than the original CPU in Aylward

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In order to simplify computer management, several system administrators are adopting advanced techniques to manage software configuration of enterprise computer networks, but the tight coupling between hardware and software makes every PC an individual managed entity, lowering the scalability and increasing the costs to manage hundreds or thousands of PCs. Virtualization is an established technology, however its use is been more focused on server consolidation and virtual desktop infrastructure, not for managing distributed computers over a network. This paper discusses the feasibility of the Distributed Virtual Machine Environment, a new approach for enterprise computer management that combines virtualization and distributed system architecture as the basis of the management architecture. © 2008 IEEE.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

"This report reproduces a thesis of the same title submitted to the Department of Electrical Engineering, Massachusetts Institute of Technology, in partial fulfillment of the requirements for the degree of Doctor of Philosophy, May 1971."--p. 3

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The paper describes the architecture of SCIT - supercomputer system of cluster type and the base architecture features used during this research project. This supercomputer system is put into research operation in Glushkov Institute of Cybernetics NAS of Ukraine from the early 2006 year. The paper may be useful for those scientists and engineers that are practically engaged in a cluster supercomputer systems design, integration and services.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper describes a PC-based mainframe computer emulator called VisibleZ and its use in teaching mainframe Computer Organization and Assembly Programming classes. VisibleZ models IBM’s z/Architecture and allows direct interpretation of mainframe assembly language object code in a graphical user interface environment that was developed in Java. The VisibleZ emulator acts as an interactive visualization tool to simulate enterprise computer architecture. The provided architectural components include main storage, CPU, registers, Program Status Word (PSW), and I/O Channels. Particular attention is given to providing visual clues to the user by color-coding screen components, machine instruction execution, and animation of the machine architecture components. Students interact with VisibleZ by executing machine instructions in a step-by-step mode, simultaneously observing the contents of memory, registers, and changes in the PSW during the fetch-decode-execute machine instruction cycle. The object-oriented design and implementation of VisibleZ allows students to develop their own instruction semantics by coding Java for existing specific z/Architecture machine instructions or design and implement new machine instructions. The use of VisibleZ in lectures, labs, and assignments is described in the paper and supported by a website that hosts an extensive collection of related materials. VisibleZ has been proven a useful tool in mainframe Assembly Language Programming and Computer Organization classes. Using VisibleZ, students develop a better understanding of mainframe concepts, components, and how the mainframe computer works. ACM Computing Classification System (1998): C.0, K.3.2.