476 resultados para Processors


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this and a preceding paper, we provide an introduction to the Fujitsu VPP range of vector-parallel supercomputers and to some of the computational chemistry software available for the VPP. Here, we consider the implementation and performance of seven popular chemistry application packages. The codes discussed range from classical molecular dynamics to semiempirical and ab initio quantum chemistry. All have evolved from sequential codes, and have typically been parallelised using a replicated data approach. As such they are well suited to the large-memory/fast-processor architecture of the VPP. For one code, CASTEP, a distributed-memory data-driven parallelisation scheme is presented. (C) 2000 Published by Elsevier Science B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A preliminary version of this paper appeared in Proceedings of the 31st IEEE Real-Time Systems Symposium, 2010, pp. 239–248.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We present a 12*(1+|R|/(4m))-speed algorithm for scheduling constrained-deadline sporadic real-time tasks on a multiprocessor comprising m processors where a task may request one of |R| sequentially-reusable shared resources.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Consider the problem of scheduling a set of implicitdeadline sporadic tasks on a heterogeneous multiprocessor so as to meet all deadlines. Tasks cannot migrate and the platform is restricted in that each processor is either of type-1 or type-2 (with each task characterized by a different speed of execution upon each type of processor). We present an algorithm for this problem with a timecomplexity of O(n·m), where n is the number of tasks and m is the number of processors. It offers the guarantee that if a task set can be scheduled by any non-migrative algorithm to meet deadlines then our algorithm meets deadlines as well if given processors twice as fast. Although this result is proven for only a restricted heterogeneous multiprocessor, we consider it significant for being the first realtime scheduling algorithm to use a low-complexity binpacking approach to schedule tasks on a heterogeneous multiprocessor with provably good performance.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Consider the problem of designing an algorithm with a high utilisation bound for scheduling sporadic tasks with implicit deadlines on identical processors. A task is characterised by its minimum interarrival time and its execution time. Task preemption and migration is permitted. Still, low preemption and migration counts are desirable. We formulate an algorithm with a utilisation bound no less than 66.¯6%, characterised by worst-case preemption counts comparing favorably against the state-of-the-art.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

6th International Real-Time Scheduling Open Problems Seminar (RTSOPS 2015), Lund, Sweden.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

El avance en la potencia de cómputo en nuestros días viene dado por la paralelización del procesamiento, dadas las características que disponen las nuevas arquitecturas de hardware. Utilizar convenientemente este hardware impacta en la aceleración de los algoritmos en ejecución (programas). Sin embargo, convertir de forma adecuada el algoritmo en su forma paralela es complejo, y a su vez, esta forma, es específica para cada tipo de hardware paralelo. En la actualidad los procesadores de uso general más comunes son los multicore, procesadores paralelos, también denominados Symmetric Multi-Processors (SMP). Hoy en día es difícil hallar un procesador para computadoras de escritorio que no tengan algún tipo de paralelismo del caracterizado por los SMP, siendo la tendencia de desarrollo, que cada día nos encontremos con procesadores con mayor numero de cores disponibles. Por otro lado, los dispositivos de procesamiento de video (Graphics Processor Units - GPU), a su vez, han ido desarrollando su potencia de cómputo por medio de disponer de múltiples unidades de procesamiento dentro de su composición electrónica, a tal punto que en la actualidad no es difícil encontrar placas de GPU con capacidad de 200 a 400 hilos de procesamiento paralelo. Estos procesadores son muy veloces y específicos para la tarea que fueron desarrollados, principalmente el procesamiento de video. Sin embargo, como este tipo de procesadores tiene muchos puntos en común con el procesamiento científico, estos dispositivos han ido reorientándose con el nombre de General Processing Graphics Processor Unit (GPGPU). A diferencia de los procesadores SMP señalados anteriormente, las GPGPU no son de propósito general y tienen sus complicaciones para uso general debido al límite en la cantidad de memoria que cada placa puede disponer y al tipo de procesamiento paralelo que debe realizar para poder ser productiva su utilización. Los dispositivos de lógica programable, FPGA, son dispositivos capaces de realizar grandes cantidades de operaciones en paralelo, por lo que pueden ser usados para la implementación de algoritmos específicos, aprovechando el paralelismo que estas ofrecen. Su inconveniente viene derivado de la complejidad para la programación y el testing del algoritmo instanciado en el dispositivo. Ante esta diversidad de procesadores paralelos, el objetivo de nuestro trabajo está enfocado en analizar las características especificas que cada uno de estos tienen, y su impacto en la estructura de los algoritmos para que su utilización pueda obtener rendimientos de procesamiento acordes al número de recursos utilizados y combinarlos de forma tal que su complementación sea benéfica. Específicamente, partiendo desde las características del hardware, determinar las propiedades que el algoritmo paralelo debe tener para poder ser acelerado. Las características de los algoritmos paralelos determinará a su vez cuál de estos nuevos tipos de hardware son los mas adecuados para su instanciación. En particular serán tenidos en cuenta el nivel de dependencia de datos, la necesidad de realizar sincronizaciones durante el procesamiento paralelo, el tamaño de datos a procesar y la complejidad de la programación paralela en cada tipo de hardware. Today´s advances in high-performance computing are driven by parallel processing capabilities of available hardware architectures. These architectures enable the acceleration of algorithms when thes ealgorithms are properly parallelized and exploit the specific processing power of the underneath architecture. Most current processors are targeted for general pruposes and integrate several processor cores on a single chip, resulting in what is known as a Symmetric Multiprocessing (SMP) unit. Nowadays even desktop computers make use of multicore processors. Meanwhile, the industry trend is to increase the number of integrated rocessor cores as technology matures. On the other hand, Graphics Processor Units (GPU), originally designed to handle only video processing, have emerged as interesting alternatives to implement algorithm acceleration. Current available GPUs are able to implement from 200 to 400 threads for parallel processing. Scientific computing can be implemented in these hardware thanks to the programability of new GPUs that have been denoted as General Processing Graphics Processor Units (GPGPU).However, GPGPU offer little memory with respect to that available for general-prupose processors; thus, the implementation of algorithms need to be addressed carefully. Finally, Field Programmable Gate Arrays (FPGA) are programmable devices which can implement hardware logic with low latency, high parallelism and deep pipelines. Thes devices can be used to implement specific algorithms that need to run at very high speeds. However, their programmability is harder that software approaches and debugging is typically time-consuming. In this context where several alternatives for speeding up algorithms are available, our work aims at determining the main features of thes architectures and developing the required know-how to accelerate algorithm execution on them. We look at identifying those algorithms that may fit better on a given architecture as well as compleme

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The modern computer systems that are in use nowadays are mostly processor-dominant, which means that their memory is treated as a slave element that has one major task – to serve execution units data requirements. This organization is based on the classical Von Neumann's computer model, proposed seven decades ago in the 1950ties. This model suffers from a substantial processor-memory bottleneck, because of the huge disparity between the processor and memory working speeds. In order to solve this problem, in this paper we propose a novel architecture and organization of processors and computers that attempts to provide stronger match between the processing and memory elements in the system. The proposed model utilizes a memory-centric architecture, wherein the execution hardware is added to the memory code blocks, allowing them to perform instructions scheduling and execution, management of data requests and responses, and direct communication with the data memory blocks without using registers. This organization allows concurrent execution of all threads, processes or program segments that fit in the memory at a given time. Therefore, in this paper we describe several possibilities for organizing the proposed memory-centric system with multiple data and logicmemory merged blocks, by utilizing a high-speed interconnection switching network.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Bank switching in embedded processors having partitioned memory architecture results in code size as well as run time overhead. An algorithm and its application to assist the compiler in eliminating the redundant bank switching codes introduced and deciding the optimum data allocation to banked memory is presented in this work. A relation matrix formed for the memory bank state transition corresponding to each bank selection instruction is used for the detection of redundant codes. Data allocation to memory is done by considering all possible permutation of memory banks and combination of data. The compiler output corresponding to each data mapping scheme is subjected to a static machine code analysis which identifies the one with minimum number of bank switching codes. Even though the method is compiler independent, the algorithm utilizes certain architectural features of the target processor. A prototype based on PIC 16F87X microcontrollers is described. This method scales well into larger number of memory blocks and other architectures so that high performance compilers can integrate this technique for efficient code generation. The technique is illustrated with an example

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization is the idle time caused by long latency operations, such as remote memory references or processor synchronization operations. One way of tolerating this latency is to use a processor with multiple hardware contexts that can rapidly switch to executing another thread of computation whenever a long latency operation occurs, thus increasing processor utilization by overlapping computation with communication. Although multiple contexts are effective for tolerating latency, this effectiveness can be limited by memory and network bandwidth, by cache interference effects among the multiple contexts, and by critical tasks sharing processor resources with less critical tasks. This thesis presents techniques that increase the effectiveness of multiple contexts by intelligently scheduling threads to make more efficient use of processor pipeline, bandwidth, and cache resources. This thesis proposes thread prioritization as a fundamental mechanism for directing the thread schedule on a multiple-context processor. A priority is assigned to each thread either statically or dynamically and is used by the thread scheduler to decide which threads to load in the contexts, and to decide which context to switch to on a context switch. We develop a multiple-context model that integrates both cache and network effects, and shows how thread prioritization can both maintain high processor utilization, and limit increases in critical path runtime caused by multithreading. The model also shows that in order to be effective in bandwidth limited applications, thread prioritization must be extended to prioritize memory requests. We show how simple hardware can prioritize the running of threads in the multiple contexts, and the issuing of requests to both the local memory and the network. Simulation experiments show how thread prioritization is used in a variety of applications. Thread prioritization can improve the performance of synchronization primitives by minimizing the number of processor cycles wasted in spinning and devoting more cycles to critical threads. Thread prioritization can be used in combination with other techniques to improve cache performance and minimize cache interference between different working sets in the cache. For applications that are critical path limited, thread prioritization can improve performance by allowing processor resources to be devoted preferentially to critical threads. These experimental results show that thread prioritization is a mechanism that can be used to implement a wide range of scheduling policies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A patent describing a computer architecture which implements a Perspex instruction.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A processing system comprises a plurality of processors (12) and communication means (20) arranged to carry messages between the processors, wherein each of the processors (12) has an operating instruction memory field (32, 34, 36) arranged to hold stored operating instructions including a re-routing target address. Each processor is arranged to receive a message (38) including operating instructions including a target address. On receipt of the message, each processor is arranged to: check the target address in the message to determine whether it corresponds to an address associated with the processor; if the target address in the message does correspond to an address associated with the processor, to check the operating instructions in the message to determine whether the message is to be re-routed; and, if the message is to be re-routed, to replace operating instructions within the message with the stored operating instructions, and place the message on the communication means for delivery to the re-routing target address.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents the use of a multiprocessor architecture for the performance improvement of tomographic image reconstruction. Image reconstruction in computed tomography (CT) is an intensive task for single-processor systems. We investigate the filtered image reconstruction suitability based on DSPs organized for parallel processing and its comparison with the Message Passing Interface (MPI) library. The experimental results show that the speedups observed for both platforms were increased in the same direction of the image resolution. In addition, the execution time to communication time ratios (Rt/Rc) as a function of the sample size have shown a narrow variation for the DSP platform in comparison with the MPI platform, which indicates its better performance for parallel image reconstruction.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This work presents the development of an IEEE 1451.2 protocol controller based on a low-cost FPGA that is directly connected to the parallel port of a conventional personal computer. In this manner it is possible to implement a Network Capable Application Processor (NCAP) based on a personal computer, without parallel port modifications. This approach allows supporting the ten signal lines of the 10-wire IEEE 1451.2 Transducer Independent Interface (TII), that connects the network processor to the Smart Transducer Interface Module (STIM) also defined in the IEEE 1451.2 standard. The protocol controller is connected to the STIM through the TII's physical interface, enabling the portability of the application at the transducer and network processor level. The protocol controller architecture was fully developed in VHDL language and we have projected a special prototype configured in a general-purpose programmable logic device. We have implemented two versions of the protocol controller, which is based on IEEE 1451 standard, and we have obtained results using simulation and experimental tests. (c) 2008 Elsevier B.V. All rights reserved.