27 resultados para Parallel Work Experience, Practise, Architecture


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Workstation clusters equipped with high performance interconnect having programmable network processors facilitate interesting opportunities to enhance the performance of parallel application run on them. In this paper, we propose schemes where certain application level processing in parallel database query execution is performed on the network processor. We evaluate the performance of TPC-H queries executing on a high end cluster where all tuple processing is done on the host processor, using a timed Petri net model, and find that tuple processing costs on the host processor dominate the execution time. These results are validated using a small cluster. We therefore propose 4 schemes where certain tuple processing activity is offloaded to the network processor. The first 2 schemes offload the tuple splitting activity - computation to identify the node on which to process the tuples, resulting in an execution time speedup of 1.09 relative to the base scheme, but with I/O bus becoming the bottleneck resource. In the 3rd scheme in addition to offloading tuple processing activity, the disk and network interface are combined to avoid the I/O bus bottleneck, which results in speedups up to 1.16, but with high host processor utilization. Our 4th scheme where the network processor also performs apart of join operation along with the host processor, gives a speedup of 1.47 along with balanced system resource utilizations. Further we observe that the proposed schemes perform equally well even in a scaled architecture i.e., when the number of processors is increased from 2 to 64

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Today's SoCs are complex designs with multiple embedded processors, memory subsystems, and application specific peripherals. The memory architecture of embedded SoCs strongly influences the power and performance of the entire system. Further, the memory subsystem constitutes a major part (typically up to 70%) of the silicon area for the current day SoC. In this article, we address the on-chip memory architecture exploration for DSP processors which are organized as multiple memory banks, where banks can be single/dual ported with non-uniform bank sizes. In this paper we propose two different methods for physical memory architecture exploration and identify the strengths and applicability of these methods in a systematic way. Both methods address the memory architecture exploration for a given target application by considering the application's data access characteristics and generates a set of Pareto-optimal design points that are interesting from a power, performance and VLSI area perspective. To the best of our knowledge, this is the first comprehensive work on memory space exploration at physical memory level that integrates data layout and memory exploration to address the system objectives from both hardware design and application software development perspective. Further we propose an automatic framework that explores the design space identifying 100's of Pareto-optimal design points within a few hours of running on a standard desktop configuration.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a decentralized/peer-to-peer architecture-based parallel version of the vector evaluated particle swarm optimization (VEPSO) algorithm for multi-objective design optimization of laminated composite plates using message passing interface (MPI). The design optimization of laminated composite plates being a combinatorially explosive constrained non-linear optimization problem (CNOP), with many design variables and a vast solution space, warrants the use of non-parametric and heuristic optimization algorithms like PSO. Optimization requires minimizing both the weight and cost of these composite plates, simultaneously, which renders the problem multi-objective. Hence VEPSO, a multi-objective variant of the PSO algorithm, is used. Despite the use of such a heuristic, the application problem, being computationally intensive, suffers from long execution times due to sequential computation. Hence, a parallel version of the PSO algorithm for the problem has been developed to run on several nodes of an IBM P720 cluster. The proposed parallel algorithm, using MPI's collective communication directives, establishes a peer-to-peer relationship between the constituent parallel processes, deviating from the more common master-slave approach, in achieving reduction of computation time by factor of up to 10. Finally we show the effectiveness of the proposed parallel algorithm by comparing it with a serial implementation of VEPSO and a parallel implementation of the vector evaluated genetic algorithm (VEGA) for the same design problem. (c) 2012 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The contour tree is a topological abstraction of a scalar field that captures evolution in level set connectivity. It is an effective representation for visual exploration and analysis of scientific data. We describe a work-efficient, output sensitive, and scalable parallel algorithm for computing the contour tree of a scalar field defined on a domain that is represented using either an unstructured mesh or a structured grid. A hybrid implementation of the algorithm using the GPU and multi-core CPU can compute the contour tree of an input containing 16 million vertices in less than ten seconds with a speedup factor of upto 13. Experiments based on an implementation in a multi-core CPU environment show near-linear speedup for large data sets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a study of the nature of the degrees-of-freedom of spatial manipulators based on the concept of partition of degrees-of-freedom. In particular, the partitioning of degrees-of-freedom is studied in five lower-mobility spatial parallel manipulators possessing different combinations of degrees-of-freedom. An extension of the existing theory is introduced so as to analyse the nature of the gained degree(s)-of-freedom at a gain-type singularity. The gain of one- and two-degrees-of-freedom is analysed in several well-studied, as well as newly developed manipulators. The formulations also present a basis for the analysis of the velocity kinematics of manipulators of any architecture. (C) 2013 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a Radix-4(3) based FFT architecture suitable for OFDM based WLAN applications. The radix-4(3) parallel unrolled architecture presented here, uses a radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. A 64 point FFT processor based on the proposed architecture has been implemented in UMC 130nm 1P8M CMOS process with a maximum clock frequency of 100 MHz and area of 0.83mm(2). The proposed processor provides a throughput of four times the clock rate and can finish one 64 point FFT computation in 16 clock cycles. For IEEE 802.11a/g WLAN, the processor needs to be operated at a clock rate of 5 MHz with a power consumption of 2.27 mW which is 27% less than the previously reported low power implementations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we propose a fully parallel 64K point radix-4(4) FFT processor. The radix-4(4) parallel unrolled architecture uses a novel radix-4 butterfly unit which takes all four inputs in parallel and can selectively produce one out of the four outputs. The radix-4(4) block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. The resultant 64K point FFT processor shows significant reduction in intermediate memory but with increased hardware complexity. Compared to the state-of-art implementation 5], our architecture shows reduced latency with comparable throughput and area. The 64K point FFT architecture was synthesized using a 130nm CMOS technology which resulted in a throughput of 1.4 GSPS and latency of 47.7 mu s with a maximum clock frequency of 350MHz. When compared to 5], the latency is reduced by 303 mu s with 50.8% reduction in area.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Prediction of queue waiting times of jobs submitted to production parallel batch systems is important to provide overall estimates to users and can also help meta-schedulers make scheduling decisions. In this work, we have developed a framework for predicting ranges of queue waiting times for jobs by employing multi-class classification of similar jobs in history. Our hierarchical prediction strategy first predicts the point wait time of a job using dynamic k-Nearest Neighbor (kNN) method. It then performs a multi-class classification using Support Vector Machines (SVMs) among all the classes of the jobs. The probabilities given by the SVM for the class predicted using k-NN and its neighboring classes are used to provide a set of ranges of predicted wait times with probabilities. We have used these predictions and probabilities in a meta-scheduling strategy that distributes jobs to different queues/sites in a multi-queue/grid environment for minimizing wait times of the jobs. Experiments with different production supercomputer job traces show that our prediction strategies can give correct predictions for about 77-87% of the jobs, and also result in about 12% improved accuracy when compared to the next best existing method. Experiments with our meta-scheduling strategy using different production and synthetic job traces for various system sizes, partitioning schemes and different workloads, show that the meta-scheduling strategy gives much improved performance when compared to existing scheduling policies by reducing the overall average queue waiting times of the jobs by about 47%.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The crystal structure of a tripeptide Boc-Leu-Val-Ac(12)c-OMe (1) is determined, which incorporates a bulky 1-aminocyclododecane-1-carboxylic acid (Ac(12)c) side chain. The peptide adopts a semi-extended backbone conformation for Leu and Val residues, while the backbone torsion angles of the C-,C--dialkylated residue Ac(12)c are in the helical region of the Ramachandran map. The molecular packing of 1 revealed a unique supramolecular twisted parallel -sheet coiling into a helical architecture in crystals, with the bulky hydrophobic Ac(12)c side chains projecting outward the helical column. This arrangement resembles the packing of peptide helices in crystal structures. Although short oligopeptides often assemble as parallel or anti-parallel -sheet in crystals, twisted or helical -sheet formation has been observed in a few examples of dipeptide crystal structures. Peptide 1 presents the first example of a tripeptide showing twisted -sheet assembly in crystals. Copyright (c) 2016 European Peptide Society and John Wiley & Sons, Ltd.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The crystal structure of a tripeptide Boc-Leu-Val-Ac(12)c-OMe (1) is determined, which incorporates a bulky 1-aminocyclododecane-1-carboxylic acid (Ac(12)c) side chain. The peptide adopts a semi-extended backbone conformation for Leu and Val residues, while the backbone torsion angles of the C-,C--dialkylated residue Ac(12)c are in the helical region of the Ramachandran map. The molecular packing of 1 revealed a unique supramolecular twisted parallel -sheet coiling into a helical architecture in crystals, with the bulky hydrophobic Ac(12)c side chains projecting outward the helical column. This arrangement resembles the packing of peptide helices in crystal structures. Although short oligopeptides often assemble as parallel or anti-parallel -sheet in crystals, twisted or helical -sheet formation has been observed in a few examples of dipeptide crystal structures. Peptide 1 presents the first example of a tripeptide showing twisted -sheet assembly in crystals. Copyright (c) 2016 European Peptide Society and John Wiley & Sons, Ltd.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The effects of contact architecture, graphene defect density and metal-semiconductor work function difference on the resistivity of metal-graphene contacts have been investigated. An architecture with metal on the bottom of graphene is found to yield resistivities that are lower, by a factor of four, and most consistent as compared to metal on top of graphene. Growth defects in graphene film were found to further reduce resistivity by a factor of two. Using a combination of method and metal used, the contact resistivity of graphene has been decreased by a factor of 10 to 1200. +/-. 250 Omega mu m using palladium as the contact metal. While the improved consistency is due to the metal being able to contact uncontaminated graphene in the metal on the bottom architecture, lower contact resistivities observed on defective graphene with the same metal are attributed to the increased number of modes of quantum transport in the channel.