4 resultados para GPGPU
em Indian Institute of Science - Bangalore - Índia
Resumo:
Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite that we used in our work. Current GPUs therefore allow concurrent execution of kernels to improve utilization. In this work, we study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. On two-program workloads from the Parboil2 benchmark suite we find concurrent execution is often no better than serialized execution. We identify that the lack of control over resource allocation to kernels is a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware concurrency policies that offer significantly better performance and concurrency compared to the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil 2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads when compared to the current CUDA concurrency implementation.
Resumo:
A series of 2′-5′-oligoguanylic acids are prepared by reacting G(cyclic)p with takadiastase T1 ribonuclease and separating the products chromatographically. The 3′-5′-oligoguanylic acids are obtained by separating the products of alkaline degradation of 3′-5′-poly(G). The optical rotatory dispersion and hypochromism of both 2′-5′- and 3′-5′-oligoguanylic acids are studied at two different pH. The optical rotatory dispersion spectrum of 2′-5′-GpG is significantly different from that of 3′-5′-GpG. The magnitude of rotation of the long-wavelength peak of 2′-5′-GpG is larger than that of 3′-5′-GpG. This finding contradicts the explanation that the extra stability and more intense circular dichroism band of other 3′-5′-dinucleoside monophosphates is due to H-bond formation between 2′-OH and either the base or the phosphate oxygen. The end phosphate group has a marked effect on the spectrum of GpG between 230 and 250 mμ. In addition the optical rotatory dispersion spectra of 2′-5′ exhibit strong pH, temperature, and solvent dependence between 230 and 250 mμ. ΔH and AS for order ⇌ disorder transition is estimated to be 9.7 kcal/mole and 35.2 eu, respectively. The optical rotatory dispersion spectra of guanine-rich oligoribonucleotides, GpGpC, GpGpU, GpGpGpC, and GpGpGpU are compared to the calculated optical rotatory dispersion from the semiempirical expression of Cantor and Tinoco, using measured optical rotatory dispersion of dimers. Contrary to previous studies, agreement is found not at all satisfactory. However, optical rotatory dispersion of 3′-5′-GpGpGpC and GpGpGpU can be estimated from the semiempirical expression, if a next-nearest interaction parameter is introduced empirically. Such interaction parameter can be calculated from the measured properties of trinucleotide sequences like GpGpG, GpGpC, and GpGpU, assuming that only the nearest-neighbor interaction is important. The optical rotatory dispersion of single-stranded poly(G) is also predicted. The importance of syn-anti equilibrium and next-nearest-neighbor interaction in oligoguanylic acids is suggested as a probable explanation.
Resumo:
This paper presents a GPU implementation of normalized cuts for road extraction problem using panchromatic satellite imagery. The roads have been extracted in three stages namely pre-processing, image segmentation and post-processing. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, vegetation,. and fallow regions). The road regions are then extracted using the normalized cuts algorithm. Normalized cuts algorithm is a graph-based partitioning `approach whose focus lies in extracting the global impression (perceptual grouping) of an image rather than local features. For the segmented image, post-processing is carried out using morphological operations - erosion and dilation. Finally, the road extracted image is overlaid on the original image. Here, a GPGPU (General Purpose Graphical Processing Unit) approach has been adopted to implement the same algorithm on the GPU for fast processing. A performance comparison of this proposed GPU implementation of normalized cuts algorithm with the earlier algorithm (CPU implementation) is presented. From the results, we conclude that the computational improvement in terms of time as the size of image increases for the proposed GPU implementation of normalized cuts. Also, a qualitative and quantitative assessment of the segmentation results has been projected.
Resumo:
Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack. In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads. We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.