48 resultados para Single Graphics Processing Units
em QUB Research Portal - Research Directory and Institutional Repository for Queen's University Belfast
Resumo:
Background: Modern cancer research often involves large datasets and the use of sophisticated statistical techniques. Together these add a heavy computational load to the analysis, which is often coupled with issues surrounding data accessibility. Connectivity mapping is an advanced bioinformatic and computational technique dedicated to therapeutics discovery and drug re-purposing around differential gene expression analysis. On a normal desktop PC, it is common for the connectivity mapping task with a single gene signature to take >2h to complete using sscMap, a popular Java application that runs on standard CPUs (Central Processing Units). Here, we describe new software, cudaMap, which has been implemented using CUDA C/C++ to harness the computational power of NVIDIA GPUs (Graphics Processing Units) to greatly reduce processing times for connectivity mapping.
Results: cudaMap can identify candidate therapeutics from the same signature in just over thirty seconds when using an NVIDIA Tesla C2050 GPU. Results from the analysis of multiple gene signatures, which would previously have taken several days, can now be obtained in as little as 10 minutes, greatly facilitating candidate therapeutics discovery with high throughput. We are able to demonstrate dramatic speed differentials between GPU assisted performance and CPU executions as the computational load increases for high accuracy evaluation of statistical significance.
Conclusion: Emerging 'omics' technologies are constantly increasing the volume of data and information to be processed in all areas of biomedical research. Embracing the multicore functionality of GPUs represents a major avenue of local accelerated computing. cudaMap will make a strong contribution in the discovery of candidate therapeutics by enabling speedy execution of heavy duty connectivity mapping tasks, which are increasingly required in modern cancer research. cudaMap is open source and can be freely downloaded from http://purl.oclc.org/NET/cudaMap.
Resumo:
This paper investigates sub-integer implementations of the adaptive Gaussian mixture model (GMM) for background/foreground segmentation to allow the deployment of the method on low cost/low power processors that lack Floating Point Unit (FPU). We propose two novel integer computer arithmetic techniques to update Gaussian parameters. Specifically, the mean value and the variance of each Gaussian are updated by a redefined and generalised "round'' operation that emulates the original updating rules for a large set of learning rates. Weights are represented by counters that are updated following stochastic rules to allow a wider range of learning rates and the weight trend is approximated by a line or a staircase. We demonstrate that the memory footprint and computational cost of GMM are significantly reduced, without significantly affecting the performance of background/foreground segmentation.
Resumo:
Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bitwidth and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems. Currently, high-performance computing offers a wide set of acceleration options, that range from multicore CPUs to graphics processing units (GPUs) and FPGAs. Depending on the simulation requirements, the ideal architecture to use can vary. In this paper we propose a new design flow based on OpenCL, a unified multiplatform programming model, which accelerates LDPC decoding simulations, thereby significantly reducing architectural exploration and design time. OpenCL-based parallel kernels are used without modifications or code tuning on multicore CPUs, GPUs and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL for mapping the simulations into FPGAs. To the best of our knowledge, this is the first time that a single, unmodified OpenCL code is used to target those three different platforms. We show that, depending on the design parameters to be explored in the simulation, on the dimension and phase of the design, the GPU or the FPGA may suit different purposes more conveniently, providing different acceleration factors. For example, although simulations can typically execute more than 3x faster on FPGAs than on GPUs, the overhead of circuit synthesis often outweighs the benefits of FPGA-accelerated execution.
Resumo:
The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct programming paradigms. In this context, we evaluate the concept of retargeting a single OpenCL program to multiple platforms, thereby significantly reducing design time. A single OpenCL-based parallel kernel is used without modifications or code tuning on multicore CPUs, GPUs, and FPGAs. We use SOpenCL (Silicon to OpenCL), a tool that automatically converts OpenCL kernels to RTL in order to introduce FPGAs as a potential platform to efficiently execute simulations coded in OpenCL. We use LDPC decoding simulations as a case study. Experimental results were obtained by testing a variety of regular and irregular LDPC codes that range from short/medium (e.g., 8,000 bit) to long length (e.g., 64,800 bit) DVB-S2 codes. We observe that, depending on the design parameters to be simulated, on the dimension and phase of the design, the GPU or FPGA may suit different purposes more conveniently, thus providing different acceleration factors over conventional multicore CPUs.
Resumo:
How can GPU acceleration be obtained as a service in a cluster? This question has become increasingly significant due to the inefficiency of installing GPUs on all nodes of a cluster. The research reported in this paper is motivated to address the above question by employing rCUDA (remote CUDA), a framework that facilitates Acceleration-as-a-Service (AaaS), such that the nodes of a cluster can request the acceleration of a set of remote GPUs on demand. The rCUDA framework exploits virtualisation and ensures that multiple nodes can share the same GPU. In this paper we test the feasibility of the rCUDA framework on a real-world application employed in the financial risk industry that can benefit from AaaS in the production setting. The results confirm the feasibility of rCUDA and highlight that rCUDA achieves similar performance compared to CUDA, provides consistent results, and more importantly, allows for a single application to benefit from all the GPUs available in the cluster without loosing efficiency.
Resumo:
Cloud computing technology has rapidly evolved over the last decade, offering an alternative way to store and work with large amounts of data. However data security remains an important issue particularly when using a public cloud service provider. The recent area of homomorphic cryptography allows computation on encrypted data, which would allow users to ensure data privacy on the cloud and increase the potential market for cloud computing. A significant amount of research on homomorphic cryptography appeared in the literature over the last few years; yet the performance of existing implementations of encryption schemes remains unsuitable for real time applications. One way this limitation is being addressed is through the use of graphics processing units (GPUs) and field programmable gate arrays (FPGAs) for implementations of homomorphic encryption schemes. This review presents the current state of the art in this promising new area of research and highlights the interesting remaining open problems.
Resumo:
Graphics Processing Units (GPUs) are becoming popular accelerators in modern High-Performance Computing (HPC) clusters. Installing GPUs on each node of the cluster is not efficient resulting in high costs and power consumption as well as underutilisation of the accelerator. The research reported in this paper is motivated towards the use of few physical GPUs by providing cluster nodes access to remote GPUs on-demand for a financial risk application. We hypothesise that sharing GPUs between several nodes, referred to as multi-tenancy, reduces the execution time and energy consumed by an application. Two data transfer modes between the CPU and the GPUs, namely concurrent and sequential, are explored. The key result from the experiments is that multi-tenancy with few physical GPUs using sequential data transfers lowers the execution time and the energy consumed, thereby improving the overall performance of the application.
Resumo:
In this paper, we propose a multi-camera application capable of processing high resolution images and extracting features based on colors patterns over graphic processing units (GPU). The goal is to work in real time under the uncontrolled environment of a sport event like a football match. Since football players are composed for diverse and complex color patterns, a Gaussian Mixture Models (GMM) is applied as segmentation paradigm, in order to analyze sport live images and video. Optimization techniques have also been applied over the C++ implementation using profiling tools focused on high performance. Time consuming tasks were implemented over NVIDIA's CUDA platform, and later restructured and enhanced, speeding up the whole process significantly. Our resulting code is around 4-11 times faster on a low cost GPU than a highly optimized C++ version on a central processing unit (CPU) over the same data. Real time has been obtained processing until 64 frames per second. An important conclusion derived from our study is the scalability of the application to the number of cores on the GPU. © 2011 Springer-Verlag.
Resumo:
Differential equations are often directly solvable by analytical means only in their one dimensional version. Partial differential equations are generally not solvable by analytical means in two and three dimensions, with the exception of few special cases. In all other cases, numerical approximation methods need to be utilized. One of the most popular methods is the finite element method. The main areas of focus, here, are the Poisson heat equation and the plate bending equation. The purpose of this paper is to provide a quick walkthrough of the various approaches that the authors followed in pursuit of creating optimal solvers, accelerated with the use of graphical processing units, and comparing them in terms of accuracy and time efficiency with existing or self-made non-accelerated solvers.
Resumo:
In the reinsurance market, the risks natural catastrophes pose to portfolios of properties must be quantified, so that they can be priced, and insurance offered. The analysis of such risks at a portfolio level requires a simulation of up to 800 000 trials with an average of 1000 catastrophic events per trial. This is sufficient to capture risk for a global multi-peril reinsurance portfolio covering a range of perils including earthquake, hurricane, tornado, hail, severe thunderstorm, wind storm, storm surge and riverine flooding, and wildfire. Such simulations are both computation and data intensive, making the application of high-performance computing techniques desirable.
In this paper, we explore the design and implementation of portfolio risk analysis on both multi-core and many-core computing platforms. Given a portfolio of property catastrophe insurance treaties, key risk measures, such as probable maximum loss, are computed by taking both primary and secondary uncertainties into account. Primary uncertainty is associated with whether or not an event occurs in a simulated year, while secondary uncertainty captures the uncertainty in the level of loss due to the use of simplified physical models and limitations in the available data. A combination of fast lookup structures, multi-threading and careful hand tuning of numerical operations is required to achieve good performance. Experimental results are reported for multi-core processors and systems using NVIDIA graphics processing unit and Intel Phi many-core accelerators.
Resumo:
In this paper, we propose a novel linear transmit precoding strategy for multiple-input, multiple-output (MIMO) systems employing improper signal constellations. In particular, improved zero-forcing (ZF) and minimum mean square error (MMSE) precoders are derived based on modified cost functions, and are shown to achieve a superior performance without loss of spectrum efficiency compared to the conventional linear and nonlinear precoders. The superiority of the proposed precoders over the conventional solutions are verified by both simulation and analytical results. The novel approach to precoding design is also applied to the case of an imperfect channel estimate with a known error covariance as well as to the multi-user scenario where precoding based on the nullspace of channel transmission matrix is employed to decouple multi-user channels. In both cases, the improved precoding schemes yield significant performance gain compared to the conventional counterparts.
Resumo:
Strong evidence of a single-photon tunneling effect, a direct analog of single-electron tunneling, has been obtained in the measurements of light tunneling through individual subwavelength pinholes in a gold film covered with a layer of polydiacetylene. The transmission of some pinholes reached saturation because of the optical nonlinearity of polydiacetylene at a very low light intensity of a few thousand photons per second. This result is explained theoretically in terms of a "photon blockade," similar to the Coulomb blockade phenomenon observed in single-electron tunneling experiments. Single-photon tunneling may find applications in the fields of quantum communication and information processing.
Resumo:
Protein TrwC is the conjugative relaxase responsible for DNA processing in plasmid R388 bacterial conjugation. TrwC has two catalytic tyrosines, Y18 and Y26, both able to carry out cleavage reactions using unmodified oligonucleotide substrates. Suicide substrates containing a 30-Sphosphorothiolate linkage at the cleavage site displaced TrwC reaction towards covalent adducts and thereby enabled intermediate steps in relaxase reactions to be investigated. Two distinct covalent TrwC–oligonucleotide complexes could be separated from noncovalently bound protein by SDS–PAGE. As observed by mass spectrometry, one complex contained a single, cleaved oligonucleotide bound to Y18, whereas the other contained two cleaved oligonucleotides, bound to Y18 and Y26. Analysis of the cleavage reaction using suicide substrates and Y18F or Y26F mutants showed that efficient Y26 cleavage only occurs after Y18 cleavage. Strand-transfer reactions carried out with the isolated Y18–DNA complex allowed the assignment of specific roles to each tyrosine. Thus, only Y18 was used for initiation. Y26 was specifically used in the second transesterification that leads to strand transfer, thus catalyzing the termination reaction that occurs in the recipient cell.
Resumo:
Transcription from morbillivirus genomes commences at a single promoter in the 3' non-coding terminus, with the six genes being transcribed sequentially. The 3' and 5' untranslated regions (UTRs) of the genes (mRNA sense), together with the intergenic trinucleotide spacer, comprise the non-coding sequences (NCS) of the virus and contain the conserved gene end and gene start signals, respectively. Bicistronic minigenomes containing transcription units (TUs) encoding autofluorescent reporter proteins separated by measles virus (MV) NCS were used to give a direct estimation of gene expression in single, living cells by assessing the relative amounts of each fluorescent protein in each cell. Initially, five minigenomes containing each of the MV NCS were generated. Assays were developed to determine the amount of each fluorescent protein in cells at both cell population and single-cell levels. This revealed significant variations in gene expression between cells expressing the same NCS-containing minigenome. The minigenome containing the M/F NCS produced significantly lower amounts of fluorescent protein from the second TU (TU2), compared with the other minigenomes. A minigenome with a truncated F 5' UTR had increased expression from TU2. This UTR is 524 nt longer than the other MV 5' UTRs. Insertions into the 5' UTR of the enhanced green fluorescent protein gene in the minigenome containing the N/P NCS showed that specific sequences, rather than just the additional length of F 5' UTR, govern this decreased expression from TU2.