990 resultados para sparse matrix-vector multiplication


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Sparse matrix-vector multiplication (SMVM) is a fundamental operation in many scientific and engineering applications. In many cases sparse matrices have thousands of rows and columns where most of the entries are zero, while non-zero data is spread over the matrix. This sparsity of data locality reduces the effectiveness of data cache in general-purpose processors quite reducing their performance efficiency when compared to what is achieved with dense matrix multiplication. In this paper, we propose a parallel processing solution for SMVM in a many-core architecture. The architecture is tested with known benchmarks using a ZYNQ-7020 FPGA. The architecture is scalable in the number of core elements and limited only by the available memory bandwidth. It achieves performance efficiencies up to almost 70% and better performances than previous FPGA designs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The modern GPUs are well suited for intensive computational tasks and massive parallel computation. Sparse matrix multiplication and linear triangular solver are the most important and heavily used kernels in scientific computation, and several challenges in developing a high performance kernel with the two modules is investigated. The main interest it to solve linear systems derived from the elliptic equations with triangular elements. The resulting linear system has a symmetric positive definite matrix. The sparse matrix is stored in the compressed sparse row (CSR) format. It is proposed a CUDA algorithm to execute the matrix vector multiplication using directly the CSR format. A dependence tree algorithm is used to determine which variables the linear triangular solver can determine in parallel. To increase the number of the parallel threads, a coloring graph algorithm is implemented to reorder the mesh numbering in a pre-processing phase. The proposed method is compared with parallel and serial available libraries. The results show that the proposed method improves the computation cost of the matrix vector multiplication. The pre-processing associated with the triangular solver needs to be executed just once in the proposed method. The conjugate gradient method was implemented and showed similar convergence rate for all the compared methods. The proposed method showed significant smaller execution time.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A bit-level systolic array for computing matrix x vector products is described. The operation is carried out on bit parallel input data words and the basic circuit takes the form of a 1-bit slice. Several bit-slice components must be connected together to form the final result, and authors outline two different ways in which this can be done. The basic array also has considerable potential as a stand-alone device, and its use in computing the Walsh-Hadamard transform and discrete Fourier transform operations is briefly discussed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Neste trabalho de dissertação apresentaremos uma classe de precondicionadores baseados na aproximação esparsa da inversa da matriz de coecientes, para a resolução de sistemas lineares esparsos de grandes portes através de métodos iterativos, mais especificamente métodos de Krylov. Para que um método de Krylov seja eficiente é extremamente necessário o uso de precondicionadores. No contexto atual, onde computadores de arquitetura híbrida são cada vez mais comuns temos uma demanda cada vez maior por precondicionadores paralelizáveis. Os métodos de inversa aproximada que serão descritos possuem aplicação paralela, pois so dependem de uma operação de produto matriz-vetor, que é altamente paralelizável. Além disso, alguns dos métodos também podem ser construídos em paralelo. A ideia principal é apresentar uma alternativa aos tradicionais precondicionadores que utilizam aproximações dos fatores LU, que apesar de robustos são de difícil paralelização.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

在科学计算中,稀疏矩阵向量乘(SpMV)是一个十分重要且经常被大量调用的计算内核.由于SpMV一般实现算法的浮点计算和存储访问次数比率非常低,且其存储访问模式极为不规则,其实际运行性能往往很低.通过采用寄存器分块算法和启发式分块大小选择算法,将稀疏矩阵分成小的稠密分块,重用保存在寄存器中向量x元素,可以提高该计算内核的性能.剖析和总结了OSKI软件包所采用的若干关键优化技术,并进行了实际应用性能测试.测试表明,在实际应用这些优化技术的过程中,应用程序对SpMV的调用次数要达到上百次的量级,才能抵消由于应用这些性能优化技术所带来的额外时间开销,取得性能加速效果.在Pentium4和AMD Athlon平台上,测试了10个矩阵,其平均加速比分别达到了1.69和1.48.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

稀疏矩阵向量乘(SpMV)采取压缩行存储格式的算法性能非常差,而寄存器分块算法可以使得数据尽量在靠近处理器的存储层次中访问而提高性能.利用RAM(h)模型进行分析和比较不同算法形式的存储访问复杂度,可以比较两种算法的优劣.通过RAM(h)分析SpMV两种实现形式的存储访问复杂度,同时在奔腾四平台上,测试了7个稀疏矩阵的SpMV性能,并统计了这两种算法中L1,L2,和TLB的缺失率,实验结果与模型分析的数据一致.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OpenMP是一种支持Fortran,C/C++的共享存储并行编程标准。它基于fork-join的并行执行模型,将程序划分为并行区和串行区。近几年来,OpenMP在SMP(Symmetric Multi-Processing)和多核体系结构的并行编程中得到了广泛的应用。随着多核处理器的发展,实际的应用程序如何充分利用多个处理器核来提高运算效率也成为研究的热点。 在科学计算中,循环结构是最核心的并行对象之一。考虑到负载平衡、调度开销、同步开销等多方面因素,OpenMP标准制定了Static调度、Dynamic调度、Guided调度和Runtime调度等不同策略。针对Guided调度策略不适合递减型循环结构的缺点,本文提出了一种改进的new_guided调度策略,并在OMPi编译器上加以实现。New_guided调度策略的主要思想是对前半部分的循环采用Static调度,后半部分的循环采用Guided调度。此外,本文针对不同的循环结构,在多核处理器上对不同的调度策略进行了评测。测试结果表明,在一般情况下,OpenMP默认的Static策略的调度性能最差;对于规则的循环结构和递增的循环结构,Dynamic调度策略、Guided调度策略和new_guided策略的性能差别不大;对于递减型的循环结构,Dynamic调度策略和new_guided策略的性能相当,要优于Guided调度策略;对于求解Mandelbrot集合这类计算量集中在中间的随机循环结构,Dynamic调度策略优于其它策略,new_guided策略的性能介于Dynamic调度和Guided调度之间。 随着多核处理器的问世和发展,多线程程序设计也已经成为一个不可回避的问题。稀疏矩阵向量乘(SpMV, Sparse Matrix-Vector Multiplication)是一个十分重要且经常被大量调用的科学计算内核。SpMV的存储访问一般都极不规则,导致现有的SpMV算法效率都比较低。目前,多核处理器芯片上的内核数量正在逐步增加。这使得在多核处理器上对SpMV进行并行化加速变得非常重要。本文介绍了稀疏矩阵的两种常用的存储格式CSR和BCSR,并采用OpenMP实现了SpMV的多核并行化。此外,本文还讨论了寄存器分块算法、压缩列索引等优化技术,以及不同调度策略对多线程并行后的SpMV的影响。在曙光天阔服务器S4800A1上的测试表明,大部分矩阵都取得了可扩展、甚至是超线性的加速比,但是对于部分规模较大的矩阵,加速效果并不明显。在我们的测试中,与基于CSR实现的多线程SpMV相比,采用寄存器分块算法优化后的SpMV运算速度平均提高了28.09%。在基于CSR实现的多线程SpMV中,采用列索引优化技术后的程序比优化前的速度平均提高了13.05%。此外,本文实现了一种基于非零元个数的调度策略。在该策略中,每个线程处理几乎相同数量的非零元。我们将它和OpenMP标准提供的三种调度策略进行了测试和分析。测试结果表明:与OpenMP提供的调度策略相比,基于非零元个数的调度策略能取得更好的负载平衡;Dynamic调度和Guided调度在多线程SpMV中的性能基本相当,均优于Static调度策略。

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A simple and efficient algorithm for the bandwidth reduction of sparse symmetric matrices is proposed. It involves column-row permutations and is well-suited to map onto the linear array topology of the SIMD architectures. The efficiency of the algorithm is compared with the other existing algorithms. The interconnectivity and the memory requirement of the linear array are discussed and the complexity of its layout area is derived. The parallel version of the algorithm mapped onto the linear array is then introduced and is explained with the help of an example. The optimality of the parallel algorithm is proved by deriving the time complexities of the algorithm on a single processor and the linear array.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Support vector machines (SVMs), though accurate, are not preferred in applications requiring high classification speed or when deployed in systems of limited computational resources, due to the large number of support vectors involved in the model. To overcome this problem we have devised a primal SVM method with the following properties: (1) it solves for the SVM representation without the need to invoke the representer theorem, (2) forward and backward selections are combined to approach the final globally optimal solution, and (3) a criterion is introduced for identification of support vectors leading to a much reduced support vector set. In addition to introducing this method the paper analyzes the complexity of the algorithm and presents test results on three public benchmark problems and a human activity recognition application. These applications demonstrate the effectiveness and efficiency of the proposed algorithm.


--------------------------------------------------------------------------------

Relevância:

100.00% 100.00%

Publicador:

Resumo:

How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like ‘edible’, ‘fits in hand’)? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the Coupled Matrix-Tensor Factorization (CMTF) problem.

Can we accelerate any CMTF solver, so that it runs within a few minutes instead of tens of hours to a day, while maintaining good accuracy? We introduce Turbo-SMT, a meta-method capable of doing exactly that: it boosts the performance of any CMTF algorithm, by up to 200x, along with an up to 65 fold increase in sparsity, with comparable accuracy to the baseline.

We apply Turbo-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Turbo-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy.




Relevância:

100.00% 100.00%

Publicador:

Resumo:

This research presents a fast algorithm for projected support vector machines (PSVM) by selecting a basis vector set (BVS) for the kernel-induced feature space, the training points are projected onto the subspace spanned by the selected BVS. A standard linear support vector machine (SVM) is then produced in the subspace with the projected training points. As the dimension of the subspace is determined by the size of the selected basis vector set, the size of the produced SVM expansion can be specified. A two-stage algorithm is derived which selects and refines the basis vector set achieving a locally optimal model. The model expansion coefficients and bias are updated recursively for increase and decrease in the basis set and support vector set. The condition for a point to be classed as outside the current basis vector and selected as a new basis vector is derived and embedded in the recursive procedure. This guarantees the linear independence of the produced basis set. The proposed algorithm is tested and compared with an existing sparse primal SVM (SpSVM) and a standard SVM (LibSVM) on seven public benchmark classification problems. Our new algorithm is designed for use in the application area of human activity recognition using smart devices and embedded sensors where their sometimes limited memory and processing resources must be exploited to the full and the more robust and accurate the classification the more satisfied the user. Experimental results demonstrate the effectiveness and efficiency of the proposed algorithm. This work builds upon a previously published algorithm specifically created for activity recognition within mobile applications for the EU Haptimap project [1]. The algorithms detailed in this paper are more memory and resource efficient making them suitable for use with bigger data sets and more easily trained SVMs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

How can we correlate neural activity in the human brain as it responds to words, with behavioral data expressed as answers to questions about these same words? In short, we want to find latent variables, that explain both the brain activity, as well as the behavioral responses. We show that this is an instance of the Coupled Matrix-Tensor Factorization (CMTF) problem. We propose Scoup-SMT, a novel, fast, and parallel algorithm that solves the CMTF problem and produces a sparse latent low-rank subspace of the data. In our experiments, we find that Scoup-SMT is 50-100 times faster than a state-of-the-art algorithm for CMTF, along with a 5 fold increase in sparsity. Moreover, we extend Scoup-SMT to handle missing data without degradation of performance. We apply Scoup-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Scoup-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy. Finally, we demonstrate the generality of Scoup-SMT, by applying it on a Facebook dataset (users, friends, wall-postings); there, Scoup-SMT spots spammer-like anomalies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Mode of access: Internet.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Sparse-matrix sampling using commercially available crystallization screen kits has become the most popular way of determining the preliminary crystallization conditions for macromolecules. In this study, the efficiency of three commercial screening kits, Crystal Screen and Crystal Screen 2 (Hampton Research), Wizard Screens I and II (Emerald BioStructures) and Personal Structure Screens 1 and 2 (Molecular Dimensions), has been compared using a set of 19 diverse proteins. 18 proteins yielded crystals using at least one crystallization screen. Surprisingly, Crystal Screens and Personal Structure Screens showed dramatically different results, although most of the crystallization formulations are identical as listed by the manufacturers. Higher molecular weight polyethylene glycols and mixed precipitants were found to be the most effective precipitants in this study.