989 resultados para Matrix-Vector Multiplication
Resumo:
The modern GPUs are well suited for intensive computational tasks and massive parallel computation. Sparse matrix multiplication and linear triangular solver are the most important and heavily used kernels in scientific computation, and several challenges in developing a high performance kernel with the two modules is investigated. The main interest it to solve linear systems derived from the elliptic equations with triangular elements. The resulting linear system has a symmetric positive definite matrix. The sparse matrix is stored in the compressed sparse row (CSR) format. It is proposed a CUDA algorithm to execute the matrix vector multiplication using directly the CSR format. A dependence tree algorithm is used to determine which variables the linear triangular solver can determine in parallel. To increase the number of the parallel threads, a coloring graph algorithm is implemented to reorder the mesh numbering in a pre-processing phase. The proposed method is compared with parallel and serial available libraries. The results show that the proposed method improves the computation cost of the matrix vector multiplication. The pre-processing associated with the triangular solver needs to be executed just once in the proposed method. The conjugate gradient method was implemented and showed similar convergence rate for all the compared methods. The proposed method showed significant smaller execution time.
Resumo:
Sparse matrix-vector multiplication (SMVM) is a fundamental operation in many scientific and engineering applications. In many cases sparse matrices have thousands of rows and columns where most of the entries are zero, while non-zero data is spread over the matrix. This sparsity of data locality reduces the effectiveness of data cache in general-purpose processors quite reducing their performance efficiency when compared to what is achieved with dense matrix multiplication. In this paper, we propose a parallel processing solution for SMVM in a many-core architecture. The architecture is tested with known benchmarks using a ZYNQ-7020 FPGA. The architecture is scalable in the number of core elements and limited only by the available memory bandwidth. It achieves performance efficiencies up to almost 70% and better performances than previous FPGA designs.
Resumo:
[EN]A natural generalization of the classical Moore-Penrose inverse is presented. The so-called S-Moore-Penrose inverse of a m x n complex matrix A, denoted by As, is defined for any linear subspace S of the matrix vector space Cnxm. The S-Moore-Penrose inverse As is characterized using either the singular value decomposition or (for the nonsingular square case) the orthogonal complements with respect to the Frobenius inner product. These results are applied to the preconditioning of linear systems based on Frobenius norm minimization and to the linearly constrained linear least squares problem.
Resumo:
Time-dependent wavepacket evolution techniques demand the action of the propagator, exp(-iHt/(h)over-bar), on a suitable initial wavepacket. When a complex absorbing potential is added to the Hamiltonian for combating unwanted reflection effects, polynomial expansions of the propagator are selected on their ability to cope with non-Hermiticity. An efficient subspace implementation of the Newton polynomial expansion scheme that requires fewer dense matrix-vector multiplications than its grid-based counterpart has been devised. Performance improvements are illustrated with some benchmark one and two-dimensional examples. (C) 2001 Elsevier Science B.V. All rights reserved.
Resumo:
In this work, we consider the numerical solution of a large eigenvalue problem resulting from a finite rank discretization of an integral operator. We are interested in computing a few eigenpairs, with an iterative method, so a matrix representation that allows for fast matrix-vector products is required. Hierarchical matrices are appropriate for this setting, and also provide cheap LU decompositions required in the spectral transformation technique. We illustrate the use of freely available software tools to address the problem, in particular SLEPc for the eigensolvers and HLib for the construction of H-matrices. The numerical tests are performed using an astrophysics application. Results show the benefits of the data-sparse representation compared to standard storage schemes, in terms of computational cost as well as memory requirements.
Resumo:
Debido al gran número de transistores por mm2 que hoy en día podemos encontrar en las GPU convencionales, en los últimos años éstas se vienen utilizando para propósitos generales gracias a que ofrecen un mayor rendimiento para computación paralela. Este proyecto implementa el producto sparse matrix-vector sobre OpenCL. En los primeros capítulos hacemos una revisión de la base teórica necesaria para comprender el problema. Después veremos los fundamentos de OpenCL y del hardware sobre el que se ejecutarán las librerías desarrolladas. En el siguiente capítulo seguiremos con una descripción del código de los kernels y de su flujo de datos. Finalmente, el software es evaluado basándose en comparativas con la CPU.
Resumo:
L’apprentissage supervisé de réseaux hiérarchiques à grande échelle connaît présentement un succès fulgurant. Malgré cette effervescence, l’apprentissage non-supervisé représente toujours, selon plusieurs chercheurs, un élément clé de l’Intelligence Artificielle, où les agents doivent apprendre à partir d’un nombre potentiellement limité de données. Cette thèse s’inscrit dans cette pensée et aborde divers sujets de recherche liés au problème d’estimation de densité par l’entremise des machines de Boltzmann (BM), modèles graphiques probabilistes au coeur de l’apprentissage profond. Nos contributions touchent les domaines de l’échantillonnage, l’estimation de fonctions de partition, l’optimisation ainsi que l’apprentissage de représentations invariantes. Cette thèse débute par l’exposition d’un nouvel algorithme d'échantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la température des chaînes de Markov sous simulation, afin de maintenir une vitesse de convergence élevée tout au long de l’apprentissage. Lorsqu’utilisé dans le contexte de l’apprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face à la sélection du taux d’apprentissage, ainsi qu’une meilleure vitesse de convergence. Nos résultats sont présent ́es dans le domaine des BMs, mais la méthode est générale et applicable à l’apprentissage de tout modèle probabiliste exploitant l’échantillonnage par chaînes de Markov. Tandis que le gradient du maximum de vraisemblance peut-être approximé par échantillonnage, l’évaluation de la log-vraisemblance nécessite un estimé de la fonction de partition. Contrairement aux approches traditionnelles qui considèrent un modèle donné comme une boîte noire, nous proposons plutôt d’exploiter la dynamique de l’apprentissage en estimant les changements successifs de log-partition encourus à chaque mise à jour des paramètres. Le problème d’estimation est reformulé comme un problème d’inférence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, où les dimensions correspondent aux axes du temps et au paramètre de température. Sur le thème de l’optimisation, nous présentons également un algorithme permettant d’appliquer, de manière efficace, le gradient naturel à des machines de Boltzmann comportant des milliers d’unités. Jusqu’à présent, son adoption était limitée par son haut coût computationel ainsi que sa demande en mémoire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet d’éviter le calcul explicite de la matrice d’information de Fisher (et son inverse) en exploitant un solveur linéaire combiné à un produit matrice-vecteur efficace. L’algorithme est prometteur: en terme du nombre d’évaluations de fonctions, MFNG converge plus rapidement que SML. Son implémentation demeure malheureusement inefficace en temps de calcul. Ces travaux explorent également les mécanismes sous-jacents à l’apprentissage de représentations invariantes. À cette fin, nous utilisons la famille de machines de Boltzmann restreintes “spike & slab” (ssRBM), que nous modifions afin de pouvoir modéliser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent être rendues invariantes à un sous-espace vectoriel, en associant à chacune d’elles, un vecteur de variables latentes continues (dénommées “slabs”). Ceci se traduit par une invariance accrue au niveau de la représentation et un meilleur taux de classification lorsque peu de données étiquetées sont disponibles. Nous terminons cette thèse sur un sujet ambitieux: l’apprentissage de représentations pouvant séparer les facteurs de variations présents dans le signal d’entrée. Nous proposons une solution à base de ssRBM bilinéaire (avec deux groupes de facteurs latents) et formulons le problème comme l’un de “pooling” dans des sous-espaces vectoriels complémentaires.
Resumo:
We consider the numerical treatment of second kind integral equations on the real line of the form ∅(s) = ∫_(-∞)^(+∞)▒〖κ(s-t)z(t)ϕ(t)dt,s=R〗 (abbreviated ϕ= ψ+K_z ϕ) in which K ϵ L_1 (R), z ϵ L_∞ (R) and ψ ϵ BC(R), the space of bounded continuous functions on R, are assumed known and ϕ ϵ BC(R) is to be determined. We first derive sharp error estimates for the finite section approximation (reducing the range of integration to [-A, A]) via bounds on (1-K_z )^(-1)as an operator on spaces of weighted continuous functions. Numerical solution by a simple discrete collocation method on a uniform grid on R is then analysed: in the case when z is compactly supported this leads to a coefficient matrix which allows a rapid matrix-vector multiply via the FFT. To utilise this possibility we propose a modified two-grid iteration, a feature of which is that the coarse grid matrix is approximated by a banded matrix, and analyse convergence and computational cost. In cases where z is not compactly supported a combined finite section and two-grid algorithm can be applied and we extend the analysis to this case. As an application we consider acoustic scattering in the half-plane with a Robin or impedance boundary condition which we formulate as a boundary integral equation of the class studied. Our final result is that if z (related to the boundary impedance in the application) takes values in an appropriate compact subset Q of the complex plane, then the difference between ϕ(s)and its finite section approximation computed numerically using the iterative scheme proposed is ≤C_1 [kh log〖(1⁄kh)+(1-Θ)^((-1)⁄2) (kA)^((-1)⁄2) 〗 ] in the interval [-ΘA,ΘA](Θ<1) for kh sufficiently small, where k is the wavenumber and h the grid spacing. Moreover this numerical approximation can be computed in ≤C_2 N logN operations, where N = 2A/h is the number of degrees of freedom. The values of the constants C1 and C2 depend only on the set Q and not on the wavenumber k or the support of z.
Resumo:
The immersed boundary method is a versatile tool for the investigation of flow-structure interaction. In a large number of applications, the immersed boundaries or structures are very stiff and strong tangential forces on these interfaces induce a well-known, severe time-step restriction for explicit discretizations. This excessive stability constraint can be removed with fully implicit or suitable semi-implicit schemes but at a seemingly prohibitive computational cost. While economical alternatives have been proposed recently for some special cases, there is a practical need for a computationally efficient approach that can be applied more broadly. In this context, we revisit a robust semi-implicit discretization introduced by Peskin in the late 1970s which has received renewed attention recently. This discretization, in which the spreading and interpolation operators are lagged. leads to a linear system of equations for the inter-face configuration at the future time, when the interfacial force is linear. However, this linear system is large and dense and thus it is challenging to streamline its solution. Moreover, while the same linear system or one of similar structure could potentially be used in Newton-type iterations, nonlinear and highly stiff immersed structures pose additional challenges to iterative methods. In this work, we address these problems and propose cost-effective computational strategies for solving Peskin`s lagged-operators type of discretization. We do this by first constructing a sufficiently accurate approximation to the system`s matrix and we obtain a rigorous estimate for this approximation. This matrix is expeditiously computed by using a combination of pre-calculated values and interpolation. The availability of a matrix allows for more efficient matrix-vector products and facilitates the design of effective iterative schemes. We propose efficient iterative approaches to deal with both linear and nonlinear interfacial forces and simple or complex immersed structures with tethered or untethered points. One of these iterative approaches employs a splitting in which we first solve a linear problem for the interfacial force and then we use a nonlinear iteration to find the interface configuration corresponding to this force. We demonstrate that the proposed approach is several orders of magnitude more efficient than the standard explicit method. In addition to considering the standard elliptical drop test case, we show both the robustness and efficacy of the proposed methodology with a 2D model of a heart valve. (C) 2009 Elsevier Inc. All rights reserved.
Resumo:
In this work we have elaborated a spline-based method of solution of inicial value problems involving ordinary differential equations, with emphasis on linear equations. The method can be seen as an alternative for the traditional solvers such as Runge-Kutta, and avoids root calculations in the linear time invariant case. The method is then applied on a central problem of control theory, namely, the step response problem for linear EDOs with possibly varying coefficients, where root calculations do not apply. We have implemented an efficient algorithm which uses exclusively matrix-vector operations. The working interval (till the settling time) was determined through a calculation of the least stable mode using a modified power method. Several variants of the method have been compared by simulation. For general linear problems with fine grid, the proposed method compares favorably with the Euler method. In the time invariant case, where the alternative is root calculation, we have indications that the proposed method is competitive for equations of sifficiently high order.
Resumo:
General Fierz-type identities are examined and their well-known connection with completeness relations in matrix vector spaces is shown. In particular, I derive the chiral Fierz identities in a simple and systematic way by using a chiral basis for the complex 4 X 4 matrices. Other completeness relations for the fundamental representations of SU(N) algebras can be extracted using the same reasoning. (c) 2005 American Association of Physics Teachers.
Resumo:
Complementing our recent work on subspace wavepacket propagation [Chem. Phys. Lett. 336 (2001) 149], we introduce a Lanczos-based implementation of the Faber polynomial quantum long-time propagator. The original version [J. Chem. Phys. 101 (1994) 10493] implicitly handles non-Hermitian Hamiltonians, that is, those perturbed by imaginary absorbing potentials to handle unwanted reflection effects. However, like many wavepacket propagation schemes, it encounters a bottleneck associated with dense matrix-vector multiplications. Our implementation seeks to reduce the quantity of such costly operations without sacrificing numerical accuracy. For some benchmark scattering problems, our approach compares favourably with the original. (C) 2004 Elsevier B.V. All rights reserved.
Resumo:
Recent integrated circuit technologies have opened the possibility to design parallel architectures with hundreds of cores on a single chip. The design space of these parallel architectures is huge with many architectural options. Exploring the design space gets even more difficult if, beyond performance and area, we also consider extra metrics like performance and area efficiency, where the designer tries to design the architecture with the best performance per chip area and the best sustainable performance. In this paper we present an algorithm-oriented approach to design a many-core architecture. Instead of doing the design space exploration of the many core architecture based on the experimental execution results of a particular benchmark of algorithms, our approach is to make a formal analysis of the algorithms considering the main architectural aspects and to determine how each particular architectural aspect is related to the performance of the architecture when running an algorithm or set of algorithms. The architectural aspects considered include the number of cores, the local memory available in each core, the communication bandwidth between the many-core architecture and the external memory and the memory hierarchy. To exemplify the approach we did a theoretical analysis of a dense matrix multiplication algorithm and determined an equation that relates the number of execution cycles with the architectural parameters. Based on this equation a many-core architecture has been designed. The results obtained indicate that a 100 mm(2) integrated circuit design of the proposed architecture, using a 65 nm technology, is able to achieve 464 GFLOPs (double precision floating-point) for a memory bandwidth of 16 GB/s. This corresponds to a performance efficiency of 71 %. Considering a 45 nm technology, a 100 mm(2) chip attains 833 GFLOPs which corresponds to 84 % of peak performance These figures are better than those obtained by previous many-core architectures, except for the area efficiency which is limited by the lower memory bandwidth considered. The results achieved are also better than those of previous state-of-the-art many-cores architectures designed specifically to achieve high performance for matrix multiplication.