878 resultados para advanced compiler optimizations
Resumo:
This paper proposes the use of empirical modeling techniques for building microarchitecture sensitive models for compiler optimizations. The models we build relate program performance to settings of compiler optimization flags, associated heuristics and key microarchitectural parameters. Unlike traditional analytical modeling methods, this relationship is learned entirely from data obtained by measuring performance at a small number of carefully selected compiler/microarchitecture configurations. We evaluate three different learning techniques in this context viz. linear regression, adaptive regression splines and radial basis function networks. We use the generated models to a) predict program performance at arbitrary compiler/microarchitecture configurations, b) quantify the significance of complex interactions between optimizations and the microarchitecture, and c) efficiently search for'optimal' settings of optimization flags and heuristics for any given microarchitectural configuration. Our evaluation using benchmarks from the SPEC CPU2000 suits suggests that accurate models (< 5% average error in prediction) can be generated using a reasonable number of simulations. We also find that using compiler settings prescribed by a model-based search can improve program performance by as much as 19% (with an average of 9.5%) over highly optimized binaries.
Resumo:
Compiler optimizations help to make code run faster at runtime. When the compilation is done before the program is run, compilation time is less of an issue, but how do on-the-fly compilation and optimization impact the overall runtime? If the compiler must compete with the running application for resources, the running application will take more time to complete. This paper investigates the impact of specific compiler optimizations on the overall runtime of an application. A foldover Plackett and Burman design is used to choose compiler optimizations that appear to contribute to shorter overall runtimes. These selected optimizations are compared with the default optimization levels in the Jikes RVM. This method selects optimizations that result in a shorter overall runtime than the default O0, O1, and O2 levels. This shows that careful selection of compiler optimizations can have a significant, positive impact on overall runtime.
Resumo:
Los algoritmos basados en registros de desplazamiento con realimentación (en inglés FSR) se han utilizado como generadores de flujos pseudoaleatorios en aplicaciones con recursos limitados como los sistemas de apertura sin llave. Se considera canal primario a aquel que se utiliza para realizar una transmisión de información. La aparición de los ataques de canal auxiliar (en inglés SCA), que explotan información filtrada inintencionadamente a través de canales laterales como el consumo, las emisiones electromagnéticas o el tiempo empleado, supone una grave amenaza para estas aplicaciones, dado que los dispositivos son accesibles por un atacante. El objetivo de esta tesis es proporcionar un conjunto de protecciones que se puedan aplicar de forma automática y que utilicen recursos ya disponibles, evitando un incremento sustancial en los costes y alargando la vida útil de aplicaciones que puedan estar desplegadas. Explotamos el paralelismo existente en algoritmos FSR, ya que sólo hay 1 bit de diferencia entre estados de rondas consecutivas. Realizamos aportaciones en tres niveles: a nivel de sistema, utilizando un coprocesador reconfigurable, a través del compilador y a nivel de bit, aprovechando los recursos disponibles en el procesador. Proponemos un marco de trabajo que nos permite evaluar implementaciones de un algoritmo incluyendo los efectos introducidos por el compilador considerando que el atacante es experto. En el campo de los ataques, hemos propuesto un nuevo ataque diferencial que se adapta mejor a las condiciones de las implementaciones software de FSR, en las que el consumo entre rondas es muy similar. SORU2 es un co-procesador vectorial reconfigurable propuesto para reducir el consumo energético en aplicaciones con paralelismo y basadas en el uso de bucles. Proponemos el uso de SORU2, además, para ejecutar algoritmos basados en FSR de forma segura. Al ser reconfigurable, no supone un sobrecoste en recursos, ya que no está dedicado en exclusiva al algoritmo de cifrado. Proponemos una configuración que ejecuta múltiples algoritmos de cifrado similares de forma simultánea, con distintas implementaciones y claves. A partir de una implementación sin protecciones, que demostramos que es completamente vulnerable ante SCA, obtenemos una implementación segura a los ataques que hemos realizado. A nivel de compilador, proponemos un mecanismo para evaluar los efectos de las secuencias de optimización del compilador sobre una implementación. El número de posibles secuencias de optimizaciones de compilador es extremadamente alto. El marco de trabajo propuesto incluye un algoritmo para la selección de las secuencias de optimización a considerar. Debido a que las optimizaciones del compilador transforman las implementaciones, se pueden generar automáticamente implementaciones diferentes combinamos para incrementar la seguridad ante SCA. Proponemos 2 mecanismos de aplicación de estas contramedidas, que aumentan la seguridad de la implementación original sin poder considerarse seguras. Finalmente hemos propuesto la ejecución paralela a nivel de bit del algoritmo en un procesador. Utilizamos la forma algebraica normal del algoritmo, que automáticamente se paraleliza. La implementación sobre el algoritmo evaluado mejora en rendimiento y evita que se filtre información por una ejecución dependiente de datos. Sin embargo, es más vulnerable ante ataques diferenciales que la implementación original. Proponemos una modificación del algoritmo para obtener una implementación segura, descartando parcialmente ejecuciones del algoritmo, de forma aleatoria. Esta implementación no introduce una sobrecarga en rendimiento comparada con las implementaciones originales. En definitiva, hemos propuesto varios mecanismos originales a distintos niveles para introducir aleatoridad en implementaciones de algoritmos FSR sin incrementar sustancialmente los recursos necesarios. ABSTRACT Feedback Shift Registers (FSR) have been traditionally used to implement pseudorandom sequence generators. These generators are used in Stream ciphers in systems with tight resource constraints, such as Remote Keyless Entry. When communicating electronic devices, the primary channel is the one used to transmit the information. Side-Channel Attack (SCA) use additional information leaking from the actual implementation, including power consumption, electromagnetic emissions or timing information. Side-Channel Attacks (SCA) are a serious threat to FSR-based applications, as an attacker usually has physical access to the devices. The main objective of this Ph.D. thesis is to provide a set of countermeasures that can be applied automatically using the available resources, avoiding a significant cost overhead and extending the useful life of deployed systems. If possible, we propose to take advantage of the inherent parallelism of FSR-based algorithms, as the state of a FSR differs from previous values only in 1-bit. We have contributed in three different levels: architecture (using a reconfigurable co-processor), using compiler optimizations, and at bit level, making the most of the resources available at the processor. We have developed a framework to evaluate implementations of an algorithm including the effects introduced by the compiler. We consider the presence of an expert attacker with great knowledge on the application and the device. Regarding SCA, we have presented a new differential SCA that performs better than traditional SCA on software FSR-based algorithms, where the leaked values are similar between rounds. SORU2 is a reconfigurable vector co-processor. It has been developed to reduce energy consumption in loop-based applications with parallelism. In addition, we propose its use for secure implementations of FSR-based algorithms. The cost overhead is discarded as the co-processor is not exclusively dedicated to the encryption algorithm. We present a co-processor configuration that executes multiple simultaneous encryptions, using different implementations and keys. From a basic implementation, which is proved to be vulnerable to SCA, we obtain an implementation where the SCA applied were unsuccessful. At compiler level, we use the framework to evaluate the effect of sequences of compiler optimization passes on a software implementation. There are many optimization passes available. The optimization sequences are combinations of the available passes. The amount of sequences is extremely high. The framework includes an algorithm for the selection of interesting sequences that require detailed evaluation. As existing compiler optimizations transform the software implementation, using different optimization sequences we can automatically generate different implementations. We propose to randomly switch between the generated implementations to increase the resistance against SCA.We propose two countermeasures. The results show that, although they increase the resistance against SCA, the resulting implementations are not secure. At bit level, we propose to exploit bit level parallelism of FSR-based implementations using pseudo bitslice implementation in a wireless node processor. The bitslice implementation is automatically obtained from the Algebraic Normal Form of the algorithm. The results show a performance improvement, avoiding timing information leakage, but increasing the vulnerability against differential SCA.We provide a secure version of the algorithm by randomly discarding part of the data obtained. The overhead in performance is negligible when compared to the original implementations. To summarize, we have proposed a set of original countermeasures at different levels that introduce randomness in FSR-based algorithms avoiding a heavy overhead on the resources required.
Resumo:
近年来,以数据依赖分析为基础的高级编译优化成为现代编译器的重要研发内容.针对这类编译优化的测试问题提出了一种测试程序自动生成方法,能够根据指定的数据依赖特征生成测试程序.首先设计了LoSpec语言用以描述测试程序,然后采用一种便于表示数据依赖关系的模型——过程图作为中间表示模型实现了测试程序的自动生成,并开发了自动测试工具LoTester.与已有方法相比,该方法对高级优化更具针对性,自动化程度较高.LoTester目前在一款面向多媒体应用的优化编译器EECC的开发中得到应用并获得了良好效果.
Resumo:
In this paper we develop compilation techniques for the realization of applications described in a High Level Language (HLL) onto a Runtime Reconfigurable Architecture. The compiler determines Hyper Operations (HyperOps) that are subgraphs of a data flow graph (of an application) and comprise elementary operations that have strong producer-consumer relationship. These HyperOps are hosted on computation structures that are provisioned on demand at runtime. We also report compiler optimizations that collectively reduce the overheads of data-driven computations in runtime reconfigurable architectures. On an average, HyperOps offer a 44% reduction in total execution time and a 18% reduction in management overheads as compared to using basic blocks as coarse grained operations. We show that HyperOps formed using our compiler are suitable to support data flow software pipelining.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
Compiler optimizations need precise and scalable analyses to discover program properties. We propose a partially flow-sensitive framework that tries to draw on the scalability of flow-insensitive algorithms while providing more precision at some specific program points. Provided with a set of critical nodes — basic blocks at which more precise information is desired — our partially flow-sensitive algorithm computes a reduced control-flow graph by collapsing some sets of non-critical nodes. The algorithm is more scalable than a fully flow-sensitive one as, assuming that the number of critical nodes is small, the reduced flow-graph is much smaller than the original flow-graph. At the same time, a much more precise information is obtained at certain program points than would had been obtained from a flow-insensitive algorithm.
Resumo:
Most Java programmers would agree that Java is a language that promotes a philosophy of “create and go forth”. By design, temporary objects are meant to be created on the heap, possibly used and then abandoned to be collected by the garbage collector. Excessive generation of temporary objects is termed “object churn” and is a form of software bloat that often leads to performance and memory problems. To mitigate this problem, many compiler optimizations aim at identifying objects that may be allocated on the stack. However, most such optimizations miss large opportunities for memory reuse when dealing with objects inside loops or when dealing with container objects. In this paper, we describe a novel algorithm that detects bloat caused by the creation of temporary container and String objects within a loop. Our analysis determines which objects created within a loop can be reused. Then we describe a source-to-source transformation that efficiently reuses such objects. Empirical evaluation indicates that our solution can reduce upto 40% of temporary object allocations in large programs, resulting in a performance improvement that can be as high as a 20% reduction in the run time, specifically when a program has a high churn rate or when the program is memory intensive and needs to run the GC often.
Resumo:
MATLAB is an array language, initially popular for rapid prototyping, but is now being increasingly used to develop production code for numerical and scientific applications. Typical MATLAB programs have abundant data parallelism. These programs also have control flow dominated scalar regions that have an impact on the program's execution time. Today's computer systems have tremendous computing power in the form of traditional CPU cores and throughput oriented accelerators such as graphics processing units(GPUs). Thus, an approach that maps the control flow dominated regions to the CPU and the data parallel regions to the GPU can significantly improve program performance. In this paper, we present the design and implementation of MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Our solution is fully automated and does not require programmer input for identifying data parallel regions. We propose a set of compiler optimizations tailored for MATLAB. Our compiler identifies data parallel regions of the program and composes them into kernels. The problem of combining statements into kernels is formulated as a constrained graph clustering problem. Heuristics are presented to map identified kernels to either the CPU or GPU so that kernel execution on the CPU and the GPU happens synergistically and the amount of data transfer needed is minimized. In order to ensure required data movement for dependencies across basic blocks, we propose a data flow analysis and edge splitting strategy. Thus our compiler automatically handles composition of kernels, mapping of kernels to CPU and GPU, scheduling and insertion of required data transfer. The proposed compiler was implemented and experimental evaluation using a set of MATLAB benchmarks shows that our approach achieves a geometric mean speedup of 19.8X for data parallel benchmarks over native execution of MATLAB.
Resumo:
编译优化是现代编译器不可缺少的重要功能。编译优化技术在过去几十年里取得了显著进展,对提升程序运行速度、节省存储空间、节省能耗等起到了不可替代的作用。然而,编译优化的可靠性却不尽人意。编译优化技术种类多、处理复杂而且可复用性弱,容易出错,即便是成熟的编译器,也不断有与编译优化相关的bug被发现出来。编译器的可靠性对软件产品的可靠性和安全性有直接影响,随着编译优化在现代编译器中的比重不断增加,编译优化的可靠性也日益受到人们的关注。 软件测试是保障编译优化可靠性的基本技术手段之一,然而,编译优化测试涉及测试程序编写、测试执行等过程,人工完成相当费时费力,因此有必要研究编译优化自动测试方法,以提高编译优化测试的效率。基于这一实际需求,Intel、MEI (Matsushita Electric Industrial)、DaimlerChrysler AG等业界产商近年来也相继与有关研究机构开展合作,研究编译优化自动测试方法。 目前已有的编译器自动测试方法中大多数都是以程序设计语言的语法和语义为主要依据,适用于测试语法检查、语义检查、代码生成等基本编译功能,对于编译优化的测试则缺乏针对性,测试效率较低,而已有的若干种面向编译优化的自动测试方法也存在着对编译优化刻画不够准确、自动化程度不高等缺陷。 本文提出一种基于形式描述的编译优化自动测试方法(TEMCOFS),其实现过程分为四个阶段,即:(1) 建立编译优化形式描述;(2) 分析编译优化描述的正确性;(3) 基于编译优化形式描述自动生成测试程序;(4) 自动执行测试。在TEMCOFS方法框架下,本文分别研究了编译优化形式化描述方法、编译优化描述正确性分析方法和基于形式描述的两种自动测试方法,实现了三类典型优化—表达式优化、数据流优化、循环优化的自动测试,主要工作包括: (1) 在编译优化形式描述方面,除了应用前人研究成果—TRANS语言描述了表达式优化和数据流优化之外,还对TRANS语言进行了扩展,建立了循环优化的形式描述机制; (2) 在编译优化正确性分析方面,首先证明了揭示程序数据依赖关系对程序变换正确性影响的依赖基础定理,为循环优化正确性分析提供了基础,然后探讨了编译优化正确性分析的一般方法; (3) 在测试自动执行方面,提出了编译优化自动变形测试执行方法,该方法将变形测试思想引入编译优化测试中,利用测试程序等价性质实现测试结果的自动判定,能够避免传统方法中测试结果判定所存在的问题; (4) 在测试程序自动生成方面,分别对应两种测试自动执行方法—比照法和变形法提出了基于编译优化形式描述的测试程序自动生成方法,能够根据表达式优化、数据流优化、循环优化的形式描述自动生成测试程序集。 在GCC编译器上的实验表明,基于本文方法自动生成的测试程序集可使GCC的编译优化模块较快达到较高的测试覆盖率。与其他编译器自动测试方法相比,本文方法对编译优化测试的针对性较好,自动化程度也较高。总而言之,本文方法对于提高编译优化测试效率、保障优化编译器的质量具有较好的实用和参考价值。
Resumo:
This thesis presents DCE, or Dynamic Conditional Execution, as an alternative to reduce the cost of mispredicted branches. The basic idea is to fetch all paths produced by a branch that obey certain restrictions regarding complexity and size. As a result, a smaller number of predictions is performed, and therefore, a lesser number of branches are mispredicted. DCE fetches through selected branches avoiding disruptions in the fetch flow when these branches are fetched. Both paths of selected branches are executed but only the correct path commits. In this thesis we propose an architecture to execute multiple paths of selected branches. Branches are selected based on the size and other conditions. Simple and complex branches can be dynamically predicated without requiring a special instruction set nor special compiler optimizations. Furthermore, a technique to reduce part of the overhead generated by the execution of multiple paths is proposed. The performance achieved reaches levels of up to 12% when comparing a Local predictor used in DCE against a Global predictor used in the reference machine. When both machines use a Local predictor, the speedup is increased by an average of 3-3.5%.
Resumo:
Finding useful sharing information between instances in object- oriented programs has recently been the focus of much research. The applications of such static analysis are multiple: by knowing which variables definitely do not share in memory we can apply conventional compiler optimizations, find coarse-grained parallelism opportunities, or, more importantly, verify certain correctness aspects of programs even in the absence of annotations. In this paper we introduce a framework for deriving precise sharing information based on abstract interpretation for a Java-like language. Our analysis achieves precision in various ways, including supporting multivariance, which allows separating different contexts. We propose a combined Set Sharing + Nullity + Classes domain which captures which instances do not share and which ones are definitively null, and which uses the classes to refine the static information when inheritance is present. The use of a set sharing abstraction allows a more precise representation of the existing sharings and is crucial in achieving precision during interprocedural analysis. Carrying the domains in a combined way facilitates the interaction among them in the presence of multivariance in the analysis. We show through examples and experimentally that both the set sharing part of the domain as well as the combined domain provide more accurate information than previous work based on pair sharing domains, at reasonable cost.
Resumo:
Finding useful sharing information between instances in object- oriented programs has been recently the focus of much research. The applications of such static analysis are multiple: by knowing which variables share in memory we can apply conventional compiler optimizations, find coarse-grained parallelism opportunities, or, more importantly,erify certain correctness aspects of programs even in the absence of annotations In this paper we introduce a framework for deriving precise sharing information based on abstract interpretation for a Java-like language. Our analysis achieves precision in various ways. The analysis is multivariant, which allows separating different contexts. We propose a combined Set Sharing + Nullity + Classes domain which captures which instances share and which ones do not or are definitively null, and which uses the classes to refine the static information when inheritance is present. Carrying the domains in a combined way facilitates the interaction among the domains in the presence of mutivariance in the analysis. We show that both the set sharing part of the domain as well as the combined domain provide more accurate information than previous work based on pair sharing domains, at reasonable cost.
Resumo:
There is demand for an easily programmable, high performance image processing platform based on FPGAs. In previous work, a novel, high performance processor - IPPro was developed and a Histogram of Orientated Gradients (HOG) algorithm study undertaken on a Xilinx Zynq platform. Here, we identify and explore a number of mapping strategies to improve processing efficiency for soft-cores and a number of options for creation of a division coprocessor. This is demonstrated for the revised high definition HOG implementation on a Zynq platform, resulting in a performance of 328 fps which represents a 146% speed improvement over the original realization and a tenfold reduction in energy.
Resumo:
CIAO is an advanced programming environment supporting Logic and Constraint programming. It offers a simple concurrent kernel on top of which declarative and non-declarative extensions are added via librarles. Librarles are available for supporting the ISOProlog standard, several constraint domains, functional and higher order programming, concurrent and distributed programming, internet programming, and others. The source language allows declaring properties of predicates via assertions, including types and modes. Such properties are checked at compile-time or at run-time. The compiler and system architecture are designed to natively support modular global analysis, with the two objectives of proving properties in assertions and performing program optimizations, including transparently exploiting parallelism in programs. The purpose of this paper is to report on recent progress made in the context of the CIAO system, with special emphasis on the capabilities of the compiler, the techniques used for supporting such capabilities, and the results in the áreas of program analysis and transformation already obtained with the system.