828 resultados para compression parallel
Resumo:
In this article, techniques have been presented for faster evolution of wavelet lifting coefficients for fingerprint image compression (FIC). In addition to increasing the computational speed by 81.35%, the coefficients performed much better than the reported coefficients in literature. Generally, full-size images are used for evolving wavelet coefficients, which is time consuming. To overcome this, in this work, wavelets were evolved with resized, cropped, resized-average and cropped-average images. On comparing the peak- signal-to-noise-ratios (PSNR) offered by the evolved wavelets, it was found that the cropped images excelled the resized images and is in par with the results reported till date. Wavelet lifting coefficients evolved from an average of four 256 256 centre-cropped images took less than 1/5th the evolution time reported in literature. It produced an improvement of 1.009 dB in average PSNR. Improvement in average PSNR was observed for other compression ratios (CR) and degraded images as well. The proposed technique gave better PSNR for various bit rates, with set partitioning in hierarchical trees (SPIHT) coder. These coefficients performed well with other fingerprint databases as well.
Resumo:
This paper explains the Genetic Algorithm (GA) evolution of optimized wavelet that surpass the cdf9/7 wavelet for fingerprint compression and reconstruction. Optimized wavelets have already been evolved in previous works in the literature, but they are highly computationally complex and time consuming. Therefore, in this work, a simple approach is made to reduce the computational complexity of the evolution algorithm. A training image set comprised of three 32x32 size cropped images performed much better than the reported coefficients in literature. An average improvement of 1.0059 dB in PSNR above the classical cdf9/7 wavelet over the 80 fingerprint images was achieved. In addition, the computational speed was increased by 90.18 %. The evolved coefficients for compression ratio (CR) 16:1 yielded better average PSNR for other CRs also. Improvement in average PSNR was experienced for degraded and noisy images as well
Resumo:
The thesis explores the area of still image compression. The image compression techniques can be broadly classified into lossless and lossy compression. The most common lossy compression techniques are based on Transform coding, Vector Quantization and Fractals. Transform coding is the simplest of the above and generally employs reversible transforms like, DCT, DWT, etc. Mapped Real Transform (MRT) is an evolving integer transform, based on real additions alone. The present research work aims at developing new image compression techniques based on MRT. Most of the transform coding techniques employ fixed block size image segmentation, usually 8×8. Hence, a fixed block size transform coding is implemented using MRT and the merits and demerits are analyzed for both 8×8 and 4×4 blocks. The N2 unique MRT coefficients, for each block, are computed using templates. Considering the merits and demerits of fixed block size transform coding techniques, a hybrid form of these techniques is implemented to improve the performance of compression. The performance of the hybrid coder is found to be better compared to the fixed block size coders. Thus, if the block size is made adaptive, the performance can be further improved. In adaptive block size coding, the block size may vary from the size of the image to 2×2. Hence, the computation of MRT using templates is impractical due to memory requirements. So, an adaptive transform coder based on Unique MRT (UMRT), a compact form of MRT, is implemented to get better performance in terms of PSNR and HVS The suitability of MRT in vector quantization of images is then experimented. The UMRT based Classified Vector Quantization (CVQ) is implemented subsequently. The edges in the images are identified and classified by employing a UMRT based criteria. Based on the above experiments, a new technique named “MRT based Adaptive Transform Coder with Classified Vector Quantization (MATC-CVQ)”is developed. Its performance is evaluated and compared against existing techniques. A comparison with standard JPEG & the well-known Shapiro’s Embedded Zero-tree Wavelet (EZW) is done and found that the proposed technique gives better performance for majority of images
Resumo:
This paper describes our plans to evaluate the present state of affairs concerning parallel programming and its systems. Three subprojects are proposed: a survey among programmers and scientists, a comparison of parallel programming systems using a standard set of test programs, and a wiki resource for the parallel programming community - the Parawiki. We would like to invite you to participate and turn these subprojects into true community efforts.
Resumo:
In this publication, we report on an online survey that was carried out among parallel programmers. More than 250 people worldwide have submitted answers to our questions, and their responses are analyzed here. Although not statistically sound, the data we provide give useful insights about which parallel programming systems and languages are known and in actual use. For instance, the collected data indicate that for our survey group MPI and (to a lesser extent) C are the most widely used parallel programming system and language, respectively.
Resumo:
The process of developing software that takes advantage of multiple processors is commonly referred to as parallel programming. For various reasons, this process is much harder than the sequential case. For decades, parallel programming has been a problem for a small niche only: engineers working on parallelizing mostly numerical applications in High Performance Computing. This has changed with the advent of multi-core processors in mainstream computer architectures. Parallel programming in our days becomes a problem for a much larger group of developers. The main objective of this thesis was to find ways to make parallel programming easier for them. Different aims were identified in order to reach the objective: research the state of the art of parallel programming today, improve the education of software developers about the topic, and provide programmers with powerful abstractions to make their work easier. To reach these aims, several key steps were taken. To start with, a survey was conducted among parallel programmers to find out about the state of the art. More than 250 people participated, yielding results about the parallel programming systems and languages in use, as well as about common problems with these systems. Furthermore, a study was conducted in university classes on parallel programming. It resulted in a list of frequently made mistakes that were analyzed and used to create a programmers' checklist to avoid them in the future. For programmers' education, an online resource was setup to collect experiences and knowledge in the field of parallel programming - called the Parawiki. Another key step in this direction was the creation of the Thinking Parallel weblog, where more than 50.000 readers to date have read essays on the topic. For the third aim (powerful abstractions), it was decided to concentrate on one parallel programming system: OpenMP. Its ease of use and high level of abstraction were the most important reasons for this decision. Two different research directions were pursued. The first one resulted in a parallel library called AthenaMP. It contains so-called generic components, derived from design patterns for parallel programming. These include functionality to enhance the locks provided by OpenMP, to perform operations on large amounts of data (data-parallel programming), and to enable the implementation of irregular algorithms using task pools. AthenaMP itself serves a triple role: the components are well-documented and can be used directly in programs, it enables developers to study the source code and learn from it, and it is possible for compiler writers to use it as a testing ground for their OpenMP compilers. The second research direction was targeted at changing the OpenMP specification to make the system more powerful. The main contributions here were a proposal to enable thread-cancellation and a proposal to avoid busy waiting. Both were implemented in a research compiler, shown to be useful in example applications, and proposed to the OpenMP Language Committee.
Resumo:
Recurrent iterated function systems (RIFSs) are improvements of iterated function systems (IFSs) using elements of the theory of Marcovian stochastic processes which can produce more natural looking images. We construct new RIFSs consisting substantially of a vertical contraction factor function and nonlinear transformations. These RIFSs are applied to image compression.
Resumo:
This paper contributes to the study of Freely Rewriting Restarting Automata (FRR-automata) and Parallel Communicating Grammar Systems (PCGS), which both are useful models in computational linguistics. For PCGSs we study two complexity measures called 'generation complexity' and 'distribution complexity', and we prove that a PCGS Pi, for which the generation complexity and the distribution complexity are both bounded by constants, can be transformed into a freely rewriting restarting automaton of a very restricted form. From this characterization it follows that the language L(Pi) generated by Pi is semi-linear, that its characteristic analysis is of polynomial size, and that this analysis can be computed in polynomial time.
Resumo:
In der vorliegenden Dissertation werden Systeme von parallel arbeitenden und miteinander kommunizierenden Restart-Automaten (engl.: systems of parallel communicating restarting automata; abgekürzt PCRA-Systeme) vorgestellt und untersucht. Dabei werden zwei bekannte Konzepte aus den Bereichen Formale Sprachen und Automatentheorie miteinander vescrknüpft: das Modell der Restart-Automaten und die sogenannten PC-Systeme (systems of parallel communicating components). Ein PCRA-System besteht aus endlich vielen Restart-Automaten, welche einerseits parallel und unabhängig voneinander lokale Berechnungen durchführen und andererseits miteinander kommunizieren dürfen. Die Kommunikation erfolgt dabei durch ein festgelegtes Kommunikationsprotokoll, das mithilfe von speziellen Kommunikationszuständen realisiert wird. Ein wesentliches Merkmal hinsichtlich der Kommunikationsstruktur in Systemen von miteinander kooperierenden Komponenten ist, ob die Kommunikation zentralisiert oder nichtzentralisiert erfolgt. Während in einer nichtzentralisierten Kommunikationsstruktur jede Komponente mit jeder anderen Komponente kommunizieren darf, findet jegliche Kommunikation innerhalb einer zentralisierten Kommunikationsstruktur ausschließlich mit einer ausgewählten Master-Komponente statt. Eines der wichtigsten Resultate dieser Arbeit zeigt, dass zentralisierte Systeme und nichtzentralisierte Systeme die gleiche Berechnungsstärke besitzen (das ist im Allgemeinen bei PC-Systemen nicht so). Darüber hinaus bewirkt auch die Verwendung von Multicast- oder Broadcast-Kommunikationsansätzen neben Punkt-zu-Punkt-Kommunikationen keine Erhöhung der Berechnungsstärke. Desweiteren wird die Ausdrucksstärke von PCRA-Systemen untersucht und mit der von PC-Systemen von endlichen Automaten und mit der von Mehrkopfautomaten verglichen. PC-Systeme von endlichen Automaten besitzen bekanntermaßen die gleiche Ausdrucksstärke wie Einwegmehrkopfautomaten und bilden eine untere Schranke für die Ausdrucksstärke von PCRA-Systemen mit Einwegkomponenten. Tatsächlich sind PCRA-Systeme auch dann stärker als PC-Systeme von endlichen Automaten, wenn die Komponenten für sich genommen die gleiche Ausdrucksstärke besitzen, also die regulären Sprachen charakterisieren. Für PCRA-Systeme mit Zweiwegekomponenten werden als untere Schranke die Sprachklassen der Zweiwegemehrkopfautomaten im deterministischen und im nichtdeterministischen Fall gezeigt, welche wiederum den bekannten Komplexitätsklassen L (deterministisch logarithmischer Platz) und NL (nichtdeterministisch logarithmischer Platz) entsprechen. Als obere Schranke wird die Klasse der kontextsensitiven Sprachen gezeigt. Außerdem werden Erweiterungen von Restart-Automaten betrachtet (nonforgetting-Eigenschaft, shrinking-Eigenschaft), welche bei einzelnen Komponenten eine Erhöhung der Berechnungsstärke bewirken, in Systemen jedoch deren Stärke nicht erhöhen. Die von PCRA-Systemen charakterisierten Sprachklassen sind unter diversen Sprachoperationen abgeschlossen und einige Sprachklassen sind sogar abstrakte Sprachfamilien (sogenannte AFL's). Abschließend werden für PCRA-Systeme spezifische Probleme auf ihre Entscheidbarkeit hin untersucht. Es wird gezeigt, dass Leerheit, Universalität, Inklusion, Gleichheit und Endlichkeit bereits für Systeme mit zwei Restart-Automaten des schwächsten Typs nicht semientscheidbar sind. Für das Wortproblem wird gezeigt, dass es im deterministischen Fall in quadratischer Zeit und im nichtdeterministischen Fall in exponentieller Zeit entscheidbar ist.
Resumo:
This thesis defines Pi, a parallel architecture interface that separates model and machine issues, allowing them to be addressed independently. This provides greater flexibility for both the model and machine builder. Pi addresses a set of common parallel model requirements including low latency communication, fast task switching, low cost synchronization, efficient storage management, the ability to exploit locality, and efficient support for sequential code. Since Pi provides generic parallel operations, it can efficiently support many parallel programming models including hybrids of existing models. Pi also forms a basis of comparison for architectural components.
Resumo:
This report addresses the problem of acquiring objects using articulated robotic hands. Standard grasps are used to make the problem tractable, and a technique is developed for generalizing these standard grasps to increase their flexibility to variations in the problem geometry. A generalized grasp description is applied to a new problem situation using a parallel search through hand configuration space, and the result of this operation is a global overview of the space of good solutions. The techniques presented in this report have been implemented, and the results are verified using the Salisbury three-finger robotic hand.
Resumo:
Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization is the idle time caused by long latency operations, such as remote memory references or processor synchronization operations. One way of tolerating this latency is to use a processor with multiple hardware contexts that can rapidly switch to executing another thread of computation whenever a long latency operation occurs, thus increasing processor utilization by overlapping computation with communication. Although multiple contexts are effective for tolerating latency, this effectiveness can be limited by memory and network bandwidth, by cache interference effects among the multiple contexts, and by critical tasks sharing processor resources with less critical tasks. This thesis presents techniques that increase the effectiveness of multiple contexts by intelligently scheduling threads to make more efficient use of processor pipeline, bandwidth, and cache resources. This thesis proposes thread prioritization as a fundamental mechanism for directing the thread schedule on a multiple-context processor. A priority is assigned to each thread either statically or dynamically and is used by the thread scheduler to decide which threads to load in the contexts, and to decide which context to switch to on a context switch. We develop a multiple-context model that integrates both cache and network effects, and shows how thread prioritization can both maintain high processor utilization, and limit increases in critical path runtime caused by multithreading. The model also shows that in order to be effective in bandwidth limited applications, thread prioritization must be extended to prioritize memory requests. We show how simple hardware can prioritize the running of threads in the multiple contexts, and the issuing of requests to both the local memory and the network. Simulation experiments show how thread prioritization is used in a variety of applications. Thread prioritization can improve the performance of synchronization primitives by minimizing the number of processor cycles wasted in spinning and devoting more cycles to critical threads. Thread prioritization can be used in combination with other techniques to improve cache performance and minimize cache interference between different working sets in the cache. For applications that are critical path limited, thread prioritization can improve performance by allowing processor resources to be devoted preferentially to critical threads. These experimental results show that thread prioritization is a mechanism that can be used to implement a wide range of scheduling policies.
Resumo:
This thesis presents a new actuator system consisting of a micro-actuator and a macro-actuator coupled in parallel via a compliant transmission. The system is called the Parallel Coupled Micro-Macro Actuator, or PaCMMA. In this system, the micro-actuator is capable of high bandwidth force control due to its low mass and direct-drive connection to the output shaft. The compliant transmission of the macro-actuator reduces the impedance (stiffness) at the output shaft and increases the dynamic range of force. Performance improvement over single actuator systems was expected in force control, impedance control, force distortion and reduction of transient impact forces. A set of quantitative measures is proposed and the actuator system is evaluated against them: Force Control Bandwidth, Position Bandwidth, Dynamic Range, Impact Force, Impedance ("Backdriveability'"), Force Distortion and Force Performance Space. Several theoretical performance limits are derived from the saturation limits of the system. A control law is proposed and control system performance is compared to the theoretical limits. A prototype testbed was built using permanenent magnet motors and an experimental comparison was performed between this actuator concept and two single actuator systems. The following performance was observed: Force bandwidth of 56Hz, Torque Dynamic Range of 800:1, Peak Torque of 1040mNm, Minimum Torque of 1.3mNm. Peak Impact Force was reduced by an order of magnitude. Distortion at small amplitudes was reduced substantially. Backdriven impedance was reduced by 2-3 orders of magnitude. This actuator system shows promise for manipulator design as well as psychophysical tests of human performance.
Resumo:
The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies.
Resumo:
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and one parallel, to maintain series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. The serial SP-order algorithm runs in O(1) amortized time per operation. In contrast, the previously best algorithm requires a time per operation that is proportional to Tarjan’s functional inverse of Ackermann’s function. SP-order employs an order-maintenance data structure that allows us to implement a more efficient "English-Hebrew" labeling scheme than was used in earlier race detectors, which immediately yields an improved determinacy-race detector. In particular, any fork-join program running in T₁ time on a single processor can be checked on the fly for determinacy races in O(T₁) time. Corresponding improved bounds can also be obtained for more sophisticated data-race detectors, for example, those that use locks. By combining SP-order with Feng and Leiserson’s serial SP-bags algorithm, we obtain a parallel SP-maintenance algorithm, called SP-hybrid. Suppose that a fork-join program has n threads, T₁ work, and a critical-path length of T[subscript â]. When executed on P processors, we prove that SP-hybrid runs in O((T₁/P + PT[subscript â]) lg n) expected time. To understand this bound, consider that the original program obtains linear speed-up over a 1-processor execution when P = O(T₁/T[subscript â]). In contrast, SP-hybrid obtains linear speed-up when P = O(√T₁/T[subscript â]), but the work is increased by a factor of O(lg n).