Multi-threaded processors execute multiple threads concurrently in order to increase overall throughput. It is well documented that multi-threading affects per-thread performance but, more importantly, some threads are affected more than others. This is especially troublesome for multi-programmed workloads. Fairness metrics measure whether all threads are affected equally. However defining equal treatment is not straightforward. Several fairness metrics for multi-threaded processors have been utilized in the literature, although there does not seem to be a consensus on what metric does the best job of measuring fairness. This paper reviews the prevalent fairness metrics and analyzes their main properties. Each metric strikes a different trade-off between fairness in the strict sense and throughput. We categorize the metrics with respect to this property. Based on experimental data for SMT processors, we suggest using the minimum fairness metric in order to balance fairness and throughput.


In the reinsurance market, the risks natural catastrophes pose to portfolios of properties must be quantified, so that they can be priced, and insurance offered. The analysis of such risks at a portfolio level requires a simulation of up to 800 000 trials with an average of 1000 catastrophic events per trial. This is sufficient to capture risk for a global multi-peril reinsurance portfolio covering a range of perils including earthquake, hurricane, tornado, hail, severe thunderstorm, wind storm, storm surge and riverine flooding, and wildfire. Such simulations are both computation and data intensive, making the application of high-performance computing techniques desirable.

In this paper, we explore the design and implementation of portfolio risk analysis on both multi-core and many-core computing platforms. Given a portfolio of property catastrophe insurance treaties, key risk measures, such as probable maximum loss, are computed by taking both primary and secondary uncertainties into account. Primary uncertainty is associated with whether or not an event occurs in a simulated year, while secondary uncertainty captures the uncertainty in the level of loss due to the use of simplified physical models and limitations in the available data. A combination of fast lookup structures, multi-threading and careful hand tuning of numerical operations is required to achieve good performance. Experimental results are reported for multi-core processors and systems using NVIDIA graphics processing unit and Intel Phi many-core accelerators.


We present in this paper several contributions on the collision detection optimization centered on hardware performance. We focus on the broad phase which is the first step of the collision detection process and propose three new ways of parallelization of the well-known Sweep and Prune algorithm. We first developed a multi-core model takes into account the number of available cores. Multi-core architecture enables us to distribute geometric computations with use of multi-threading. Critical writing section and threads idling have been minimized by introducing new data structures for each thread. Programming with directives, like OpenMP, appears to be a good compromise for code portability. We then proposed a new GPU-based algorithm also based on the "Sweep and Prune" that has been adapted to multi-GPU architectures. Our technique is based on a spatial subdivision method used to distribute computations among GPUs. Results show that significant speed-up can be obtained by passing from 1 to 4 GPUs in a large-scale environment.


Sensor network nodes exhibit characteristics of both embedded systems and general-purpose systems.A sensor network operating system is a kind of embedded operating system, but unlike a typical embedded operating system, sensor network operatin g system may not be real time, and is constrained by memory and energy constraints. Most sensor network operating systems are based on event-driven approach. Event-driven approach is efficient in terms of time and space.Also this approach does not require a separate stack for each execution context. But using this model, it is difficult to implement long running tasks, like cryptographic operations. A thread based computation requires a separate stack for each execution context, and is less efficient in terms of time and space. In this paper, we propose a thread based execution model that uses only a fixed number of stacks. In this execution model, the number of stacks at each priority level are fixed. It minimizes the stack requirement for multi-threading environment and at the same time provides ease of programming. We give an implementation of this model in Contiki OS by separating thread implementation from protothread implementation completely. We have tested our OS by implementing a clock synchronization protocol using it.


Programming environments for smartphones expose a concurrency model that combines multi-threading and asynchronous event-based dispatch. While this enables the development of efficient and feature-rich applications, unforeseen thread interleavings coupled with non-deterministic reorderings of asynchronous tasks can lead to subtle concurrency errors in the applications. In this paper, we formalize the concurrency semantics of the Android programming model. We further define the happens-before relation for Android applications, and develop a dynamic race detection technique based on this relation. Our relation generalizes the so far independently studied happens-before relations for multi-threaded programs and single-threaded event-driven programs. Additionally, our race detection technique uses a model of the Android runtime environment to reduce false positives. We have implemented a tool called DROIDRACER. It generates execution traces by systematically testing Android applications and detects data races by computing the happens-before relation on the traces. We analyzed 1 5 Android applications including popular applications such as Facebook, Twitter and K-9 Mail. Our results indicate that data races are prevalent in Android applications, and that DROIDRACER is an effective tool to identify data races.


A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and one parallel, to maintain series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. The serial SP-order algorithm runs in O(1) amortized time per operation. In contrast, the previously best algorithm requires a time per operation that is proportional to Tarjan’s functional inverse of Ackermann’s function. SP-order employs an order-maintenance data structure that allows us to implement a more efficient "English-Hebrew" labeling scheme than was used in earlier race detectors, which immediately yields an improved determinacy-race detector. In particular, any fork-join program running in T₁ time on a single processor can be checked on the fly for determinacy races in O(T₁) time. Corresponding improved bounds can also be obtained for more sophisticated data-race detectors, for example, those that use locks. By combining SP-order with Feng and Leiserson’s serial SP-bags algorithm, we obtain a parallel SP-maintenance algorithm, called SP-hybrid. Suppose that a fork-join program has n threads, T₁ work, and a critical-path length of T[subscript ∞]. When executed on P processors, we prove that SP-hybrid runs in O((T₁/P + PT[subscript ∞]) lg n) expected time. To understand this bound, consider that the original program obtains linear speed-up over a 1-processor execution when P = O(T₁/T[subscript ∞]). In contrast, SP-hybrid obtains linear speed-up when P = O(√T₁/T[subscript ∞]), but the work is increased by a factor of O(lg n).


In the 1990s the Message Passing Interface Forum defined MPI bindings for Fortran, C, and C++. With the success of MPI these relatively conservative languages have continued to dominate in the parallel computing community. There are compelling arguments in favour of more modern languages like Java. These include portability, better runtime error checking, modularity, and multi-threading. But these arguments have not converted many HPC programmers, perhaps due to the scarcity of full-scale scientific Java codes, and the lack of evidence for performance competitive with C or Fortran. This paper tries to redress this situation by porting two scientific applications to Java. Both of these applications are parallelized using our thread-safe Java messaging system—MPJ Express. The first application is the Gadget-2 code, which is a massively parallel structure formation code for cosmological simulations. The second application uses the finite-domain time-difference method for simulations in the area of computational electromagnetics. We evaluate and compare the performance of the Java and C versions of these two scientific applications, and demonstrate that the Java codes can achieve performance comparable with legacy applications written in conventional HPC languages. Copyright © 2009 John Wiley & Sons, Ltd.


The past decade has witnessed explosive growth of mobile subscribers and services. With the purpose of providing better-swifter-cheaper services, radio network optimisation plays a crucial role but faces enormous challenges. The concept of Dynamic Network Optimisation (DNO), therefore, has been introduced to optimally and continuously adjust network configurations, in response to changes in network conditions and traffic. However, the realization of DNO has been seriously hindered by the bottleneck of optimisation speed performance. An advanced distributed parallel solution is presented in this paper, as to bridge the gap by accelerating the sophisticated proprietary network optimisation algorithm, while maintaining the optimisation quality and numerical consistency. The ariesoACP product from Arieso Ltd serves as the main platform for acceleration. This solution has been prototyped, implemented and tested. Real-project based results exhibit a high scalability and substantial acceleration at an average speed-up of 2.5, 4.9 and 6.1 on a distributed 5-core, 9-core and 16-core system, respectively. This significantly outperforms other parallel solutions such as multi-threading. Furthermore, augmented optimisation outcome, alongside high correctness and self-consistency, have also been fulfilled. Overall, this is a breakthrough towards the realization of DNO.


Aiming to ensure greater reliability and consistency of data stored in the database, the data cleaning stage is set early in the process of Knowledge Discovery in Databases (KDD) and is responsible for eliminating problems and adjust the data for the later stages, especially for the stage of data mining. Such problems occur in the instance level and schema, namely, missing values, null values, duplicate tuples, values outside the domain, among others. Several algorithms were developed to perform the cleaning step in databases, some of them were developed specifically to work with the phonetics of words, since a word can be written in different ways. Within this perspective, this work presents as original contribution an optimization of algorithm for the detection of duplicate tuples in databases through phonetic based on multithreading without the need for trained data, as well as an independent environment of language to be supported for this. © 2011 IEEE.


Il Web nel corso della sua esistenza ha subito un mutamento dovuto in parte dalle richieste del mercato, ma soprattutto dall’evoluzione e la nascita costante delle numerose tecnologie coinvolte in esso. Si è passati da un’iniziale semplice diffusione di contenuti statici, ad una successiva collezione di siti web, dapprima con limitate presenze di dinamicità e interattività (a causa dei limiti tecnologici), ma successivamente poi evoluti alle attuali applicazioni web moderne che hanno colmato il gap con le applicazioni desktop, sia a livello tecnologico, che a livello di diffusione effettiva sul mercato. Tali applicazioni web moderne possono presentare un grado di complessità paragonabile in tutto e per tutto ai sistemi software desktop tradizionali; le tecnologie web hanno subito nel tempo un evoluzione legata ai cambiamenti del web stesso e tra le tecnologie più diffuse troviamo JavaScript, un linguaggio di scripting nato per dare dinamicità ai siti web che si ritrova tutt’ora ad essere utilizzato come linguaggio di programmazione di applicazioni altamente strutturate. Nel corso degli anni la comunità di sviluppo che ruota intorno a JavaScript ha prodotto numerose librerie al supporto del linguaggio dotando così gli sviluppatori di un linguaggio completo in grado di far realizzare applicazioni web avanzate. Le recenti evoluzioni dei motori javascript presenti nei browser hanno inoltre incrementato le prestazioni del linguaggio consacrandone la sua leadership nei confronti dei linguaggi concorrenti. Negli ultimi anni a causa della crescita della complessità delle applicazioni web, javascript è stato messo molto in discussione in quanto come linguaggio non offre le classiche astrazioni consolidate nel tempo per la programmazione altamente strutturata; per questo motivo sono nati linguaggi orientati alla programmazione ad oggetti per il web che si pongono come obiettivo la risoluzione di questo problema: tra questi si trovano linguaggi che hanno l’ambizione di soppiantare JavaScript come ad esempio Dart creato da Google, oppure altri che invece sfruttano JavaScript come linguaggio base al quale aggiungono le caratteristiche mancanti e, mediante il processo di compilazione, producono codice JavaScript puro compatibile con i motori JavaScript presenti nei browser. JavaScript storicamente fu introdotto come linguaggio sia per la programmazione client-side, che per la controparte server-side, ma per vari motivi (la forte concorrenza, basse performance, etc.) ebbe successo solo come linguaggio per la programmazione client; le recenti evoluzioni del linguaggio lo hanno però riportato in auge anche per la programmazione server-side, soprattutto per i miglioramenti delle performance, ma anche per la sua naturale predisposizione per la programmazione event-driven, paradigma alternativo al multi-threading per la programmazione concorrente. Un’applicazione web di elevata complessità al giorno d’oggi può quindi essere interamente sviluppata utilizzando il linguaggio JavaScript, acquisendone sia i suoi vantaggi che gli svantaggi; le nuove tecnologie introdotte ambiscono quindi a diventare la soluzione per i problemi presenti in JavaScript e di conseguenza si propongono come potenziali nuovi linguaggi completi per la programmazione web del futuro, anticipando anche le prossime evoluzioni delle tecnologie già esistenti preannunciate dagli enti standard della programmazione web, il W3C ed ECMAScript. In questa tesi saranno affrontate le tematiche appena introdotte confrontando tra loro le tecnologie in gioco con lo scopo di ottenere un’ampia panoramica delle soluzioni che uno sviluppatore web dovrà prendere in considerazione per realizzare un sistema di importanti dimensioni; in particolare sarà approfondito il linguaggio TypeScript proposto da Microsoft, il quale è nato in successione a Dart apparentemente con lo stesso scopo, ma grazie alla compatibilità con JavaScript e soprattutto con il vasto mondo di librerie legate ad esso nate in questi ultimi anni, si presenta nel mercato come tecnologia facile da apprendere per tutti gli sviluppatori che già da tempo hanno sviluppato abilità nella programmazione JavaScript.


An overview is given of the lessons learned from the introduction of multi-threading using OpenMP in tmLQCD. In particular, programming style, performance measurements, cache misses, scaling, thread distribution for hybrid codes, race conditions, the overlapping of communication and computation and the measurement and reduction of certain overheads are discussed. Performance measurements and sampling profiles are given for different implementations of the hopping matrix computational kernel.


We present a framework for the analysis of the decoding delay in multiview video coding (MVC). We show that in real-time applications, an accurate estimation of the decoding delay is essential to achieve a minimum communication latency. As opposed to single-view codecs, the complexity of the multiview prediction structure and the parallel decoding of several views requires a systematic analysis of this decoding delay, which we solve using graph theory and a model of the decoder hardware architecture. Our framework assumes a decoder implementation in general purpose multi-core processors with multi-threading capabilities. For this hardware model, we show that frame processing times depend on the computational load of the decoder and we provide an iterative algorithm to compute jointly frame processing times and decoding delay. Finally, we show that decoding delay analysis can be applied to design decoders with the objective of minimizing the communication latency of the MVC system.


In this paper, we present a formal model of Java concurrency using the Object-Z specification language. This model captures the Java thread synchronization concepts of locking, blocking, waiting and notification. In the model, we take a viewpoints approach, first capturing the role of the objects and threads, and then taking a system view where we capture the way the objects and threads cooperate and communicate. As a simple illustration of how the model can, in general be applied, we use Object-Z inheritance to integrate the model with the classical producer-consumer system to create a specification directly incorporating the Java concurrency constructs.


The Java programming language supports concurrency. Concurrent programs are hard to test due to their inherent non-determinism. This paper presents a classification of concurrency failures that is based on a model of Java concurrency. The model and failure classification is used to justify coverage of synchronization primitives of concurrent components. This is achieved by constructing concurrency flow graphs for each method call. A producer-consumer monitor is used to demonstrate how the approach can be used to measure coverage of concurrency primitives and thereby assist in determining test sequences for deterministic execution.


