16 resultados para HPC parallel computer architecture queues fault tolerance programmability ADAM
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
The design optimization of industrial products has always been an essential activity to improve product quality while reducing time-to-market and production costs. Although cost management is very complex and comprises all phases of the product life cycle, the control of geometrical and dimensional variations, known as Dimensional Management (DM), allows compliance with product and process requirements. Hence, the tolerance-cost optimization becomes the main practice to provide an effective application of Design for Tolerancing (DfT) and Design to Cost (DtC) approaches by enabling a connection between product tolerances and associated manufacturing costs. However, despite the growing interest in this topic, a profitable application in the industry of these techniques is hampered by their complexity: the definition of a systematic framework is the key element to improving design optimization, enhancing the concurrent use of Computer-Aided tools and Model-Based Definition (MBD) practices. The present doctorate research aims to define and develop an integrated methodology for product/process design optimization, to better exploit the new capabilities of advanced simulations and tools. By implementing predictive models and multi-disciplinary optimization, a Computer-Aided Integrated framework for tolerance-cost optimization has been proposed to allow the integration of DfT and DtC approaches and their direct application for the design of automotive components. Several case studies have been considered, with the final application of the integrated framework on a high-performance V12 engine assembly, to achieve both functional targets and cost reduction. From a scientific point of view, the proposed methodology provides an improvement for the tolerance-cost optimization of industrial components. The integration of theoretical approaches and Computer-Aided tools allows to analyse the influence of tolerances on both product performance and manufacturing costs. The case studies proved the suitability of the methodology for its application in the industrial field, providing the identification of further areas for improvement and refinement.
Resumo:
The sustained demand for faster,more powerful chips has beenmet by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SOC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MPSOC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NOCS) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the on-chip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation focuses on all of the above points, by describing a NoC architectural implementation called ×pipes; a NoC simulation environment within a cycle-accurate MPSoC emulator called MPARM; a NoC design flow consisting of a front-end tool for optimal NoC instantiation, called SunFloor, and a set of back-end facilities for the study of NoC physical implementations. This dissertation proves the viability of NoCs for current and upcoming designs, by outlining their advantages (alongwith a fewtradeoffs) and by providing a full NoC implementation framework. It also presents some examples of additional extensions of NoCs, allowing e.g. for increased fault tolerance, and outlines where NoCsmay find further application scenarios, such as in stacked chips.
Resumo:
The Peer-to-Peer network paradigm is drawing the attention of both final users and researchers for its features. P2P networks shift from the classic client-server approach to a high level of decentralization where there is no central control and all the nodes should be able not only to require services, but to provide them to other peers as well. While on one hand such high level of decentralization might lead to interesting properties like scalability and fault tolerance, on the other hand it implies many new problems to deal with. A key feature of many P2P systems is openness, meaning that everybody is potentially able to join a network with no need for subscription or payment systems. The combination of openness and lack of central control makes it feasible for a user to free-ride, that is to increase its own benefit by using services without allocating resources to satisfy other peers’ requests. One of the main goals when designing a P2P system is therefore to achieve cooperation between users. Given the nature of P2P systems based on simple local interactions of many peers having partial knowledge of the whole system, an interesting way to achieve desired properties on a system scale might consist in obtaining them as emergent properties of the many interactions occurring at local node level. Two methods are typically used to face the problem of cooperation in P2P networks: 1) engineering emergent properties when designing the protocol; 2) study the system as a game and apply Game Theory techniques, especially to find Nash Equilibria in the game and to reach them making the system stable against possible deviant behaviors. In this work we present an evolutionary framework to enforce cooperative behaviour in P2P networks that is alternative to both the methods mentioned above. Our approach is based on an evolutionary algorithm inspired by computational sociology and evolutionary game theory, consisting in having each peer periodically trying to copy another peer which is performing better. The proposed algorithms, called SLAC and SLACER, draw inspiration from tag systems originated in computational sociology, the main idea behind the algorithm consists in having low performance nodes copying high performance ones. The algorithm is run locally by every node and leads to an evolution of the network both from the topology and from the nodes’ strategy point of view. Initial tests with a simple Prisoners’ Dilemma application show how SLAC is able to bring the network to a state of high cooperation independently from the initial network conditions. Interesting results are obtained when studying the effect of cheating nodes on SLAC algorithm. In fact in some cases selfish nodes rationally exploiting the system for their own benefit can actually improve system performance from the cooperation formation point of view. The final step is to apply our results to more realistic scenarios. We put our efforts in studying and improving the BitTorrent protocol. BitTorrent was chosen not only for its popularity but because it has many points in common with SLAC and SLACER algorithms, ranging from the game theoretical inspiration (tit-for-tat-like mechanism) to the swarms topology. We discovered fairness, meant as ratio between uploaded and downloaded data, to be a weakness of the original BitTorrent protocol and we drew inspiration from the knowledge of cooperation formation and maintenance mechanism derived from the development and analysis of SLAC and SLACER, to improve fairness and tackle freeriding and cheating in BitTorrent. We produced an extension of BitTorrent called BitFair that has been evaluated through simulation and has shown the abilities of enforcing fairness and tackling free-riding and cheating nodes.
Resumo:
The wide diffusion of cheap, small, and portable sensors integrated in an unprecedented large variety of devices and the availability of almost ubiquitous Internet connectivity make it possible to collect an unprecedented amount of real time information about the environment we live in. These data streams, if properly and timely analyzed, can be exploited to build new intelligent and pervasive services that have the potential of improving people's quality of life in a variety of cross concerning domains such as entertainment, health-care, or energy management. The large heterogeneity of application domains, however, calls for a middleware-level infrastructure that can effectively support their different quality requirements. In this thesis we study the challenges related to the provisioning of differentiated quality-of-service (QoS) during the processing of data streams produced in pervasive environments. We analyze the trade-offs between guaranteed quality, cost, and scalability in streams distribution and processing by surveying existing state-of-the-art solutions and identifying and exploring their weaknesses. We propose an original model for QoS-centric distributed stream processing in data centers and we present Quasit, its prototype implementation offering a scalable and extensible platform that can be used by researchers to implement and validate novel QoS-enforcement mechanisms. To support our study, we also explore an original class of weaker quality guarantees that can reduce costs when application semantics do not require strict quality enforcement. We validate the effectiveness of this idea in a practical use-case scenario that investigates partial fault-tolerance policies in stream processing by performing a large experimental study on the prototype of our novel LAAR dynamic replication technique. Our modeling, prototyping, and experimental work demonstrates that, by providing data distribution and processing middleware with application-level knowledge of the different quality requirements associated to different pervasive data flows, it is possible to improve system scalability while reducing costs.
Resumo:
Durum wheat is the second most important wheat species worldwide and the most important crop in several Mediterranean countries including Italy. Durum wheat is primarily grown under rainfed conditions where episodes of drought and heat stress are major factors limiting grain yield. The research presented in this thesis aimed at the identification of traits and genes that underlie root system architecture (RSA) and tolerance to heat stress in durum wheat, in order to eventually contribute to the genetic improvement of this species. In the first two experiments we aimed at the identification of QTLs for root trait architecture at the seedling level by studying a bi-parental population of 176 recombinant inbred lines (from the cross Meridiano x Claudio) and a collection of 183 durum elite accessions. Forty-eight novel QTLs for RSA traits were identified in each of the two experiments, by means of linkage- and association mapping-based QTL analysis, respectively. Important QTLs controlling the angle of root growth in the seedling were identified. In a third experiment, we investigated the phenotypic variation of root anatomical traits by means of microscope-based analysis of root cross sections in 10 elite durum cultivars. The results showed the presence of sizeable genetic variation in aerenchyma-related traits, prompting for additional studies aimed at mapping the QTLs governing such variation and to test the role of aerenchyma in the adaptive response to abiotic stresses. In the fourth experiment, an association mapping experiment for cell membrane stability at the seedling stage (as a proxy trait for heat tolerance) was carried out by means of association mapping. A total of 34 QTLs (including five major ones), were detected. Our study provides information on QTLs for root architecture and heat tolerance which could potentially be considered in durum wheat breeding programs.
Resumo:
The 3-UPU three degrees of freedom fully parallel manipulator, where U and P are for universal and prismatic pair respectively, is a very well known manipulator that can provide the platform with three degrees of freedom of pure translation, pure rotation or mixed translation and rotation with respect to the base, according to the relative directions of the revolute pair axes (each universal pair comprises two revolute pairs with intersecting and perpendicular axes). In particular, pure translational parallel 3-UPU manipulators (3-UPU TPMs) received great attention. Many studies have been reported in the literature on singularities, workspace, and joint clearance influence on the platform accuracy of this manipulator. However, much work has still to be done to reveal all the features this topology can offer to the designer when different architecture, i.e. different geometry are considered. Therefore, this dissertation will focus on this type of the 3-UPU manipulators. The first part of the dissertation presents six new architectures of the 3-UPU TPMs which offer interesting features to the designer. In the second part, a procedure is presented which is based on some indexes, in order to allows the designer to select the best architecture of the 3-UPU TPMs for a given task. Four indexes are proposed as stiffness, clearance, singularity and size of the manipulator in order to apply the procedure.
Resumo:
Hybrid technologies, thanks to the convergence of integrated microelectronic devices and new class of microfluidic structures could open new perspectives to the way how nanoscale events are discovered, monitored and controlled. The key point of this thesis is to evaluate the impact of such an approach into applications of ion-channel High Throughput Screening (HTS)platforms. This approach offers promising opportunities for the development of new classes of sensitive, reliable and cheap sensors. There are numerous advantages of embedding microelectronic readout structures strictly coupled to sensing elements. On the one hand the signal-to-noise-ratio is increased as a result of scaling. On the other, the readout miniaturization allows organization of sensors into arrays, increasing the capability of the platform in terms of number of acquired data, as required in the HTS approach, to improve sensing accuracy and reliabiity. However, accurate interface design is required to establish efficient communication between ionic-based and electronic-based signals. The work made in this thesis will show a first example of a complete parallel readout system with single ion channel resolution, using a compact and scalable hybrid architecture suitable to be interfaced to large array of sensors, ensuring simultaneous signal recording and smart control of the signal-to-noise ratio and bandwidth trade off. More specifically, an array of microfluidic polymer structures, hosting artificial lipid bilayers blocks where single ion channel pores are embededed, is coupled with an array of ultra-low noise current amplifiers for signal amplification and data processing. As demonstrating working example, the platform was used to acquire ultra small currents derived by single non-covalent molecular binding between alpha-hemolysin pores and beta-cyclodextrin molecules in artificial lipid membranes.
Resumo:
The term "Brain Imaging" identi�es a set of techniques to analyze the structure and/or functional behavior of the brain in normal and/or pathological situations. These techniques are largely used in the study of brain activity. In addition to clinical usage, analysis of brain activity is gaining popularity in others recent �fields, i.e. Brain Computer Interfaces (BCI) and the study of cognitive processes. In this context, usage of classical solutions (e.g. f MRI, PET-CT) could be unfeasible, due to their low temporal resolution, high cost and limited portability. For these reasons alternative low cost techniques are object of research, typically based on simple recording hardware and on intensive data elaboration process. Typical examples are ElectroEncephaloGraphy (EEG) and Electrical Impedance Tomography (EIT), where electric potential at the patient's scalp is recorded by high impedance electrodes. In EEG potentials are directly generated from neuronal activity, while in EIT by the injection of small currents at the scalp. To retrieve meaningful insights on brain activity from measurements, EIT and EEG relies on detailed knowledge of the underlying electrical properties of the body. This is obtained from numerical models of the electric �field distribution therein. The inhomogeneous and anisotropic electric properties of human tissues make accurate modeling and simulation very challenging, leading to a tradeo�ff between physical accuracy and technical feasibility, which currently severely limits the capabilities of these techniques. Moreover elaboration of data recorded requires usage of regularization techniques computationally intensive, which influences the application with heavy temporal constraints (such as BCI). This work focuses on the parallel implementation of a work-flow for EEG and EIT data processing. The resulting software is accelerated using multi-core GPUs, in order to provide solution in reasonable times and address requirements of real-time BCI systems, without over-simplifying the complexity and accuracy of the head models.
Resumo:
The aim of this work was to show that refined analyses of background, low magnitude seismicity allow to delineate the main active faults and to accurately estimate the directions of the regional tectonic stress that characterize the Southern Apennines (Italy), a structurally complex area with high seismic potential. Thanks the presence in the area of an integrated dense and wide dynamic network, was possible to analyzed an high quality microearthquake data-set consisting of 1312 events that occurred from August 2005 to April 2011 by integrating the data recorded at 42 seismic stations of various networks. The refined seismicity location and focal mechanisms well delineate a system of NW-SE striking normal faults along the Apenninic chain and an approximately E-W oriented, strike-slip fault, transversely cutting the belt. The seismicity along the chain does not occur on a single fault but in a volume, delimited by the faults activated during the 1980 Irpinia M 6.9 earthquake, on sub-parallel predominant normal faults. Results show that the recent low magnitude earthquakes belongs to the background seismicity and they are likely generated along the major fault segments activated during the most recent earthquakes, suggesting that they are still active today thirty years after the mainshock occurrences. In this sense, this study gives a new perspective to the application of the high quality records of low magnitude background seismicity for the identification and characterization of active fault systems. The analysis of the stress tensor inversion provides two equivalent models to explain the microearthquake generation along both the NW-SE striking normal faults and the E- W oriented fault with a dominant dextral strike-slip motion, but having different geological interpretations. We suggest that the NW-SE-striking Africa-Eurasia convergence acts in the background of all these structures, playing a primary and unifying role in the seismotectonics of the whole region.
Resumo:
Mainstream hardware is becoming parallel, heterogeneous, and distributed on every desk, every home and in every pocket. As a consequence, in the last years software is having an epochal turn toward concurrency, distribution, interaction which is pushed by the evolution of hardware architectures and the growing of network availability. This calls for introducing further abstraction layers on top of those provided by classical mainstream programming paradigms, to tackle more effectively the new complexities that developers have to face in everyday programming. A convergence it is recognizable in the mainstream toward the adoption of the actor paradigm as a mean to unite object-oriented programming and concurrency. Nevertheless, we argue that the actor paradigm can only be considered a good starting point to provide a more comprehensive response to such a fundamental and radical change in software development. Accordingly, the main objective of this thesis is to propose Agent-Oriented Programming (AOP) as a high-level general purpose programming paradigm, natural evolution of actors and objects, introducing a further level of human-inspired concepts for programming software systems, meant to simplify the design and programming of concurrent, distributed, reactive/interactive programs. To this end, in the dissertation first we construct the required background by studying the state-of-the-art of both actor-oriented and agent-oriented programming, and then we focus on the engineering of integrated programming technologies for developing agent-based systems in their classical application domains: artificial intelligence and distributed artificial intelligence. Then, we shift the perspective moving from the development of intelligent software systems, toward general purpose software development. Using the expertise maturated during the phase of background construction, we introduce a general-purpose programming language named simpAL, which founds its roots on general principles and practices of software development, and at the same time provides an agent-oriented level of abstraction for the engineering of general purpose software systems.
Resumo:
Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.
Resumo:
Despite the several issues faced in the past, the evolutionary trend of silicon has kept its constant pace. Today an ever increasing number of cores is integrated onto the same die. Unfortunately, the extraordinary performance achievable by the many-core paradigm is limited by several factors. Memory bandwidth limitation, combined with inefficient synchronization mechanisms, can severely overcome the potential computation capabilities. Moreover, the huge HW/SW design space requires accurate and flexible tools to perform architectural explorations and validation of design choices. In this thesis we focus on the aforementioned aspects: a flexible and accurate Virtual Platform has been developed, targeting a reference many-core architecture. Such tool has been used to perform architectural explorations, focusing on instruction caching architecture and hybrid HW/SW synchronization mechanism. Beside architectural implications, another issue of embedded systems is considered: energy efficiency. Near Threshold Computing is a key research area in the Ultra-Low-Power domain, as it promises a tenfold improvement in energy efficiency compared to super-threshold operation and it mitigates thermal bottlenecks. The physical implications of modern deep sub-micron technology are severely limiting performance and reliability of modern designs. Reliability becomes a major obstacle when operating in NTC, especially memory operation becomes unreliable and can compromise system correctness. In the present work a novel hybrid memory architecture is devised to overcome reliability issues and at the same time improve energy efficiency by means of aggressive voltage scaling when allowed by workload requirements. Variability is another great drawback of near-threshold operation. The greatly increased sensitivity to threshold voltage variations in today a major concern for electronic devices. We introduce a variation-tolerant extension of the baseline many-core architecture. By means of micro-architectural knobs and a lightweight runtime control unit, the baseline architecture becomes dynamically tolerant to variations.
Resumo:
A High-Performance Computing job dispatcher is a critical software that assigns the finite computing resources to submitted jobs. This resource assignment over time is known as the on-line job dispatching problem in HPC systems. The fact the problem is on-line means that solutions must be computed in real-time, and their required time cannot exceed some threshold to do not affect the normal system functioning. In addition, a job dispatcher must deal with a lot of uncertainty: submission times, the number of requested resources, and duration of jobs. Heuristic-based techniques have been broadly used in HPC systems, at the cost of achieving (sub-)optimal solutions in a short time. However, the scheduling and resource allocation components are separated, thus generates a decoupled decision that may cause a performance loss. Optimization-based techniques are less used for this problem, although they can significantly improve the performance of HPC systems at the expense of higher computation time. Nowadays, HPC systems are being used for modern applications, such as big data analytics and predictive model building, that employ, in general, many short jobs. However, this information is unknown at dispatching time, and job dispatchers need to process large numbers of them quickly while ensuring high Quality-of-Service (QoS) levels. Constraint Programming (CP) has been shown to be an effective approach to tackle job dispatching problems. However, state-of-the-art CP-based job dispatchers are unable to satisfy the challenges of on-line dispatching, such as generate dispatching decisions in a brief period and integrate current and past information of the housing system. Given the previous reasons, we propose CP-based dispatchers that are more suitable for HPC systems running modern applications, generating on-line dispatching decisions in a proper time and are able to make effective use of job duration predictions to improve QoS levels, especially for workloads dominated by short jobs.
Resumo:
Embedding intelligence in extreme edge devices allows distilling raw data acquired from sensors into actionable information, directly on IoT end-nodes. This computing paradigm, in which end-nodes no longer depend entirely on the Cloud, offers undeniable benefits, driving a large research area (TinyML) to deploy leading Machine Learning (ML) algorithms on micro-controller class of devices. To fit the limited memory storage capability of these tiny platforms, full-precision Deep Neural Networks (DNNs) are compressed by representing their data down to byte and sub-byte formats, in the integer domain. However, the current generation of micro-controller systems can barely cope with the computing requirements of QNNs. This thesis tackles the challenge from many perspectives, presenting solutions both at software and hardware levels, exploiting parallelism, heterogeneity and software programmability to guarantee high flexibility and high energy-performance proportionality. The first contribution, PULP-NN, is an optimized software computing library for QNN inference on parallel ultra-low-power (PULP) clusters of RISC-V processors, showing one order of magnitude improvements in performance and energy efficiency, compared to current State-of-the-Art (SoA) STM32 micro-controller systems (MCUs) based on ARM Cortex-M cores. The second contribution is XpulpNN, a set of RISC-V domain specific instruction set architecture (ISA) extensions to deal with sub-byte integer arithmetic computation. The solution, including the ISA extensions and the micro-architecture to support them, achieves energy efficiency comparable with dedicated DNN accelerators and surpasses the efficiency of SoA ARM Cortex-M based MCUs, such as the low-end STM32M4 and the high-end STM32H7 devices, by up to three orders of magnitude. To overcome the Von Neumann bottleneck while guaranteeing the highest flexibility, the final contribution integrates an Analog In-Memory Computing accelerator into the PULP cluster, creating a fully programmable heterogeneous fabric that demonstrates end-to-end inference capabilities of SoA MobileNetV2 models, showing two orders of magnitude performance improvements over current SoA analog/digital solutions.