166 resultados para execution traces
Resumo:
Virtualization is one of the key enabling technologies for Cloud computing. Although it facilitates improved utilization of resources, virtualization can lead to performance degradation due to the sharing of physical resources like CPU, memory, network interfaces, disk controllers, etc. Multi-tenancy can cause highly unpredictable performance for concurrent I/O applications running inside virtual machines that share local disk storage in Cloud. Disk I/O requests in a typical Cloud setup may have varied requirements in terms of latency and throughput as they arise from a range of heterogeneous applications having diverse performance goals. This necessitates providing differential performance services to different I/O applications. In this paper, we present PriDyn, a novel scheduling framework which is designed to consider I/O performance metrics of applications such as acceptable latency and convert them to an appropriate priority value for disk access based on the current system state. This framework aims to provide differentiated I/O service to various applications and ensures predictable performance for critical applications in multi-tenant Cloud environment. We demonstrate through experimental validations on real world I/O traces that this framework achieves appreciable enhancements in I/O performance, indicating that this approach is a promising step towards enabling QoS guarantees on Cloud storage.
Resumo:
The correctness of a hard real-time system depends its ability to meet all its deadlines. Existing real-time systems use either a pure real-time scheduler or a real-time scheduler embedded as a real-time scheduling class in the scheduler of an operating system (OS). Existing implementations of schedulers in multicore systems that support real-time and non-real-time tasks, permit the execution of non-real-time tasks in all the cores with priorities lower than those of real-time tasks, but interrupts and softirqs associated with these non-real-time tasks can execute in any core with priorities higher than those of real-time tasks. As a result, the execution overhead of real-time tasks is quite large in these systems, which, in turn, affects their runtime. In order that the hard real-time tasks can be executed in such systems with minimal interference from other Linux tasks, we propose, in this paper, an integrated scheduler architecture, called SchedISA, which aims to considerably reduce the execution overhead of real-time tasks in these systems. In order to test the efficacy of the proposed scheduler, we implemented partitioned earliest deadline first (P-EDF) scheduling algorithm in SchedISA on Linux kernel, version 3.8, and conducted experiments on Intel core i7 processor with eight logical cores. We compared the execution overhead of real-time tasks in the above implementation of SchedISA with that in SCHED_DEADLINE's P-EDF implementation, which concurrently executes real-time and non-real-time tasks in Linux OS in all the cores. The experimental results show that the execution overhead of real-time tasks in the above implementation of SchedISA is considerably less than that in SCHED_DEADLINE. We believe that, with further refinement of SchedISA, the execution overhead of real-time tasks in SchedISA can be reduced to a predictable maximum, making it suitable for scheduling hard real-time tasks without affecting the CPU share of Linux tasks.
Resumo:
A block-structured adaptive mesh refinement (AMR) technique has been used to obtain numerical solutions for many scientific applications. Some block-structured AMR approaches have focused on forming patches of non-uniform sizes where the size of a patch can be tuned to the geometry of a region of interest. In this paper, we develop strategies for adaptive execution of block-structured AMR applications on GPUs, for hyperbolic directionally split solvers. While effective hybrid execution strategies exist for applications with uniform patches, our work considers efficient execution of non-uniform patches with different workloads. Our techniques include bin-packing work units to load balance GPU computations, adaptive asynchronism between CPU and GPU executions using a knapsack formulation, and scheduling communications for multi-GPU executions. Our experiments with synthetic and real data, for single-GPU and multi-GPU executions, on Tesla S1070 and Fermi C2070 clusters, show that our strategies result in up to a 3.23 speedup in performance over existing strategies.
Resumo:
Cache analysis plays a very important role in obtaining precise Worst Case Execution Time (WCET) estimates of programs for real-time systems. While Abstract Interpretation based approaches are almost universally used for cache analysis, they fail to take advantage of its unique requirement: it is not necessary to find the guaranteed cache behavior that holds across all executions of a program. We only need the cache behavior along one particular program path, which is the path with the maximum execution time. In this work, we introduce the concept of cache miss paths, which allows us to use the worst-case path information to improve the precision of AI-based cache analysis. We use Abstract Interpretation to determine the cache miss paths, and then integrate them in the IPET formulation. An added advantage is that this further allows us to use infeasible path information for cache analysis. Experimentally, our approach gives more precise WCETs as compared to AI-based cache analysis, and we also provide techniques to trade-off analysis time with precision to provide scalability.
Resumo:
Prediction of queue waiting times of jobs submitted to production parallel batch systems is important to provide overall estimates to users and can also help meta-schedulers make scheduling decisions. In this work, we have developed a framework for predicting ranges of queue waiting times for jobs by employing multi-class classification of similar jobs in history. Our hierarchical prediction strategy first predicts the point wait time of a job using dynamic k-Nearest Neighbor (kNN) method. It then performs a multi-class classification using Support Vector Machines (SVMs) among all the classes of the jobs. The probabilities given by the SVM for the class predicted using k-NN and its neighboring classes are used to provide a set of ranges of predicted wait times with probabilities. We have used these predictions and probabilities in a meta-scheduling strategy that distributes jobs to different queues/sites in a multi-queue/grid environment for minimizing wait times of the jobs. Experiments with different production supercomputer job traces show that our prediction strategies can give correct predictions for about 77-87% of the jobs, and also result in about 12% improved accuracy when compared to the next best existing method. Experiments with our meta-scheduling strategy using different production and synthetic job traces for various system sizes, partitioning schemes and different workloads, show that the meta-scheduling strategy gives much improved performance when compared to existing scheduling policies by reducing the overall average queue waiting times of the jobs by about 47%.
Resumo:
Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack. In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads. We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.
Resumo:
We report the origin of room temperature weak ferromagnetic behavior of polycrystalline Pb(Fe2/3W1/3)O-3 (PFW) powder. The structure and magnetic properties of the ceramic powder prepared by a Columbite method were characterized by X-ray and neutron diffraction, Mossbauer spectroscopy and magnetization measurements. Rietveld analysis of diffraction data confirm the formation of single phase PFW, without traces of any parasitic pyrochlore phase. PFW was found to crystallize in the cubic structure at room temperature. The Rietveld refinement of neutron diffraction data measured at room temperature confirmed the G-type antiferromagnetic structure of PFW in our sample. However, along with the antiferromagnetic (AFM) ordering of the Fe spins, we have observed the existence of weak ferromagnetism at room temperature through: (i) a clear opening of hysteresis (M-H) loop, (ii) bifurcation of the field cooled and zero-field cooled susceptibility; supported by Mossbauer spectroscopy results. The P-E loop measurements showed a non-linear slim hysteresis loop at room temperature due to the electronic conduction through the local inhomogeneities in the PFW crystallites and the inter-particle regions. By corroborating all the magnetic measurements, especially the spin glass nature of the sample, with the conduction behavior of the sample, we report here that the observed ferromagnetism originates at these local inhomogeneous regions in the sample, where the Fe-spins are not perfectly aligned antiferromagnetically due to the compositional disordering. (C) 2015 Elsevier Ltd and Techna Group S.r.l. All rights reserved.
Resumo:
The growing number of applications and processing units in modern Multiprocessor Systems-on-Chips (MPSoCs) come along with reduced time to market. Different IP cores can come from different vendors, and their trust levels are also different, but typically they use Network-on-Chip (NoC) as their communication infrastructure. An MPSoC can have multiple Trusted Execution Environments (TEEs). Apart from performance, power, and area research in the field of MPSoC, robust and secure system design is also gaining importance in the research community. To build a secure system, the designer must know beforehand all kinds of attack possibilities for the respective system (MPSoC). In this paper we survey the possible attack scenarios on present-day MPSoCs and investigate a new attack scenario, i.e., router attack targeted toward NoC architecture. We show the validity of this attack by analyzing different present-day NoC architectures and show that they are all vulnerable to this type of attack. By launching a router attack, an attacker can control the whole chip very easily, which makes it a very serious issue. Both routing tables and routing logic-based routers are vulnerable to such attacks. In this paper, we address attacks on routing tables. We propose different monitoring-based countermeasures against routing table-based router attack in an MPSoC having multiple TEEs. Synthesis results show that proposed countermeasures, viz. Runtime-monitor, Restart-monitor, Intermediate manager, and Auditor, occupy areas that are 26.6, 22, 0.2, and 12.2 % of a routing table-based router area. Apart from these, we propose Ejection address checker and Local monitoring module inside a router that cause 3.4 and 10.6 % increase of a router area, respectively. Simulation results are also given, which shows effectiveness of proposed monitoring-based countermeasures.
Resumo:
It has been previously reported that the addition of boron to Ti-6Al-4V results in significant refinement of the as-cast microstructure and enhancement in the strain hardening. However, the mechanism for the latter effect has not been adequately studied. The aim of this study was to understand the reasons for the enhancement in room temperature strain hardening on addition of boron to as cast Ti-6Al-4V alloy. A study was conducted on slip transmission using SEM, TEM, optical profilometry and four point probe resistivity measurements on un-deformed and deformed samples of Ti-6Al-4V-xB with five levels of boron. Optical profilometry was used to quantify the magnitude of offsets on slip traces which in turn provided information about the extent of planar or multiple slip. Studies on deformed samples reveal that while lath boundaries appear to easily permit dislocation slip transmission, colony boundaries are potent barriers to slip. From TEM studies it was also observed that while alloys containing lower boron underwent planar slip, deformation was more homogeneous in higher boron alloys due to multiple slip resulting from large number of colony boundaries. Multiple slip is also proposed to be the prime cause of the enhanced strain hardening.
Resumo:
Scalable stream processing and continuous dataflow systems are gaining traction with the rise of big data due to the need for processing high velocity data in near real time. Unlike batch processing systems such as MapReduce and workflows, static scheduling strategies fall short for continuous dataflows due to the variations in the input data rates and the need for sustained throughput. The elastic resource provisioning of cloud infrastructure is valuable to meet the changing resource needs of such continuous applications. However, multi-tenant cloud resources introduce yet another dimension of performance variability that impacts the application's throughput. In this paper we propose PLAStiCC, an adaptive scheduling algorithm that balances resource cost and application throughput using a prediction-based lookahead approach. It not only addresses variations in the input data rates but also the underlying cloud infrastructure. In addition, we also propose several simpler static scheduling heuristics that operate in the absence of accurate performance prediction model. These static and adaptive heuristics are evaluated through extensive simulations using performance traces obtained from Amazon AWS IaaS public cloud. Our results show an improvement of up to 20% in the overall profit as compared to the reactive adaptation algorithm.
Resumo:
In this paper we present HyperCell as a reconfigurable datapath for Instruction Extensions (IEs). HyperCell comprises an array of compute units laid over a switch network. We present an IE synthesis methodology that enables post-silicon realization of IE datapaths on HyperCell. The synthesis methodology optimally exploits hardware resources in HyperCell to enable software pipelined execution of IEs. Exploitation of temporal reuse of data in HyperCell results in significant reduction of input/output bandwidth requirements of HyperCell.
Resumo:
This paper proposes a probabilistic prediction based approach for providing Quality of Service (QoS) to delay sensitive traffic for Internet of Things (IoT). A joint packet scheduling and dynamic bandwidth allocation scheme is proposed to provide service differentiation and preferential treatment to delay sensitive traffic. The scheduler focuses on reducing the waiting time of high priority delay sensitive services in the queue and simultaneously keeping the waiting time of other services within tolerable limits. The scheme uses the difference in probability of average queue length of high priority packets at previous cycle and current cycle to determine the probability of average weight required in the current cycle. This offers optimized bandwidth allocation to all the services by avoiding distribution of excess resources for high priority services and yet guaranteeing the services for it. The performance of the algorithm is investigated using MPEG-4 traffic traces under different system loading. The results show the improved performance with respect to waiting time for scheduling high priority packets and simultaneously keeping tolerable limits for waiting time and packet loss for other services. Crown Copyright (C) 2015 Published by Elsevier B.V.
Resumo:
Clock synchronization in a wireless sensor network (WSN) is quite essential as it provides a consistent and a coherent time frame for all the nodes across the network. Typically, clock synchronization is achieved by message passing using a contention-based scheme for media access, like carrier sense multiple access (CSMA). The nodes try to synchronize with each other, by sending synchronization request messages. If many nodes try to send messages simultaneously, contention-based schemes cannot efficiently avoid collisions. In such a situation, there are chances of collisions, and hence, message losses, which, in turn, affects the convergence of the synchronization algorithms. However, the number of collisions can be reduced with a frame based approach like time division multiple access (TDMA) for message passing. In this paper, we propose a design to utilize TDMA-based media access and control (MAC) protocol for the performance improvement of clock synchronization protocols. The basic idea is to use TDMA-based transmissions when the degree of synchronization improves among the sensor nodes during the execution of the clock synchronization algorithm. The design significantly reduces the collisions among the synchronization protocol messages. We have simulated the proposed protocol in Castalia network simulator. The simulation results show that the proposed protocol significantly reduces the time required for synchronization and also improves the accuracy of the synchronization algorithm.
Resumo:
Affine transformations have proven to be very powerful for loop restructuring due to their ability to model a very wide range of transformations. A single multi-dimensional affine function can represent a long and complex sequence of simpler transformations. Existing affine transformation frameworks like the Pluto algorithm, that include a cost function for modern multicore architectures where coarse-grained parallelism and locality are crucial, consider only a sub-space of transformations to avoid a combinatorial explosion in finding the transformations. The ensuing practical tradeoffs lead to the exclusion of certain useful transformations, in particular, transformation compositions involving loop reversals and loop skewing by negative factors. In this paper, we propose an approach to address this limitation by modeling a much larger space of affine transformations in conjunction with the Pluto algorithm's cost function. We perform an experimental evaluation of both, the effect on compilation time, and performance of generated codes. The evaluation shows that our new framework, Pluto+, provides no degradation in performance in any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes with periodic boundary conditions, it provides a mean speedup of 1.33x over Pluto. We also show that Pluto+ does not increase compile times significantly. Experimental results on Polybench show that Pluto+ increases overall polyhedral source-to-source optimization time only by 15%. In cases where it improves execution time significantly, it increased polyhedral optimization time only by 2.04x.
Resumo:
Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.