965 resultados para Computer Architecture
Resumo:
IEEE 802.11p is the new standard for inter-vehicular communications (IVC) using the 5.9 GHz frequency band; it is planned to be widely deployed to enable cooperative systems. 802.11p uses and performance have been studied theoretically and in simulations over the past years. Unfortunately, many of these results have not been confirmed by on-tracks experimentation. In this paper, we describe field trials of 802.11p technology with our test vehicles. Metrics such as maximum range, latency and frame loss are examined.
Resumo:
A modular, graphic-oriented Internet browser has been developed to enable non-technical client access to a literal spinning world of information and remotely sensed. The Earth Portal (www.earthportal.net) uses the ManyOne browser (www.manyone.net) to provide engaging point and click views of the Earth fully tessellated with remotely sensed imagery and geospatial data. The ManyOne browser technology use Mozilla with embedded plugins to apply multiple 3-D graphics engines, e.g. ArcGlobe or GeoFusion, that directly link with the open-systems architecture of the geo-spatial infrastructure. This innovation allows for rendering of satellite imagery directly over the Earth's surface and requires no technical training by the web user. Effective use of this global distribution system for the remote sensing community requires a minimal compliance with protocols and standards that have been promoted by NSDI and other open-systems standards organizations.
Resumo:
This paper demonstrates the use of a spreadsheet in exploring non-linear difference equations that describe digital control systems used in radio engineering, communication and computer architecture. These systems, being the focus of intensive studies of mathematicians and engineers over the last 40 years, may exhibit extremely complicated behaviour interpreted in contemporary terms as transition from global asymptotic stability to chaos through period-doubling bifurcations. The authors argue that embedding advanced mathematical ideas in the technological tool enables one to introduce fundamentals of discrete control systems in tertiary curricula without learners having to deal with complex machinery that rigorous mathematical methods of investigation require. In particular, in the appropriately designed spreadsheet environment, one can effectively visualize a qualitative difference in the behviour of systems with different types of non-linear characteristic.
Resumo:
Wi-Fi is a commonly available source of localization information in urban environments but is challenging to integrate into conventional mapping architectures. Current state of the art probabilistic Wi-Fi SLAM algorithms are limited by spatial resolution and an inability to remove the accumulation of rotational error, inherent limitations of the Wi-Fi architecture. In this paper we leverage the low quality sensory requirements and coarse metric properties of RatSLAM to localize using Wi-Fi fingerprints. To further improve performance, we present a novel sensor fusion technique that integrates camera and Wi-Fi to improve localization specificity, and use compass sensor data to remove orientation drift. We evaluate the algorithms in diverse real world indoor and outdoor environments, including an office floor, university campus and a visually aliased circular building loop. The algorithms produce topologically correct maps that are superior to those produced using only a single sensor modality.
Resumo:
Energy efficient embedded computing enables new application scenarios in mobile devices like software-defined radio and video processing. The hierarchical multiprocessor considered in this work may contain dozens or hundreds of resource efficient VLIW CPUs. Programming this number of CPU cores is a complex task requiring compiler support. The stream programming paradigm provides beneficial properties that help to support automatic partitioning. This work describes a compiler for streaming applications targeting the self-build hierarchical CoreVA-MPSoC multiprocessor platform. The compiler is supported by a programming model that is tailored to fit the streaming programming paradigm. We present a novel simulated-annealing (SA) based partitioning algorithm, called Smart SA. The overall speedup of Smart SA is 12.84 for an MPSoC with 16 CPU cores compared to a single CPU implementation. Comparison with a state of the art partitioning algorithm shows an average performance improvement of 34.07%.
Resumo:
Processor architects have a challenging task of evaluating a large design space consisting of several interacting parameters and optimizations. In order to assist architects in making crucial design decisions, we build linear regression models that relate Processor performance to micro-architecture parameters, using simulation based experiments. We obtain good approximate models using an iterative process in which Akaike's information criteria is used to extract a good linear model from a small set of simulations, and limited further simulation is guided by the model using D-optimal experimental designs. The iterative process is repeated until desired error bounds are achieved. We used this procedure to establish the relationship of the CPI performance response to 26 key micro-architectural parameters using a detailed cycle-by-cycle superscalar processor simulator The resulting models provide a significance ordering on all micro-architectural parameters and their interactions, and explain the performance variations of micro-architectural techniques.
Resumo:
Frequent accesses to the register file make it one of the major sources of energy consumption in ILP architectures. The large number of functional units connected to a large unified register file in VLIW architectures make power dissipation in the register file even worse because of the need for a large number of ports. High power dissipation in a relatively smaller area occupied by a register file leads to a high power density in the register file and makes it one of the prime hot-spots. This makes it highly susceptible to the possibility of a catastrophic heatstroke. This in turn impacts the performance and cost because of the need for periodic cool down and sophisticated packaging and cooling techniques respectively. Clustered VLIW architectures partition the register file among clusters of functional units and reduce the number of ports required thereby reducing the power dissipation. However, we observe that the aggregate accesses to register files in clustered VLIW architectures (and associated energy consumption) become very high compared to the centralized VLIW architectures and this can be attributed to a large number of explicit inter-cluster communications. Snooping based clustered VLIW architectures provide very limited but very fast way of inter-cluster communication by allowing some of the functional units to directly read some of the operands from the register file of some of the other clusters. In this paper, we propose instruction scheduling algorithms that exploit the limited snooping capability to reduce the register file energy consumption on an average by 12% and 18% and improve the overall performance by 5% and 11% for a 2-clustered and a 4-clustered machine respectively, over an earlier state-of-the-art clustered scheduling algorithm when evaluated in the context of snooping based clustered VLIW architectures.
Resumo:
Superscalar processors currently have the potential to fetch multiple basic blocks per cycle by employing one of several recently proposed instruction fetch mechanisms. However, this increased fetch bandwidth cannot be exploited unless pipeline stages further downstream correspondingly improve. In particular,register renaming a large number of instructions per cycle is diDcult. A large instruction window, needed to receive multiple basic blocks per cycle, will slow down dependence resolution and instruction issue. This paper addresses these and related issues by proposing (i) partitioning of the instruction window into multiple blocks, each holding a dynamic code sequence; (ii) logical partitioning of the registerjle into a global file and several local jles, the latter holding registers local to a dynamic code sequence; (iii) the dynamic recording and reuse of register renaming information for registers local to a dynamic code sequence. Performance studies show these mechanisms improve performance over traditional superscalar processors by factors ranging from 1.5 to a little over 3 for the SPEC Integer programs. Next, it is observed that several of the loops in the benchmarks display vector-like behavior during execution, even if the static loop bodies are likely complex for compile-time vectorization. A dynamic loop vectorization mechanism that builds on top of the above mechanisms is briefly outlined. The mechanism vectorizes up to 60% of the dynamic instructions for some programs, albeit the average number of iterations per loop is quite small.
Resumo:
Effective sharing of the last level cache has a significant influence on the overall performance of a multicore system. We observe that existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and in some cases lack the flexibility to support a variety of performance goals. In this paper, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals. We demonstrate the flexibility of PriSM, by computing the eviction probabilities needed to achieve goals like hit-maximization, fairness and QOS. PriSM-HitMax improves performance by 18.7% over LRU and 11.8% over previously proposed schemes in a sixteen core machine. PriSM-Fairness improves fairness over existing solutions by 23.3% along with a performance improvement of 19.0%. PriSM-QOS successfully achieves the desired QOS targets.
Resumo:
The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC - Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.
Resumo:
142 p.
Resumo:
XII, 116 p.
Resumo:
Numerical modeling of groundwater is very important for understanding groundwater flow and solving hydrogeological problem. Today, groundwater studies require massive model cells and high calculation accuracy, which are beyond single-CPU computer’s capabilities. With the development of high performance parallel computing technologies, application of parallel computing method on numerical modeling of groundwater flow becomes necessary and important. Using parallel computing can improve the ability to resolve various hydro-geological and environmental problems. In this study, parallel computing method on two main types of modern parallel computer architecture, shared memory parallel systems and distributed shared memory parallel systems, are discussed. OpenMP and MPI (PETSc) are both used to parallelize the most widely used groundwater simulator, MODFLOW. Two parallel solvers, P-PCG and P-MODFLOW, were developed for MODFLOW. The parallelized MODFLOW was used to simulate regional groundwater flow in Beishan, Gansu Province, which is a potential high-level radioactive waste geological disposal area in China. 1. The OpenMP programming paradigm was used to parallelize the PCG (preconditioned conjugate-gradient method) solver, which is one of the main solver for MODFLOW. The parallel PCG solver, P-PCG, is verified using an 8-processor computer. Both the impact of compilers and different model domain sizes were considered in the numerical experiments. The largest test model has 1000 columns, 1000 rows and 1000 layers. Based on the timing results, execution times using the P-PCG solver are typically about 1.40 to 5.31 times faster than those using the serial one. In addition, the simulation results are the exact same as the original PCG solver, because the majority of serial codes were not changed. It is worth noting that this parallelizing approach reduces cost in terms of software maintenance because only a single source PCG solver code needs to be maintained in the MODFLOW source tree. 2. P-MODFLOW, a domain decomposition–based model implemented in a parallel computing environment is developed, which allows efficient simulation of a regional-scale groundwater flow. The basic approach partitions a large model domain into any number of sub-domains. Parallel processors are used to solve the model equations within each sub-domain. The use of domain decomposition method to achieve the MODFLOW program distributed shared memory parallel computing system will process the application of MODFLOW be extended to the fleet of the most popular systems, so that a large-scale simulation could take full advantage of hundreds or even thousands parallel processors. P-MODFLOW has a good parallel performance, with the maximum speedup of 18.32 (14 processors). Super linear speedups have been achieved in the parallel tests, indicating the efficiency and scalability of the code. Parallel program design, load balancing and full use of the PETSc were considered to achieve a highly efficient parallel program. 3. The characterization of regional ground water flow system is very important for high-level radioactive waste geological disposal. The Beishan area, located in northwestern Gansu Province, China, is selected as a potential site for disposal repository. The area includes about 80000 km2 and has complicated hydrogeological conditions, which greatly increase the computational effort of regional ground water flow models. In order to reduce computing time, parallel computing scheme was applied to regional ground water flow modeling. Models with over 10 million cells were used to simulate how the faults and different recharge conditions impact regional ground water flow pattern. The results of this study provide regional ground water flow information for the site characterization of the potential high-level radioactive waste disposal.
Resumo:
On 70~(th) SEG Annual meeting, many author have announced their result on the wave equation prestack depth migration. The methods of the wave-field imaging base on wave equation becomes mature and the main direction of seismic imaging. The direction of imaging the complex media has been the main one of the projects that the national "85" and "95" reservoir geophysics key projects and "Knowledge innovation key project of Chinese Academy of Science" have been supported. Furthermore, we began the study for special oil field situation of our nation with the international research groups. Under the background, the author combined the thoughts of symplectic with wave equation pre-stack depth migration, and develops and efficient wave equation pre-stack depth migration method. The purpose of this work is to find out a way to imaging the complex geological goals of Chinese oilfields and form a procedure of seismic data processing. The paper gives the approximation of one way wave equation operator, and shows the numerical results. The comparisons have been made between split-step phase method, Kirchhoff and Ray+FD methods on the pulse response, simple model and Marmousi model. The results shows that the method in this paper has an higher accuracy. Four field data examples have also be given in this paper. The results of field data demonstrate that the method can be usable. The velocity estimation is an important part of the wave equation pre-stack depth migration. A parallel velocity estimation program has been written and tested on the Beowulf clusters. The program can establish a velocity profile automatically. An example on Marmousi model has shown in the third part of the paper to demonstrate the method. Another field data was also given in the paper. Beowulf cluster is the converge of the high performance computer architecture. Today, Beowulf Cluster is a good choice for institutes and small companies to finish their task. The paper gives some comparison results the computation of the wave equation pre-stack migration on Beowulf cluster, IBM-SP2 (24 nodes) in Daqing and Shuguang 3000, and the comparison of their prize. The results show that the Beowulf cluster is an efficient way to finish the large amount computation of the wave equation pre-stack depth migration, especially for 3D.