781 resultados para big data storage
Resumo:
A computer-controlled laser writing system for optical integrated circuits and data storage is described. The system is characterized by holographic (649F) and high-resolution plates. A minimum linewidth of 2.5 mum is obtained by controlling the system parameters. We show that this system can also be used for data storage applications.
Resumo:
In big data image/video analytics, we encounter the problem of learning an over-complete dictionary for sparse representation from a large training dataset, which cannot be processed at once because of storage and computational constraints. To tackle the problem of dictionary learning in such scenarios, we propose an algorithm that exploits the inherent clustered structure of the training data and make use of a divide-and-conquer approach. The fundamental idea behind the algorithm is to partition the training dataset into smaller clusters, and learn local dictionaries for each cluster. Subsequently, the local dictionaries are merged to form a global dictionary. Merging is done by solving another dictionary learning problem on the atoms of the locally trained dictionaries. This algorithm is referred to as the split-and-merge algorithm. We show that the proposed algorithm is efficient in its usage of memory and computational complexity, and performs on par with the standard learning strategy, which operates on the entire data at a time. As an application, we consider the problem of image denoising. We present a comparative analysis of our algorithm with the standard learning techniques that use the entire database at a time, in terms of training and denoising performance. We observe that the split-and-merge algorithm results in a remarkable reduction of training time, without significantly affecting the denoising performance.
Resumo:
Storage systems are widely used and have played a crucial rule in both consumer and industrial products, for example, personal computers, data centers, and embedded systems. However, such system suffers from issues of cost, restricted-lifetime, and reliability with the emergence of new systems and devices, such as distributed storage and flash memory, respectively. Information theory, on the other hand, provides fundamental bounds and solutions to fully utilize resources such as data density, information I/O and network bandwidth. This thesis bridges these two topics, and proposes to solve challenges in data storage using a variety of coding techniques, so that storage becomes faster, more affordable, and more reliable.
We consider the system level and study the integration of RAID schemes and distributed storage. Erasure-correcting codes are the basis of the ubiquitous RAID schemes for storage systems, where disks correspond to symbols in the code and are located in a (distributed) network. Specifically, RAID schemes are based on MDS (maximum distance separable) array codes that enable optimal storage and efficient encoding and decoding algorithms. With r redundancy symbols an MDS code can sustain r erasures. For example, consider an MDS code that can correct two erasures. It is clear that when two symbols are erased, one needs to access and transmit all the remaining information to rebuild the erasures. However, an interesting and practical question is: What is the smallest fraction of information that one needs to access and transmit in order to correct a single erasure? In Part I we will show that the lower bound of 1/2 is achievable and that the result can be generalized to codes with arbitrary number of parities and optimal rebuilding.
We consider the device level and study coding and modulation techniques for emerging non-volatile memories such as flash memory. In particular, rank modulation is a novel data representation scheme proposed by Jiang et al. for multi-level flash memory cells, in which a set of n cells stores information in the permutation induced by the different charge levels of the individual cells. It eliminates the need for discrete cell levels, as well as overshoot errors, when programming cells. In order to decrease the decoding complexity, we propose two variations of this scheme in Part II: bounded rank modulation where only small sliding windows of cells are sorted to generated permutations, and partial rank modulation where only part of the n cells are used to represent data. We study limits on the capacity of bounded rank modulation and propose encoding and decoding algorithms. We show that overlaps between windows will increase capacity. We present Gray codes spanning all possible partial-rank states and using only ``push-to-the-top'' operations. These Gray codes turn out to solve an open combinatorial problem called universal cycle, which is a sequence of integers generating all possible partial permutations.
Resumo:
The work presented in this thesis revolves around erasure correction coding, as applied to distributed data storage and real-time streaming communications.
First, we examine the problem of allocating a given storage budget over a set of nodes for maximum reliability. The objective is to find an allocation of the budget that maximizes the probability of successful recovery by a data collector accessing a random subset of the nodes. This optimization problem is challenging in general because of its combinatorial nature, despite its simple formulation. We study several variations of the problem, assuming different allocation models and access models, and determine the optimal allocation and the optimal symmetric allocation (in which all nonempty nodes store the same amount of data) for a variety of cases. Although the optimal allocation can have nonintuitive structure and can be difficult to find in general, our results suggest that, as a simple heuristic, reliable storage can be achieved by spreading the budget maximally over all nodes when the budget is large, and spreading it minimally over a few nodes when it is small. Coding would therefore be beneficial in the former case, while uncoded replication would suffice in the latter case.
Second, we study how distributed storage allocations affect the recovery delay in a mobile setting. Specifically, two recovery delay optimization problems are considered for a network of mobile storage nodes: the maximization of the probability of successful recovery by a given deadline, and the minimization of the expected recovery delay. We show that the first problem is closely related to the earlier allocation problem, and solve the second problem completely for the case of symmetric allocations. It turns out that the optimal allocations for the two problems can be quite different. In a simulation study, we evaluated the performance of a simple data dissemination and storage protocol for mobile delay-tolerant networks, and observed that the choice of allocation can have a significant impact on the recovery delay under a variety of scenarios.
Third, we consider a real-time streaming system where messages created at regular time intervals at a source are encoded for transmission to a receiver over a packet erasure link; the receiver must subsequently decode each message within a given delay from its creation time. For erasure models containing a limited number of erasures per coding window, per sliding window, and containing erasure bursts whose maximum length is sufficiently short or long, we show that a time-invariant intrasession code asymptotically achieves the maximum message size among all codes that allow decoding under all admissible erasure patterns. For the bursty erasure model, we also show that diagonally interleaved codes derived from specific systematic block codes are asymptotically optimal over all codes in certain cases. We also study an i.i.d. erasure model in which each transmitted packet is erased independently with the same probability; the objective is to maximize the decoding probability for a given message size. We derive an upper bound on the decoding probability for any time-invariant code, and show that the gap between this bound and the performance of a family of time-invariant intrasession codes is small when the message size and packet erasure probability are small. In a simulation study, these codes performed well against a family of random time-invariant convolutional codes under a number of scenarios.
Finally, we consider the joint problems of routing and caching for named data networking. We propose a backpressure-based policy that employs virtual interest packets to make routing and caching decisions. In a packet-level simulation, the proposed policy outperformed a basic protocol that combines shortest-path routing with least-recently-used (LRU) cache replacement.
Resumo:
Electron microscopy (EM) has advanced in an exponential way since the first transmission electron microscope (TEM) was built in the 1930’s. The urge to ‘see’ things is an essential part of human nature (talk of ‘seeing is believing’) and apart from scanning tunnel microscopes which give information about the surface, EM is the only imaging technology capable of really visualising atomic structures in depth down to single atoms. With the development of nanotechnology the demand to image and analyse small things has become even greater and electron microscopes have found their way from highly delicate and sophisticated research grade instruments to key-turn and even bench-top instruments for everyday use in every materials research lab on the planet. The semiconductor industry is as dependent on the use of EM as life sciences and pharmaceutical industry. With this generalisation of use for imaging, the need to deploy advanced uses of EM has become more and more apparent. The combination of several coinciding beams (electron, ion and even light) to create DualBeam or TripleBeam instruments for instance enhances the usefulness from pure imaging to manipulating on the nanoscale. And when it comes to the analytic power of EM with the many ways the highly energetic electrons and ions interact with the matter in the specimen there is a plethora of niches which evolved during the last two decades, specialising in every kind of analysis that can be thought of and combined with EM. In the course of this study the emphasis was placed on the application of these advanced analytical EM techniques in the context of multiscale and multimodal microscopy – multiscale meaning across length scales from micrometres or larger to nanometres, multimodal meaning numerous techniques applied to the same sample volume in a correlative manner. In order to demonstrate the breadth and potential of the multiscale and multimodal concept an integration of it was attempted in two areas: I) Biocompatible materials using polycrystalline stainless steel and II) Semiconductors using thin multiferroic films. I) The motivation to use stainless steel (316L medical grade) comes from the potential modulation of endothelial cell growth which can have a big impact on the improvement of cardio-vascular stents – which are mainly made of 316L – through nano-texturing of the stent surface by focused ion beam (FIB) lithography. Patterning with FIB has never been reported before in connection with stents and cell growth and in order to gain a better understanding of the beam-substrate interaction during patterning a correlative microscopy approach was used to illuminate the patterning process from many possible angles. Electron backscattering diffraction (EBSD) was used to analyse the crystallographic structure, FIB was used for the patterning and simultaneously visualising the crystal structure as part of the monitoring process, scanning electron microscopy (SEM) and atomic force microscopy (AFM) were employed to analyse the topography and the final step being 3D visualisation through serial FIB/SEM sectioning. II) The motivation for the use of thin multiferroic films stems from the ever-growing demand for increased data storage at lesser and lesser energy consumption. The Aurivillius phase material used in this study has a high potential in this area. Yet it is necessary to show clearly that the film is really multiferroic and no second phase inclusions are present even at very low concentrations – ~0.1vol% could already be problematic. Thus, in this study a technique was developed to analyse ultra-low density inclusions in thin multiferroic films down to concentrations of 0.01%. The goal achieved was a complete structural and compositional analysis of the films which required identification of second phase inclusions (through elemental analysis EDX(Energy Dispersive X-ray)), localise them (employing 72 hour EDX mapping in the SEM), isolate them for the TEM (using FIB) and give an upper confidence limit of 99.5% to the influence of the inclusions on the magnetic behaviour of the main phase (statistical analysis).
Resumo:
It is estimated that the quantity of digital data being transferred, processed or stored at any one time currently stands at 4.4 zettabytes (4.4 × 2 70 bytes) and this figure is expected to have grown by a factor of 10 to 44 zettabytes by 2020. Exploiting this data is, and will remain, a significant challenge. At present there is the capacity to store 33% of digital data in existence at any one time; by 2020 this capacity is expected to fall to 15%. These statistics suggest that, in the era of Big Data, the identification of important, exploitable data will need to be done in a timely manner. Systems for the monitoring and analysis of data, e.g. stock markets, smart grids and sensor networks, can be made up of massive numbers of individual components. These components can be geographically distributed yet may interact with one another via continuous data streams, which in turn may affect the state of the sender or receiver. This introduces a dynamic causality, which further complicates the overall system by introducing a temporal constraint that is difficult to accommodate. Practical approaches to realising the system described above have led to a multiplicity of analysis techniques, each of which concentrates on specific characteristics of the system being analysed and treats these characteristics as the dominant component affecting the results being sought. The multiplicity of analysis techniques introduces another layer of heterogeneity, that is heterogeneity of approach, partitioning the field to the extent that results from one domain are difficult to exploit in another. The question is asked can a generic solution for the monitoring and analysis of data that: accommodates temporal constraints; bridges the gap between expert knowledge and raw data; and enables data to be effectively interpreted and exploited in a transparent manner, be identified? The approach proposed in this dissertation acquires, analyses and processes data in a manner that is free of the constraints of any particular analysis technique, while at the same time facilitating these techniques where appropriate. Constraints are applied by defining a workflow based on the production, interpretation and consumption of data. This supports the application of different analysis techniques on the same raw data without the danger of incorporating hidden bias that may exist. To illustrate and to realise this approach a software platform has been created that allows for the transparent analysis of data, combining analysis techniques with a maintainable record of provenance so that independent third party analysis can be applied to verify any derived conclusions. In order to demonstrate these concepts, a complex real world example involving the near real-time capturing and analysis of neurophysiological data from a neonatal intensive care unit (NICU) was chosen. A system was engineered to gather raw data, analyse that data using different analysis techniques, uncover information, incorporate that information into the system and curate the evolution of the discovered knowledge. The application domain was chosen for three reasons: firstly because it is complex and no comprehensive solution exists; secondly, it requires tight interaction with domain experts, thus requiring the handling of subjective knowledge and inference; and thirdly, given the dearth of neurophysiologists, there is a real world need to provide a solution for this domain
Resumo:
This short position paper considers issues in developing Data Architecture for the Internet of Things (IoT) through the medium of an exemplar project, Domain Expertise Capture in Authoring and Development Environments (DECADE). A brief discussion sets the background for IoT, and the development of the distinction between things and computers. The paper makes a strong argument to avoid reinvention of the wheel, and to reuse approaches to distributed heterogeneous data architectures and the lessons learned from that work, and apply them to this situation. DECADE requires an autonomous recording system, local data storage, semi-autonomous verification model, sign-off mechanism, qualitative and quantitative analysis carried out when and where required through web-service architecture, based on ontology and analytic agents, with a self-maintaining ontology model. To develop this, we describe a web-service architecture, combining a distributed data warehouse, web services for analysis agents, ontology agents and a verification engine, with a centrally verified outcome database maintained by certifying body for qualification/professional status.
Resumo:
The member states of the European Union are faced with the challenges of handling “big data” as well as with a growing impact of the supranational level. Given that the success of efforts at European level strongly depends on corresponding national and local activities, i.e., the quality of implementation and the degree of consistency, this chapter centers upon the coherence of European strategies and national implementations concerning the reuse of public sector information. Taking the City of Vienna’s open data activities as an illustrative example, we seek an answer to the question whether and to what extent developments at European level and other factors have an effect on local efforts towards open data. We find that the European Commission’s ambitions are driven by a strong economic argumentation, while the efforts of the City of Vienna have only very little to do with the European orientation and are rather dominated by lifestyle and administrative reform arguments. Hence, we observe a decoupling of supranational strategies and national implementation activities. The very reluctant attitude at Austrian federal level might be one reason for this, nationally induced barriers—such as the administrative culture—might be another. In order to enhance the correspondence between the strategies of the supranational level and those of the implementers at national and regional levels, the strengthening of soft law measures could be promising.
Resumo:
This special issue provides the latest research and development on wireless mobile wearable communications. According to a report by Juniper Research, the market value of connected wearable devices is expected to reach $1.5 billion by 2014, and the shipment of wearable devices may reach 70 million by 2017. Good examples of wearable devices are the prominent Google Glass and Microsoft HoloLens. As wearable technology is rapidly penetrating our daily life, mobile wearable communication is becoming a new communication paradigm. Mobile wearable device communications create new challenges compared to ordinary sensor networks and short-range communication. In mobile wearable communications, devices communicate with each other in a peer-to-peer fashion or client-server fashion and also communicate with aggregation points (e.g., smartphones, tablets, and gateway nodes). Wearable devices are expected to integrate multiple radio technologies for various applications' needs with small power consumption and low transmission delays. These devices can hence collect, interpret, transmit, and exchange data among supporting components, other wearable devices, and the Internet. Such data are not limited to people's personal biomedical information but also include human-centric social and contextual data. The success of mobile wearable technology depends on communication and networking architectures that support efficient and secure end-to-end information flows. A key design consideration of future wearable devices is the ability to ubiquitously connect to smartphones or the Internet with very low energy consumption. Radio propagation and, accordingly, channel models are also different from those in other existing wireless technologies. A huge number of connected wearable devices require novel big data processing algorithms, efficient storage solutions, cloud-assisted infrastructures, and spectrum-efficient communications technologies.
Resumo:
Emerging web applications like cloud computing, Big Data and social networks have created the need for powerful centres hosting hundreds of thousands of servers. Currently, the data centres are based on general purpose processors that provide high flexibility buts lack the energy efficiency of customized accelerators. VINEYARD aims to develop an integrated platform for energy-efficient data centres based on new servers with novel, coarse-grain and fine-grain, programmable hardware accelerators. It will, also, build a high-level programming framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by employing typical data-centre programming frameworks (e.g. MapReduce, Storm, Spark, etc.). This programming framework will, further, allow the hardware accelerators to be swapped in and out of the heterogeneous infrastructure so as to offer high flexibility and energy efficiency. VINEYARD will foster the expansion of the soft-IP core industry, currently limited in the embedded systems, to the data-centre market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases (a) a bio-informatics application for high-accuracy brain modeling, (b) two critical financial applications, and (c) a big-data analysis application.
Resumo:
This work is dedicated to comparison of open source as well as proprietary transport protocols for highspeed data transmission via IP networks. The contemporary common TCP needs significant improvement since it was developed as general-purpose transport protocol and firstly introduced four decades ago. In nowadays networks, TCP fits not all communication needs that society has. Caused of it another transport protocols have been developed and successfully used for e.g. Big Data movement. In scope of this research the following protocols have been investigated for its efficiency on 10Gbps links: UDT, RBUDP, MTP and RWTP. The protocols were tested under different impairments such as Round Trip Time up to 400 ms and packet losses up to 2%. Investigated parameters are the data rate under different conditions of the network, the CPU load by sender andreceiver during the experiments, size of feedback data, CPU usage per Gbps and the amount of feedback data per GiByte of effectively transmitted data. The best performance and fair resources consumption was observed by RWTP. From the opensource projects, the best behavior is showed by RBUDP.
Resumo:
Remote Data acquisition and analysing systems developed for fisheries and related environmental studies have been reported. It consists of three units. The first one namely multichannel remote data acquisition system is installed at the remote place powered by a rechargeable battery. It acquires and stores the 16 channel environmental data on a battery backed up RAM. The second unit called the Field data analyser is used for insitue display and analysis of the data stored in the backed up RAM. The third unit namely Laboratory data analyser is an IBM compatible PC based unit for detailed analysis and interpretation of the data after bringing the RAM unit to the laboratory. The data collected using the system has been analysed and presented in the form of a graph. The system timer operated at negligibly low current, switches on the power to the entire remote operated system at prefixed time interval of 2 hours.Data storage at remote site on low power battery backedupRAM and retrieval and analysis of data using PC are the special i ty of the system. The remote operated system takes about 7 seconds including the 5 second stabilization time to acquire and store data and is very ideal for remote operation on rechargeable bat tery. The system can store 16 channel data scanned at 2 hour interval for 10 days on 2K backed up RAM with memory expansion facility for 8K RAM.
Resumo:
This is a research discussion about the Hampshire Hub - see http://protohub.net/. The aim is to find out more about the project, and discuss future collaboration and sharing of ideas. Mark Braggins (Hampshire Hub Partnership) will introduce the Hampshire Hub programme, setting out its main objectives, work done to-date, next steps including the Hampshire data store (which will use the PublishMyData linked data platform), and opportunities for University of Southampton to engage with the programme , including the forthcoming Hampshire Hackathons Bill Roberts (Swirrl) will give an overview of the PublishMyData platform, and how it will help deliver the objectives of the Hampshire Hub. He will detail some of the new functionality being added to the platform Steve Peters (DCLG Open Data Communities) will focus on developing a web of data that blends and combines local and national data sources around localities, and common topics/themes. This will include observations on the potential employing emerging new, big data sources to help deliver more effective, better targeted public services. Steve will illustrate this with practical examples of DCLG’s work to publish its own data in a SPARQL end-point, so that it can be used over the web alongside related 3rd party sources. He will share examples of some of the practical challenges, particularly around querying and re-using geographic LinkedData in a federated world of SPARQL end-point.