818 resultados para big data


Relevância:

70.00% 70.00%

Publicador:

Resumo:

Distributed systems are widely used for solving large-scale and data-intensive computing problems, including all-to-all comparison (ATAC) problems. However, when used for ATAC problems, existing computational frameworks such as Hadoop focus on load balancing for allocating comparison tasks, without careful consideration of data distribution and storage usage. While Hadoop-based solutions provide users with simplicity of implementation, their inherent MapReduce computing pattern does not match the ATAC pattern. This leads to load imbalances and poor data locality when Hadoop's data distribution strategy is used for ATAC problems. Here we present a data distribution strategy which considers data locality, load balancing and storage savings for ATAC computing problems in homogeneous distributed systems. A simulated annealing algorithm is developed for data distribution and task scheduling. Experimental results show a significant performance improvement for our approach over Hadoop-based solutions.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

As technological capabilities for capturing, aggregating, and processing large quantities of data continue to improve, the question becomes how to effectively utilise these resources. Whenever automatic methods fail, it is necessary to rely on human background knowledge, intuition, and deliberation. This creates demand for data exploration interfaces that support the analytical process, allowing users to absorb and derive knowledge from data. Such interfaces have historically been designed for experts. However, existing research has shown promise in involving a broader range of users that act as citizen scientists, placing high demands in terms of usability. Visualisation is one of the most effective analytical tools for humans to process abstract information. Our research focuses on the development of interfaces to support collaborative, community-led inquiry into data, which we refer to as Participatory Data Analytics. The development of data exploration interfaces to support independent investigations by local communities around topics of their interest presents a unique set of challenges, which we discuss in this paper. We present our preliminary work towards suitable high-level abstractions and interaction concepts to allow users to construct and tailor visualisations to their own needs.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This research studied distributed computing of all-to-all comparison problems with big data sets. The thesis formalised the problem, and developed a high-performance and scalable computing framework with a programming model, data distribution strategies and task scheduling policies to solve the problem. The study considered storage usage, data locality and load balancing for performance improvement in solving the problem. The research outcomes can be applied in bioinformatics, biometrics and data mining and other domains in which all-to-all comparisons are a typical computing pattern.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Big Data and Learning Analytics’ promise to revolutionise educational institutions, endeavours, and actions through more and better data is now compelling. Multiple, and continually updating, data sets produce a new sense of ‘personalised learning’. A crucial attribute of the datafication, and subsequent profiling, of learner behaviour and engagement is the continual modification of the learning environment to induce greater levels of investment on the parts of each learner. The assumption is that more and better data, gathered faster and fed into ever-updating algorithms, provide more complete tools to understand, and therefore improve, learning experiences through adaptive personalisation. The argument in this paper is that Learning Personalisation names a new logistics of investment as the common ‘sense’ of the school, in which disciplinary education is ‘both disappearing and giving way to frightful continual training, to continual monitoring'.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Solving large-scale all-to-all comparison problems using distributed computing is increasingly significant for various applications. Previous efforts to implement distributed all-to-all comparison frameworks have treated the two phases of data distribution and comparison task scheduling separately. This leads to high storage demands as well as poor data locality for the comparison tasks, thus creating a need to redistribute the data at runtime. Furthermore, most previous methods have been developed for homogeneous computing environments, so their overall performance is degraded even further when they are used in heterogeneous distributed systems. To tackle these challenges, this paper presents a data-aware task scheduling approach for solving all-to-all comparison problems in heterogeneous distributed systems. The approach formulates the requirements for data distribution and comparison task scheduling simultaneously as a constrained optimization problem. Then, metaheuristic data pre-scheduling and dynamic task scheduling strategies are developed along with an algorithmic implementation to solve the problem. The approach provides perfect data locality for all comparison tasks, avoiding rearrangement of data at runtime. It achieves load balancing among heterogeneous computing nodes, thus enhancing the overall computation time. It also reduces data storage requirements across the network. The effectiveness of the approach is demonstrated through experimental studies.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

It is estimated that the quantity of digital data being transferred, processed or stored at any one time currently stands at 4.4 zettabytes (4.4 × 2 70 bytes) and this figure is expected to have grown by a factor of 10 to 44 zettabytes by 2020. Exploiting this data is, and will remain, a significant challenge. At present there is the capacity to store 33% of digital data in existence at any one time; by 2020 this capacity is expected to fall to 15%. These statistics suggest that, in the era of Big Data, the identification of important, exploitable data will need to be done in a timely manner. Systems for the monitoring and analysis of data, e.g. stock markets, smart grids and sensor networks, can be made up of massive numbers of individual components. These components can be geographically distributed yet may interact with one another via continuous data streams, which in turn may affect the state of the sender or receiver. This introduces a dynamic causality, which further complicates the overall system by introducing a temporal constraint that is difficult to accommodate. Practical approaches to realising the system described above have led to a multiplicity of analysis techniques, each of which concentrates on specific characteristics of the system being analysed and treats these characteristics as the dominant component affecting the results being sought. The multiplicity of analysis techniques introduces another layer of heterogeneity, that is heterogeneity of approach, partitioning the field to the extent that results from one domain are difficult to exploit in another. The question is asked can a generic solution for the monitoring and analysis of data that: accommodates temporal constraints; bridges the gap between expert knowledge and raw data; and enables data to be effectively interpreted and exploited in a transparent manner, be identified? The approach proposed in this dissertation acquires, analyses and processes data in a manner that is free of the constraints of any particular analysis technique, while at the same time facilitating these techniques where appropriate. Constraints are applied by defining a workflow based on the production, interpretation and consumption of data. This supports the application of different analysis techniques on the same raw data without the danger of incorporating hidden bias that may exist. To illustrate and to realise this approach a software platform has been created that allows for the transparent analysis of data, combining analysis techniques with a maintainable record of provenance so that independent third party analysis can be applied to verify any derived conclusions. In order to demonstrate these concepts, a complex real world example involving the near real-time capturing and analysis of neurophysiological data from a neonatal intensive care unit (NICU) was chosen. A system was engineered to gather raw data, analyse that data using different analysis techniques, uncover information, incorporate that information into the system and curate the evolution of the discovered knowledge. The application domain was chosen for three reasons: firstly because it is complex and no comprehensive solution exists; secondly, it requires tight interaction with domain experts, thus requiring the handling of subjective knowledge and inference; and thirdly, given the dearth of neurophysiologists, there is a real world need to provide a solution for this domain

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The member states of the European Union are faced with the challenges of handling “big data” as well as with a growing impact of the supranational level. Given that the success of efforts at European level strongly depends on corresponding national and local activities, i.e., the quality of implementation and the degree of consistency, this chapter centers upon the coherence of European strategies and national implementations concerning the reuse of public sector information. Taking the City of Vienna’s open data activities as an illustrative example, we seek an answer to the question whether and to what extent developments at European level and other factors have an effect on local efforts towards open data. We find that the European Commission’s ambitions are driven by a strong economic argumentation, while the efforts of the City of Vienna have only very little to do with the European orientation and are rather dominated by lifestyle and administrative reform arguments. Hence, we observe a decoupling of supranational strategies and national implementation activities. The very reluctant attitude at Austrian federal level might be one reason for this, nationally induced barriers—such as the administrative culture—might be another. In order to enhance the correspondence between the strategies of the supranational level and those of the implementers at national and regional levels, the strengthening of soft law measures could be promising.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Emerging web applications like cloud computing, Big Data and social networks have created the need for powerful centres hosting hundreds of thousands of servers. Currently, the data centres are based on general purpose processors that provide high flexibility buts lack the energy efficiency of customized accelerators. VINEYARD aims to develop an integrated platform for energy-efficient data centres based on new servers with novel, coarse-grain and fine-grain, programmable hardware accelerators. It will, also, build a high-level programming framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by employing typical data-centre programming frameworks (e.g. MapReduce, Storm, Spark, etc.). This programming framework will, further, allow the hardware accelerators to be swapped in and out of the heterogeneous infrastructure so as to offer high flexibility and energy efficiency. VINEYARD will foster the expansion of the soft-IP core industry, currently limited in the embedded systems, to the data-centre market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases (a) a bio-informatics application for high-accuracy brain modeling, (b) two critical financial applications, and (c) a big-data analysis application.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This work is dedicated to comparison of open source as well as proprietary transport protocols for highspeed data transmission via IP networks. The contemporary common TCP needs significant improvement since it was developed as general-purpose transport protocol and firstly introduced four decades ago. In nowadays networks, TCP fits not all communication needs that society has. Caused of it another transport protocols have been developed and successfully used for e.g. Big Data movement. In scope of this research the following protocols have been investigated for its efficiency on 10Gbps links: UDT, RBUDP, MTP and RWTP. The protocols were tested under different impairments such as Round Trip Time up to 400 ms and packet losses up to 2%. Investigated parameters are the data rate under different conditions of the network, the CPU load by sender andreceiver during the experiments, size of feedback data, CPU usage per Gbps and the amount of feedback data per GiByte of effectively transmitted data. The best performance and fair resources consumption was observed by RWTP. From the opensource projects, the best behavior is showed by RBUDP.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This is a research discussion about the Hampshire Hub - see http://protohub.net/. The aim is to find out more about the project, and discuss future collaboration and sharing of ideas. Mark Braggins (Hampshire Hub Partnership) will introduce the Hampshire Hub programme, setting out its main objectives, work done to-date, next steps including the Hampshire data store (which will use the PublishMyData linked data platform), and opportunities for University of Southampton to engage with the programme , including the forthcoming Hampshire Hackathons Bill Roberts (Swirrl) will give an overview of the PublishMyData platform, and how it will help deliver the objectives of the Hampshire Hub. He will detail some of the new functionality being added to the platform Steve Peters (DCLG Open Data Communities) will focus on developing a web of data that blends and combines local and national data sources around localities, and common topics/themes. This will include observations on the potential employing emerging new, big data sources to help deliver more effective, better targeted public services. Steve will illustrate this with practical examples of DCLG’s work to publish its own data in a SPARQL end-point, so that it can be used over the web alongside related 3rd party sources. He will share examples of some of the practical challenges, particularly around querying and re-using geographic LinkedData in a federated world of SPARQL end-point.

Relevância:

70.00% 70.00%

Publicador:

Relevância:

70.00% 70.00%

Publicador:

Relevância:

70.00% 70.00%

Publicador:

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The CHARMe project enables the annotation of climate data with key pieces of supporting information that we term “commentary”. Commentary reflects the experience that has built up in the user community, and can help new or less-expert users (such as consultants, SMEs, experts in other fields) to understand and interpret complex data. In the context of global climate services, the CHARMe system will record, retain and disseminate this commentary on climate datasets, and provide a means for feeding back this experience to the data providers. Based on novel linked data techniques and standards, the project has developed a core system, data model and suite of open-source tools to enable this information to be shared, discovered and exploited by the community.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We present an overview of the MELODIES project, which is developing new data-intensive environmental services based on data from Earth Observation satellites, government databases, national and European agencies and more. We focus here on the capabilities and benefits of the project’s “technical platform”, which applies cloud computing and Linked Data technologies to enable the development of these services, providing flexibility and scalability.