875 resultados para big data storage


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The demand for data storage and processing is increasing at a rapid speed in the big data era. The management of such tremendous volume of data is a critical challenge to the data storage systems. Firstly, since 60% of the stored data is claimed to be redundant, data deduplication technology becomes an attractive solution to save storage space and traffic in a big data environment. Secondly, the security issues, such as confidentiality, integrity and privacy of the big data should also be considered for big data storage. To address these problems, convergent encryption is widely used to secure data deduplication for big data storage. Nonetheless, there still exist some other security issues, such as proof of ownership, key management and so on. In this chapter, we first introduce some major cyber attacks for big data storage. Then, we describe the existing fundamental security techniques, whose integration is essential for preventing data from existing and future security attacks. By discussing some interesting open problems, we finally expect to trigger more research efforts in this new research field.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In recent years, big data have become a hot research topic. The increasing amount of big data also increases the chance of breaching the privacy of individuals. Since big data require high computational power and large storage, distributed systems are used. As multiple parties are involved in these systems, the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms developed for privacy protection at different stages (e.g., data generation, data storage, and data processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms. In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacy-preserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges and future research directions related to privacy preservation in big data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The concept of big data has already outperformed traditional data management efforts in almost all industries. Other instances it has succeeded in obtaining promising results that provide value from large-scale integration and analysis of heterogeneous data sources for example Genomic and proteomic information. Big data analytics have become increasingly important in describing the data sets and analytical techniques in software applications that are so large and complex due to its significant advantages including better business decisions, cost reduction and delivery of new product and services [1]. In a similar context, the health community has experienced not only more complex and large data content, but also information systems that contain a large number of data sources with interrelated and interconnected data attributes. That have resulted in challenging, and highly dynamic environments leading to creation of big data with its enumerate complexities, for instant sharing of information with the expected security requirements of stakeholders. When comparing big data analysis with other sectors, the health sector is still in its early stages. Key challenges include accommodating the volume, velocity and variety of healthcare data with the current deluge of exponential growth. Given the complexity of big data, it is understood that while data storage and accessibility are technically manageable, the implementation of Information Accountability measures to healthcare big data might be a practical solution in support of information security, privacy and traceability measures. Transparency is one important measure that can demonstrate integrity which is a vital factor in the healthcare service. Clarity about performance expectations is considered to be another Information Accountability measure which is necessary to avoid data ambiguity and controversy about interpretation and finally, liability [2]. According to current studies [3] Electronic Health Records (EHR) are key information resources for big data analysis and is also composed of varied co-created values [3]. Common healthcare information originates from and is used by different actors and groups that facilitate understanding of the relationship for other data sources. Consequently, healthcare services often serve as an integrated service bundle. Although a critical requirement in healthcare services and analytics, it is difficult to find a comprehensive set of guidelines to adopt EHR to fulfil the big data analysis requirements. Therefore as a remedy, this research work focus on a systematic approach containing comprehensive guidelines with the accurate data that must be provided to apply and evaluate big data analysis until the necessary decision making requirements are fulfilled to improve quality of healthcare services. Hence, we believe that this approach would subsequently improve quality of life.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Increasingly larger scale applications are generating an unprecedented amount of data. However, the increasing gap between computation and I/O capacity on High End Computing machines makes a severe bottleneck for data analysis. Instead of moving data from its source to the output storage, in-situ analytics processes output data while simulations are running. However, in-situ data analysis incurs much more computing resource contentions with simulations. Such contentions severely damage the performance of simulation on HPE. Since different data processing strategies have different impact on performance and cost, there is a consequent need for flexibility in the location of data analytics. In this paper, we explore and analyze several potential data-analytics placement strategies along the I/O path. To find out the best strategy to reduce data movement in given situation, we propose a flexible data analytics (FlexAnalytics) framework in this paper. Based on this framework, a FlexAnalytics prototype system is developed for analytics placement. FlexAnalytics system enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and visualization, as well as for large-scale data transfer. Two use cases – scientific data compression and remote visualization – have been applied in the study to verify the performance of FlexAnalytics. Experimental results demonstrate that FlexAnalytics framework increases data transition bandwidth and improves the application end-to-end transfer performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big Data and predictive analytics have received significant attention from the media and academic literature throughout the past few years, and it is likely that these emerging technologies will materially impact the mining sector. This short communication argues, however, that these technological forces will probably unfold differently in the mining industry than they have in many other sectors because of significant differences in the marginal cost of data capture and storage. To this end, we offer a brief overview of what Big Data and predictive analytics are, and explain how they are bringing about changes in a broad range of sectors. We discuss the “N=all” approach to data collection being promoted by many consultants and technology vendors in the marketplace but, by considering the economic and technical realities of data acquisition and storage, we then explain why a “n « all” data collection strategy probably makes more sense for the mining sector. Finally, towards shaping the industry’s policies with regards to technology-related investments in this area, we conclude by putting forward a conceptual model for leveraging Big Data tools and analytical techniques that is a more appropriate fit for the mining sector.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The term 'big data' has recently emerged to describe a range of technological and commercial trends enabling the storage and analysis of huge amounts of customer data, such as that generated by social networks and mobile devices. Much of the commercial promise of big data is in the ability to generate valuable insights from collecting new types and volumes of data in ways that were not previously economically viable. At the same time a number of questions have been raised about the implications for individual privacy. This paper explores key perspectives underlying the emergence of big data, and considers both the opportunities and ethical challenges raised for market research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The size and complexity of data sets generated within ecosystem-level programmes merits their capture, curation, storage and analysis, synthesis and visualisation using Big Data approaches. This review looks at previous attempts to organise and analyse such data through the International Biological Programme and draws on the mistakes made and the lessons learned for effective Big Data approaches to current Research Councils United Kingdom (RCUK) ecosystem-level programmes, using Biodiversity and Ecosystem Service Sustainability (BESS) and Environmental Virtual Observatory Pilot (EVOp) as exemplars. The challenges raised by such data are identified, explored and suggestions are made for the two major issues of extending analyses across different spatio-temporal scales and for the effective integration of quantitative and qualitative data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the explosion of big data, processing large numbers of continuous data streams, i.e., big data stream processing (BDSP), has become a crucial requirement for many scientific and industrial applications in recent years. By offering a pool of computation, communication and storage resources, public clouds, like Amazon's EC2, are undoubtedly the most efficient platforms to meet the ever-growing needs of BDSP. Public cloud service providers usually operate a number of geo-distributed datacenters across the globe. Different datacenter pairs are with different inter-datacenter network costs charged by Internet Service Providers (ISPs). While, inter-datacenter traffic in BDSP constitutes a large portion of a cloud provider's traffic demand over the Internet and incurs substantial communication cost, which may even become the dominant operational expenditure factor. As the datacenter resources are provided in a virtualized way, the virtual machines (VMs) for stream processing tasks can be freely deployed onto any datacenters, provided that the Service Level Agreement (SLA, e.g., quality-of-information) is obeyed. This raises the opportunity, but also a challenge, to explore the inter-datacenter network cost diversities to optimize both VM placement and load balancing towards network cost minimization with guaranteed SLA. In this paper, we first propose a general modeling framework that describes all representative inter-task relationship semantics in BDSP. Based on our novel framework, we then formulate the communication cost minimization problem for BDSP into a mixed-integer linear programming (MILP) problem and prove it to be NP-hard. We then propose a computation-efficient solution based on MILP. The high efficiency of our proposal is validated by extensive simulation based studies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A coleta e o armazenamento de dados em larga escala, combinados à capacidade de processamento de dados que não necessariamente tenham relação entre si de forma a gerar novos dados e informações, é uma tecnologia amplamente usada na atualidade, conhecida de forma geral como Big Data. Ao mesmo tempo em que possibilita a criação de novos produtos e serviços inovadores, os quais atendem a demandas e solucionam problemas de diversos setores da sociedade, o Big Data levanta uma série de questionamentos relacionados aos direitos à privacidade e à proteção dos dados pessoais. Esse artigo visa proporcionar um debate sobre o alcance da atual proteção jurídica aos direitos à privacidade e aos dados pessoais nesse contexto, e consequentemente fomentar novos estudos sobre a compatibilização dos mesmos com a liberdade de inovação. Para tanto, abordará, em um primeiro momento, pontos positivos e negativos do Big Data, identificando como o mesmo afeta a sociedade e a economia de forma ampla, incluindo, mas não se limitando, a questões de consumo, saúde, organização social, administração governamental, etc. Em seguida, serão identificados os efeitos dessa tecnologia sobre os direitos à privacidade e à proteção dos dados pessoais, tendo em vista que o Big Data gera grandes mudanças no que diz respeito ao armazenamento e tratamento de dados. Por fim, será feito um mapeamento do atual quadro regulatório brasileiro de proteção a tais direitos, observando se o mesmo realmente responde aos desafios atuais de compatibilização entre inovação e privacidade.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big data è il termine usato per descrivere una raccolta di dati così estesa in termini di volume,velocità e varietà da richiedere tecnologie e metodi analitici specifici per l'estrazione di valori significativi. Molti sistemi sono sempre più costituiti e caratterizzati da enormi moli di dati da gestire,originati da sorgenti altamente eterogenee e con formati altamente differenziati,oltre a qualità dei dati estremamente eterogenei. Un altro requisito in questi sistemi potrebbe essere il fattore temporale: sempre più sistemi hanno bisogno di ricevere dati significativi dai Big Data il prima possibile,e sempre più spesso l’input da gestire è rappresentato da uno stream di informazioni continuo. In questo campo si inseriscono delle soluzioni specifiche per questi casi chiamati Online Stream Processing. L’obiettivo di questa tesi è di proporre un prototipo funzionante che elabori dati di Instant Coupon provenienti da diverse fonti con diversi formati e protocolli di informazioni e trasmissione e che memorizzi i dati elaborati in maniera efficiente per avere delle risposte in tempo reale. Le fonti di informazione possono essere di due tipologie: XMPP e Eddystone. Il sistema una volta ricevute le informazioni in ingresso, estrapola ed elabora codeste fino ad avere dati significativi che possono essere utilizzati da terze parti. Lo storage di questi dati è fatto su Apache Cassandra. Il problema più grosso che si è dovuto risolvere riguarda il fatto che Apache Storm non prevede il ribilanciamento delle risorse in maniera automatica, in questo caso specifico però la distribuzione dei clienti durante la giornata è molto varia e ricca di picchi. Il sistema interno di ribilanciamento sfrutta tecnologie innovative come le metriche e sulla base del throughput e della latenza esecutiva decide se aumentare/diminuire il numero di risorse o semplicemente non fare niente se le statistiche sono all’interno dei valori di soglia voluti.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação apresentada à Escola Superior de Tecnologia do Instituto Politécnico de Castelo Branco para cumprimento dos requisitos necessários à obtenção do grau de Mestre em Desenvolvimento de Software e Sistemas Interactivos, realizada sob a orientação científica da categoria profissional do orientador Doutor Eurico Ribeiro Lopes, do Instituto Politécnico de Castelo Branco.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

O trabalho desenvolvido analisa a Comunicação Social no contexto da internet e delineia novas metodologias de estudo para a área na filtragem de significados no âmbito científico dos fluxos de informação das redes sociais, mídias de notícias ou qualquer outro dispositivo que permita armazenamento e acesso a informação estruturada e não estruturada. No intento de uma reflexão sobre os caminhos, que estes fluxos de informação se desenvolvem e principalmente no volume produzido, o projeto dimensiona os campos de significados que tal relação se configura nas teorias e práticas de pesquisa. O objetivo geral deste trabalho é contextualizar a área da Comunicação Social dentro de uma realidade mutável e dinâmica que é o ambiente da internet e fazer paralelos perante as aplicações já sucedidas por outras áreas. Com o método de estudo de caso foram analisados três casos sob duas chaves conceituais a Web Sphere Analysis e a Web Science refletindo os sistemas de informação contrapostos no quesito discursivo e estrutural. Assim se busca observar qual ganho a Comunicação Social tem no modo de visualizar seus objetos de estudo no ambiente das internet por essas perspectivas. O resultado da pesquisa mostra que é um desafio para o pesquisador da Comunicação Social buscar novas aprendizagens, mas a retroalimentação de informação no ambiente colaborativo que a internet apresenta é um caminho fértil para pesquisa, pois a modelagem de dados ganha corpus analítico quando o conjunto de ferramentas promovido e impulsionado pela tecnologia permite isolar conteúdos e possibilita aprofundamento dos significados e suas relações.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big Data Analytics is an emerging field since massive storage and computing capabilities have been made available by advanced e-infrastructures. Earth and Environmental sciences are likely to benefit from Big Data Analytics techniques supporting the processing of the large number of Earth Observation datasets currently acquired and generated through observations and simulations. However, Earth Science data and applications present specificities in terms of relevance of the geospatial information, wide heterogeneity of data models and formats, and complexity of processing. Therefore, Big Earth Data Analytics requires specifically tailored techniques and tools. The EarthServer Big Earth Data Analytics engine offers a solution for coverage-type datasets, built around a high performance array database technology, and the adoption and enhancement of standards for service interaction (OGC WCS and WCPS). The EarthServer solution, led by the collection of requirements from scientific communities and international initiatives, provides a holistic approach that ranges from query languages and scalability up to mobile access and visualization. The result is demonstrated and validated through the development of lighthouse applications in the Marine, Geology, Atmospheric, Planetary and Cryospheric science domains.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big Data Analytics is an emerging field since massive storage and computing capabilities have been made available by advanced e-infrastructures. Earth and Environmental sciences are likely to benefit from Big Data Analytics techniques supporting the processing of the large number of Earth Observation datasets currently acquired and generated through observations and simulations. However, Earth Science data and applications present specificities in terms of relevance of the geospatial information, wide heterogeneity of data models and formats, and complexity of processing. Therefore, Big Earth Data Analytics requires specifically tailored techniques and tools. The EarthServer Big Earth Data Analytics engine offers a solution for coverage-type datasets, built around a high performance array database technology, and the adoption and enhancement of standards for service interaction (OGC WCS and WCPS). The EarthServer solution, led by the collection of requirements from scientific communities and international initiatives, provides a holistic approach that ranges from query languages and scalability up to mobile access and visualization. The result is demonstrated and validated through the development of lighthouse applications in the Marine, Geology, Atmospheric, Planetary and Cryospheric science domains.