77 resultados para Big Data Hadoop Spark GPSJ

em Deakin Research Online - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper is written through the vision on integrating Internet-of-Things (IoT) with the power of Cloud Computing and the intelligence of Big Data analytics. But integration of all these three cutting edge technologies is complex to understand. In this research we first provide a security centric view of three layered approach for understanding the technology, gaps and security issues. Then with a series of lab experiments on different hardware, we have collected performance data from all these three layers, combined these data together and finally applied modern machine learning algorithms to distinguish 18 different activities and cyber-attacks. From our experiments we find classification algorithm RandomForest can identify 93.9% attacks and activities in this complex environment. From the existing literature, no one has ever attempted similar experiment for cyber-attack detection for IoT neither with performance data nor with a three layered approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In big-data-driven traffic flow prediction systems, the robustness of prediction performance depends on accuracy and timeliness. This paper presents a new MapReduce-based nearest neighbor (NN) approach for traffic flow prediction using correlation analysis (TFPC) on a Hadoop platform. In particular, we develop a real-time prediction system including two key modules, i.e., offline distributed training (ODT) and online parallel prediction (OPP). Moreover, we build a parallel k-nearest neighbor optimization classifier, which incorporates correlation information among traffic flows into the classification process. Finally, we propose a novel prediction calculation method, combining the current data observed in OPP and the classification results obtained from large-scale historical data in ODT, to generate traffic flow prediction in real time. The empirical study on real-world traffic flow big data using the leave-one-out cross validation method shows that TFPC significantly outperforms four state-of-the-art prediction approaches, i.e., autoregressive integrated moving average, Naïve Bayes, multilayer perceptron neural networks, and NN regression, in terms of accuracy, which can be improved 90.07% in the best case, with an average mean absolute percent error of 5.53%. In addition, it displays excellent speedup, scaleup, and sizeup.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the arrival of Big Data Era, properly utilizing the power of big data is becoming increasingly essential for the strength and competitiveness of businesses and organizations. We are facing grand challenges from big data from different perspectives, such as processing, communication, security, and privacy. In this talk, we discuss the big data challenges in network traffic classification and our solutions to the challenges. The significance of the research lies in the fact that each year the network traffic increase exponentially on the current Internet. Traffic classification has wide applications in network management, from security monitoring to quality of service measurements. Recent research tends to apply machine-learning techniques to flow statistical feature based classification methods. In this talk, we propose a series of novel approaches for traffic classification, which can improve the classification performance effectively by incorporating correlated information into the classification process. We analyze the new classification approaches and their performance benefit from both theoretical and empirical perspectives. A large number of experiments are carried out on two real-world traffic datasets to validate the proposed approach. The results show the traffic classification performance can be improved significantly even under the extreme difficult circumstance of very few training samples. Our work has significant impact on security applications.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper introduces and investigates large iterative multitier ensemble (LIME) classifiers specifically tailored for big data. These classifiers are very large, but are quite easy to generate and use. They can be so large that it makes sense to use them only for big data. They are generated automatically as a result of several iterations in applying ensemble meta classifiers. They incorporate diverse ensemble meta classifiers into several tiers simultaneously and combine them into one automatically generated iterative system so that many ensemble meta classifiers function as integral parts of other ensemble meta classifiers at higher tiers. In this paper, we carry out a comprehensive investigation of the performance of LIME classifiers for a problem concerning security of big data. Our experiments compare LIME classifiers with various base classifiers and standard ordinary ensemble meta classifiers. The results obtained demonstrate that LIME classifiers can significantly increase the accuracy of classifications. LIME classifiers performed better than the base classifiers and standard ensemble meta classifiers.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Big data presents a remarkable opportunity for organisations to obtain critical intelligence to drive decisions and obtain insights as never before. However, big data generates high network traffic. Moreover, the continuous growth in the variety of network traffic due to big data variety has rendered the network to be one of the key big data challenges. In this article, we present a comprehensive analysis of big data variety and its adverse effects on the network performance. We present taxonomy of big data variety and discuss various dimensions of the big data variety features. We also discuss how the features influence the interconnection network requirements. Finally, we discuss some of the challenges each big data variety dimension presents and possible approach to address them.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Stochastic search techniques such as evolutionary algorithms (EA) are known to be better explorer of search space as compared to conventional techniques including deterministic methods. However, in the era of big data like most other search methods and learning algorithms, suitability of evolutionary algorithms is naturally questioned. Big data pose new computational challenges including very high dimensionality and sparseness of data. Evolutionary algorithms' superior exploration skills should make them promising candidates for handling optimization problems involving big data. High dimensional problems introduce added complexity to the search space. However, EAs need to be enhanced to ensure that majority of the potential winner solutions gets the chance to survive and mature. In this paper we present an evolutionary algorithm with enhanced ability to deal with the problems of high dimensionality and sparseness of data. In addition to an informed exploration of the solution space, this technique balances exploration and exploitation using a hierarchical multi-population approach. The proposed model uses informed genetic operators to introduce diversity by expanding the scope of search process at the expense of redundant less promising members of the population. Next phase of the algorithm attempts to deal with the problem of high dimensionality by ensuring broader and more exhaustive search and preventing premature death of potential solutions. To achieve this, in addition to the above exploration controlling mechanism, a multi-tier hierarchical architecture is employed, where, in separate layers, the less fit isolated individuals evolve in dynamic sub-populations that coexist alongside the original or main population. Evaluation of the proposed technique on well known benchmark problems ascertains its superior performance. The algorithm has also been successfully applied to a real world problem of financial portfolio management. Although the proposed method cannot be considered big data-ready, it is certainly a move in the right direction.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recently, the Big Data paradigm has received considerable attention since it gives a great opportunity to mine knowledge from massive amounts of data. However, the new mined knowledge will be useless if data is fake, or sometimes the massive amounts of data cannot be collected due to the worry on the abuse of data. This situation asks for new security solutions. On the other hand, the biggest feature of Big Data is "massive", which requires that any security solution for Big Data should be "efficient". In this paper, we propose a new identity-based generalized signcryption scheme to solve the above problems. In particular, it has the following two properties to fit the efficiency requirement. (1) It can work as an encryption scheme, a signature scheme or a signcryption scheme as per need. (2) It does not have the heavy burden on the complicated certificate management as the traditional cryptographic schemes. Furthermore, our proposed scheme can be proven-secure in the standard model. © 2014 Elsevier Inc. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Hadoop framework provides a powerful way to handle Big Data. Since Hadoop has inherent defects of high memory overhead and low computing performance in processing massive small files, we implement three methods and propose two strategies for solving small files problem in this paper. First, we implement three methods, i.e., Hadoop Archives (HAR), Sequence Files (SF) and CombineFileInputFormat (CFIF), to compensate the existing defects of Hadoop. Moreover, we propose two strategies for meeting the actual needs of different users. Finally, we evaluate the efficiency of the implemented methods and the validity of the proposed strategies. The experimental results show that our methods and strategies can improve the efficiency of massive small files processing, thereby enhancing the overall performance of Hadoop. © 2014 ISSN 1881-803X.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Smart grid is a technological innovation that improves efficiency, reliability, economics, and sustainability of electricity services. It plays a crucial role in modern energy infrastructure. The main challenges of smart grids, however, are how to manage different types of front-end intelligent devices such as power assets and smart meters efficiently; and how to process a huge amount of data received from these devices. Cloud computing, a technology that provides computational resources on demands, is a good candidate to address these challenges since it has several good properties such as energy saving, cost saving, agility, scalability, and flexibility. In this paper, we propose a secure cloud computing based framework for big data information management in smart grids, which we call 'Smart-Frame.' The main idea of our framework is to build a hierarchical structure of cloud computing centers to provide different types of computing services for information management and big data analysis. In addition to this structural framework, we present a security solution based on identity-based encryption, signature and proxy re-encryption to address critical security issues of the proposed framework.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As a leading framework for processing and analyzing big data, MapReduce is leveraged by many enterprises to parallelize their data processing on distributed computing systems. Unfortunately, the all-to-all data forwarding from map tasks to reduce tasks in the traditional MapReduce framework would generate a large amount of network traffic. The fact that the intermediate data generated by map tasks can be combined with significant traffic reduction in many applications motivates us to propose a data aggregation scheme for MapReduce jobs in cloud. Specifically, we design an aggregation architecture under the existing MapReduce framework with the objective of minimizing the data traffic during the shuffle phase, in which aggregators can reside anywhere in the cloud. Some experimental results also show that our proposal outperforms existing work by reducing the network traffic significantly.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the explosion of big data, processing large numbers of continuous data streams, i.e., big data stream processing (BDSP), has become a crucial requirement for many scientific and industrial applications in recent years. By offering a pool of computation, communication and storage resources, public clouds, like Amazon's EC2, are undoubtedly the most efficient platforms to meet the ever-growing needs of BDSP. Public cloud service providers usually operate a number of geo-distributed datacenters across the globe. Different datacenter pairs are with different inter-datacenter network costs charged by Internet Service Providers (ISPs). While, inter-datacenter traffic in BDSP constitutes a large portion of a cloud provider's traffic demand over the Internet and incurs substantial communication cost, which may even become the dominant operational expenditure factor. As the datacenter resources are provided in a virtualized way, the virtual machines (VMs) for stream processing tasks can be freely deployed onto any datacenters, provided that the Service Level Agreement (SLA, e.g., quality-of-information) is obeyed. This raises the opportunity, but also a challenge, to explore the inter-datacenter network cost diversities to optimize both VM placement and load balancing towards network cost minimization with guaranteed SLA. In this paper, we first propose a general modeling framework that describes all representative inter-task relationship semantics in BDSP. Based on our novel framework, we then formulate the communication cost minimization problem for BDSP into a mixed-integer linear programming (MILP) problem and prove it to be NP-hard. We then propose a computation-efficient solution based on MILP. The high efficiency of our proposal is validated by extensive simulation based studies.

Relevância:

100.00% 100.00%

Publicador: