876 resultados para bigdata, data stream processing, dsp, apache storm, cyber security


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most data stream classification techniques assume that the underlying feature space is static. However, in real-world applications the set of features and their relevance to the target concept may change over time. In addition, when the underlying concepts reappear, reusing previously learnt models can enhance the learning process in terms of accuracy and processing time at the expense of manageable memory consumption. In this paper, we propose mining recurring concepts in a dynamic feature space (MReC-DFS), a data stream classification system to address the challenges of learning recurring concepts in a dynamic feature space while simultaneously reducing the memory cost associated with storing past models. MReC-DFS is able to detect and adapt to concept changes using the performance of the learning process and contextual information. To handle recurring concepts, stored models are combined in a dynamically weighted ensemble. Incremental feature selection is performed to reduce the combined feature space. This contribution allows MReC-DFS to store only the features most relevant to the learnt concepts, which in turn increases the memory efficiency of the technique. In addition, an incremental feature selection method is proposed that dynamically determines the threshold between relevant and irrelevant features. Experimental results demonstrating the high accuracy of MReC-DFS compared with state-of-the-art techniques on a variety of real datasets are presented. The results also show the superior memory efficiency of MReC-DFS.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Los avances en el hardware permiten disponer de grandes volúmenes de datos, surgiendo aplicaciones que deben suministrar información en tiempo cuasi-real, la monitorización de pacientes, ej., el seguimiento sanitario de las conducciones de agua, etc. Las necesidades de estas aplicaciones hacen emerger el modelo de flujo de datos (data streaming) frente al modelo almacenar-para-despuésprocesar (store-then-process). Mientras que en el modelo store-then-process, los datos son almacenados para ser posteriormente consultados; en los sistemas de streaming, los datos son procesados a su llegada al sistema, produciendo respuestas continuas sin llegar a almacenarse. Esta nueva visión impone desafíos para el procesamiento de datos al vuelo: 1) las respuestas deben producirse de manera continua cada vez que nuevos datos llegan al sistema; 2) los datos son accedidos solo una vez y, generalmente, no son almacenados en su totalidad; y 3) el tiempo de procesamiento por dato para producir una respuesta debe ser bajo. Aunque existen dos modelos para el cómputo de respuestas continuas, el modelo evolutivo y el de ventana deslizante; éste segundo se ajusta mejor en ciertas aplicaciones al considerar únicamente los datos recibidos más recientemente, en lugar de todo el histórico de datos. En los últimos años, la minería de datos en streaming se ha centrado en el modelo evolutivo. Mientras que, en el modelo de ventana deslizante, el trabajo presentado es más reducido ya que estos algoritmos no sólo deben de ser incrementales si no que deben borrar la información que caduca por el deslizamiento de la ventana manteniendo los anteriores tres desafíos. Una de las tareas fundamentales en minería de datos es la búsqueda de agrupaciones donde, dado un conjunto de datos, el objetivo es encontrar grupos representativos, de manera que se tenga una descripción sintética del conjunto. Estas agrupaciones son fundamentales en aplicaciones como la detección de intrusos en la red o la segmentación de clientes en el marketing y la publicidad. Debido a las cantidades masivas de datos que deben procesarse en este tipo de aplicaciones (millones de eventos por segundo), las soluciones centralizadas puede ser incapaz de hacer frente a las restricciones de tiempo de procesamiento, por lo que deben recurrir a descartar datos durante los picos de carga. Para evitar esta perdida de datos, se impone el procesamiento distribuido de streams, en concreto, los algoritmos de agrupamiento deben ser adaptados para este tipo de entornos, en los que los datos están distribuidos. En streaming, la investigación no solo se centra en el diseño para tareas generales, como la agrupación, sino también en la búsqueda de nuevos enfoques que se adapten mejor a escenarios particulares. Como ejemplo, un mecanismo de agrupación ad-hoc resulta ser más adecuado para la defensa contra la denegación de servicio distribuida (Distributed Denial of Services, DDoS) que el problema tradicional de k-medias. En esta tesis se pretende contribuir en el problema agrupamiento en streaming tanto en entornos centralizados y distribuidos. Hemos diseñado un algoritmo centralizado de clustering mostrando las capacidades para descubrir agrupaciones de alta calidad en bajo tiempo frente a otras soluciones del estado del arte, en una amplia evaluación. Además, se ha trabajado sobre una estructura que reduce notablemente el espacio de memoria necesario, controlando, en todo momento, el error de los cómputos. Nuestro trabajo también proporciona dos protocolos de distribución del cómputo de agrupaciones. Se han analizado dos características fundamentales: el impacto sobre la calidad del clustering al realizar el cómputo distribuido y las condiciones necesarias para la reducción del tiempo de procesamiento frente a la solución centralizada. Finalmente, hemos desarrollado un entorno para la detección de ataques DDoS basado en agrupaciones. En este último caso, se ha caracterizado el tipo de ataques detectados y se ha desarrollado una evaluación sobre la eficiencia y eficacia de la mitigación del impacto del ataque. ABSTRACT Advances in hardware allow to collect huge volumes of data emerging applications that must provide information in near-real time, e.g., patient monitoring, health monitoring of water pipes, etc. The data streaming model emerges to comply with these applications overcoming the traditional store-then-process model. With the store-then-process model, data is stored before being consulted; while, in streaming, data are processed on the fly producing continuous responses. The challenges of streaming for processing data on the fly are the following: 1) responses must be produced continuously whenever new data arrives in the system; 2) data is accessed only once and is generally not maintained in its entirety, and 3) data processing time to produce a response should be low. Two models exist to compute continuous responses: the evolving model and the sliding window model; the latter fits best with applications must be computed over the most recently data rather than all the previous data. In recent years, research in the context of data stream mining has focused mainly on the evolving model. In the sliding window model, the work presented is smaller since these algorithms must be incremental and they must delete the information which expires when the window slides. Clustering is one of the fundamental techniques of data mining and is used to analyze data sets in order to find representative groups that provide a concise description of the data being processed. Clustering is critical in applications such as network intrusion detection or customer segmentation in marketing and advertising. Due to the huge amount of data that must be processed by such applications (up to millions of events per second), centralized solutions are usually unable to cope with timing restrictions and recur to shedding techniques where data is discarded during load peaks. To avoid discarding of data, processing of streams (such as clustering) must be distributed and adapted to environments where information is distributed. In streaming, research does not only focus on designing for general tasks, such as clustering, but also in finding new approaches that fit bests with particular scenarios. As an example, an ad-hoc grouping mechanism turns out to be more adequate than k-means for defense against Distributed Denial of Service (DDoS). This thesis contributes to the data stream mining clustering technique both for centralized and distributed environments. We present a centralized clustering algorithm showing capabilities to discover clusters of high quality in low time and we provide a comparison with existing state of the art solutions. We have worked on a data structure that significantly reduces memory requirements while controlling the error of the clusters statistics. We also provide two distributed clustering protocols. We focus on the analysis of two key features: the impact on the clustering quality when computation is distributed and the requirements for reducing the processing time compared to the centralized solution. Finally, with respect to ad-hoc grouping techniques, we have developed a DDoS detection framework based on clustering.We have characterized the attacks detected and we have evaluated the efficiency and effectiveness of mitigating the attack impact.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Optical communications receivers using wavelet signals processing is proposed in this paper for dense wavelength-division multiplexed (DWDM) systems and modal-division multiplexed (MDM) transmissions. The optical signal-to-noise ratio (OSNR) required to demodulate polarization-division multiplexed quadrature phase shift keying (PDM-QPSK) modulation format is alleviated with the wavelet denoising process. This procedure improves the bit error rate (BER) performance and increasing the transmission distance in DWDM systems. Additionally, the wavelet-based design relies on signal decomposition using time-limited basis functions allowing to reduce the computational cost in Digital-Signal-Processing (DSP) module. Attending to MDM systems, a new scheme of encoding data bits based on wavelets is presented to minimize the mode coupling in few-mode (FWF) and multimode fibers (MMF). The Shifted Prolate Wave Spheroidal (SPWS) functions are proposed to reduce the modal interference.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Stream-mining approach is defined as a set of cutting-edge techniques designed to process streams of data in real time, in order to extract knowledge. In the particular case of classification, stream-mining has to adapt its behaviour to the volatile underlying data distributions, what has been called concept drift. Moreover, it is important to note that concept drift may lead to situations where predictive models become invalid and have therefore to be updated to represent the actual concepts that data poses. In this context, there is a specific type of concept drift, known as recurrent concept drift, where the concepts represented by data have already appeared in the past. In those cases the learning process could be saved or at least minimized by applying a previously trained model. This could be extremely useful in ubiquitous environments that are characterized by the existence of resource constrained devices. To deal with the aforementioned scenario, meta-models can be used in the process of enhancing the drift detection mechanisms used by data stream algorithms, by representing and predicting when the change will occur. There are some real-world situations where a concept reappears, as in the case of intrusion detection systems (IDS), where the same incidents or an adaptation of them usually reappear over time. In these environments the early prediction of drift by means of a better knowledge of past models can help to anticipate to the change, thus improving efficiency of the model regarding the training instances needed. By means of using meta-models as a recurrent drift detection mechanism, the ability to share concepts representations among different data mining processes is open. That kind of exchanges could improve the accuracy of the resultant local model as such model may benefit from patterns similar to the local concept that were observed in other scenarios, but not yet locally. This would also improve the efficiency of training instances used during the classification process, as long as the exchange of models would aid in the application of already trained recurrent models, that have been previously seen by any of the collaborative devices. Which it is to say that the scope of recurrence detection and representation is broaden. In fact the detection, representation and exchange of concept drift patterns would be extremely useful for the law enforcement activities fighting against cyber crime. Being the information exchange one of the main pillars of cooperation, national units would benefit from the experience and knowledge gained by third parties. Moreover, in the specific scope of critical infrastructures protection it is crucial to count with information exchange mechanisms, both from a strategical and technical scope. The exchange of concept drift detection schemes in cyber security environments would aid in the process of preventing, detecting and effectively responding to threads in cyber space. Furthermore, as a complement of meta-models, a mechanism to assess the similarity between classification models is also needed when dealing with recurrent concepts. In this context, when reusing a previously trained model a rough comparison between concepts is usually made, applying boolean logic. The introduction of fuzzy logic comparisons between models could lead to a better efficient reuse of previously seen concepts, by applying not just equal models, but also similar ones. This work faces the aforementioned open issues by means of: the MMPRec system, that integrates a meta-model mechanism and a fuzzy similarity function; a collaborative environment to share meta-models between different devices; a recurrent drift generator that allows to test the usefulness of recurrent drift systems, as it is the case of MMPRec. Moreover, this thesis presents an experimental validation of the proposed contributions using synthetic and real datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Frequent Itemsets mining is well explored for various data types, and its computational complexity is well understood. There are methods to deal effectively with computational problems. This paper shows another approach to further performance enhancements of frequent items sets computation. We have made a series of observations that led us to inventing data pre-processing methods such that the final step of the Partition algorithm, where a combination of all local candidate sets must be processed, is executed on substantially smaller input data. The paper shows results from several experiments that confirmed our general and formally presented observations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A k-NN query finds the k nearest-neighbors of a given point from a point database. When it is sufficient to measure object distance using the Euclidian distance, the key to efficient k-NN query processing is to fetch and check the distances of a minimum number of points from the database. For many applications, such as vehicle movement along road networks or rover and animal movement along terrain surfaces, the distance is only meaningful when it is along a valid movement path. For this type of k-NN queries, the focus of efficient query processing is to minimize the cost of computing distances using the environment data (such as the road network data and the terrain data), which can be several orders of magnitude larger than that of the point data. Efficient processing of k-NN queries based on the Euclidian distance or the road network distance has been investigated extensively in the past. In this paper, we investigate the problem of surface k-NN query processing, where the distance is calculated from the shortest path along a terrain surface. This problem is very challenging, as the terrain data can be very large and the computational cost of finding shortest paths is very high. We propose an efficient solution based on multiresolution terrain models. Our approach eliminates the need of costly process of finding shortest paths by ranking objects using estimated lower and upper bounds of distance on multiresolution terrain models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Error free propagation of a single polarisation optical time division multiplexed 40 Gbit/s dispersion managed pulsed data stream over dispersion (non-shifted) fibre. This distance is twice the previous record at this data rate.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis experimentally examines the use of different techniques for optical fibre transmission over ultra long haul distances. Its format firstly examines the use of dispersion management as a means of achieving long haul communications. Secondly, examining the use concatenated NOLMs for DM autosoliton ultra long haul propagation, by comparing their performance with a generic system without NOLMs. Thirdly, timing jitter in concatenated NOLM system is examined and compared to the generic system and lastly issues of OTDM amplitude non-uniformity from channel to channel in a saturable absorber, specifically a NOLM, are raised. Transmission at a rate of 40Gbit/s is studied in an all-Raman amplified standard fibre link with amplifier spacing of the order of 80km. We demonstrate in this thesis that the detrimental effects associated with high power Raman amplification can be minimized by dispersion map optimization. As a result, a transmission distance of 1600 km (2000km including dispersion compensating fibre) has been achieved in standard single mode fibre. The use of concatenated NOLMs to provide a stable propagation regime has been proposed theoretically. In this thesis, the observation experimentally of autosoliton propagation is shown for the first time in a dispersion managed optical transmission system. The system is based on a strong dispersion map with large amplifier spacing. Operation at transmission rates of 10, 40 and 80Gbit/s is demonstrated. With an insertion of a stabilizing element to the NOLM, the transmission of a 10 and 20Gbit/s data stream was extended and demonstrated experimentally. Error-free propagation over 100 and 20 thousand kilometres has been achieved at 10 and 20Gbit/s respectively, with terrestrial amplifier spacing. The monitor of timing jitter is of importance to all optical systems. Evolution of timing jitter in a DM autosoliton system has been studied in this thesis and analyzed at bit ranges from 10Gbit/s to 80Gbit/s. Non-linear guiding by in-line regenerators considerably changes the dynamics of jitter accumulation. As transmission systems require higher data rates, the use of OTDM will become more prolific. The dynamics of switching and transmission of an optical signal comprising individual OTDM channels of unequal amplitudes in a dispersion-managed link with in-line non-linear fibre loop mirrors is investigated.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We have recently proposed the framework of independent blind source separation as an advantageous approach to steganography. Amongst the several characteristics noted was a sensitivity to message reconstruction due to small perturbations in the sources. This characteristic is not common in most other approaches to steganography. In this paper we discuss how this sensitivity relates the joint diagonalisation inside the independent component approach, and reliance on exact knowledge of secret information, and how it can be used as an additional and inherent security mechanism against malicious attack to discovery of the hidden messages. The paper therefore provides an enhanced mechanism that can be used for e-document forensic analysis and can be applied to different dimensionality digital data media. In this paper we use a low dimensional example of biomedical time series as might occur in the electronic patient health record, where protection of the private patient information is paramount.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A novel architecture for microwave/millimeter-wave signal generation and data modulation using a fiber-grating-based distributed feedback laser has been proposed in this letter. For demonstration, a 155.52-Mb/s data stream on a 16.9-GHz subcarrier has been transmitted and recovered successfully. It has been proved that this technology would be of benefit to future microwave data transmission systems.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Error-free transmission of a single polarization optical time division multiplexed 40 Gbit/s dispersion managed pulse data stream over 1009 km has been achieved in dispersion-compensated standard (non-dispersion shifted) fibre. This distance is twice the previous record at this data rate.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A novel architecture for microwave/millimeter-wave signal generation and data modulation using a fiber-grating-based distributed feedback laser has been proposed in this letter. For demonstration, a 155.52-Mb/s data stream on a 16.9-GHz subcarrier has been transmitted and recovered successfully. It has been proved that this technology would be of benefit to future microwave data transmission systems. © 2006 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Forests play a pivotal role in timber production, maintenance and development of biodiversity and in carbon sequestration and storage in the context of the Kyoto Protocol. Policy makers and forest experts therefore require reliable information on forest extent, type and change for management, planning and modeling purposes. It is becoming increasingly clear that such forest information is frequently inconsistent and unharmonised between countries and continents. This research paper presents a forest information portal that has been developed in line with the GEOSS and INSPIRE frameworks. The web portal provides access to forest resources data at a variety of spatial scales, from global through to regional and local, as well as providing analytical capabilities for monitoring and validating forest change. The system also allows for the utilisation of forest data and processing services within other thematic areas. The web portal has been developed using open standards to facilitate accessibility, interoperability and data transfer.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We demonstrate that the use of in-line nonlinear optical loop mirrors (NOLMs) in dispersion-managed (DM) transmission systems dominated by amplitude noise can achieve passive 2R regeneration of a 40 and 80 Gbit/s RZ data stream. This is an indication that the use of this approach could obviate the need for full-regeneration in high data rate, strong DM systems, when intra-channel four-wave mixing poses serious problems.