876 resultados para bigdata, data stream processing, dsp, apache storm, cyber security


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Internet of Things är ett samlingsbegrepp för den utveckling som innebär att olika typer av enheter kan förses med sensorer och datachip som är uppkopplade mot internet. En ökad mängd data innebär en ökad förfrågan på lösningar som kan lagra, spåra, analysera och bearbeta data. Ett sätt att möta denna förfrågan är att använda sig av molnbaserade realtidsanalystjänster. Multi-tenant och single-tenant är två typer av arkitekturer för molnbaserade realtidsanalystjänster som kan användas för att lösa problemen med hanteringen av de ökade datamängderna. Dessa arkitekturer skiljer sig åt när det gäller komplexitet i utvecklingen. I detta arbete representerar Azure Stream Analytics en multi-tenant arkitektur och HDInsight/Storm representerar en single-tenant arkitektur. För att kunna göra en jämförelse av molnbaserade realtidsanalystjänster med olika arkitekturer, har vi valt att använda oss av användbarhetskriterierna: effektivitet, ändamålsenlighet och användarnöjdhet. Vi kom fram till att vi ville ha svar på följande frågor relaterade till ovannämnda tre användbarhetskriterier: • Vilka likheter och skillnader kan vi se i utvecklingstider? • Kan vi identifiera skillnader i funktionalitet? • Hur upplever utvecklare de olika analystjänsterna? Vi har använt en design and creation strategi för att utveckla två Proof of Concept prototyper och samlat in data genom att använda flera datainsamlingsmetoder. Proof of Concept prototyperna inkluderade två artefakter, en för Azure Stream Analytics och en för HDInsight/Storm. Vi utvärderade dessa genom att utföra fem olika scenarier som var för sig hade 2-5 delmål. Vi simulerade strömmande data genom att låta en applikation kontinuerligt slumpa fram data som vi analyserade med hjälp av de två realtidsanalystjänsterna. Vi har använt oss av observationer för att dokumentera hur vi arbetade med utvecklingen av analystjänsterna samt för att mäta utvecklingstider och identifiera skillnader i funktionalitet. Vi har även använt oss av frågeformulär för att ta reda på vad användare tyckte om analystjänsterna. Vi kom fram till att Azure Stream Analytics initialt var mer användbart än HDInsight/Storm men att skillnaderna minskade efter hand. Azure Stream Analytics var lättare att arbeta med vid simplare analyser medan HDInsight/Storm hade ett bredare val av funktionalitet.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Digital signal processing (DSP) aims to extract specific information from digital signals. Digital signals are, by definition, physical quantities represented by a sequence of discrete values and from these sequences it is possible to extract and analyze the desired information. The unevenly sampled data can not be properly analyzed using standard techniques of digital signal processing. This work aimed to adapt a technique of DSP, the multiresolution analysis, to analyze unevenly smapled data, to aid the studies in the CoRoT laboratory at UFRN. The process is based on re-indexing the wavelet transform to handle unevenly sampled data properly. The was efective presenting satisfactory results

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Currently, many museums, botanic gardens and herbariums keep data of biological collections and using computational tools researchers digitalize and provide access to their data using data portals. The replication of databases in portals can be accomplished through the use of protocols and data schema. However, the implementation of this solution demands a large amount of time, concerning both the transfer of fragments of data and processing data within the portal. With the growth of data digitalization in institutions, this scenario tends to be increasingly exacerbated, making it hard to maintain the records updated on the portals. As an original contribution, this research proposes analysing the data replication process to evaluate the performance of portals. The Inter-American Biodiversity Information Network (IABIN) biodiversity data portal of pollinators was used as a study case, which supports both situations: conventional data replication of records of specimen occurrences and interactions between them. With the results of this research, it is possible to simulate a situation before its implementation, thus predicting the performance of replication operations. Additionally, these results may contribute to future improvements to this process, in order to decrease the time required to make the data available in portals. © Rinton Press.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Current scientific applications have been producing large amounts of data. The processing, handling and analysis of such data require large-scale computing infrastructures such as clusters and grids. In this area, studies aim at improving the performance of data-intensive applications by optimizing data accesses. In order to achieve this goal, distributed storage systems have been considering techniques of data replication, migration, distribution, and access parallelism. However, the main drawback of those studies is that they do not take into account application behavior to perform data access optimization. This limitation motivated this paper which applies strategies to support the online prediction of application behavior in order to optimize data access operations on distributed systems, without requiring any information on past executions. In order to accomplish such a goal, this approach organizes application behaviors as time series and, then, analyzes and classifies those series according to their properties. By knowing properties, the approach selects modeling techniques to represent series and perform predictions, which are, later on, used to optimize data access operations. This new approach was implemented and evaluated using the OptorSim simulator, sponsored by the LHC-CERN project and widely employed by the scientific community. Experiments confirm this new approach reduces application execution time in about 50 percent, specially when handling large amounts of data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Die Arbeit behandelt das Problem der Skalierbarkeit von Reinforcement Lernen auf hochdimensionale und komplexe Aufgabenstellungen. Unter Reinforcement Lernen versteht man dabei eine auf approximativem Dynamischen Programmieren basierende Klasse von Lernverfahren, die speziell Anwendung in der Künstlichen Intelligenz findet und zur autonomen Steuerung simulierter Agenten oder realer Hardwareroboter in dynamischen und unwägbaren Umwelten genutzt werden kann. Dazu wird mittels Regression aus Stichproben eine Funktion bestimmt, die die Lösung einer "Optimalitätsgleichung" (Bellman) ist und aus der sich näherungsweise optimale Entscheidungen ableiten lassen. Eine große Hürde stellt dabei die Dimensionalität des Zustandsraums dar, die häufig hoch und daher traditionellen gitterbasierten Approximationsverfahren wenig zugänglich ist. Das Ziel dieser Arbeit ist es, Reinforcement Lernen durch nichtparametrisierte Funktionsapproximation (genauer, Regularisierungsnetze) auf -- im Prinzip beliebig -- hochdimensionale Probleme anwendbar zu machen. Regularisierungsnetze sind eine Verallgemeinerung von gewöhnlichen Basisfunktionsnetzen, die die gesuchte Lösung durch die Daten parametrisieren, wodurch die explizite Wahl von Knoten/Basisfunktionen entfällt und so bei hochdimensionalen Eingaben der "Fluch der Dimension" umgangen werden kann. Gleichzeitig sind Regularisierungsnetze aber auch lineare Approximatoren, die technisch einfach handhabbar sind und für die die bestehenden Konvergenzaussagen von Reinforcement Lernen Gültigkeit behalten (anders als etwa bei Feed-Forward Neuronalen Netzen). Allen diesen theoretischen Vorteilen gegenüber steht allerdings ein sehr praktisches Problem: der Rechenaufwand bei der Verwendung von Regularisierungsnetzen skaliert von Natur aus wie O(n**3), wobei n die Anzahl der Daten ist. Das ist besonders deswegen problematisch, weil bei Reinforcement Lernen der Lernprozeß online erfolgt -- die Stichproben werden von einem Agenten/Roboter erzeugt, während er mit der Umwelt interagiert. Anpassungen an der Lösung müssen daher sofort und mit wenig Rechenaufwand vorgenommen werden. Der Beitrag dieser Arbeit gliedert sich daher in zwei Teile: Im ersten Teil der Arbeit formulieren wir für Regularisierungsnetze einen effizienten Lernalgorithmus zum Lösen allgemeiner Regressionsaufgaben, der speziell auf die Anforderungen von Online-Lernen zugeschnitten ist. Unser Ansatz basiert auf der Vorgehensweise von Recursive Least-Squares, kann aber mit konstantem Zeitaufwand nicht nur neue Daten sondern auch neue Basisfunktionen in das bestehende Modell einfügen. Ermöglicht wird das durch die "Subset of Regressors" Approximation, wodurch der Kern durch eine stark reduzierte Auswahl von Trainingsdaten approximiert wird, und einer gierigen Auswahlwahlprozedur, die diese Basiselemente direkt aus dem Datenstrom zur Laufzeit selektiert. Im zweiten Teil übertragen wir diesen Algorithmus auf approximative Politik-Evaluation mittels Least-Squares basiertem Temporal-Difference Lernen, und integrieren diesen Baustein in ein Gesamtsystem zum autonomen Lernen von optimalem Verhalten. Insgesamt entwickeln wir ein in hohem Maße dateneffizientes Verfahren, das insbesondere für Lernprobleme aus der Robotik mit kontinuierlichen und hochdimensionalen Zustandsräumen sowie stochastischen Zustandsübergängen geeignet ist. Dabei sind wir nicht auf ein Modell der Umwelt angewiesen, arbeiten weitestgehend unabhängig von der Dimension des Zustandsraums, erzielen Konvergenz bereits mit relativ wenigen Agent-Umwelt Interaktionen, und können dank des effizienten Online-Algorithmus auch im Kontext zeitkritischer Echtzeitanwendungen operieren. Wir demonstrieren die Leistungsfähigkeit unseres Ansatzes anhand von zwei realistischen und komplexen Anwendungsbeispielen: dem Problem RoboCup-Keepaway, sowie der Steuerung eines (simulierten) Oktopus-Tentakels.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several countries have acquired, over the past decades, large amounts of area covering Airborne Electromagnetic data. Contribution of airborne geophysics has dramatically increased for both groundwater resource mapping and management proving how those systems are appropriate for large-scale and efficient groundwater surveying. We start with processing and inversion of two AEM dataset from two different systems collected over the Spiritwood Valley Aquifer area, Manitoba, Canada respectively, the AeroTEM III (commissioned by the Geological Survey of Canada in 2010) and the “Full waveform VTEM” dataset, collected and tested over the same survey area, during the fall 2011. We demonstrate that in the presence of multiple datasets, either AEM and ground data, due processing, inversion, post-processing, data integration and data calibration is the proper approach capable of providing reliable and consistent resistivity models. Our approach can be of interest to many end users, ranging from Geological Surveys, Universities to Private Companies, which are often proprietary of large geophysical databases to be interpreted for geological and\or hydrogeological purposes. In this study we deeply investigate the role of integration of several complimentary types of geophysical data collected over the same survey area. We show that data integration can improve inversions, reduce ambiguity and deliver high resolution results. We further attempt to use the final, most reliable output resistivity models as a solid basis for building a knowledge-driven 3D geological voxel-based model. A voxel approach allows a quantitative understanding of the hydrogeological setting of the area, and it can be further used to estimate the aquifers volumes (i.e. potential amount of groundwater resources) as well as hydrogeological flow model prediction. In addition, we investigated the impact of an AEM dataset towards hydrogeological mapping and 3D hydrogeological modeling, comparing it to having only a ground based TEM dataset and\or to having only boreholes data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Lo scopo dell'elaborato di tesi è l'analisi, progettazione e sviluppo di un prototipo di una infrastruttura cloud in grado di gestire un grande flusso di eventi generati da dispositivi mobili. Questi utilizzano informazioni come la posizione assunta e il valore dei sensori locali di cui possono essere equipaggiati al fine di realizzare il proprio funzionamento. Le informazioni così ottenute vengono trasmesse in modo da ottenere una rete di device in grado di acquisire autonomamente informazioni sull'ambiente ed auto-organizzarsi. La costruzione di tale struttura si colloca in un più ampio ambito di ricerca che punta a integrare metodi per la comunicazione ravvicinata con il cloud al fine di permettere la comunicazione tra dispositivi vicini in qualsiasi situazione che si potrebbe presentare in una situazione reale. A definire le specifiche della infrastruttura e quindi a impersonare il ruolo di committente è stato il relatore, Prof. Mirko Viroli, mentre lo sviluppo è stato portato avanti da me e dal correlatore, Ing. Pietro Brunetti. Visti gli studi precedenti riguardanti il cloud computing nell'area dei sistemi complessi distribuiti, Brunetti ha dato il maggiore contributo nella fase di analisi del problema e di progettazione mentre la parte riguardante la effettiva gestione degli eventi, le computazioni in cloud e lo storage dei dati è stata maggiormente affrontata da me. In particolare mi sono occupato dello studio e della implementazione del backend computazionale, basato sulla tecnologia Apache Storm, della componente di storage dei dati, basata su Neo4j, e della costruzione di un pannello di visualizzazione basato su AJAX e Linkurious. A questo va aggiunto lo studio su Apache Kafka, utilizzato come tecnologia per realizzare la comunicazione asincrona ad alte performance tra le componenti. Si è reso necessario costruire un simulatore al fine di condurre i test per verificare il funzionamento della infrastruttura prototipale e per saggiarne l'effettiva scalabilità, considerato il potenziale numero di dispositivi da sostenere che può andare dalle decine alle migliaia. La sfida più importante riguarda la gestione della vicinanza tra dispositivi e la possibilità di scalare la computazione su più macchine. Per questo motivo è stato necessario far uso di tecnologie per l'esecuzione delle operazioni di memorizzazione, calcolo e trasmissione dei dati in grado di essere eseguite su un cluster e garantire una accettabile fault-tolerancy. Da questo punto di vista i lavori che hanno portato alla costruzione della infrastruttura sono risultati essere un'ottima occasione per prendere familiarità con tecnologie prima sconosciute. Quasi tutte le tecnologie utilizzate fanno parte dell'ecosistema Apache e, come esposto all'interno della tesi, stanno ricevendo una grande attenzione da importanti realtà proprio in questo periodo, specialmente Apache Storm e Kafka. Il software prodotto per la costruzione della infrastruttura è completamente sviluppato in Java a cui si aggiunge la componente web di visualizzazione sviluppata in Javascript.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Transformer protection is one of the most challenging applications within the power system protective relay field. Transformers with a capacity rating exceeding 10 MVA are usually protected using differential current relays. Transformers are an aging and vulnerable bottleneck in the present power grid; therefore, quick fault detection and corresponding transformer de-energization is the key element in minimizing transformer damage. Present differential current relays are based on digital signal processing (DSP). They combine DSP phasor estimation and protective-logic-based decision making. The limitations of existing DSP-based differential current relays must be identified to determine the best protection options for sensitive and quick fault detection. The development, implementation, and evaluation of a DSP differential current relay is detailed. The overall goal is to make fault detection faster without compromising secure and safe transformer operation. A detailed background on the DSP differential current relay is provided. Then different DSP phasor estimation filters are implemented and evaluated based on their ability to extract desired frequency components from the measured current signal quickly and accurately. The main focus of the phasor estimation evaluation is to identify the difference between using non-recursive and recursive filtering methods. Then the protective logic of the DSP differential current relay is implemented and required settings made in accordance with transformer application. Finally, the DSP differential current relay will be evaluated using available transformer models within the ATP simulation environment. Recursive filtering methods were found to have significant advantage over non-recursive filtering methods when evaluated individually and when applied in the DSP differential relay. Recursive filtering methods can be up to 50% faster than non-recursive methods, but can cause false trip due to overshoot if the only objective is speed. The relay sensitivity is however independent of filtering method and depends on the settings of the relay’s differential characteristics (pickup threshold and percent slope).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper considers a framework where data from correlated sources are transmitted with the help of network coding in ad hoc network topologies. The correlated data are encoded independently at sensors and network coding is employed in the intermediate nodes in order to improve the data delivery performance. In such settings, we focus on the problem of reconstructing the sources at decoder when perfect decoding is not possible due to losses or bandwidth variations. We show that the source data similarity can be used at decoder to permit decoding based on a novel and simple approximate decoding scheme. We analyze the influence of the network coding parameters and in particular the size of finite coding fields on the decoding performance. We further determine the optimal field size that maximizes the expected decoding performance as a trade-off between information loss incurred by limiting the resolution of the source data and the error probability in the reconstructed data. Moreover, we show that the performance of the approximate decoding improves when the accuracy of the source model increases even with simple approximate decoding techniques. We provide illustrative examples showing how the proposed algorithm can be deployed in sensor networks and distributed imaging applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method. METHODS In this Privacy Preserving Probabilistic Record Linkage method we apply a three-party protocol, with two sites collecting individual data and an independent trusted linkage center as the third partner. Our method consists of three main steps: pre-processing, encryption and probabilistic record linkage. Data pre-processing and encryption are done at the sites by local personnel. To guarantee similar quality and format of variables and identical encryption procedure at each site, the linkage center generates semi-automated pre-processing and encryption templates. To retrieve information (i.e. data structure) for the creation of templates without ever accessing plain person identifiable information, we introduced a novel method of data masking. Sensitive string variables are encrypted using Bloom filters, which enables calculation of similarity coefficients. For date variables, we developed special encryption procedures to handle the most common date errors. The linkage center performs probabilistic record linkage with encrypted person identifiable information and plain non-sensitive variables. RESULTS In this paper we describe step by step how to link existing health-related data using encryption methods to preserve privacy of persons in the study. CONCLUSION Privacy Preserving Probabilistic Record linkage expands record linkage facilities in settings where a unique identifier is unavailable and/or regulations restrict access to the non-unique person identifiable information needed to link existing health-related data sets. Automated pre-processing and encryption fully protect sensitive information ensuring participant confidentiality. This method is suitable not just for epidemiological research but also for any setting with similar challenges.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Sensor networks are increasingly becoming one of the main sources of Big Data on the Web. However, the observations that they produce are made available with heterogeneous schemas, vocabularies and data formats, making it difficult to share and reuse these data for other purposes than those for which they were originally set up. In this thesis we address these challenges, considering how we can transform streaming raw data to rich ontology-based information that is accessible through continuous queries for streaming data. Our main contribution is an ontology-based approach for providing data access and query capabilities to streaming data sources, allowing users to express their needs at a conceptual level, independent of implementation and language-specific details. We introduce novel query rewriting and data translation techniques that rely on mapping definitions relating streaming data models to ontological concepts. Specific contributions include: • The syntax and semantics of the SPARQLStream query language for ontologybased data access, and a query rewriting approach for transforming SPARQLStream queries into streaming algebra expressions. • The design of an ontology-based streaming data access engine that can internally reuse an existing data stream engine, complex event processor or sensor middleware, using R2RML mappings for defining relationships between streaming data models and ontology concepts. Concerning the sensor metadata of such streaming data sources, we have investigated how we can use raw measurements to characterize streaming data, producing enriched data descriptions in terms of ontological models. Our specific contributions are: • A representation of sensor data time series that captures gradient information that is useful to characterize types of sensor data. • A method for classifying sensor data time series and determining the type of data, using data mining techniques, and a method for extracting semantic sensor metadata features from the time series.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a data-intensive architecture that demonstrates the ability to support applications from a wide range of application domains, and support the different types of users involved in defining, designing and executing data-intensive processing tasks. The prototype architecture is introduced, and the pivotal role of DISPEL as a canonical language is explained. The architecture promotes the exploration and exploitation of distributed and heterogeneous data and spans the complete knowledge discovery process, from data preparation, to analysis, to evaluation and reiteration. The architecture evaluation included large-scale applications from astronomy, cosmology, hydrology, functional genetics, imaging processing and seismology.