947 resultados para Data pre-processing
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
Edge-labeled graphs have proliferated rapidly over the last decade due to the increased popularity of social networks and the Semantic Web. In social networks, relationships between people are represented by edges and each edge is labeled with a semantic annotation. Hence, a huge single graph can express many different relationships between entities. The Semantic Web represents each single fragment of knowledge as a triple (subject, predicate, object), which is conceptually identical to an edge from subject to object labeled with predicates. A set of triples constitutes an edge-labeled graph on which knowledge inference is performed. Subgraph matching has been extensively used as a query language for patterns in the context of edge-labeled graphs. For example, in social networks, users can specify a subgraph matching query to find all people that have certain neighborhood relationships. Heavily used fragments of the SPARQL query language for the Semantic Web and graph queries of other graph DBMS can also be viewed as subgraph matching over large graphs. Though subgraph matching has been extensively studied as a query paradigm in the Semantic Web and in social networks, a user can get a large number of answers in response to a query. These answers can be shown to the user in accordance with an importance ranking. In this thesis proposal, we present four different scoring models along with scalable algorithms to find the top-k answers via a suite of intelligent pruning techniques. The suggested models consist of a practically important subset of the SPARQL query language augmented with some additional useful features. The first model called Substitution Importance Query (SIQ) identifies the top-k answers whose scores are calculated from matched vertices' properties in each answer in accordance with a user-specified notion of importance. The second model called Vertex Importance Query (VIQ) identifies important vertices in accordance with a user-defined scoring method that builds on top of various subgraphs articulated by the user. Approximate Importance Query (AIQ), our third model, allows partial and inexact matchings and returns top-k of them with a user-specified approximation terms and scoring functions. In the fourth model called Probabilistic Importance Query (PIQ), a query consists of several sub-blocks: one mandatory block that must be mapped and other blocks that can be opportunistically mapped. The probability is calculated from various aspects of answers such as the number of mapped blocks, vertices' properties in each block and so on and the most top-k probable answers are returned. An important distinguishing feature of our work is that we allow the user a huge amount of freedom in specifying: (i) what pattern and approximation he considers important, (ii) how to score answers - irrespective of whether they are vertices or substitution, and (iii) how to combine and aggregate scores generated by multiple patterns and/or multiple substitutions. Because so much power is given to the user, indexing is more challenging than in situations where additional restrictions are imposed on the queries the user can ask. The proposed algorithms for the first model can also be used for answering SPARQL queries with ORDER BY and LIMIT, and the method for the second model also works for SPARQL queries with GROUP BY, ORDER BY and LIMIT. We test our algorithms on multiple real-world graph databases, showing that our algorithms are far more efficient than popular triple stores.
Resumo:
This document does NOT address the issue of oxygen data quality control (either real-time or delayed mode). As a preliminary step towards that goal, this document seeks to ensure that all countries deploying floats equipped with oxygen sensors document the data and metadata related to these floats properly. We produced this document in response to action item 14 from the AST-10 meeting in Hangzhou (March 22-23, 2009). Action item 14: Denis Gilbert to work with Taiyo Kobayashi and Virginie Thierry to ensure DACs are processing oxygen data according to recommendations. If the recommendations contained herein are followed, we will end up with a more uniform set of oxygen data within the Argo data system, allowing users to begin analysing not only their own oxygen data, but also those of others, in the true spirit of Argo data sharing. Indications provided in this document are valid as of the date of writing this document. It is very likely that changes in sensors, calibrations and conversions equations will occur in the future. Please contact V. Thierry (vthierry@ifremer.fr) for any inconsistencies or missing information. A dedicated webpage on the Argo Data Management website (www) contains all information regarding Argo oxygen data management : current and previous version of this cookbook, oxygen sensor manuals, calibration sheet examples, examples of matlab code to process oxygen data, test data, etc..
Resumo:
This thesis reports on an investigation of the feasibility and usefulness of incorporating dynamic management facilities for managing sensed context data in a distributed contextaware mobile application. The investigation focuses on reducing the work required to integrate new sensed context streams in an existing context aware architecture. Current architectures require integration work for new streams and new contexts that are encountered. This means of operation is acceptable for current fixed architectures. However, as systems become more mobile the number of discoverable streams increases. Without the ability to discover and use these new streams the functionality of any given device will be limited to the streams that it knows how to decode. The integration of new streams requires that the sensed context data be understood by the current application. If the new source provides data of a type that an application currently requires then the new source should be connected to the application without any prior knowledge of the new source. If the type is similar and can be converted then this stream too should be appropriated by the application. Such applications are based on portable devices (phones, PDAs) for semi-autonomous services that use data from sensors connected to the devices, plus data exchanged with other such devices and remote servers. Such applications must handle input from a variety of sensors, refining the data locally and managing its communication from the device in volatile and unpredictable network conditions. The choice to focus on locally connected sensory input allows for the introduction of privacy and access controls. This local control can determine how the information is communicated to others. This investigation focuses on the evaluation of three approaches to sensor data management. The first system is characterised by its static management based on the pre-pended metadata. This was the reference system. Developed for a mobile system, the data was processed based on the attached metadata. The code that performed the processing was static. The second system was developed to move away from the static processing and introduce a greater freedom of handling for the data stream, this resulted in a heavy weight approach. The approach focused on pushing the processing of the data into a number of networked nodes rather than the monolithic design of the previous system. By creating a separate communication channel for the metadata it is possible to be more flexible with the amount and type of data transmitted. The final system pulled the benefits of the other systems together. By providing a small management class that would load a separate handler based on the incoming data, Dynamism was maximised whilst maintaining ease of code understanding. The three systems were then compared to highlight their ability to dynamically manage new sensed context. The evaluation took two approaches, the first is a quantitative analysis of the code to understand the complexity of the relative three systems. This was done by evaluating what changes to the system were involved for the new context. The second approach takes a qualitative view of the work required by the software engineer to reconfigure the systems to provide support for a new data stream. The evaluation highlights the various scenarios in which the three systems are most suited. There is always a trade-o↵ in the development of a system. The three approaches highlight this fact. The creation of a statically bound system can be quick to develop but may need to be completely re-written if the requirements move too far. Alternatively a highly dynamic system may be able to cope with new requirements but the developer time to create such a system may be greater than the creation of several simpler systems.
Resumo:
The CATARINA Leg1 cruise was carried out from June 22 to July 24 2012 on board the B/O Sarmiento de Gamboa, under the scientific supervision of Aida Rios (CSIC-IIM). It included the occurrence of the OVIDE hydrological section that was performed in June 2002, 2004, 2006, 2008 and 2010, as part of the CLIVAR program (name A25) ), and under the supervision of Herlé Mercier (CNRSLPO). This section begins near Lisbon (Portugal), runs through the West European Basin and the Iceland Basin, crosses the Reykjanes Ridge (300 miles north of Charlie-Gibbs Fracture Zone, and ends at Cape Hoppe (southeast tip of Greenland). The objective of this repeated hydrological section is to monitor the variability of water mass properties and main current transports in the basin, complementing the international observation array relevant for climate studies. In addition, the Labrador Sea was partly sampled (stations 101-108) between Greenland and Newfoundland, but heavy weather conditions prevented the achievement of the section south of 53°40’N. The quality of CTD data is essential to reach the first objective of the CATARINA project, i.e. to quantify the Meridional Overturning Circulation and water mass ventilation changes and their effect on the changes in the anthropogenic carbon ocean uptake and storage capacity. The CATARINA project was mainly funded by the Spanish Ministry of Sciences and Innovation and co-funded by the Fondo Europeo de Desarrollo Regional. The hydrological OVIDE section includes 95 surface-bottom stations from coast to coast, collecting profiles of temperature, salinity, oxygen and currents, spaced by 2 to 25 Nm depending on the steepness of the topography. The position of the stations closely follows that of OVIDE 2002. In addition, 8 stations were carried out in the Labrador Sea. From the 24 bottles closed at various depth at each stations, samples of sea water are used for salinity and oxygen calibration, and for measurements of biogeochemical components that are not reported here. The data were acquired with a Seabird CTD (SBE911+) and an SBE43 for the dissolved oxygen, belonging to the Spanish UTM group. The software SBE data processing was used after decoding and cleaning the raw data. Then, the LPO matlab toolbox was used to calibrate and bin the data as it was done for the previous OVIDE cruises, using on the one hand pre and post-cruise calibration results for the pressure and temperature sensors (done at Ifremer) and on the other hand the water samples of the 24 bottles of the rosette at each station for the salinity and dissolved oxygen data. A final accuracy of 0.002°C, 0.002 psu and 0.04 ml/l (2.3 umol/kg) was obtained on final profiles of temperature, salinity and dissolved oxygen, compatible with international requirements issued from the WOCE program.
Resumo:
Freeze drying technology can give good quality attributes of vegetables and fruits in terms of color, nutrition, volume, rehydration kinetics, stability during storage, among others, when compared with solely air dried ones. However, published scientific works showed that treatments applied before and after air dehydration are effective in food attributes, improving its quality. Therefore, the hypothesis of the present thesis was focus in a vast research of scientific work that showed the possibility to apply a pre-treatment and a post-treatment to food products combined with conventional air drying aiming being close, or even better, to the quality that a freeze dried product can give. Such attributes are the enzymatic inactivation, stability during storage, drying and rehydration kinetics, color, nutrition, volume and texture/structure. With regard to pre-treatments, the ones studied along the present work were: water blanching, steam blanching, ultrasound, freezing, high pressure and osmotic dehydration. High electric pulsed field was also studied but the food attributes were not explained on detailed. Basically, water and steam blanching showed to be adequate to inactivate enzymes in order to prevent enzymatic browning and preserve the product quality during long storage periods. With regard to ultrasound pre-treatment the published results pointed that ultrasound is an effective pre-treatment to reduce further drying times, improve rehydration kinetics and color retention. On the other hand, studies showed that ultrasound allow sugars losses and, in some cases, can lead to cell disruption. For freezing pre-treatment an overall conclusion was difficult to draw for some food attributes, since, each fruit or vegetable is unique and freezing comprises a lot of variables. However, for the studied cases, freezing showed to be a pre-treatment able to enhance rehydration kinetics and color attributes. High pressure pre-treatment showed to inactivate enzymes improving storage stability of food and showed to have a positive performance in terms of rehydration. For other attributes, when high pressure technology was applied, the literature showed divergent results according with the crops used. Finally, osmotic dehydration has been widely used in food processing to incorporate a desired salt or sugar present in aqueous solution into the cellular structure of food matrix (improvement of nutrition attribute). Moreover, osmotic dehydration lead to shorter drying times and the impregnation of solutes during osmose allow cellular strengthens of food. In case of post-treatments, puffing and a new technology denominated as instant controlled pressure drop (DIC) were reported in the literature as treatments able to improve diverse Abstract Effect of Pre-treatments and Post-treatments on Drying Products x food attributes. Basically, both technologies are similar where the product is submitted to a high pressure step and the process can make use of different heating mediums such as CO2, steam, air and N2. However, there exist a significant difference related with the final stage of both which can comprise the quality of the final product. On the other hand, puffing and DIC are used to expand cellular tissues improving the volume of food samples, helping in rehydration kinetics as posterior procedure, among others. The effectiveness of such pre and/or post-treatments is dependent on the state of the vegetables and fruits used which are also dependent of its cellular structure, variety, origin, state (fresh, ripe, raw), harvesting conditions, etc. In conclusion, as it was seen in the open literature, the application of pre-treatments and post-treatments coupled with a conventional air dehydration aim to give dehydrated food products with similar quality of freeze dried ones. Along the present Master thesis the experimental data was removed due to confidential reasons of the company Unilever R&D Vlaardingen
Resumo:
By providing vehicle-to-vehicle and vehicle-to-infrastructure wireless communications, vehicular ad hoc networks (VANETs), also known as the “networks on wheels”, can greatly enhance traffic safety, traffic efficiency and driving experience for intelligent transportation system (ITS). However, the unique features of VANETs, such as high mobility and uneven distribution of vehicular nodes, impose critical challenges of high efficiency and reliability for the implementation of VANETs. This dissertation is motivated by the great application potentials of VANETs in the design of efficient in-network data processing and dissemination. Considering the significance of message aggregation, data dissemination and data collection, this dissertation research targets at enhancing the traffic safety and traffic efficiency, as well as developing novel commercial applications, based on VANETs, following four aspects: 1) accurate and efficient message aggregation to detect on-road safety relevant events, 2) reliable data dissemination to reliably notify remote vehicles, 3) efficient and reliable spatial data collection from vehicular sensors, and 4) novel promising applications to exploit the commercial potentials of VANETs. Specifically, to enable cooperative detection of safety relevant events on the roads, the structure-less message aggregation (SLMA) scheme is proposed to improve communication efficiency and message accuracy. The scheme of relative position based message dissemination (RPB-MD) is proposed to reliably and efficiently disseminate messages to all intended vehicles in the zone-of-relevance in varying traffic density. Due to numerous vehicular sensor data available based on VANETs, the scheme of compressive sampling based data collection (CS-DC) is proposed to efficiently collect the spatial relevance data in a large scale, especially in the dense traffic. In addition, with novel and efficient solutions proposed for the application specific issues of data dissemination and data collection, several appealing value-added applications for VANETs are developed to exploit the commercial potentials of VANETs, namely general purpose automatic survey (GPAS), VANET-based ambient ad dissemination (VAAD) and VANET based vehicle performance monitoring and analysis (VehicleView). Thus, by improving the efficiency and reliability in in-network data processing and dissemination, including message aggregation, data dissemination and data collection, together with the development of novel promising applications, this dissertation will help push VANETs further to the stage of massive deployment.
Resumo:
Healthcare systems have assimilated information and communication technologies in order to improve the quality of healthcare and patient's experience at reduced costs. The increasing digitalization of people's health information raises however new threats regarding information security and privacy. Accidental or deliberate data breaches of health data may lead to societal pressures, embarrassment and discrimination. Information security and privacy are paramount to achieve high quality healthcare services, and further, to not harm individuals when providing care. With that in mind, we give special attention to the category of Mobile Health (mHealth) systems. That is, the use of mobile devices (e.g., mobile phones, sensors, PDAs) to support medical and public health. Such systems, have been particularly successful in developing countries, taking advantage of the flourishing mobile market and the need to expand the coverage of primary healthcare programs. Many mHealth initiatives, however, fail to address security and privacy issues. This, coupled with the lack of specific legislation for privacy and data protection in these countries, increases the risk of harm to individuals. The overall objective of this thesis is to enhance knowledge regarding the design of security and privacy technologies for mHealth systems. In particular, we deal with mHealth Data Collection Systems (MDCSs), which consists of mobile devices for collecting and reporting health-related data, replacing paper-based approaches for health surveys and surveillance. This thesis consists of publications contributing to mHealth security and privacy in various ways: with a comprehensive literature review about mHealth in Brazil; with the design of a security framework for MDCSs (SecourHealth); with the design of a MDCS (GeoHealth); with the design of Privacy Impact Assessment template for MDCSs; and with the study of ontology-based obfuscation and anonymisation functions for health data.
Resumo:
The key functional operability in the pre-Lisbon PJCCM pillar of the EU is the exchange of intelligence and information amongst the law enforcement bodies of the EU. The twin issues of data protection and data security within what was the EU’s third pillar legal framework therefore come to the fore. With the Lisbon Treaty reform of the EU, and the increased role of the Commission in PJCCM policy areas, and the integration of the PJCCM provisions with what have traditionally been the pillar I activities of Frontex, the opportunity for streamlining the data protection and data security provisions of the law enforcement bodies of the post-Lisbon EU arises. This is recognised by the Commission in their drafting of an amending regulation for Frontex , when they say that they would prefer “to return to the question of personal data in the context of the overall strategy for information exchange to be presented later this year and also taking into account the reflection to be carried out on how to further develop cooperation between agencies in the justice and home affairs field as requested by the Stockholm programme.” The focus of the literature published on this topic, has for the most part, been on the data protection provisions in Pillar I, EC. While the focus of research has recently sifted to the previously Pillar III PJCCM provisions on data protection, a more focused analysis of the interlocking issues of data protection and data security needs to be made in the context of the law enforcement bodies, particularly with regard to those which were based in the pre-Lisbon third pillar. This paper will make a contribution to that debate, arguing that a review of both the data protection and security provision post-Lisbon is required, not only in order to reinforce individual rights, but also inter-agency operability in combating cross-border EU crime. The EC’s provisions on data protection, as enshrined by Directive 95/46/EC, do not apply to the legal frameworks covering developments within the third pillar of the EU. Even Council Framework Decision 2008/977/JHA, which is supposed to cover data protection provisions within PJCCM expressly states that its provisions do not apply to “Europol, Eurojust, the Schengen Information System (SIS)” or to the Customs Information System (CIS). In addition, the post Treaty of Prüm provisions covering the sharing of DNA profiles, dactyloscopic data and vehicle registration data pursuant to Council Decision 2008/615/JHA, are not to be covered by the provisions of the 2008 Framework Decision. As stated by Hijmans and Scirocco, the regime is “best defined as a patchwork of data protection regimes”, with “no legal framework which is stable and unequivocal, like Directive 95/46/EC in the First pillar”. Data security issues are also key to the sharing of data in organised crime or counterterrorism situations. This article will critically analyse the current legal framework for data protection and security within the third pillar of the EU.
Resumo:
The fourth industrial revolution, also known as Industry 4.0, has rapidly gained traction in businesses across Europe and the world, becoming a central theme in small, medium, and large enterprises alike. This new paradigm shifts the focus from locally-based and barely automated firms to a globally interconnected industrial sector, stimulating economic growth and productivity, and supporting the upskilling and reskilling of employees. However, despite the maturity and scalability of information and cloud technologies, the support systems already present in the machine field are often outdated and lack the necessary security, access control, and advanced communication capabilities. This dissertation proposes architectures and technologies designed to bridge the gap between Operational and Information Technology, in a manner that is non-disruptive, efficient, and scalable. The proposal presents cloud-enabled data-gathering architectures that make use of the newest IT and networking technologies to achieve the desired quality of service and non-functional properties. By harnessing industrial and business data, processes can be optimized even before product sale, while the integrated environment enhances data exchange for post-sale support. The architectures have been tested and have shown encouraging performance results, providing a promising solution for companies looking to embrace Industry 4.0, enhance their operational capabilities, and prepare themselves for the upcoming fifth human-centric revolution.
Resumo:
In questo elaborato vengono analizzate differenti tecniche per la detection di jammer attivi e costanti in una comunicazione satellitare in uplink. Osservando un numero limitato di campioni ricevuti si vuole identificare la presenza di un jammer. A tal fine sono stati implementati i seguenti classificatori binari: support vector machine (SVM), multilayer perceptron (MLP), spectrum guarding e autoencoder. Questi algoritmi di apprendimento automatico dipendono dalle features che ricevono in ingresso, per questo motivo è stata posta particolare attenzione alla loro scelta. A tal fine, sono state confrontate le accuratezze ottenute dai detector addestrati utilizzando differenti tipologie di informazione come: i segnali grezzi nel tempo, le statistical features, le trasformate wavelet e lo spettro ciclico. I pattern prodotti dall’estrazione di queste features dai segnali satellitari possono avere dimensioni elevate, quindi, prima della detection, vengono utilizzati i seguenti algoritmi per la riduzione della dimensionalità: principal component analysis (PCA) e linear discriminant analysis (LDA). Lo scopo di tale processo non è quello di eliminare le features meno rilevanti, ma combinarle in modo da preservare al massimo l’informazione, evitando problemi di overfitting e underfitting. Le simulazioni numeriche effettuate hanno evidenziato come lo spettro ciclico sia in grado di fornire le features migliori per la detection producendo però pattern di dimensioni elevate, per questo motivo è stato necessario l’utilizzo di algoritmi di riduzione della dimensionalità. In particolare, l'algoritmo PCA è stato in grado di estrarre delle informazioni migliori rispetto a LDA, le cui accuratezze risentivano troppo del tipo di jammer utilizzato nella fase di addestramento. Infine, l’algoritmo che ha fornito le prestazioni migliori è stato il Multilayer Perceptron che ha richiesto tempi di addestramento contenuti e dei valori di accuratezza elevati.
Resumo:
The recording and processing of voice data raises increasing privacy concerns for users and service providers. One way to address these issues is to move processing on the edge device closer to the recording so that potentially identifiable information is not transmitted over the internet. However, this is often not possible due to hardware limitations. An interesting alternative is the development of voice anonymization techniques that remove individual speakers characteristics while preserving linguistic and acoustic information in the data. In this work, a state-of-the-art approach to sequence-to-sequence speech conversion, ini- tially based on x-vectors and bottleneck features for automatic speech recognition, is explored to disentangle the two acoustic information using different pre-trained speech and speakers representation. Furthermore, different strategies for selecting target speech representations are analyzed. Results on public datasets in terms of equal error rate and word error rate show that good privacy is achieved with limited impact on converted speech quality relative to the original method.
Resumo:
Artificial Intelligence (AI) has substantially influenced numerous disciplines in recent years. Biology, chemistry, and bioinformatics are among them, with significant advances in protein structure prediction, paratope prediction, protein-protein interactions (PPIs), and antibody-antigen interactions. Understanding PPIs is critical since they are responsible for practically everything living and have several uses in vaccines, cancer, immunology, and inflammatory illnesses. Machine Learning (ML) offers enormous potential for effectively simulating antibody-antigen interactions and improving in-silico optimization of therapeutic antibodies for desired features, including binding activity, stability, and low immunogenicity. This research looks at the use of AI algorithms to better understand antibody-antigen interactions, and it further expands and explains several difficulties encountered in the field. Furthermore, we contribute by presenting a method that outperforms existing state-of-the-art strategies in paratope prediction from sequence data.
Resumo:
To identify risk factors associated with post-operative temporomandibular joint dysfunction after craniotomy. The study sample included 24 patients, mean age of 37.3 ± 10 years; eligible for surgery for refractory epilepsy, evaluated according to RDC/TMD before and after surgery. The primary predictor was the time after the surgery. The primary outcome variable was maximal mouth opening. Other outcome variables were: disc displacement, bruxism, TMJ sound, TMJ pain, and pain associated to mandibular movements. Data analyses were performed using bivariate and multiple regression methods. The maximal mouth opening was significantly reduced after surgery in all patients (p = 0.03). In the multiple regression model, time of evaluation and pre-operative bruxism were significantly (p < .05) associated with an increased risk for TMD post-surgery. A significant correlation between surgery follow-up time and maximal opening mouth was found. Pre-operative bruxism was associated with increased risk for temporomandibular joint dysfunction after craniotomy.
Resumo:
A method using the ring-oven technique for pre-concentration in filter paper discs and near infrared hyperspectral imaging is proposed to identify four detergent and dispersant additives, and to determine their concentration in gasoline. Different approaches were used to select the best image data processing in order to gather the relevant spectral information. This was attained by selecting the pixels of the region of interest (ROI), using a pre-calculated threshold value of the PCA scores arranged as histograms, to select the spectra set; summing up the selected spectra to achieve representativeness; and compensating for the superimposed filter paper spectral information, also supported by scores histograms for each individual sample. The best classification model was achieved using linear discriminant analysis and genetic algorithm (LDA/GA), whose correct classification rate in the external validation set was 92%. Previous classification of the type of additive present in the gasoline is necessary to define the PLS model required for its quantitative determination. Considering that two of the additives studied present high spectral similarity, a PLS regression model was constructed to predict their content in gasoline, while two additional models were used for the remaining additives. The results for the external validation of these regression models showed a mean percentage error of prediction varying from 5 to 15%.