871 resultados para Data detection
Resumo:
This thesis stems from the project with real-time environmental monitoring company EMSAT Corporation. They were looking for methods to automatically ag spikes and other anomalies in their environmental sensor data streams. The problem presents several challenges: near real-time anomaly detection, absence of labeled data and time-changing data streams. Here, we address this problem using both a statistical parametric approach as well as a non-parametric approach like Kernel Density Estimation (KDE). The main contribution of this thesis is extending the KDE to work more effectively for evolving data streams, particularly in presence of concept drift. To address that, we have developed a framework for integrating Adaptive Windowing (ADWIN) change detection algorithm with KDE. We have tested this approach on several real world data sets and received positive feedback from our industry collaborator. Some results appearing in this thesis have been presented at ECML PKDD 2015 Doctoral Consortium.
Resumo:
Peer reviewed
Resumo:
Head motion during a Positron Emission Tomography (PET) brain scan can considerably degrade image quality. External motion-tracking devices have proven successful in minimizing this effect, but the associated time, maintenance, and workflow changes inhibit their widespread clinical use. List-mode PET acquisition allows for the retroactive analysis of coincidence events on any time scale throughout a scan, and therefore potentially offers a data-driven motion detection and characterization technique. An algorithm was developed to parse list-mode data, divide the full acquisition into short scan intervals, and calculate the line-of-response (LOR) midpoint average for each interval. These LOR midpoint averages, known as “radioactivity centroids,” were presumed to represent the center of the radioactivity distribution in the scanner, and it was thought that changes in this metric over time would correspond to intra-scan motion.
Several scans were taken of the 3D Hoffman brain phantom on a GE Discovery IQ PET/CT scanner to test the ability of the radioactivity to indicate intra-scan motion. Each scan incrementally surveyed motion in a different degree of freedom (2 translational and 2 rotational). The radioactivity centroids calculated from these scans correlated linearly to phantom positions/orientations. Centroid measurements over 1-second intervals performed on scans with ~1mCi of activity in the center of the field of view had standard deviations of 0.026 cm in the x- and y-dimensions and 0.020 cm in the z-dimension, which demonstrates high precision and repeatability in this metric. Radioactivity centroids are thus shown to successfully represent discrete motions on the submillimeter scale. It is also shown that while the radioactivity centroid can precisely indicate the amount of motion during an acquisition, it fails to distinguish what type of motion occurred.
Resumo:
Current state of the art techniques for landmine detection in ground penetrating radar (GPR) utilize statistical methods to identify characteristics of a landmine response. This research makes use of 2-D slices of data in which subsurface landmine responses have hyperbolic shapes. Various methods from the field of visual image processing are adapted to the 2-D GPR data, producing superior landmine detection results. This research goes on to develop a physics-based GPR augmentation method motivated by current advances in visual object detection. This GPR specific augmentation is used to mitigate issues caused by insufficient training sets. This work shows that augmentation improves detection performance under training conditions that are normally very difficult. Finally, this work introduces the use of convolutional neural networks as a method to learn feature extraction parameters. These learned convolutional features outperform hand-designed features in GPR detection tasks. This work presents a number of methods, both borrowed from and motivated by the substantial work in visual image processing. The methods developed and presented in this work show an improvement in overall detection performance and introduce a method to improve the robustness of statistical classification.
Resumo:
The main objective of this work was to develop a novel dimensionality reduction technique as a part of an integrated pattern recognition solution capable of identifying adulterants such as hazelnut oil in extra virgin olive oil at low percentages based on spectroscopic chemical fingerprints. A novel Continuous Locality Preserving Projections (CLPP) technique is proposed which allows the modelling of the continuous nature of the produced in-house admixtures as data series instead of discrete points. The maintenance of the continuous structure of the data manifold enables the better visualisation of this examined classification problem and facilitates the more accurate utilisation of the manifold for detecting the adulterants. The performance of the proposed technique is validated with two different spectroscopic techniques (Raman and Fourier transform infrared, FT-IR). In all cases studied, CLPP accompanied by k-Nearest Neighbors (kNN) algorithm was found to outperform any other state-of-the-art pattern recognition techniques.
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
Incidental findings on low-dose CT images obtained during hybrid imaging are an increasing phenomenon as CT technology advances. Understanding the diagnostic value of incidental findings along with the technical limitations is important when reporting image results and recommending follow-up, which may result in an additional radiation dose from further diagnostic imaging and an increase in patient anxiety. This study assessed lesions incidentally detected on CT images acquired for attenuation correction on two SPECT/CT systems. Methods: An anthropomorphic chest phantom containing simulated lesions of varying size and density was imaged on an Infinia Hawkeye 4 and a Symbia T6 using the low-dose CT settings applied for attenuation correction acquisitions in myocardial perfusion imaging. Twenty-two interpreters assessed 46 images from each SPECT/CT system (15 normal images and 31 abnormal images; 41 lesions). Data were evaluated using a jackknife alternative free-response receiver-operating-characteristic analysis (JAFROC). Results: JAFROC analysis showed a significant difference (P < 0.0001) in lesion detection, with the figures of merit being 0.599 (95% confidence interval, 0.568, 0.631) and 0.810 (95% confidence interval, 0.781, 0.839) for the Infinia Hawkeye 4 and Symbia T6, respectively. Lesion detection on the Infinia Hawkeye 4 was generally limited to larger, higher-density lesions. The Symbia T6 allowed improved detection rates for midsized lesions and some lower-density lesions. However, interpreters struggled to detect small (5 mm) lesions on both image sets, irrespective of density. Conclusion: Lesion detection is more reliable on low-dose CT images from the Symbia T6 than from the Infinia Hawkeye 4. This phantom-based study gives an indication of potential lesion detection in the clinical context as shown by two commonly used SPECT/CT systems, which may assist the clinician in determining whether further diagnostic imaging is justified.
Resumo:
In order to optimize frontal detection in sea surface temperature fields at 4 km resolution, a combined statistical and expert-based approach is applied to test different spatial smoothing of the data prior to the detection process. Fronts are usually detected at 1 km resolution using the histogram-based, single image edge detection (SIED) algorithm developed by Cayula and Cornillon in 1992, with a standard preliminary smoothing using a median filter and a 3 × 3 pixel kernel. Here, detections are performed in three study regions (off Morocco, the Mozambique Channel, and north-western Australia) and across the Indian Ocean basin using the combination of multiple windows (CMW) method developed by Nieto, Demarcq and McClatchie in 2012 which improves on the original Cayula and Cornillon algorithm. Detections at 4 km and 1 km of resolution are compared. Fronts are divided in two intensity classes (“weak” and “strong”) according to their thermal gradient. A preliminary smoothing is applied prior to the detection using different convolutions: three type of filters (median, average and Gaussian) combined with four kernel sizes (3 × 3, 5 × 5, 7 × 7, and 9 × 9 pixels) and three detection window sizes (16 × 16, 24 × 24 and 32 × 32 pixels) to test the effect of these smoothing combinations on reducing the background noise of the data and therefore on improving the frontal detection. The performance of the combinations on 4 km data are evaluated using two criteria: detection efficiency and front length. We find that the optimal combination of preliminary smoothing parameters in enhancing detection efficiency and preserving front length includes a median filter, a 16 × 16 pixel window size, and a 5 × 5 pixel kernel for strong fronts and a 7 × 7 pixel kernel for weak fronts. Results show an improvement in detection performance (from largest to smallest window size) of 71% for strong fronts and 120% for weak fronts. Despite the small window used (16 × 16 pixels), the length of the fronts has been preserved relative to that found with 1 km data. This optimal preliminary smoothing and the CMW detection algorithm on 4 km sea surface temperature data are then used to describe the spatial distribution of the monthly frequencies of occurrence for both strong and weak fronts across the Indian Ocean basin. In general strong fronts are observed in coastal areas whereas weak fronts, with some seasonal exceptions, are mainly located in the open ocean. This study shows that adequate noise reduction done by a preliminary smoothing of the data considerably improves the frontal detection efficiency as well as the global quality of the results. Consequently, the use of 4 km data enables frontal detections similar to 1 km data (using a standard median 3 × 3 convolution) in terms of detectability, length and location. This method, using 4 km data is easily applicable to large regions or at the global scale with far less constraints of data manipulation and processing time relative to 1 km data.
Resumo:
In recent years, there has been exponential growth in using virtual spaces, including dialogue systems, that handle personal information. The concept of personal privacy in the literature is discussed and controversial, whereas, in the technological field, it directly influences the degree of reliability perceived in the information system (privacy ‘as trust’). This work aims to protect the right to privacy on personal data (GDPR, 2018) and avoid the loss of sensitive content by exploring sensitive information detection (SID) task. It is grounded on the following research questions: (RQ1) What does sensitive data mean? How to define a personal sensitive information domain? (RQ2) How to create a state-of-the-art model for SID?(RQ3) How to evaluate the model? RQ1 theoretically investigates the concepts of privacy and the ontological state-of-the-art representation of personal information. The Data Privacy Vocabulary (DPV) is the taxonomic resource taken as an authoritative reference for the definition of the knowledge domain. Concerning RQ2, we investigate two approaches to classify sensitive data: the first - bottom-up - explores automatic learning methods based on transformer networks, the second - top-down - proposes logical-symbolic methods with the construction of privaframe, a knowledge graph of compositional frames representing personal data categories. Both approaches are tested. For the evaluation - RQ3 – we create SPeDaC, a sentence-level labeled resource. This can be used as a benchmark or training in the SID task, filling the gap of a shared resource in this field. If the approach based on artificial neural networks confirms the validity of the direction adopted in the most recent studies on SID, the logical-symbolic approach emerges as the preferred way for the classification of fine-grained personal data categories, thanks to the semantic-grounded tailor modeling it allows. At the same time, the results highlight the strong potential of hybrid architectures in solving automatic tasks.
Resumo:
Correctness of information gathered in production environments is an essential part of quality assurance processes in many industries, this task is often performed by human resources who visually take annotations in various steps of the production flow. Depending on the performed task the correlation between where exactly the information is gathered and what it represents is more than often lost in the process. The lack of labeled data places a great boundary on the application of deep neural networks aimed at object detection tasks, moreover supervised training of deep models requires a great amount of data to be available. Reaching an adequate large collection of labeled images through classic techniques of data annotations is an exhausting and costly task to perform, not always suitable for every scenario. A possible solution is to generate synthetic data that replicates the real one and use it to fine-tune a deep neural network trained on one or more source domains to a different target domain. The purpose of this thesis is to show a real case scenario where the provided data were both in great scarcity and missing the required annotations. Sequentially a possible approach is presented where synthetic data has been generated to address those issues while standing as a training base of deep neural networks for object detection, capable of working on images taken in production-like environments. Lastly, it compares performance on different types of synthetic data and convolutional neural networks used as backbones for the model.
Resumo:
In questo elaborato vengono analizzate differenti tecniche per la detection di jammer attivi e costanti in una comunicazione satellitare in uplink. Osservando un numero limitato di campioni ricevuti si vuole identificare la presenza di un jammer. A tal fine sono stati implementati i seguenti classificatori binari: support vector machine (SVM), multilayer perceptron (MLP), spectrum guarding e autoencoder. Questi algoritmi di apprendimento automatico dipendono dalle features che ricevono in ingresso, per questo motivo è stata posta particolare attenzione alla loro scelta. A tal fine, sono state confrontate le accuratezze ottenute dai detector addestrati utilizzando differenti tipologie di informazione come: i segnali grezzi nel tempo, le statistical features, le trasformate wavelet e lo spettro ciclico. I pattern prodotti dall’estrazione di queste features dai segnali satellitari possono avere dimensioni elevate, quindi, prima della detection, vengono utilizzati i seguenti algoritmi per la riduzione della dimensionalità: principal component analysis (PCA) e linear discriminant analysis (LDA). Lo scopo di tale processo non è quello di eliminare le features meno rilevanti, ma combinarle in modo da preservare al massimo l’informazione, evitando problemi di overfitting e underfitting. Le simulazioni numeriche effettuate hanno evidenziato come lo spettro ciclico sia in grado di fornire le features migliori per la detection producendo però pattern di dimensioni elevate, per questo motivo è stato necessario l’utilizzo di algoritmi di riduzione della dimensionalità. In particolare, l'algoritmo PCA è stato in grado di estrarre delle informazioni migliori rispetto a LDA, le cui accuratezze risentivano troppo del tipo di jammer utilizzato nella fase di addestramento. Infine, l’algoritmo che ha fornito le prestazioni migliori è stato il Multilayer Perceptron che ha richiesto tempi di addestramento contenuti e dei valori di accuratezza elevati.
Resumo:
During the last semester of the Master’s Degree in Artificial Intelligence, I carried out my internship working for TXT e-Solution on the ADMITTED project. This paper describes the work done in those months. The thesis will be divided into two parts representing the two different tasks I was assigned during the course of my experience. The First part will be about the introduction of the project and the work done on the admittedly library, maintaining the code base and writing the test suits. The work carried out is more connected to the Software engineer role, developing features, fixing bugs and testing. The second part will describe the experiments done on the Anomaly detection task using a Deep Learning technique called Autoencoder, this task is on the other hand more connected to the data science role. The two tasks were not done simultaneously but were dealt with one after the other, which is why I preferred to divide them into two separate parts of this paper.
Resumo:
In acquired immunodeficiency syndrome (AIDS) studies it is quite common to observe viral load measurements collected irregularly over time. Moreover, these measurements can be subjected to some upper and/or lower detection limits depending on the quantification assays. A complication arises when these continuous repeated measures have a heavy-tailed behavior. For such data structures, we propose a robust structure for a censored linear model based on the multivariate Student's t-distribution. To compensate for the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is employed. An efficient expectation maximization type algorithm is developed for computing the maximum likelihood estimates, obtaining as a by-product the standard errors of the fixed effects and the log-likelihood function. The proposed algorithm uses closed-form expressions at the E-step that rely on formulas for the mean and variance of a truncated multivariate Student's t-distribution. The methodology is illustrated through an application to an Human Immunodeficiency Virus-AIDS (HIV-AIDS) study and several simulation studies.