947 resultados para Data pre-processing


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Lymphoma is a type of cancer that affects the immune system, and is classified as Hodgkin or non-Hodgkin. It is one of the ten types of cancer that are the most common on earth. Among all malignant neoplasms diagnosed in the world, lymphoma ranges from three to four percent of them. Our work presents a study of some filters devoted to enhancing images of lymphoma at the pre-processing step. Here the enhancement is useful for removing noise from the digital images. We have analysed the noise caused by different sources like room vibration, scraps and defocusing, and in the following classes of lymphoma: follicular, mantle cell and B-cell chronic lymphocytic leukemia. The filters Gaussian, Median and Mean-Shift were applied to different colour models (RGB, Lab and HSV). Afterwards, we performed a quantitative analysis of the images by means of the Structural Similarity Index. This was done in order to evaluate the similarity between the images. In all cases we have obtained a certainty of at least 75%, which rises to 99% if one considers only HSV. Namely, we have concluded that HSV is an important choice of colour model at pre-processing histological images of lymphoma, because in this case the resulting image will get the best enhancement.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

La tesi propone una soluzione middleware per scenari in cui i sensori producono un numero elevato di dati che è necessario gestire ed elaborare attraverso operazioni di preprocessing, filtering e buffering al fine di migliorare l'efficienza di comunicazione e del consumo di banda nel rispetto di vincoli energetici e computazionali. E'possibile effettuare l'ottimizzazione di questi componenti attraverso operazioni di tuning remoto.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This is the first part of a study investigating a model-based transient calibration process for diesel engines. The motivation is to populate hundreds of parameters (which can be calibrated) in a methodical and optimum manner by using model-based optimization in conjunction with the manual process so that, relative to the manual process used by itself, a significant improvement in transient emissions and fuel consumption and a sizable reduction in calibration time and test cell requirements is achieved. Empirical transient modelling and optimization has been addressed in the second part of this work, while the required data for model training and generalization are the focus of the current work. Transient and steady-state data from a turbocharged multicylinder diesel engine have been examined from a model training perspective. A single-cylinder engine with external air-handling has been used to expand the steady-state data to encompass transient parameter space. Based on comparative model performance and differences in the non-parametric space, primarily driven by a high engine difference between exhaust and intake manifold pressures (ΔP) during transients, it has been recommended that transient emission models should be trained with transient training data. It has been shown that electronic control module (ECM) estimates of transient charge flow and the exhaust gas recirculation (EGR) fraction cannot be accurate at the high engine ΔP frequently encountered during transient operation, and that such estimates do not account for cylinder-to-cylinder variation. The effects of high engine ΔP must therefore be incorporated empirically by using transient data generated from a spectrum of transient calibrations. Specific recommendations on how to choose such calibrations, how many data to acquire, and how to specify transient segments for data acquisition have been made. Methods to process transient data to account for transport delays and sensor lags have been developed. The processed data have then been visualized using statistical means to understand transient emission formation. Two modes of transient opacity formation have been observed and described. The first mode is driven by high engine ΔP and low fresh air flowrates, while the second mode is driven by high engine ΔP and high EGR flowrates. The EGR fraction is inaccurately estimated at both modes, while EGR distribution has been shown to be present but unaccounted for by the ECM. The two modes and associated phenomena are essential to understanding why transient emission models are calibration dependent and furthermore how to choose training data that will result in good model generalization.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BACKGROUND Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method. METHODS In this Privacy Preserving Probabilistic Record Linkage method we apply a three-party protocol, with two sites collecting individual data and an independent trusted linkage center as the third partner. Our method consists of three main steps: pre-processing, encryption and probabilistic record linkage. Data pre-processing and encryption are done at the sites by local personnel. To guarantee similar quality and format of variables and identical encryption procedure at each site, the linkage center generates semi-automated pre-processing and encryption templates. To retrieve information (i.e. data structure) for the creation of templates without ever accessing plain person identifiable information, we introduced a novel method of data masking. Sensitive string variables are encrypted using Bloom filters, which enables calculation of similarity coefficients. For date variables, we developed special encryption procedures to handle the most common date errors. The linkage center performs probabilistic record linkage with encrypted person identifiable information and plain non-sensitive variables. RESULTS In this paper we describe step by step how to link existing health-related data using encryption methods to preserve privacy of persons in the study. CONCLUSION Privacy Preserving Probabilistic Record linkage expands record linkage facilities in settings where a unique identifier is unavailable and/or regulations restrict access to the non-unique person identifiable information needed to link existing health-related data sets. Automated pre-processing and encryption fully protect sensitive information ensuring participant confidentiality. This method is suitable not just for epidemiological research but also for any setting with similar challenges.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Frequent Itemsets mining is well explored for various data types, and its computational complexity is well understood. There are methods to deal effectively with computational problems. This paper shows another approach to further performance enhancements of frequent items sets computation. We have made a series of observations that led us to inventing data pre-processing methods such that the final step of the Partition algorithm, where a combination of all local candidate sets must be processed, is executed on substantially smaller input data. The paper shows results from several experiments that confirmed our general and formally presented observations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Con l’avvento di Internet, il numero di utenti con un effettivo accesso alla rete e la possibilità di condividere informazioni con tutto il mondo è, negli anni, in continua crescita. Con l’introduzione dei social media, in aggiunta, gli utenti sono portati a trasferire sul web una grande quantità di informazioni personali mettendoli a disposizione delle varie aziende. Inoltre, il mondo dell’Internet Of Things, grazie al quale i sensori e le macchine risultano essere agenti sulla rete, permette di avere, per ogni utente, un numero maggiore di dispositivi, direttamente collegati tra loro e alla rete globale. Proporzionalmente a questi fattori anche la mole di dati che vengono generati e immagazzinati sta aumentando in maniera vertiginosa dando luogo alla nascita di un nuovo concetto: i Big Data. Nasce, di conseguenza, la necessità di far ricorso a nuovi strumenti che possano sfruttare la potenza di calcolo oggi offerta dalle architetture più complesse che comprendono, sotto un unico sistema, un insieme di host utili per l’analisi. A tal merito, una quantità di dati così vasta, routine se si parla di Big Data, aggiunta ad una velocità di trasmissione e trasferimento altrettanto alta, rende la memorizzazione dei dati malagevole, tanto meno se le tecniche di storage risultano essere i tradizionali DBMS. Una soluzione relazionale classica, infatti, permetterebbe di processare dati solo su richiesta, producendo ritardi, significative latenze e inevitabile perdita di frazioni di dataset. Occorre, perciò, far ricorso a nuove tecnologie e strumenti consoni a esigenze diverse dalla classica analisi batch. In particolare, è stato preso in considerazione, come argomento di questa tesi, il Data Stream Processing progettando e prototipando un sistema bastato su Apache Storm scegliendo, come campo di applicazione, la cyber security.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a pre-processing mesh re-distribution algorithm based upon harmonic maps employed in conjunction with discontinuous Galerkin approximations of advection-diffusion-reaction problems. Extensive two-dimensional numerical experiments with different choices of monitor functions, including monitor functions derived from goal-oriented a posteriori error indicators are presented. The examples presented clearly demonstrate the capabilities and the benefits of combining our pre-processing mesh movement algorithm with both uniform, as well as, adaptive isotropic and anisotropic mesh refinement.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we propose a method based on association rule-mining to enhance the diagnosis of medical images (mammograms). It combines low-level features automatically extracted from images and high-level knowledge from specialists to search for patterns. Our method analyzes medical images and automatically generates suggestions of diagnoses employing mining of association rules. The suggestions of diagnosis are used to accelerate the image analysis performed by specialists as well as to provide them an alternative to work on. The proposed method uses two new algorithms, PreSAGe and HiCARe. The PreSAGe algorithm combines, in a single step, feature selection and discretization, and reduces the mining complexity. Experiments performed on PreSAGe show that this algorithm is highly suitable to perform feature selection and discretization in medical images. HiCARe is a new associative classifier. The HiCARe algorithm has an important property that makes it unique: it assigns multiple keywords per image to suggest a diagnosis with high values of accuracy. Our method was applied to real datasets, and the results show high sensitivity (up to 95%) and accuracy (up to 92%), allowing us to claim that the use of association rules is a powerful means to assist in the diagnosing task.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Genetic algorithm was used for variable selection in simultaneous determination of mixtures of glucose, maltose and fructose by mid infrared spectroscopy. Different models, using partial least squares (PLS) and multiple linear regression (MLR) with and without data pre-processing, were used. Based on the results obtained, it was verified that a simpler model (multiple linear regression with variable selection by genetic algorithm) produces results comparable to more complex methods (partial least squares). The relative errors obtained for the best model was around 3% for the sugar determination, which is acceptable for this kind of determination.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Learning Disability (LD) is a general term that describes specific kinds of learning problems. It is a neurological condition that affects a child's brain and impairs his ability to carry out one or many specific tasks. The learning disabled children are neither slow nor mentally retarded. This disorder can make it problematic for a child to learn as quickly or in the same way as some child who isn't affected by a learning disability. An affected child can have normal or above average intelligence. They may have difficulty paying attention, with reading or letter recognition, or with mathematics. It does not mean that children who have learning disabilities are less intelligent. In fact, many children who have learning disabilities are more intelligent than an average child. Learning disabilities vary from child to child. One child with LD may not have the same kind of learning problems as another child with LD. There is no cure for learning disabilities and they are life-long. However, children with LD can be high achievers and can be taught ways to get around the learning disability. In this research work, data mining using machine learning techniques are used to analyze the symptoms of LD, establish interrelationships between them and evaluate the relative importance of these symptoms. To increase the diagnostic accuracy of learning disability prediction, a knowledge based tool based on statistical machine learning or data mining techniques, with high accuracy,according to the knowledge obtained from the clinical information, is proposed. The basic idea of the developed knowledge based tool is to increase the accuracy of the learning disability assessment and reduce the time used for the same. Different statistical machine learning techniques in data mining are used in the study. Identifying the important parameters of LD prediction using the data mining techniques, identifying the hidden relationship between the symptoms of LD and estimating the relative significance of each symptoms of LD are also the parts of the objectives of this research work. The developed tool has many advantages compared to the traditional methods of using check lists in determination of learning disabilities. For improving the performance of various classifiers, we developed some preprocessing methods for the LD prediction system. A new system based on fuzzy and rough set models are also developed for LD prediction. Here also the importance of pre-processing is studied. A Graphical User Interface (GUI) is designed for developing an integrated knowledge based tool for prediction of LD as well as its degree. The designed tool stores the details of the children in the student database and retrieves their LD report as and when required. The present study undoubtedly proves the effectiveness of the tool developed based on various machine learning techniques. It also identifies the important parameters of LD and accurately predicts the learning disability in school age children. This thesis makes several major contributions in technical, general and social areas. The results are found very beneficial to the parents, teachers and the institutions. They are able to diagnose the child’s problem at an early stage and can go for the proper treatments/counseling at the correct time so as to avoid the academic and social losses.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Compute grids are used widely in many areas of environmental science, but there has been limited uptake of grid computing by the climate modelling community, partly because the characteristics of many climate models make them difficult to use with popular grid middleware systems. In particular, climate models usually produce large volumes of output data, and running them usually involves complicated workflows implemented as shell scripts. For example, NEMO (Smith et al. 2008) is a state-of-the-art ocean model that is used currently for operational ocean forecasting in France, and will soon be used in the UK for both ocean forecasting and climate modelling. On a typical modern cluster, a particular one year global ocean simulation at 1-degree resolution takes about three hours when running on 40 processors, and produces roughly 20 GB of output as 50000 separate files. 50-year simulations are common, during which the model is resubmitted as a new job after each year. Running NEMO relies on a set of complicated shell scripts and command utilities for data pre-processing and post-processing prior to job resubmission. Grid Remote Execution (G-Rex) is a pure Java grid middleware system that allows scientific applications to be deployed as Web services on remote computer systems, and then launched and controlled as if they are running on the user's own computer. Although G-Rex is general purpose middleware it has two key features that make it particularly suitable for remote execution of climate models: (1) Output from the model is transferred back to the user while the run is in progress to prevent it from accumulating on the remote system and to allow the user to monitor the model; (2) The client component is a command-line program that can easily be incorporated into existing model work-flow scripts. G-Rex has a REST (Fielding, 2000) architectural style, which allows client programs to be very simple and lightweight and allows users to interact with model runs using only a basic HTTP client (such as a Web browser or the curl utility) if they wish. This design also allows for new client interfaces to be developed in other programming languages with relatively little effort. The G-Rex server is a standard Web application that runs inside a servlet container such as Apache Tomcat and is therefore easy to install and maintain by system administrators. G-Rex is employed as the middleware for the NERC1 Cluster Grid, a small grid of HPC2 clusters belonging to collaborating NERC research institutes. Currently the NEMO (Smith et al. 2008) and POLCOMS (Holt et al, 2008) ocean models are installed, and there are plans to install the Hadley Centre’s HadCM3 model for use in the decadal climate prediction project GCEP (Haines et al., 2008). The science projects involving NEMO on the Grid have a particular focus on data assimilation (Smith et al. 2008), a technique that involves constraining model simulations with observations. The POLCOMS model will play an important part in the GCOMS project (Holt et al, 2008), which aims to simulate the world’s coastal oceans. A typical use of G-Rex by a scientist to run a climate model on the NERC Cluster Grid proceeds as follows :(1) The scientist prepares input files on his or her local machine. (2) Using information provided by the Grid’s Ganglia3 monitoring system, the scientist selects an appropriate compute resource. (3) The scientist runs the relevant workflow script on his or her local machine. This is unmodified except that calls to run the model (e.g. with “mpirun”) are simply replaced with calls to "GRexRun" (4) The G-Rex middleware automatically handles the uploading of input files to the remote resource, and the downloading of output files back to the user, including their deletion from the remote system, during the run. (5) The scientist monitors the output files, using familiar analysis and visualization tools on his or her own local machine. G-Rex is well suited to climate modelling because it addresses many of the middleware usability issues that have led to limited uptake of grid computing by climate scientists. It is a lightweight, low-impact and easy-to-install solution that is currently designed for use in relatively small grids such as the NERC Cluster Grid. A current topic of research is the use of G-Rex as an easy-to-use front-end to larger-scale Grid resources such as the UK National Grid service.