26 resultados para data pre-processing
em Repositório Científico do Instituto Politécnico de Lisboa - Portugal
Resumo:
Microarray allow to monitoring simultaneously thousands of genes, where the abundance of the transcripts under a same experimental condition at the same time can be quantified. Among various available array technologies, double channel cDNA microarray experiments have arisen in numerous technical protocols associated to genomic studies, which is the focus of this work. Microarray experiments involve many steps and each one can affect the quality of raw data. Background correction and normalization are preprocessing techniques to clean and correct the raw data when undesirable fluctuations arise from technical factors. Several recent studies showed that there is no preprocessing strategy that outperforms others in all circumstances and thus it seems difficult to provide general recommendations. In this work, it is proposed to use exploratory techniques to visualize the effects of preprocessing methods on statistical analysis of cancer two-channel microarray data sets, where the cancer types (classes) are known. For selecting differential expressed genes the arrow plot was used and the graph of profiles resultant from the correspondence analysis for visualizing the results. It was used 6 background methods and 6 normalization methods, performing 36 pre-processing methods and it was analyzed in a published cDNA microarray database (Liver) available at http://genome-www5.stanford.edu/ which microarrays were already classified by cancer type. All statistical analyses were performed using the R statistical software.
Resumo:
Relatório do Trabalho Final de Mestrado para obtenção do grau de Mestre em Engenharia de Electrónica e Telecomunicações
Resumo:
In research on Silent Speech Interfaces (SSI), different sources of information (modalities) have been combined, aiming at obtaining better performance than the individual modalities. However, when combining these modalities, the dimensionality of the feature space rapidly increases, yielding the well-known "curse of dimensionality". As a consequence, in order to extract useful information from this data, one has to resort to feature selection (FS) techniques to lower the dimensionality of the learning space. In this paper, we assess the impact of FS techniques for silent speech data, in a dataset with 4 non-invasive and promising modalities, namely: video, depth, ultrasonic Doppler sensing, and surface electromyography. We consider two supervised (mutual information and Fisher's ratio) and two unsupervised (meanmedian and arithmetic mean geometric mean) FS filters. The evaluation was made by assessing the classification accuracy (word recognition error) of three well-known classifiers (knearest neighbors, support vector machines, and dynamic time warping). The key results of this study show that both unsupervised and supervised FS techniques improve on the classification accuracy on both individual and combined modalities. For instance, on the video component, we attain relative performance gains of 36.2% in error rates. FS is also useful as pre-processing for feature fusion. Copyright © 2014 ISCA.
Resumo:
Discrete data representations are necessary, or at least convenient, in many machine learning problems. While feature selection (FS) techniques aim at finding relevant subsets of features, the goal of feature discretization (FD) is to find concise (quantized) data representations, adequate for the learning task at hand. In this paper, we propose two incremental methods for FD. The first method belongs to the filter family, in which the quality of the discretization is assessed by a (supervised or unsupervised) relevance criterion. The second method is a wrapper, where discretized features are assessed using a classifier. Both methods can be coupled with any static (unsupervised or supervised) discretization procedure and can be used to perform FS as pre-processing or post-processing stages. The proposed methods attain efficient representations suitable for binary and multi-class problems with different types of data, being competitive with existing methods. Moreover, using well-known FS methods with the features discretized by our techniques leads to better accuracy than with the features discretized by other methods or with the original features. (C) 2013 Elsevier B.V. All rights reserved.
Resumo:
Behavioral biometrics is one of the areas with growing interest within the biosignal research community. A recent trend in the field is ECG-based biometrics, where electrocardiographic (ECG) signals are used as input to the biometric system. Previous work has shown this to be a promising trait, with the potential to serve as a good complement to other existing, and already more established modalities, due to its intrinsic characteristics. In this paper, we propose a system for ECG biometrics centered on signals acquired at the subject's hand. Our work is based on a previously developed custom, non-intrusive sensing apparatus for data acquisition at the hands, and involved the pre-processing of the ECG signals, and evaluation of two classification approaches targeted at real-time or near real-time applications. Preliminary results show that this system leads to competitive results both for authentication and identification, and further validate the potential of ECG signals as a complementary modality in the toolbox of the biometric system designer.
Resumo:
Hyperspectral instruments have been incorporated in satellite missions, providing data of high spectral resolution of the Earth. This data can be used in remote sensing applications, such as, target detection, hazard prevention, and monitoring oil spills, among others. In most of these applications, one of the requirements of paramount importance is the ability to give real-time or near real-time response. Recently, onboard processing systems have emerged, in order to overcome the huge amount of data to transfer from the satellite to the ground station, and thus, avoiding delays between hyperspectral image acquisition and its interpretation. For this purpose, compact reconfigurable hardware modules, such as field programmable gate arrays (FPGAs) are widely used. This paper proposes a parallel FPGA-based architecture for endmember’s signature extraction. This method based on the Vertex Component Analysis (VCA) has several advantages, namely it is unsupervised, fully automatic, and it works without dimensionality reduction (DR) pre-processing step. The architecture has been designed for a low cost Xilinx Zynq board with a Zynq-7020 SoC FPGA based on the Artix-7 FPGA programmable logic and tested using real hyperspectral data sets collected by the NASA’s Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) over the Cuprite mining district in Nevada. Experimental results indicate that the proposed implementation can achieve real-time processing, while maintaining the methods accuracy, which indicate the potential of the proposed platform to implement high-performance, low cost embedded systems, opening new perspectives for onboard hyperspectral image processing.
Resumo:
The use of iris recognition for human authentication has been spreading in the past years. Daugman has proposed a method for iris recognition, composed by four stages: segmentation, normalization, feature extraction, and matching. In this paper we propose some modifications and extensions to Daugman's method to cope with noisy images. These modifications are proposed after a study of images of CASIA and UBIRIS databases. The major modification is on the computationally demanding segmentation stage, for which we propose a faster and equally accurate template matching approach. The extensions on the algorithm address the important issue of pre-processing that depends on the image database, being mandatory when we have a non infra-red camera, like a typical WebCam. For this scenario, we propose methods for reflection removal and pupil enhancement and isolation. The tests, carried out by our C# application on grayscale CASIA and UBIRIS images show that the template matching segmentation method is more accurate and faster than the previous one, for noisy images. The proposed algorithms are found to be efficient and necessary when we deal with non infra-red images and non uniform illumination.
Resumo:
This paper proposes an FPGA-based architecture for onboard hyperspectral unmixing. This method based on the Vertex Component Analysis (VCA) has several advantages, namely it is unsupervised, fully automatic, and it works without dimensionality reduction (DR) pre-processing step. The architecture has been designed for a low cost Xilinx Zynq board with a Zynq-7020 SoC FPGA based on the Artix-7 FPGA programmable logic and tested using real hyperspectral datasets. Experimental results indicate that the proposed implementation can achieve real-time processing, while maintaining the methods accuracy, which indicate the potential of the proposed platform to implement high-performance, low cost embedded systems.
Resumo:
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
Resumo:
The characteristics of tunable wavelength filters based on a-SiC:H multilayered stacked pin cells are studied both theoretically and experimentally. The optical transducers were produced by PECVD and tested for a proper fine tuning of the cyan and yellow fluorescent proteins emission. The active device consists of a p-i'(a-SiC:H)-n/p-i(a-Si:H)-n heterostructures sandwiched between two transparent contacts. Experimental data on spectral response analysis, current-voltage characteristics and color and transmission rate discrimination are reported. Cyan and yellow fluorescent input channels were transmitted together, each one with a specific transmission rate and different intensities. The multiplexed optical signal was analyzed by reading out, under positive and negative applied voltages, the generated photocurrents. Results show that the optimized optical transducer has the capability of combining the transient fluorescent signals onto a single output signal without losing any specificity (color and intensity). It acts as a voltage controlled optical filter: when the applied voltages are chosen appropriately the transducer can select separately the cyan and yellow channel emissions (wavelength and frequency) and also to quantify their relative intensities. A theoretical analysis supported by a numerical simulation is presented.
Resumo:
Amorphous SiC tandem heterostructures are used to filter a specific band, in the visible range. Experimental and simulated results are compared to validate the use of SiC multilayered structures in applications where gain compensation is needed or to attenuate unwanted wavelengths. Spectral response data acquired under different frequencies, optical wavelength control and side irradiations are analyzed. Transfer function characteristics are discussed. Color pulsed communication channels are transmitted together and the output signal analyzed under different background conditions. Results show that under controlled wavelength backgrounds, the device sensitivity is enhanced in a precise wavelength range and quenched in the others, tuning or suppressing a specific band. Depending on the background wavelength and irradiation side, the device acts either as a long-, a short-, or a band-rejection pass filter. An optoelectronic model supports the experimental results and gives insight on the physics of the device.
Resumo:
In this work liver contour is semi-automatically segmented and quantified in order to help the identification and diagnosis of diffuse liver disease. The features extracted from the liver contour are jointly used with clinical and laboratorial data in the staging process. The classification results of a support vector machine, a Bayesian and a k-nearest neighbor classifier are compared. A population of 88 patients at five different stages of diffuse liver disease and a leave-one-out cross-validation strategy are used in the classification process. The best results are obtained using the k-nearest neighbor classifier, with an overall accuracy of 80.68%. The good performance of the proposed method shows a reliable indicator that can improve the information in the staging of diffuse liver disease.
Resumo:
Dissertação apresentada à Escola Superior de Educação de Lisboa para a obtenção do Grau de Mestre em Ciências da Educação - especialidade Supervisão em Educação
Resumo:
Trabalho de Projeto para obtenção do grau de Mestre em Engenharia Informática e de Computadores
Resumo:
Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.