215 resultados para preprocessing
Resumo:
In this paper, a novel method for power quality signal decomposition is proposed based on Independent Component Analysis (ICA). This method aims to decompose the power system signal (voltage or current) into components that can provide more specific information about the different disturbances which are occurring simultaneously during a multiple disturbance situation. The ICA is originally a multichannel technique. However, the method proposes its use to blindly separate out disturbances existing in a single measured signal (single channel). Therefore, a preprocessing step for the ICA is proposed using a filter bank. The proposed method was applied to synthetic data, simulated data, as well as actual power system signals, showing a very good performance. A comparison with the decomposition provided by the Discrete Wavelet Transform shows that the proposed method presented better decoupling for the analyzed data. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Neuroimaging studies suggest anterior-limbic structural brain abnormalities in patients with bipolar disorder (BD), but few studies have shown these abnormalities in unaffected but genetically liable family members. In this study, we report morphometric correlates of genetic risk for BD using voxel-based morphometry. In 35 BD type I (BD-I) patients, 20 unaffected first-degree relatives (UAR) of BD patients and 40 healthy control subjects underwent 3 T magnetic resonance scanner imaging. Preprocessing of images used DARTEL (diffeomorphic anatomical registration through exponentiated lie algebra) for voxel-based morphometry in SPM8 (Wellcome Department of Imaging Neuroscience, London, UK). The whole-brain analysis revealed that the gray matter (GM) volumes of the left anterior insula and right inferior frontal gyrus showed a significant main effect of diagnosis. Multiple comparison analysis showed that the BD-I patients and the UAR subjects had smaller left anterior insular GM volumes compared with the healthy subjects, the BD-I patients had smaller right inferior frontal gyrus compared with the healthy subjects. For white matter (WM) volumes, there was a significant main effect of diagnosis for medial frontal gyrus. The UAR subjects had smaller right medial frontal WM volumes compared with the healthy subjects. These findings suggest that morphometric brain abnormalities of the anterior-limbic neural substrate are associated with family history of BD, which may give insight into the pathophysiology of BD, and be a potential candidate as a morphological endophenotype of BD. Molecular Psychiatry (2012) 17, 412-420; doi: 10.1038/mp.2011.3; published online 15 February 2011
Resumo:
Abstract Background Atherosclerosis causes millions of deaths, annually yielding billions in expenses round the world. Intravascular Optical Coherence Tomography (IVOCT) is a medical imaging modality, which displays high resolution images of coronary cross-section. Nonetheless, quantitative information can only be obtained with segmentation; consequently, more adequate diagnostics, therapies and interventions can be provided. Since it is a relatively new modality, many different segmentation methods, available in the literature for other modalities, could be successfully applied to IVOCT images, improving accuracies and uses. Method An automatic lumen segmentation approach, based on Wavelet Transform and Mathematical Morphology, is presented. The methodology is divided into three main parts. First, the preprocessing stage attenuates and enhances undesirable and important information, respectively. Second, in the feature extraction block, wavelet is associated with an adapted version of Otsu threshold; hence, tissue information is discriminated and binarized. Finally, binary morphological reconstruction improves the binary information and constructs the binary lumen object. Results The evaluation was carried out by segmenting 290 challenging images from human and pig coronaries, and rabbit iliac arteries; the outcomes were compared with the gold standards made by experts. The resultant accuracy was obtained: True Positive (%) = 99.29 ± 2.96, False Positive (%) = 3.69 ± 2.88, False Negative (%) = 0.71 ± 2.96, Max False Positive Distance (mm) = 0.1 ± 0.07, Max False Negative Distance (mm) = 0.06 ± 0.1. Conclusions In conclusion, by segmenting a number of IVOCT images with various features, the proposed technique showed to be robust and more accurate than published studies; in addition, the method is completely automatic, providing a new tool for IVOCT segmentation.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
[ES]El spam, o correo no deseado enviado masivamente, es una amenaza que afecta al correo electrónico y otros medios de comunicación telemática. Su alto volumen de circulación genera pérdidas temporales y económicas considerables. Se presenta una solución a este problema: un sistema inteligente híbrido de filtrado antispam, basado en redes neuronales artificiales (RNA) no supervisadas. Consta de una etapa de preprocesado y de otra de procesado, basadas en distintos modelos de computación: programada (con 2 fases: manual y computacional) y neuronal (mediante mapas autoorganizados de Kohonen, SOM), respectivamente. Este sistema ha sido optimizado usando, como cuerpo de datos, ham de “Enron Email” y spam de dos fuentes diferentes. Se analiza la calidad y el rendimiento del mismo mediante diferentes métricas.
Resumo:
Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.
Resumo:
Some fundamental biological processes such as embryonic development have been preserved during evolution and are common to species belonging to different phylogenetic positions, but are nowadays largely unknown. The understanding of cell morphodynamics leading to the formation of organized spatial distribution of cells such as tissues and organs can be achieved through the reconstruction of cells shape and position during the development of a live animal embryo. We design in this work a chain of image processing methods to automatically segment and track cells nuclei and membranes during the development of a zebrafish embryo, which has been largely validates as model organism to understand vertebrate development, gene function and healingrepair mechanisms in vertebrates. The embryo is previously labeled through the ubiquitous expression of fluorescent proteins addressed to cells nuclei and membranes, and temporal sequences of volumetric images are acquired with laser scanning microscopy. Cells position is detected by processing nuclei images either through the generalized form of the Hough transform or identifying nuclei position with local maxima after a smoothing preprocessing step. Membranes and nuclei shapes are reconstructed by using PDEs based variational techniques such as the Subjective Surfaces and the Chan Vese method. Cells tracking is performed by combining informations previously detected on cells shape and position with biological regularization constraints. Our results are manually validated and reconstruct the formation of zebrafish brain at 7-8 somite stage with all the cells tracked starting from late sphere stage with less than 2% error for at least 6 hours. Our reconstruction opens the way to a systematic investigation of cellular behaviors, of clonal origin and clonal complexity of brain organs, as well as the contribution of cell proliferation modes and cell movements to the formation of local patterns and morphogenetic fields.
Resumo:
ZUSAMMENFASSUNG Langzeitbeobachtungsstudien zur Landschaftsdynamik inSahelländern stehen generell einem defizitären Angebot anquantitativen Rauminformationen gegenüber. Der in Malivorgefundene lokal- bis regionalräumliche Datenmangelführte zu einer methodologischen Studie, die die Entwicklungvon Verfahren zur multi-temporalen Erfassung und Analyse vonLandschaftsveränderungsdaten beinhaltet. Für den RaumWestafrika existiert in großer Flächenüberdeckunghistorisches Fernerkundungsmaterial in Form hochauflösenderLuftbilder ab den 50er Jahren und erste erdbeobachtendeSatellitendaten von Landsat-MSS ab den 70er Jahren.Multitemporale Langzeitanalysen verlangen zur digitalenReproduzierbarkeit, zur Datenvergleich- undObjekterfaßbarkeit die a priori-Betrachtung derDatenbeschaffenheit und -qualität. Zwei, ohne verfügbare, noch rekonstruierbareBodenkontrolldaten entwickelte Methodenansätze zeigen nichtnur die Möglichkeiten, sondern auch die Grenzen eindeutigerradiometrischer und morphometrischerBildinformationsgewinnung. Innerhalb desÜberschwemmungsgunstraums des Nigerbinnendeltas im ZentrumMalis stellen sich zwei Teilstudien zur Extraktion vonquantitativen Sahelvegetationsdaten den radiometrischen undatmosphärischen Problemen:1. Präprozessierende Homogenisierung von multitemporalenMSS-Archivdaten mit Simulationen zur Wirksamkeitatmosphärischer und sensorbedingter Effekte2. Entwicklung einer Methode zur semi-automatischenErfassung und Quantifizierung der Dynamik derGehölzbedeckungsdichte auf panchromatischenArchiv-Luftbildern Die erste Teilstudie stellt historischeLandsat-MSS-Satellitenbilddaten für multi-temporale Analysender Landschaftsdynamik als unbrauchbar heraus. In derzweiten Teilstudie wird der eigens, mittelsmorphomathematischer Filteroperationen für die automatischeMusterkennung und Quantifizierung von Sahelgehölzobjektenentwickelte Methodenansatz präsentiert. Abschließend wird die Forderung nach kosten- undzeiteffizienten Methodenstandards hinsichtlich ihrerRepräsentativität für die Langzeitbeobachtung desRessourceninventars semi-arider Räume sowie deroperationellen Transferierbarkeit auf Datenmaterial modernerFernerkundungssensoren diskutiert.
Resumo:
Magnetic Resonance Spectroscopy (MRS) is an advanced clinical and research application which guarantees a specific biochemical and metabolic characterization of tissues by the detection and quantification of key metabolites for diagnosis and disease staging. The "Associazione Italiana di Fisica Medica (AIFM)" has promoted the activity of the "Interconfronto di spettroscopia in RM" working group. The purpose of the study is to compare and analyze results obtained by perfoming MRS on scanners of different manufacturing in order to compile a robust protocol for spectroscopic examinations in clinical routines. This thesis takes part into this project by using the GE Signa HDxt 1.5 T at the Pavillion no. 11 of the S.Orsola-Malpighi hospital in Bologna. The spectral analyses have been performed with the jMRUI package, which includes a wide range of preprocessing and quantification algorithms for signal analysis in the time domain. After the quality assurance on the scanner with standard and innovative methods, both spectra with and without suppression of the water peak have been acquired on the GE test phantom. The comparison of the ratios of the metabolite amplitudes over Creatine computed by the workstation software, which works on the frequencies, and jMRUI shows good agreement, suggesting that quantifications in both domains may lead to consistent results. The characterization of an in-house phantom provided by the working group has achieved its goal of assessing the solution content and the metabolite concentrations with good accuracy. The goodness of the experimental procedure and data analysis has been demonstrated by the correct estimation of the T2 of water, the observed biexponential relaxation curve of Creatine and the correct TE value at which the modulation by J coupling causes the Lactate doublet to be inverted in the spectrum. The work of this thesis has demonstrated that it is possible to perform measurements and establish protocols for data analysis, based on the physical principles of NMR, which are able to provide robust values for the spectral parameters of clinical use.
Resumo:
Nowadays communication is switching from a centralized scenario, where communication media like newspapers, radio, TV programs produce information and people are just consumers, to a completely different decentralized scenario, where everyone is potentially an information producer through the use of social networks, blogs, forums that allow a real-time worldwide information exchange. These new instruments, as a result of their widespread diffusion, have started playing an important socio-economic role. They are the most used communication media and, as a consequence, they constitute the main source of information enterprises, political parties and other organizations can rely on. Analyzing data stored in servers all over the world is feasible by means of Text Mining techniques like Sentiment Analysis, which aims to extract opinions from huge amount of unstructured texts. This could lead to determine, for instance, the user satisfaction degree about products, services, politicians and so on. In this context, this dissertation presents new Document Sentiment Classification methods based on the mathematical theory of Markov Chains. All these approaches bank on a Markov Chain based model, which is language independent and whose killing features are simplicity and generality, which make it interesting with respect to previous sophisticated techniques. Every discussed technique has been tested in both Single-Domain and Cross-Domain Sentiment Classification areas, comparing performance with those of other two previous works. The performed analysis shows that some of the examined algorithms produce results comparable with the best methods in literature, with reference to both single-domain and cross-domain tasks, in $2$-classes (i.e. positive and negative) Document Sentiment Classification. However, there is still room for improvement, because this work also shows the way to walk in order to enhance performance, that is, a good novel feature selection process would be enough to outperform the state of the art. Furthermore, since some of the proposed approaches show promising results in $2$-classes Single-Domain Sentiment Classification, another future work will regard validating these results also in tasks with more than $2$ classes.
Resumo:
La tesi propone una soluzione middleware per scenari in cui i sensori producono un numero elevato di dati che è necessario gestire ed elaborare attraverso operazioni di preprocessing, filtering e buffering al fine di migliorare l'efficienza di comunicazione e del consumo di banda nel rispetto di vincoli energetici e computazionali. E'possibile effettuare l'ottimizzazione di questi componenti attraverso operazioni di tuning remoto.
Resumo:
We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.
Resumo:
This article gives an overview over the methods used in the low--level analysis of gene expression data generated using DNA microarrays. This type of experiment allows to determine relative levels of nucleic acid abundance in a set of tissues or cell populations for thousands of transcripts or loci simultaneously. Careful statistical design and analysis are essential to improve the efficiency and reliability of microarray experiments throughout the data acquisition and analysis process. This includes the design of probes, the experimental design, the image analysis of microarray scanned images, the normalization of fluorescence intensities, the assessment of the quality of microarray data and incorporation of quality information in subsequent analyses, the combination of information across arrays and across sets of experiments, the discovery and recognition of patterns in expression at the single gene and multiple gene levels, and the assessment of significance of these findings, considering the fact that there is a lot of noise and thus random features in the data. For all of these components, access to a flexible and efficient statistical computing environment is an essential aspect.
Resumo:
In most microarray technologies, a number of critical steps are required to convert raw intensity measurements into the data relied upon by data analysts, biologists and clinicians. These data manipulations, referred to as preprocessing, can influence the quality of the ultimate measurements. In the last few years, the high-throughput measurement of gene expression is the most popular application of microarray technology. For this application, various groups have demonstrated that the use of modern statistical methodology can substantially improve accuracy and precision of gene expression measurements, relative to ad-hoc procedures introduced by designers and manufacturers of the technology. Currently, other applications of microarrays are becoming more and more popular. In this paper we describe a preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease. In particular we describe methodology useful for preprocessing Affymetrix SNP chips and obtaining genotype calls with the preprocessed data. We demonstrate how our procedure improves existing approaches using data from three relatively large studies including one in which large number independent calls are available. Software implementing these ideas are avialble from the Bioconductor oligo package.
Resumo:
BACKGROUND: Starches are the major source of dietary glucose in weaned children and adults. However, small intestine alpha-glucogenesis by starch digestion is poorly understood due to substrate structural and chemical complexity, as well as the multiplicity of participating enzymes. Our objective was dissection of luminal and mucosal alpha-glucosidase activities participating in digestion of the soluble starch product maltodextrin (MDx). PATIENTS AND METHODS: Immunoprecipitated assays were performed on biopsy specimens and isolated enterocytes with MDx substrate. RESULTS: Mucosal sucrase-isomaltase (SI) and maltase-glucoamylase (MGAM) contributed 85% of total in vitro alpha-glucogenesis. Recombinant human pancreatic alpha-amylase alone contributed <15% of in vitro alpha-glucogenesis; however, alpha-amylase strongly amplified the mucosal alpha-glucogenic activities by preprocessing of starch to short glucose oligomer substrates. At low glucose oligomer concentrations, MGAM was 10 times more active than SI, but at higher concentrations it experienced substrate inhibition whereas SI was not affected. The in vitro results indicated that MGAM activity is inhibited by alpha-amylase digested starch product "brake" and contributes only 20% of mucosal alpha-glucogenic activity. SI contributes most of the alpha-glucogenic activity at higher oligomer substrate concentrations. CONCLUSIONS: MGAM primes and SI activity sustains and constrains prandial alpha-glucogenesis from starch oligomers at approximately 5% of the uninhibited rate. This coupled mucosal mechanism may contribute to highly efficient glucogenesis from low-starch diets and play a role in meeting the high requirement for glucose during children's brain maturation. The brake could play a constraining role on rates of glucose production from higher-starch diets consumed by an older population at risk for degenerative metabolic disorders.