187 resultados para Preprocessing
Resumo:
Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.
Resumo:
Some fundamental biological processes such as embryonic development have been preserved during evolution and are common to species belonging to different phylogenetic positions, but are nowadays largely unknown. The understanding of cell morphodynamics leading to the formation of organized spatial distribution of cells such as tissues and organs can be achieved through the reconstruction of cells shape and position during the development of a live animal embryo. We design in this work a chain of image processing methods to automatically segment and track cells nuclei and membranes during the development of a zebrafish embryo, which has been largely validates as model organism to understand vertebrate development, gene function and healingrepair mechanisms in vertebrates. The embryo is previously labeled through the ubiquitous expression of fluorescent proteins addressed to cells nuclei and membranes, and temporal sequences of volumetric images are acquired with laser scanning microscopy. Cells position is detected by processing nuclei images either through the generalized form of the Hough transform or identifying nuclei position with local maxima after a smoothing preprocessing step. Membranes and nuclei shapes are reconstructed by using PDEs based variational techniques such as the Subjective Surfaces and the Chan Vese method. Cells tracking is performed by combining informations previously detected on cells shape and position with biological regularization constraints. Our results are manually validated and reconstruct the formation of zebrafish brain at 7-8 somite stage with all the cells tracked starting from late sphere stage with less than 2% error for at least 6 hours. Our reconstruction opens the way to a systematic investigation of cellular behaviors, of clonal origin and clonal complexity of brain organs, as well as the contribution of cell proliferation modes and cell movements to the formation of local patterns and morphogenetic fields.
Resumo:
ZUSAMMENFASSUNG Langzeitbeobachtungsstudien zur Landschaftsdynamik inSahelländern stehen generell einem defizitären Angebot anquantitativen Rauminformationen gegenüber. Der in Malivorgefundene lokal- bis regionalräumliche Datenmangelführte zu einer methodologischen Studie, die die Entwicklungvon Verfahren zur multi-temporalen Erfassung und Analyse vonLandschaftsveränderungsdaten beinhaltet. Für den RaumWestafrika existiert in großer Flächenüberdeckunghistorisches Fernerkundungsmaterial in Form hochauflösenderLuftbilder ab den 50er Jahren und erste erdbeobachtendeSatellitendaten von Landsat-MSS ab den 70er Jahren.Multitemporale Langzeitanalysen verlangen zur digitalenReproduzierbarkeit, zur Datenvergleich- undObjekterfaßbarkeit die a priori-Betrachtung derDatenbeschaffenheit und -qualität. Zwei, ohne verfügbare, noch rekonstruierbareBodenkontrolldaten entwickelte Methodenansätze zeigen nichtnur die Möglichkeiten, sondern auch die Grenzen eindeutigerradiometrischer und morphometrischerBildinformationsgewinnung. Innerhalb desÜberschwemmungsgunstraums des Nigerbinnendeltas im ZentrumMalis stellen sich zwei Teilstudien zur Extraktion vonquantitativen Sahelvegetationsdaten den radiometrischen undatmosphärischen Problemen:1. Präprozessierende Homogenisierung von multitemporalenMSS-Archivdaten mit Simulationen zur Wirksamkeitatmosphärischer und sensorbedingter Effekte2. Entwicklung einer Methode zur semi-automatischenErfassung und Quantifizierung der Dynamik derGehölzbedeckungsdichte auf panchromatischenArchiv-Luftbildern Die erste Teilstudie stellt historischeLandsat-MSS-Satellitenbilddaten für multi-temporale Analysender Landschaftsdynamik als unbrauchbar heraus. In derzweiten Teilstudie wird der eigens, mittelsmorphomathematischer Filteroperationen für die automatischeMusterkennung und Quantifizierung von Sahelgehölzobjektenentwickelte Methodenansatz präsentiert. Abschließend wird die Forderung nach kosten- undzeiteffizienten Methodenstandards hinsichtlich ihrerRepräsentativität für die Langzeitbeobachtung desRessourceninventars semi-arider Räume sowie deroperationellen Transferierbarkeit auf Datenmaterial modernerFernerkundungssensoren diskutiert.
Resumo:
Magnetic Resonance Spectroscopy (MRS) is an advanced clinical and research application which guarantees a specific biochemical and metabolic characterization of tissues by the detection and quantification of key metabolites for diagnosis and disease staging. The "Associazione Italiana di Fisica Medica (AIFM)" has promoted the activity of the "Interconfronto di spettroscopia in RM" working group. The purpose of the study is to compare and analyze results obtained by perfoming MRS on scanners of different manufacturing in order to compile a robust protocol for spectroscopic examinations in clinical routines. This thesis takes part into this project by using the GE Signa HDxt 1.5 T at the Pavillion no. 11 of the S.Orsola-Malpighi hospital in Bologna. The spectral analyses have been performed with the jMRUI package, which includes a wide range of preprocessing and quantification algorithms for signal analysis in the time domain. After the quality assurance on the scanner with standard and innovative methods, both spectra with and without suppression of the water peak have been acquired on the GE test phantom. The comparison of the ratios of the metabolite amplitudes over Creatine computed by the workstation software, which works on the frequencies, and jMRUI shows good agreement, suggesting that quantifications in both domains may lead to consistent results. The characterization of an in-house phantom provided by the working group has achieved its goal of assessing the solution content and the metabolite concentrations with good accuracy. The goodness of the experimental procedure and data analysis has been demonstrated by the correct estimation of the T2 of water, the observed biexponential relaxation curve of Creatine and the correct TE value at which the modulation by J coupling causes the Lactate doublet to be inverted in the spectrum. The work of this thesis has demonstrated that it is possible to perform measurements and establish protocols for data analysis, based on the physical principles of NMR, which are able to provide robust values for the spectral parameters of clinical use.
Resumo:
Nowadays communication is switching from a centralized scenario, where communication media like newspapers, radio, TV programs produce information and people are just consumers, to a completely different decentralized scenario, where everyone is potentially an information producer through the use of social networks, blogs, forums that allow a real-time worldwide information exchange. These new instruments, as a result of their widespread diffusion, have started playing an important socio-economic role. They are the most used communication media and, as a consequence, they constitute the main source of information enterprises, political parties and other organizations can rely on. Analyzing data stored in servers all over the world is feasible by means of Text Mining techniques like Sentiment Analysis, which aims to extract opinions from huge amount of unstructured texts. This could lead to determine, for instance, the user satisfaction degree about products, services, politicians and so on. In this context, this dissertation presents new Document Sentiment Classification methods based on the mathematical theory of Markov Chains. All these approaches bank on a Markov Chain based model, which is language independent and whose killing features are simplicity and generality, which make it interesting with respect to previous sophisticated techniques. Every discussed technique has been tested in both Single-Domain and Cross-Domain Sentiment Classification areas, comparing performance with those of other two previous works. The performed analysis shows that some of the examined algorithms produce results comparable with the best methods in literature, with reference to both single-domain and cross-domain tasks, in $2$-classes (i.e. positive and negative) Document Sentiment Classification. However, there is still room for improvement, because this work also shows the way to walk in order to enhance performance, that is, a good novel feature selection process would be enough to outperform the state of the art. Furthermore, since some of the proposed approaches show promising results in $2$-classes Single-Domain Sentiment Classification, another future work will regard validating these results also in tasks with more than $2$ classes.
Resumo:
La tesi propone una soluzione middleware per scenari in cui i sensori producono un numero elevato di dati che è necessario gestire ed elaborare attraverso operazioni di preprocessing, filtering e buffering al fine di migliorare l'efficienza di comunicazione e del consumo di banda nel rispetto di vincoli energetici e computazionali. E'possibile effettuare l'ottimizzazione di questi componenti attraverso operazioni di tuning remoto.
Resumo:
We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.
Resumo:
This article gives an overview over the methods used in the low--level analysis of gene expression data generated using DNA microarrays. This type of experiment allows to determine relative levels of nucleic acid abundance in a set of tissues or cell populations for thousands of transcripts or loci simultaneously. Careful statistical design and analysis are essential to improve the efficiency and reliability of microarray experiments throughout the data acquisition and analysis process. This includes the design of probes, the experimental design, the image analysis of microarray scanned images, the normalization of fluorescence intensities, the assessment of the quality of microarray data and incorporation of quality information in subsequent analyses, the combination of information across arrays and across sets of experiments, the discovery and recognition of patterns in expression at the single gene and multiple gene levels, and the assessment of significance of these findings, considering the fact that there is a lot of noise and thus random features in the data. For all of these components, access to a flexible and efficient statistical computing environment is an essential aspect.
Resumo:
In most microarray technologies, a number of critical steps are required to convert raw intensity measurements into the data relied upon by data analysts, biologists and clinicians. These data manipulations, referred to as preprocessing, can influence the quality of the ultimate measurements. In the last few years, the high-throughput measurement of gene expression is the most popular application of microarray technology. For this application, various groups have demonstrated that the use of modern statistical methodology can substantially improve accuracy and precision of gene expression measurements, relative to ad-hoc procedures introduced by designers and manufacturers of the technology. Currently, other applications of microarrays are becoming more and more popular. In this paper we describe a preprocessing methodology for a technology designed for the identification of DNA sequence variants in specific genes or regions of the human genome that are associated with phenotypes of interest such as disease. In particular we describe methodology useful for preprocessing Affymetrix SNP chips and obtaining genotype calls with the preprocessed data. We demonstrate how our procedure improves existing approaches using data from three relatively large studies including one in which large number independent calls are available. Software implementing these ideas are avialble from the Bioconductor oligo package.
Resumo:
BACKGROUND: Starches are the major source of dietary glucose in weaned children and adults. However, small intestine alpha-glucogenesis by starch digestion is poorly understood due to substrate structural and chemical complexity, as well as the multiplicity of participating enzymes. Our objective was dissection of luminal and mucosal alpha-glucosidase activities participating in digestion of the soluble starch product maltodextrin (MDx). PATIENTS AND METHODS: Immunoprecipitated assays were performed on biopsy specimens and isolated enterocytes with MDx substrate. RESULTS: Mucosal sucrase-isomaltase (SI) and maltase-glucoamylase (MGAM) contributed 85% of total in vitro alpha-glucogenesis. Recombinant human pancreatic alpha-amylase alone contributed <15% of in vitro alpha-glucogenesis; however, alpha-amylase strongly amplified the mucosal alpha-glucogenic activities by preprocessing of starch to short glucose oligomer substrates. At low glucose oligomer concentrations, MGAM was 10 times more active than SI, but at higher concentrations it experienced substrate inhibition whereas SI was not affected. The in vitro results indicated that MGAM activity is inhibited by alpha-amylase digested starch product "brake" and contributes only 20% of mucosal alpha-glucogenic activity. SI contributes most of the alpha-glucogenic activity at higher oligomer substrate concentrations. CONCLUSIONS: MGAM primes and SI activity sustains and constrains prandial alpha-glucogenesis from starch oligomers at approximately 5% of the uninhibited rate. This coupled mucosal mechanism may contribute to highly efficient glucogenesis from low-starch diets and play a role in meeting the high requirement for glucose during children's brain maturation. The brake could play a constraining role on rates of glucose production from higher-starch diets consumed by an older population at risk for degenerative metabolic disorders.
Resumo:
Electroencephalograms (EEG) are often contaminated with high amplitude artifacts limiting the usability of data. Methods that reduce these artifacts are often restricted to certain types of artifacts, require manual interaction or large training data sets. Within this paper we introduce a novel method, which is able to eliminate many different types of artifacts without manual intervention. The algorithm first decomposes the signal into different sub-band signals in order to isolate different types of artifacts into specific frequency bands. After signal decomposition with principal component analysis (PCA) an adaptive threshold is applied to eliminate components with high variance corresponding to the dominant artifact activity. Our results show that the algorithm is able to significantly reduce artifacts while preserving the EEG activity. Parameters for the algorithm do not have to be identified for every patient individually making the method a good candidate for preprocessing in automatic seizure detection and prediction algorithms.
Resumo:
Many methodologies dealing with prediction or simulation of soft tissue deformations on medical image data require preprocessing of the data in order to produce a different shape representation that complies with standard methodologies, such as mass–spring networks, finite element method s (FEM). On the other hand, methodologies working directly on the image space normally do not take into account mechanical behavior of tissues and tend to lack physics foundations driving soft tissue deformations. This chapter presents a method to simulate soft tissue deformations based on coupled concepts from image analysis and mechanics theory. The proposed methodology is based on a robust stochastic approach that takes into account material properties retrieved directly from the image, concepts from continuum mechanics and FEM. The optimization framework is solved within a hierarchical Markov random field (HMRF) which is implemented on the graphics processor unit (GPU See Graphics processing unit ).
Resumo:
In reverse logistics networks, products (e.g., bottles or containers) have to be transported from a depot to customer locations and, after use, from customer locations back to the depot. In order to operate economically beneficial, companies prefer a simultaneous delivery and pick-up service. The resulting Vehicle Routing Problem with Simultaneous Delivery and Pick-up (VRPSDP) is an operational problem, which has to be solved daily by many companies. We present two mixed-integer linear model formulations for the VRPSDP, namely a vehicle-flow and a commodity-flow model. In order to strengthen the models, domain-reducing preprocessing techniques, and effective cutting planes are outlined. Symmetric benchmark instances known from the literature as well as new asymmetric instances derived from real-world problems are solved to optimality using CPLEX 12.1.
Resumo:
In process industries, make-and-pack production is used to produce food and beverages, chemicals, and metal products, among others. This type of production process allows the fabrication of a wide range of products in relatively small amounts using the same equipment. In this article, we consider a real-world production process (cf. Honkomp et al. 2000. The curse of reality – why process scheduling optimization problems are diffcult in practice. Computers & Chemical Engineering, 24, 323–328.) comprising sequence-dependent changeover times, multipurpose storage units with limited capacities, quarantine times, batch splitting, partial equipment connectivity, and transfer times. The planning problem consists of computing a production schedule such that a given demand of packed products is fulfilled, all technological constraints are satisfied, and the production makespan is minimised. None of the models in the literature covers all of the technological constraints that occur in such make-and-pack production processes. To close this gap, we develop an efficient mixed-integer linear programming model that is based on a continuous time domain and general-precedence variables. We propose novel types of symmetry-breaking constraints and a preprocessing procedure to improve the model performance. In an experimental analysis, we show that small- and moderate-sized instances can be solved to optimality within short CPU times.
Resumo:
OBJECTIVES Molecular subclassification of non small-cell lung cancer (NSCLC) is essential to improve clinical outcome. This study assessed the prognostic and predictive value of circulating micro-RNA (miRNA) in patients with non-squamous NSCLC enrolled in the phase II SAKK (Swiss Group for Clinical Cancer Research) trial 19/05, receiving uniform treatment with first-line bevacizumab and erlotinib followed by platinum-based chemotherapy at progression. MATERIALS AND METHODS Fifty patients with baseline and 24 h blood samples were included from SAKK 19/05. The primary study endpoint was to identify prognostic (overall survival, OS) miRNA's. Patient samples were analyzed with Agilent human miRNA 8x60K microarrays, each glass slide formatted with eight high-definition 60K arrays. Each array contained 40 probes targeting each of the 1347 miRNA. Data preprocessing included quantile normalization using robust multi-array average (RMA) algorithm. Prognostic and predictive miRNA expression profiles were identified by Spearman's rank correlation test (percentage tumor shrinkage) or log-rank testing (for time-to-event endpoints). RESULTS Data preprocessing kept 49 patients and 424 miRNA for further analysis. Ten miRNA's were significantly associated with OS, with hsa-miR-29a being the strongest prognostic marker (HR=6.44, 95%-CI 2.39-17.33). Patients with high has-miR-29a expression had a significantly lower survival at 10 months compared to patients with a low expression (54% versus 83%). Six out of the 10 miRNA's (hsa-miRN-29a, hsa-miR-542-5p, hsa-miR-502-3p, hsa-miR-376a, hsa-miR-500a, hsa-miR-424) were insensitive to perturbations according to jackknife cross-validation on their HR for OS. The respective principal component analysis (PCA) defined a meta-miRNA signature including the same 6 miRNA's, resulting in a HR of 0.66 (95%-CI 0.53-0.82). CONCLUSION Cell-free circulating miRNA-profiling successfully identified a highly prognostic 6-gene signature in patients with advanced non-squamous NSCLC. Circulating miRNA profiling should further be validated in external cohorts for the selection and monitoring of systemic treatment in patients with advanced NSCLC.