921 resultados para Data Storage Solutions
Resumo:
Background The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results We have implemented an extension of Chado – the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework. A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different “omics” technologies with patient’s clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans webcite.
Resumo:
The development of cloud computing services is speeding up the rate in which the organizations outsource their computational services or sell their idle computational resources. Even though migrating to the cloud remains a tempting trend from a financial perspective, there are several other aspects that must be taken into account by companies before they decide to do so. One of the most important aspect refers to security: while some cloud computing security issues are inherited from the solutions adopted to create such services, many new security questions that are particular to these solutions also arise, including those related to how the services are organized and which kind of service/data can be placed in the cloud. Aiming to give a better understanding of this complex scenario, in this article we identify and classify the main security concerns and solutions in cloud computing, and propose a taxonomy of security in cloud computing, giving an overview of the current status of security in this emerging technology.
Resumo:
The effects of fluoride, which is present in different oral hygiene products, deserve more investigation because little is known about their impact on the surface of titanium, which is largely used in Implantology. This study evaluated the surface of commercially pure titanium (cpTi) after exposure to different concentrations of sodium fluoride (NaF). The hypothesis tested in this study was that different concentrations of NaF applied at different time intervals can affect the titanium surface in different ways. The treatments resulted in the following groups: GA (control): immersion in distilled water; GB: immersion in 0.05% NaF for 3 min daily; GC: immersion in 0.2% NaF for 3 min daily; GD: immersion in 0.05% NaF for 3 min every 2 weeks; and GE: immersion in 0.2% NaF for 3 min every 2 weeks. The experiment lasted 60 days. Roughness was measured initially and every 15 days subsequently up to 60 days. After 60 days, corrosion analysis and anodic polarization were done. The samples were examined by scanning electron microscopy (SEM). The roughness data were analyzed by ANOVA and there was no significant difference among groups and among time intervals. The corrosion data (i corr) were analyzed by the Mann-Whitney test, and significant differences were found between GA and GC, GB and GC, GC and GD, GC and GE. SEM micrographs showed that the titanium surface exposed to NaF presented corrosion that varied with the different concentrations. This study suggests that the use of 0.05% NaF solution on cpTi is safe, whereas the 0.2% NaF solution should be carefully evaluated with regard to its daily use.
Resumo:
An important feature in computer systems developed for the agricultural sector is to satisfy the heterogeneity of data generated in different processes. Most problems related with this heterogeneity arise from the lack of standard for different computing solutions proposed. An efficient solution for that is to create a single standard for data exchange. The study on the actual process involved in cotton production was based on a research developed by the Brazilian Agricultural Research Corporation (EMBRAPA) that reports all phases as a result of the compilation of several theoretical and practical researches related to cotton crop. The proposition of a standard starts with the identification of the most important classes of data involved in the process, and includes an ontology that is the systematization of concepts related to the production of cotton fiber and results in a set of classes, relations, functions and instances. The results are used as a reference for the development of computational tools, transforming implicit knowledge into applications that support the knowledge described. This research is based on data from the Midwest of Brazil. The choice of the cotton process as a study case comes from the fact that Brazil is one of the major players and there are several improvements required for system integration in this segment.
Resumo:
In this thesis some multivariate spectroscopic methods for the analysis of solutions are proposed. Spectroscopy and multivariate data analysis form a powerful combination for obtaining both quantitative and qualitative information and it is shown how spectroscopic techniques in combination with chemometric data evaluation can be used to obtain rapid, simple and efficient analytical methods. These spectroscopic methods consisting of spectroscopic analysis, a high level of automation and chemometric data evaluation can lead to analytical methods with a high analytical capacity, and for these methods, the term high-capacity analysis (HCA) is suggested. It is further shown how chemometric evaluation of the multivariate data in chromatographic analyses decreases the need for baseline separation. The thesis is based on six papers and the chemometric tools used are experimental design, principal component analysis (PCA), soft independent modelling of class analogy (SIMCA), partial least squares regression (PLS) and parallel factor analysis (PARAFAC). The analytical techniques utilised are scanning ultraviolet-visible (UV-Vis) spectroscopy, diode array detection (DAD) used in non-column chromatographic diode array UV spectroscopy, high-performance liquid chromatography with diode array detection (HPLC-DAD) and fluorescence spectroscopy. The methods proposed are exemplified in the analysis of pharmaceutical solutions and serum proteins. In Paper I a method is proposed for the determination of the content and identity of the active compound in pharmaceutical solutions by means of UV-Vis spectroscopy, orthogonal signal correction and multivariate calibration with PLS and SIMCA classification. Paper II proposes a new method for the rapid determination of pharmaceutical solutions by the use of non-column chromatographic diode array UV spectroscopy, i.e. a conventional HPLC-DAD system without any chromatographic column connected. In Paper III an investigation is made of the ability of a control sample, of known content and identity to diagnose and correct errors in multivariate predictions something that together with use of multivariate residuals can make it possible to use the same calibration model over time. In Paper IV a method is proposed for simultaneous determination of serum proteins with fluorescence spectroscopy and multivariate calibration. Paper V proposes a method for the determination of chromatographic peak purity by means of PCA of HPLC-DAD data. In Paper VI PARAFAC is applied for the decomposition of DAD data of some partially separated peaks into the pure chromatographic, spectral and concentration profiles.
Resumo:
Water is one of the most common compounds on earth and is essential for all biological activities. Water has, however, been a mystery for many years due to the large number of unusual chemical and physical properties, e.g. decreased volume during melting and maximum density at 4 °C. The origin of the anomalies behavior is the nature of the hydrogen bond. This thesis will presented an x-ray absorption spectroscopy (XAS) study to reveal the hydrogen bond structure in liquid water. The x-ray absorption process is faster than a femtosecond and thereby reflects the molecular orbital structure in a frozen geometry locally around the probed water molecules. The results indicate that the electronic structure of liquid water is significantly different from that of the solid and gaseous forms. The molecular arrangement in the first coordination shell of liquid water is actually very similar as the two-hydrogen-bonded configurations at the surface of ice. This discovery suggests that most molecules in liquid water have two-hydrogen-bonded configurations with one donor and one acceptor hydrogen bond compared to the four-hydrogen-bonded tetrahedral structure in ice. This result is controversial since the general picture is that the structure of liquid water is very similar to the structure of ice. The results are, however, consistent with x-ray and neutron diffraction data but reveals serious discrepancies with structures based on current molecular dynamics simulations. The two-hydrogen-bond configuration in liquid water is rigid and heating from 25 °C to 90 °C introduce a minor change in the hydrogen-bonded configurations. Furthermore, XAS studies of water in aqueous solutions show that ion hydration does not affect the hydrogen bond configuration of the bulk. Only water molecules in the close vicinity to the ions show changes in the hydrogen bond formation. XAS data obtained with fluorescence yield are sensitive enough to resolved electronic structure of water molecules in the first hydration sphere and to distinguish between different protonated species. Hence, XAS is a useful tool to provide insight into the local electronic structure of a hydrogen-bonded liquid and it is applied for the first time on water revealing unique information of high importance.
Resumo:
[EN] In the frame of the restoration of natural populations of Cymodocea nodosa of the Canary Islands, seeds are being collected at natural populations where germination is rather scarce and seasonal after dormancy. We have developed techniques of propagation in vitro of collected seeds, consisting in forced seed germination and seedlings propagation to obtain mature 20-30 cm plantlet, which eventually are being used for restoration. In order to improve the developed methodology, several experiments were conducted to adjust conditions for seed storage under different regimes of temperature without loosing germinative potential, fertilize during propagation with controlled released NPK fertilizers, and increase growth by dipping seedlings in solutions of the most common plant hormones.
Resumo:
[EN] Acoustic Doppler Current Profilers (ADCPs) have proven to be a useful oceanographic tool in the study of ocean dynamics. Data from D279, a transatlantic hydrographic cruise carried out in spring 2004 along 24.5°N, were processed, and lowered ADCP (LADCP) bottom track data were used to assess the choice of reference velocity for geostrophic calculations. The reference velocities from different combinations of ADCP data were compared to one another and a reference velocity was chosen based on the LADCP data. The barotropic tidal component was subtracted to provide a final reference velocity estimated by LADCP data. The results of the velocity fields are also shown. Further studies involving inverse solutions will include the reference velocity calculated here.
Resumo:
The increasing diffusion of wireless-enabled portable devices is pushing toward the design of novel service scenarios, promoting temporary and opportunistic interactions in infrastructure-less environments. Mobile Ad Hoc Networks (MANET) are the general model of these higly dynamic networks that can be specialized, depending on application cases, in more specific and refined models such as Vehicular Ad Hoc Networks and Wireless Sensor Networks. Two interesting deployment cases are of increasing relevance: resource diffusion among users equipped with portable devices, such as laptops, smart phones or PDAs in crowded areas (termed dense MANET) and dissemination/indexing of monitoring information collected in Vehicular Sensor Networks. The extreme dynamicity of these scenarios calls for novel distributed protocols and services facilitating application development. To this aim we have designed middleware solutions supporting these challenging tasks. REDMAN manages, retrieves, and disseminates replicas of software resources in dense MANET; it implements novel lightweight protocols to maintain a desired replication degree despite participants mobility, and efficiently perform resource retrieval. REDMAN exploits the high-density assumption to achieve scalability and limited network overhead. Sensed data gathering and distributed indexing in Vehicular Networks raise similar issues: we propose a specific middleware support, called MobEyes, exploiting node mobility to opportunistically diffuse data summaries among neighbor vehicles. MobEyes creates a low-cost opportunistic distributed index to query the distributed storage and to determine the location of needed information. Extensive validation and testing of REDMAN and MobEyes prove the effectiveness of our original solutions in limiting communication overhead while maintaining the required accuracy of replication degree and indexing completeness, and demonstrates the feasibility of the middleware approach.
Resumo:
[EN] This work makes a theoretical–experimental contribution to the study of ester and alkane solutions. Experimental data of isobaric vapor–liquid equilibria (VLE) are presented at 101.3 kPa for binary systems of methyl ethanoate with six alkanes (from C5 to C10), and of volumes and mixing enthalpies, vE and hE.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
The increasing aversion to technological risks of the society requires the development of inherently safer and environmentally friendlier processes, besides assuring the economic competitiveness of the industrial activities. The different forms of impact (e.g. environmental, economic and societal) are frequently characterized by conflicting reduction strategies and must be holistically taken into account in order to identify the optimal solutions in process design. Though the literature reports an extensive discussion of strategies and specific principles, quantitative assessment tools are required to identify the marginal improvements in alternative design options, to allow the trade-off among contradictory aspects and to prevent the “risk shift”. In the present work a set of integrated quantitative tools for design assessment (i.e. design support system) was developed. The tools were specifically dedicated to the implementation of sustainability and inherent safety in process and plant design activities, with respect to chemical and industrial processes in which substances dangerous for humans and environment are used or stored. The tools were mainly devoted to the application in the stages of “conceptual” and “basic design”, when the project is still open to changes (due to the large number of degrees of freedom) which may comprise of strategies to improve sustainability and inherent safety. The set of developed tools includes different phases of the design activities, all through the lifecycle of a project (inventories, process flow diagrams, preliminary plant lay-out plans). The development of such tools gives a substantial contribution to fill the present gap in the availability of sound supports for implementing safety and sustainability in early phases of process design. The proposed decision support system was based on the development of a set of leading key performance indicators (KPIs), which ensure the assessment of economic, societal and environmental impacts of a process (i.e. sustainability profile). The KPIs were based on impact models (also complex), but are easy and swift in the practical application. Their full evaluation is possible also starting from the limited data available during early process design. Innovative reference criteria were developed to compare and aggregate the KPIs on the basis of the actual sitespecific impact burden and the sustainability policy. Particular attention was devoted to the development of reliable criteria and tools for the assessment of inherent safety in different stages of the project lifecycle. The assessment follows an innovative approach in the analysis of inherent safety, based on both the calculation of the expected consequences of potential accidents and the evaluation of the hazards related to equipment. The methodology overrides several problems present in the previous methods proposed for quantitative inherent safety assessment (use of arbitrary indexes, subjective judgement, build-in assumptions, etc.). A specific procedure was defined for the assessment of the hazards related to the formations of undesired substances in chemical systems undergoing “out of control” conditions. In the assessment of layout plans, “ad hoc” tools were developed to account for the hazard of domino escalations and the safety economics. The effectiveness and value of the tools were demonstrated by the application to a large number of case studies concerning different kinds of design activities (choice of materials, design of the process, of the plant, of the layout) and different types of processes/plants (chemical industry, storage facilities, waste disposal). An experimental survey (analysis of the thermal stability of isomers of nitrobenzaldehyde) provided the input data necessary to demonstrate the method for inherent safety assessment of materials.
Resumo:
Bioinformatics is a recent and emerging discipline which aims at studying biological problems through computational approaches. Most branches of bioinformatics such as Genomics, Proteomics and Molecular Dynamics are particularly computationally intensive, requiring huge amount of computational resources for running algorithms of everincreasing complexity over data of everincreasing size. In the search for computational power, the EGEE Grid platform, world's largest community of interconnected clusters load balanced as a whole, seems particularly promising and is considered the new hope for satisfying the everincreasing computational requirements of bioinformatics, as well as physics and other computational sciences. The EGEE platform, however, is rather new and not yet free of problems. In addition, specific requirements of bioinformatics need to be addressed in order to use this new platform effectively for bioinformatics tasks. In my three years' Ph.D. work I addressed numerous aspects of this Grid platform, with particular attention to those needed by the bioinformatics domain. I hence created three major frameworks, Vnas, GridDBManager and SETest, plus an additional smaller standalone solution, to enhance the support for bioinformatics applications in the Grid environment and to reduce the effort needed to create new applications, additionally addressing numerous existing Grid issues and performing a series of optimizations. The Vnas framework is an advanced system for the submission and monitoring of Grid jobs that provides an abstraction with reliability over the Grid platform. In addition, Vnas greatly simplifies the development of new Grid applications by providing a callback system to simplify the creation of arbitrarily complex multistage computational pipelines and provides an abstracted virtual sandbox which bypasses Grid limitations. Vnas also reduces the usage of Grid bandwidth and storage resources by transparently detecting equality of virtual sandbox files based on content, across different submissions, even when performed by different users. BGBlast, evolution of the earlier project GridBlast, now provides a Grid Database Manager (GridDBManager) component for managing and automatically updating biological flatfile databases in the Grid environment. GridDBManager sports very novel features such as an adaptive replication algorithm that constantly optimizes the number of replicas of the managed databases in the Grid environment, balancing between response times (performances) and storage costs according to a programmed cost formula. GridDBManager also provides a very optimized automated management for older versions of the databases based on reverse delta files, which reduces the storage costs required to keep such older versions available in the Grid environment by two orders of magnitude. The SETest framework provides a way to the user to test and regressiontest Python applications completely scattered with side effects (this is a common case with Grid computational pipelines), which could not easily be tested using the more standard methods of unit testing or test cases. The technique is based on a new concept of datasets containing invocations and results of filtered calls. The framework hence significantly accelerates the development of new applications and computational pipelines for the Grid environment, and the efforts required for maintenance. An analysis of the impact of these solutions will be provided in this thesis. This Ph.D. work originated various publications in journals and conference proceedings as reported in the Appendix. Also, I orally presented my work at numerous international conferences related to Grid and bioinformatics.
Resumo:
The dynamicity and heterogeneity that characterize pervasive environments raise new challenges in the design of mobile middleware. Pervasive environments are characterized by a significant degree of heterogeneity, variability, and dynamicity that conventional middleware solutions are not able to adequately manage. Originally designed for use in a relatively static context, such middleware systems tend to hide low-level details to provide applications with a transparent view on the underlying execution platform. In mobile environments, however, the context is extremely dynamic and cannot be managed by a priori assumptions. Novel middleware should therefore support mobile computing applications in the task of adapting their behavior to frequent changes in the execution context, that is, it should become context-aware. In particular, this thesis has identified the following key requirements for novel context-aware middleware that existing solutions do not fulfil yet. (i) Middleware solutions should support interoperability between possibly unknown entities by providing expressive representation models that allow to describe interacting entities, their operating conditions and the surrounding world, i.e., their context, according to an unambiguous semantics. (ii) Middleware solutions should support distributed applications in the task of reconfiguring and adapting their behavior/results to ongoing context changes. (iii) Context-aware middleware support should be deployed on heterogeneous devices under variable operating conditions, such as different user needs, application requirements, available connectivity and device computational capabilities, as well as changing environmental conditions. Our main claim is that the adoption of semantic metadata to represent context information and context-dependent adaptation strategies allows to build context-aware middleware suitable for all dynamically available portable devices. Semantic metadata provide powerful knowledge representation means to model even complex context information, and allow to perform automated reasoning to infer additional and/or more complex knowledge from available context data. In addition, we suggest that, by adopting proper configuration and deployment strategies, semantic support features can be provided to differentiated users and devices according to their specific needs and current context. This thesis has investigated novel design guidelines and implementation options for semantic-based context-aware middleware solutions targeted to pervasive environments. These guidelines have been applied to different application areas within pervasive computing that would particularly benefit from the exploitation of context. Common to all applications is the key role of context in enabling mobile users to personalize applications based on their needs and current situation. The main contributions of this thesis are (i) the definition of a metadata model to represent and reason about context, (ii) the definition of a model for the design and development of context-aware middleware based on semantic metadata, (iii) the design of three novel middleware architectures and the development of a prototypal implementation for each of these architectures, and (iv) the proposal of a viable approach to portability issues raised by the adoption of semantic support services in pervasive applications.
Resumo:
Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.