778 resultados para Machine Learning. Semissupervised learning. Multi-label classification. Reliability Parameter
Resumo:
Urban regeneration is more and more a “universal issue” and a crucial factor in the new trends of urban planning. It is no longer only an area of study and research; it became part of new urban and housing policies. Urban regeneration involves complex decisions as a consequence of the multiple dimensions of the problems that include special technical requirements, safety concerns, socio-economic, environmental, aesthetic, and political impacts, among others. This multi-dimensional nature of urban regeneration projects and their large capital investments justify the development and use of state-of-the-art decision support methodologies to assist decision makers. This research focuses on the development of a multi-attribute approach for the evaluation of building conservation status in urban regeneration projects, thus supporting decision makers in their analysis of the problem and in the definition of strategies and priorities of intervention. The methods presented can be embedded into a Geographical Information System for visualization of results. A real-world case study was used to test the methodology, whose results are also presented.
Resumo:
The study of electricity markets operation has been gaining an increasing importance in last years, as result of the new challenges that the electricity markets restructuring produced. This restructuring increased the competitiveness of the market, but with it its complexity. The growing complexity and unpredictability of the market’s evolution consequently increases the decision making difficulty. Therefore, the intervenient entities are forced to rethink their behaviour and market strategies. Currently, lots of information concerning electricity markets is available. These data, concerning innumerous regards of electricity markets operation, is accessible free of charge, and it is essential for understanding and suitably modelling electricity markets. This paper proposes a tool which is able to handle, store and dynamically update data. The development of the proposed tool is expected to be of great importance to improve the comprehension of electricity markets and the interactions among the involved entities.
Resumo:
An extensive set of machine learning and pattern classification techniques trained and tested on KDD dataset failed in detecting most of the user-to-root attacks. This paper aims to provide an approach for mitigating negative aspects of the mentioned dataset, which led to low detection rates. Genetic algorithm is employed to implement rules for detecting various types of attacks. Rules are formed of the features of the dataset identified as the most important ones for each attack type. In this way we introduce high level of generality and thus achieve high detection rates, but also gain high reduction of the system training time. Thenceforth we re-check the decision of the user-to- root rules with the rules that detect other types of attacks. In this way we decrease the false-positive rate. The model was verified on KDD 99, demonstrating higher detection rates than those reported by the state- of-the-art while maintaining low false-positive rate.
Resumo:
Multi-parametric and quantitative magnetic resonance imaging (MRI) techniques have come into the focus of interest, both as a research and diagnostic modality for the evaluation of patients suffering from mild cognitive decline and overt dementia. In this study we address the question, if disease related quantitative magnetization transfer effects (qMT) within the intra- and extracellular matrices of the hippocampus may aid in the differentiation between clinically diagnosed patients with Alzheimer disease (AD), patients with mild cognitive impairment (MCI) and healthy controls. We evaluated 22 patients with AD (n=12) and MCI (n=10) and 22 healthy elderly (n=12) and younger (n=10) controls with multi-parametric MRI. Neuropsychological testing was performed in patients and elderly controls (n=34). In order to quantify the qMT effects, the absorption spectrum was sampled at relevant off-resonance frequencies. The qMT-parameters were calculated according to a two-pool spin-bath model including the T1- and T2 relaxation parameters of the free pool, determined in separate experiments. Histograms (fixed bin-size) of the normalized qMT-parameter values (z-scores) within the anterior and posterior hippocampus (hippocampal head and body) were subjected to a fuzzy-c-means classification algorithm with downstreamed PCA projection. The within-cluster sums of point-to-centroid distances were used to examine the effects of qMT- and diffusion anisotropy parameters on the discrimination of healthy volunteers, patients with Alzheimer and MCIs. The qMT-parameters T2(r) (T2 of the restricted pool) and F (fractional pool size) differentiated between the three groups (control, MCI and AD) in the anterior hippocampus. In our cohort, the MT ratio, as proposed in previous reports, did not differentiate between MCI and AD or healthy controls and MCI, but between healthy controls and AD.
Resumo:
La tomografía axial computerizada (TAC) es la modalidad de imagen médica preferente para el estudio de enfermedades pulmonares y el análisis de su vasculatura. La segmentación general de vasos en pulmón ha sido abordada en profundidad a lo largo de los últimos años por la comunidad científica que trabaja en el campo de procesamiento de imagen; sin embargo, la diferenciación entre irrigaciones arterial y venosa es aún un problema abierto. De hecho, la separación automática de arterias y venas está considerado como uno de los grandes retos futuros del procesamiento de imágenes biomédicas. La segmentación arteria-vena (AV) permitiría el estudio de ambas irrigaciones por separado, lo cual tendría importantes consecuencias en diferentes escenarios médicos y múltiples enfermedades pulmonares o estados patológicos. Características como la densidad, geometría, topología y tamaño de los vasos sanguíneos podrían ser analizados en enfermedades que conllevan remodelación de la vasculatura pulmonar, haciendo incluso posible el descubrimiento de nuevos biomarcadores específicos que aún hoy en dípermanecen ocultos. Esta diferenciación entre arterias y venas también podría ayudar a la mejora y el desarrollo de métodos de procesamiento de las distintas estructuras pulmonares. Sin embargo, el estudio del efecto de las enfermedades en los árboles arterial y venoso ha sido inviable hasta ahora a pesar de su indudable utilidad. La extrema complejidad de los árboles vasculares del pulmón hace inabordable una separación manual de ambas estructuras en un tiempo realista, fomentando aún más la necesidad de diseñar herramientas automáticas o semiautomáticas para tal objetivo. Pero la ausencia de casos correctamente segmentados y etiquetados conlleva múltiples limitaciones en el desarrollo de sistemas de separación AV, en los cuales son necesarias imágenes de referencia tanto para entrenar como para validar los algoritmos. Por ello, el diseño de imágenes sintéticas de TAC pulmonar podría superar estas dificultades ofreciendo la posibilidad de acceso a una base de datos de casos pseudoreales bajo un entorno restringido y controlado donde cada parte de la imagen (incluyendo arterias y venas) está unívocamente diferenciada. En esta Tesis Doctoral abordamos ambos problemas, los cuales están fuertemente interrelacionados. Primero se describe el diseño de una estrategia para generar, automáticamente, fantomas computacionales de TAC de pulmón en humanos. Partiendo de conocimientos a priori, tanto biológicos como de características de imagen de CT, acerca de la topología y relación entre las distintas estructuras pulmonares, el sistema desarrollado es capaz de generar vías aéreas, arterias y venas pulmonares sintéticas usando métodos de crecimiento iterativo, que posteriormente se unen para formar un pulmón simulado con características realistas. Estos casos sintéticos, junto a imágenes reales de TAC sin contraste, han sido usados en el desarrollo de un método completamente automático de segmentación/separación AV. La estrategia comprende una primera extracción genérica de vasos pulmonares usando partículas espacio-escala, y una posterior clasificación AV de tales partículas mediante el uso de Graph-Cuts (GC) basados en la similitud con arteria o vena (obtenida con algoritmos de aprendizaje automático) y la inclusión de información de conectividad entre partículas. La validación de los fantomas pulmonares se ha llevado a cabo mediante inspección visual y medidas cuantitativas relacionadas con las distribuciones de intensidad, dispersión de estructuras y relación entre arterias y vías aéreas, los cuales muestran una buena correspondencia entre los pulmones reales y los generados sintéticamente. La evaluación del algoritmo de segmentación AV está basada en distintas estrategias de comprobación de la exactitud en la clasificación de vasos, las cuales revelan una adecuada diferenciación entre arterias y venas tanto en los casos reales como en los sintéticos, abriendo así un amplio abanico de posibilidades en el estudio clínico de enfermedades cardiopulmonares y en el desarrollo de metodologías y nuevos algoritmos para el análisis de imágenes pulmonares. ABSTRACT Computed tomography (CT) is the reference image modality for the study of lung diseases and pulmonary vasculature. Lung vessel segmentation has been widely explored by the biomedical image processing community, however, differentiation of arterial from venous irrigations is still an open problem. Indeed, automatic separation of arterial and venous trees has been considered during last years as one of the main future challenges in the field. Artery-Vein (AV) segmentation would be useful in different medical scenarios and multiple pulmonary diseases or pathological states, allowing the study of arterial and venous irrigations separately. Features such as density, geometry, topology and size of vessels could be analyzed in diseases that imply vasculature remodeling, making even possible the discovery of new specific biomarkers that remain hidden nowadays. Differentiation between arteries and veins could also enhance or improve methods processing pulmonary structures. Nevertheless, AV segmentation has been unfeasible until now in clinical routine despite its objective usefulness. The huge complexity of pulmonary vascular trees makes a manual segmentation of both structures unfeasible in realistic time, encouraging the design of automatic or semiautomatic tools to perform the task. However, this lack of proper labeled cases seriously limits in the development of AV segmentation systems, where reference standards are necessary in both algorithm training and validation stages. For that reason, the design of synthetic CT images of the lung could overcome these difficulties by providing a database of pseudorealistic cases in a constrained and controlled scenario where each part of the image (including arteries and veins) is differentiated unequivocally. In this Ph.D. Thesis we address both interrelated problems. First, the design of a complete framework to automatically generate computational CT phantoms of the human lung is described. Starting from biological and imagebased knowledge about the topology and relationships between structures, the system is able to generate synthetic pulmonary arteries, veins, and airways using iterative growth methods that can be merged into a final simulated lung with realistic features. These synthetic cases, together with labeled real CT datasets, have been used as reference for the development of a fully automatic pulmonary AV segmentation/separation method. The approach comprises a vessel extraction stage using scale-space particles and their posterior artery-vein classification using Graph-Cuts (GC) based on arterial/venous similarity scores obtained with a Machine Learning (ML) pre-classification step and particle connectivity information. Validation of pulmonary phantoms from visual examination and quantitative measurements of intensity distributions, dispersion of structures and relationships between pulmonary air and blood flow systems, show good correspondence between real and synthetic lungs. The evaluation of the Artery-Vein (AV) segmentation algorithm, based on different strategies to assess the accuracy of vessel particles classification, reveal accurate differentiation between arteries and vein in both real and synthetic cases that open a huge range of possibilities in the clinical study of cardiopulmonary diseases and the development of methodological approaches for the analysis of pulmonary images.
Resumo:
The blood types determination is essential to perform safe blood transfusions. In emergency situations isadministrated the “universal donor” blood type. However, sometimes, this blood type can cause incom-patibilities in the transfusion receptor. A mechatronic prototype was developed to solve this problem.The prototype was built to meet specific goals, incorporating all the necessary components. The obtainedsolution is close to the final system that will be produced later, at industrial scale, as a medical device.The prototype is a portable and low cost device, and can be used in remote locations. A computer appli-cation, previously developed is used to operate with the developed mechatronic prototype, and obtainautomatically test results. It allows image acquisition, processing and analysis, based on Computer Visionalgorithms, Machine Learning algorithms and deterministic algorithms. The Machine Learning algorithmsenable the classification of occurrence, or alack of agglutination in the mixture (blood/reagents), and amore reliable and a safer methodology as test data are stored in a database. The work developed allowsthe administration of a compatible blood type in emergency situations, avoiding the discontinuity of the“universal donor” blood type stocks, and reducing the occurrence of human errors in the transfusion practice.
Resumo:
Allergy is an overreaction by the immune system to a previously encountered, ordinarily harmless substance - typically proteins - resulting in skin rash, swelling of mucous membranes, sneezing or wheezing, or other abnormal conditions. The use of modified proteins is increasingly widespread: their presence in food, commercial products, such as washing powder, and medical therapeutics and diagnostics, makes predicting and identifying potential allergens a crucial societal issue. The prediction of allergens has been explored widely using bioinformatics, with many tools being developed in the last decade; many of these are freely available online. Here, we report a set of novel models for allergen prediction utilizing amino acid E-descriptors, auto- and cross-covariance transformation, and several machine learning methods for classification, including logistic regression (LR), decision tree (DT), naïve Bayes (NB), random forest (RF), multilayer perceptron (MLP) and k nearest neighbours (kNN). The best performing method was kNN with 85.3% accuracy at 5-fold cross-validation. The resulting model has been implemented in a revised version of the AllerTOP server (http://www.ddg-pharmfac.net/AllerTOP). © Springer-Verlag 2014.
Resumo:
Background: Allergy is a form of hypersensitivity to normally innocuous substances, such as dust, pollen, foods or drugs. Allergens are small antigens that commonly provoke an IgE antibody response. There are two types of bioinformatics-based allergen prediction. The first approach follows FAO/WHO Codex alimentarius guidelines and searches for sequence similarity. The second approach is based on identifying conserved allergenicity-related linear motifs. Both approaches assume that allergenicity is a linearly coded property. In the present study, we applied ACC pre-processing to sets of known allergens, developing alignment-independent models for allergen recognition based on the main chemical properties of amino acid sequences.Results: A set of 684 food, 1,156 inhalant and 555 toxin allergens was collected from several databases. A set of non-allergens from the same species were selected to mirror the allergen set. The amino acids in the protein sequences were described by three z-descriptors (z1, z2 and z3) and by auto- and cross-covariance (ACC) transformation were converted into uniform vectors. Each protein was presented as a vector of 45 variables. Five machine learning methods for classification were applied in the study to derive models for allergen prediction. The methods were: discriminant analysis by partial least squares (DA-PLS), logistic regression (LR), decision tree (DT), naïve Bayes (NB) and k nearest neighbours (kNN). The best performing model was derived by kNN at k = 3. It was optimized, cross-validated and implemented in a server named AllerTOP, freely accessible at http://www.pharmfac.net/allertop. AllerTOP also predicts the most probable route of exposure. In comparison to other servers for allergen prediction, AllerTOP outperforms them with 94% sensitivity.Conclusions: AllerTOP is the first alignment-free server for in silico prediction of allergens based on the main physicochemical properties of proteins. Significantly, as well allergenicity AllerTOP is able to predict the route of allergen exposure: food, inhalant or toxin. © 2013 Dimitrov et al.; licensee BioMed Central Ltd.
Resumo:
Dissertação de mestrado integrado em Engenharia Biomédica (área de especialização em Eletrónica Médica)
Resumo:
This paper presents a novel approach to the automatic classification of very large data sets composed of terahertz pulse transient signals, highlighting their potential use in biochemical, biomedical, pharmaceutical and security applications. Two different types of THz spectra are considered in the classification process. Firstly a binary classification study of poly-A and poly-C ribonucleic acid samples is performed. This is then contrasted with a difficult multi-class classification problem of spectra from six different powder samples that although have fairly indistinguishable features in the optical spectrum, they also possess a few discernable spectral features in the terahertz part of the spectrum. Classification is performed using a complex-valued extreme learning machine algorithm that takes into account features in both the amplitude as well as the phase of the recorded spectra. Classification speed and accuracy are contrasted with that achieved using a support vector machine classifier. The study systematically compares the classifier performance achieved after adopting different Gaussian kernels when separating amplitude and phase signatures. The two signatures are presented as feature vectors for both training and testing purposes. The study confirms the utility of complex-valued extreme learning machine algorithms for classification of the very large data sets generated with current terahertz imaging spectrometers. The classifier can take into consideration heterogeneous layers within an object as would be required within a tomographic setting and is sufficiently robust to detect patterns hidden inside noisy terahertz data sets. The proposed study opens up the opportunity for the establishment of complex-valued extreme learning machine algorithms as new chemometric tools that will assist the wider proliferation of terahertz sensing technology for chemical sensing, quality control, security screening and clinic diagnosis. Furthermore, the proposed algorithm should also be very useful in other applications requiring the classification of very large datasets.
Resumo:
Semi-supervised learning is applied to classification problems where only a small portion of the data items is labeled. In these cases, the reliability of the labels is a crucial factor, because mislabeled items may propagate wrong labels to a large portion or even the entire data set. This paper aims to address this problem by presenting a graph-based (network-based) semi-supervised learning method, specifically designed to handle data sets with mislabeled samples. The method uses teams of walking particles, with competitive and cooperative behavior, for label propagation in the network constructed from the input data set. The proposed model is nature-inspired and it incorporates some features to make it robust to a considerable amount of mislabeled data items. Computer simulations show the performance of the method in the presence of different percentage of mislabeled data, in networks of different sizes and average node degree. Importantly, these simulations reveals the existence of the critical points of the mislabeled subset size, below which the network is free of wrong label contamination, but above which the mislabeled samples start to propagate their labels to the rest of the network. Moreover, numerical comparisons have been made among the proposed method and other representative graph-based semi-supervised learning methods using both artificial and real-world data sets. Interestingly, the proposed method has increasing better performance than the others as the percentage of mislabeled samples is getting larger. © 2012 IEEE.
Resumo:
The computational power is increasing day by day. Despite that, there are some tasks that are still difficult or even impossible for a computer to perform. For example, while identifying a facial expression is easy for a human, for a computer it is an area in development. To tackle this and similar issues, crowdsourcing has grown as a way to use human computation in a large scale. Crowdsourcing is a novel approach to collect labels in a fast and cheap manner, by sourcing the labels from the crowds. However, these labels lack reliability since annotators are not guaranteed to have any expertise in the field. This fact has led to a new research area where we must create or adapt annotation models to handle these weaklylabeled data. Current techniques explore the annotators’ expertise and the task difficulty as variables that influences labels’ correction. Other specific aspects are also considered by noisy-labels analysis techniques. The main contribution of this thesis is the process to collect reliable crowdsourcing labels for a facial expressions dataset. This process consists in two steps: first, we design our crowdsourcing tasks to collect annotators labels; next, we infer the true label from the collected labels by applying state-of-art crowdsourcing algorithms. At the same time, a facial expression dataset is created, containing 40.000 images and respective labels. At the end, we publish the resulting dataset.
Resumo:
Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.