970 resultados para hierarchical classification structures
Resumo:
Several popular Machine Learning techniques are originally designed for the solution of two-class problems. However, several classification problems have more than two classes. One approach to deal with multiclass problems using binary classifiers is to decompose the multiclass problem into multiple binary sub-problems disposed in a binary tree. This approach requires a binary partition of the classes for each node of the tree, which defines the tree structure. This paper presents two algorithms to determine the tree structure taking into account information collected from the used dataset. This approach allows the tree structure to be determined automatically for any multiclass dataset.
Resumo:
G protein-coupled receptors (GPCRs) play important physiological roles transducing extracellular signals into intracellular responses. Approximately 50% of all marketed drugs target a GPCR. There remains considerable interest in effectively predicting the function of a GPCR from its primary sequence.
Resumo:
We address the important bioinformatics problem of predicting protein function from a protein's primary sequence. We consider the functional classification of G-Protein-Coupled Receptors (GPCRs), whose functions are specified in a class hierarchy. We tackle this task using a novel top-down hierarchical classification system where, for each node in the class hierarchy, the predictor attributes to be used in that node and the classifier to be applied to the selected attributes are chosen in a data-driven manner. Compared with a previous hierarchical classification system selecting classifiers only, our new system significantly reduced processing time without significantly sacrificing predictive accuracy.
Resumo:
This dissertation investigates the very important and current problem of modelling human expertise. This is an apparent issue in any computer system emulating human decision making. It is prominent in Clinical Decision Support Systems (CDSS) due to the complexity of the induction process and the vast number of parameters in most cases. Other issues such as human error and missing or incomplete data present further challenges. In this thesis, the Galatean Risk Screening Tool (GRiST) is used as an example of modelling clinical expertise and parameter elicitation. The tool is a mental health clinical record management system with a top layer of decision support capabilities. It is currently being deployed by several NHS mental health trusts across the UK. The aim of the research is to investigate the problem of parameter elicitation by inducing them from real clinical data rather than from the human experts who provided the decision model. The induced parameters provide an insight into both the data relationships and how experts make decisions themselves. The outcomes help further understand human decision making and, in particular, help GRiST provide more accurate emulations of risk judgements. Although the algorithms and methods presented in this dissertation are applied to GRiST, they can be adopted for other human knowledge engineering domains.
Resumo:
MOTIVATION: G protein-coupled receptors (GPCRs) play an important role in many physiological systems by transducing an extracellular signal into an intracellular response. Over 50% of all marketed drugs are targeted towards a GPCR. There is considerable interest in developing an algorithm that could effectively predict the function of a GPCR from its primary sequence. Such an algorithm is useful not only in identifying novel GPCR sequences but in characterizing the interrelationships between known GPCRs. RESULTS: An alignment-free approach to GPCR classification has been developed using techniques drawn from data mining and proteochemometrics. A dataset of over 8000 sequences was constructed to train the algorithm. This represents one of the largest GPCR datasets currently available. A predictive algorithm was developed based upon the simplest reasonable numerical representation of the protein's physicochemical properties. A selective top-down approach was developed, which used a hierarchical classifier to assign sequences to subdivisions within the GPCR hierarchy. The predictive performance of the algorithm was assessed against several standard data mining classifiers and further validated against Support Vector Machine-based GPCR prediction servers. The selective top-down approach achieves significantly higher accuracy than standard data mining methods in almost all cases.
Resumo:
In this thesis we address a multi-label hierarchical text classification problem in a low-resource setting and explore different approaches to identify the best one for our case. The goal is to train a model that classifies English school exercises according to a hierarchical taxonomy with few labeled data. The experiments made in this work employ different machine learning models and text representation techniques: CatBoost with tf-idf features, classifiers based on pre-trained models (mBERT, LASER), and SetFit, a framework for few-shot text classification. SetFit proved to be the most promising approach, achieving better performance when during training only a few labeled examples per class are available. However, this thesis does not consider all the hierarchical taxonomy, but only the first two levels: to address classification with the classes at the third level further experiments should be carried out, exploring methods for zero-shot text classification, data augmentation, and strategies to exploit the hierarchical structure of the taxonomy during training.
Resumo:
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies
Resumo:
Numerical analyses (correspondence analysis, ascending hierarchical classification, and cladistics) were done with morphological characters of adult phlebotomine sand flies. The resulting classification largely confirms that of classical taxonomy for supra-specific groups from the Old World, though the positions of some groups are adjusted. The taxa Spelaeophlebotomus Theodor 1948, Idiophlebotomus Quate & Fairchild 1961, Australophlebotomus Theodor 1948 and Chinius Leng 1987 are notably distinct from other Old World groups, particularly from the genus Phlebotomus Rondani & Berté 1840. Spelaeomyia Theodor 1948 and, in particular, Parvidens Theodor & Mesghali 1964 are clearly separate from Sergentomyia França & Parrot 1920.
Resumo:
Automatic classification of makams from symbolic data is a rarely studied topic. In this paper, first a review of an n-gram based approach is presented using various representations of the symbolic data. While a high degree of precision can be obtained, confusion happens mainly for makams using (almost) the same scale and pitch hierarchy but differ in overall melodic progression, seyir. To further improve the system, first n-gram based classification is tested for various sections of the piece to take into account a feature of the seyir that melodic progression starts in a certain region of the scale. In a second test, a hierarchical classification structure is designed which uses n-grams and seyir features in different levels to further improve the system.
Resumo:
The main objective of this study was todo a statistical analysis of ecological type from optical satellite data, using Tipping's sparse Bayesian algorithm. This thesis uses "the Relevence Vector Machine" algorithm in ecological classification betweenforestland and wetland. Further this bi-classification technique was used to do classification of many other different species of trees and produces hierarchical classification of entire subclasses given as a target class. Also, we carried out an attempt to use airborne image of same forest area. Combining it with image analysis, using different image processing operation, we tried to extract good features and later used them to perform classification of forestland and wetland.
Resumo:
MOTIVATION: Lipids are a large and diverse group of biological molecules with roles in membrane formation, energy storage and signaling. Cellular lipidomes may contain tens of thousands of structures, a staggering degree of complexity whose significance is not yet fully understood. High-throughput mass spectrometry-based platforms provide a means to study this complexity, but the interpretation of lipidomic data and its integration with prior knowledge of lipid biology suffers from a lack of appropriate tools to manage the data and extract knowledge from it. RESULTS: To facilitate the description and exploration of lipidomic data and its integration with prior biological knowledge, we have developed a knowledge resource for lipids and their biology-SwissLipids. SwissLipids provides curated knowledge of lipid structures and metabolism which is used to generate an in silico library of feasible lipid structures. These are arranged in a hierarchical classification that links mass spectrometry analytical outputs to all possible lipid structures, metabolic reactions and enzymes. SwissLipids provides a reference namespace for lipidomic data publication, data exploration and hypothesis generation. The current version of SwissLipids includes over 244 000 known and theoretically possible lipid structures, over 800 proteins, and curated links to published knowledge from over 620 peer-reviewed publications. We are continually updating the SwissLipids hierarchy with new lipid categories and new expert curated knowledge. AVAILABILITY: SwissLipids is freely available at http://www.swisslipids.org/. CONTACT: alan.bridge@isb-sib.ch SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Resumo:
Les milieux humides remplissent plusieurs fonctions écologiques d’importance et contribuent à la biodiversité de la faune et de la flore. Même s’il existe une reconnaissance croissante sur l’importante de protéger ces milieux, il n’en demeure pas moins que leur intégrité est encore menacée par la pression des activités humaines. L’inventaire et le suivi systématique des milieux humides constituent une nécessité et la télédétection est le seul moyen réaliste d’atteindre ce but. L’objectif de cette thèse consiste à contribuer et à améliorer la caractérisation des milieux humides en utilisant des données satellites acquises par des radars polarimétriques en bande L (ALOS-PALSAR) et C (RADARSAT-2). Cette thèse se fonde sur deux hypothèses (chap. 1). La première hypothèse stipule que les classes de physionomies végétales, basées sur la structure des végétaux, sont plus appropriées que les classes d’espèces végétales car mieux adaptées au contenu informationnel des images radar polarimétriques. La seconde hypothèse stipule que les algorithmes de décompositions polarimétriques permettent une extraction optimale de l’information polarimétrique comparativement à une approche multipolarisée basée sur les canaux de polarisation HH, HV et VV (chap. 3). En particulier, l’apport de la décomposition incohérente de Touzi pour l’inventaire et le suivi de milieux humides est examiné en détail. Cette décomposition permet de caractériser le type de diffusion, la phase, l’orientation, la symétrie, le degré de polarisation et la puissance rétrodiffusée d’une cible à l’aide d’une série de paramètres extraits d’une analyse des vecteurs et des valeurs propres de la matrice de cohérence. La région du lac Saint-Pierre a été sélectionnée comme site d’étude étant donné la grande diversité de ses milieux humides qui y couvrent plus de 20 000 ha. L’un des défis posés par cette thèse consiste au fait qu’il n’existe pas de système standard énumérant l’ensemble possible des classes physionomiques ni d’indications précises quant à leurs caractéristiques et dimensions. Une grande attention a donc été portée à la création de ces classes par recoupement de sources de données diverses et plus de 50 espèces végétales ont été regroupées en 9 classes physionomiques (chap. 7, 8 et 9). Plusieurs analyses sont proposées pour valider les hypothèses de cette thèse (chap. 9). Des analyses de sensibilité par diffusiogramme sont utilisées pour étudier les caractéristiques et la dispersion des physionomies végétales dans différents espaces constitués de paramètres polarimétriques ou canaux de polarisation (chap. 10 et 12). Des séries temporelles d’images RADARSAT-2 sont utilisées pour approfondir la compréhension de l’évolution saisonnière des physionomies végétales (chap. 12). L’algorithme de la divergence transformée est utilisé pour quantifier la séparabilité entre les classes physionomiques et pour identifier le ou les paramètres ayant le plus contribué(s) à leur séparabilité (chap. 11 et 13). Des classifications sont aussi proposées et les résultats comparés à une carte existante des milieux humide du lac Saint-Pierre (14). Finalement, une analyse du potentiel des paramètres polarimétrique en bande C et L est proposé pour le suivi de l’hydrologie des tourbières (chap. 15 et 16). Les analyses de sensibilité montrent que les paramètres de la 1re composante, relatifs à la portion dominante (polarisée) du signal, sont suffisants pour une caractérisation générale des physionomies végétales. Les paramètres des 2e et 3e composantes sont cependant nécessaires pour obtenir de meilleures séparabilités entre les classes (chap. 11 et 13) et une meilleure discrimination entre milieux humides et milieux secs (chap. 14). Cette thèse montre qu’il est préférable de considérer individuellement les paramètres des 1re, 2e et 3e composantes plutôt que leur somme pondérée par leurs valeurs propres respectives (chap. 10 et 12). Cette thèse examine également la complémentarité entre les paramètres de structure et ceux relatifs à la puissance rétrodiffusée, souvent ignorée et normalisée par la plupart des décompositions polarimétriques. La dimension temporelle (saisonnière) est essentielle pour la caractérisation et la classification des physionomies végétales (chap. 12, 13 et 14). Des images acquises au printemps (avril et mai) sont nécessaires pour discriminer les milieux secs des milieux humides alors que des images acquises en été (juillet et août) sont nécessaires pour raffiner la classification des physionomies végétales. Un arbre hiérarchique de classification développé dans cette thèse constitue une synthèse des connaissances acquises (chap. 14). À l’aide d’un nombre relativement réduit de paramètres polarimétriques et de règles de décisions simples, il est possible d’identifier, entre autres, trois classes de bas marais et de discriminer avec succès les hauts marais herbacés des autres classes physionomiques sans avoir recours à des sources de données auxiliaires. Les résultats obtenus sont comparables à ceux provenant d’une classification supervisée utilisant deux images Landsat-5 avec une exactitude globale de 77.3% et 79.0% respectivement. Diverses classifications utilisant la machine à vecteurs de support (SVM) permettent de reproduire les résultats obtenus avec l’arbre hiérarchique de classification. L’exploitation d’une plus forte dimensionalitée par le SVM, avec une précision globale maximale de 79.1%, ne permet cependant pas d’obtenir des résultats significativement meilleurs. Finalement, la phase de la décomposition de Touzi apparaît être le seul paramètre (en bande L) sensible aux variations du niveau d’eau sous la surface des tourbières ouvertes (chap. 16). Ce paramètre offre donc un grand potentiel pour le suivi de l’hydrologie des tourbières comparativement à la différence de phase entre les canaux HH et VV. Cette thèse démontre que les paramètres de la décomposition de Touzi permettent une meilleure caractérisation, de meilleures séparabilités et de meilleures classifications des physionomies végétales des milieux humides que les canaux de polarisation HH, HV et VV. Le regroupement des espèces végétales en classes physionomiques est un concept valable. Mais certaines espèces végétales partageant une physionomie similaire, mais occupant un milieu différent (haut vs bas marais), ont cependant présenté des différences significatives quant aux propriétés de leur rétrodiffusion.
Resumo:
Hierarchical knowledge structures are frequently used within clinical decision support systems as part of the model for generating intelligent advice. The nodes in the hierarchy inevitably have varying influence on the decisionmaking processes, which needs to be reflected by parameters. If the model has been elicited from human experts, it is not feasible to ask them to estimate the parameters because there will be so many in even moderately-sized structures. This paper describes how the parameters could be obtained from data instead, using only a small number of cases. The original method [1] is applied to a particular web-based clinical decision support system called GRiST, which uses its hierarchical knowledge to quantify the risks associated with mental-health problems. The knowledge was elicited from multidisciplinary mental-health practitioners but the tree has several thousand nodes, all requiring an estimation of their relative influence on the assessment process. The method described in the paper shows how they can be obtained from about 200 cases instead. It greatly reduces the experts’ elicitation tasks and has the potential for being generalised to similar knowledge-engineering domains where relative weightings of node siblings are part of the parameter space.
Resumo:
Technological advances combined with healthcare assistance bring increased risks related to patient safety, causing health institutions to be environments susceptible to losses in the provided care. Sectors of high complexity, such as Intensive Care Units have such characteristics highlighted due to being spaces designed for the care of patients in serious medical condition, when the use of advanced technological devices becomes a necessity. Thus, the aim of this study was to assess nursing care from the perspective of patient safety in intensive care units. This is an evaluative research, which combines various forms of data collection and analysis in order to conduct a deepened investigation. Data collection occurred in loco, from April to July 2014 in hospitals equipped with adult intensive care unit services. For this, a checklist instrument and semi-structured interviews conducted with patients, families, professionals were used in order to evaluate the structure-process-outcome triad. The instrument for nursing care assessment regarding Patient Safety included 97 questions related to structure and processes. Interviews provided data for outcome analysis. The selection of interviewees/participants was based on the willingness of potential participants. The following methods were used to collect data resulting from the instrument: statistical analysis of inter-rater reliability measure known as kappa (K); observations from judges resulting from the observation process; and added information obtained from the literature on the thematic. Data analysis from the interviews was carried out with IRAMUTEQ software, which used Descending Hierarchical Classification and Similarity analysis to aid in data interpretation. Research steps followed the ethical principles presented by Resolution No. 466 of December 12, 2012, and the results were presented in three manuscripts: 1) Evaluation of patient safety in Intensive Care Units: a focus on structure; 2) Health evaluation processes: a nursing care perspective on patient safety; 3) Patient safety in intensive care units: perception of nurses, family members and patients. The first article, related to the structure, refers to the use of 24 items of the employed instrument, showing that most of the findings were not aligned with the adequacy standards, which indicates poor conditions in structures offered in health services. The second article provides an analysis of the pillar of Processes, with the use of 73 items of the instrument, showing that 50 items did not meet the required standards for safe handling due to the absence of adequate scientific guidance and effective communication in nursing care process. For the third article, results indicate that intensive care units were safe places, yet urges for changes, especially in the physical structure and availability of materials and communication among professionals, patients and families. Therefore, our findings suggest that the nursing care being provided in the evaluated intensive care units contains troubling shortcomings with regards to patient safety, thereby evidencing an insecure setting for the assistance offered, in addition to a need for urgent measures to remedy the identified inadequacies with appropriate structures and implement protocols and care guidelines in order to consolidate an environment more favorable to patient safety.