771 resultados para Gender classification model
Resumo:
This thesis presents a promising boundary setting method for solving challenging issues in text classification to produce an effective text classifier. A classifier must identify boundary between classes optimally. However, after the features are selected, the boundary is still unclear with regard to mixed positive and negative documents. A classifier combination method to boost effectiveness of the classification model is also presented. The experiments carried out in the study demonstrate that the proposed classifier is promising.
Resumo:
Objective Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates – an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates. Methods Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model. Results The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable. Conclusion The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries to monitor and report on cancer mortality in a timely and accurate manner. In addition, the methods and findings are generally applicable beyond cancer classification and to other sources of medical text besides death certificates.
Resumo:
The use of near infrared (NIR) hyperspectral imaging and hyperspectral image analysis for distinguishing between hard, intermediate and soft maize kernels from inbred lines was evaluated. NIR hyperspectral images of two sets (12 and 24 kernels) of whole maize kernels were acquired using a Spectral Dimensions MatrixNIR camera with a spectral range of 960-1662 nm and a sisuChema SWIR (short wave infrared) hyperspectral pushbroom imaging system with a spectral range of 1000-2498 nm. Exploratory principal component analysis (PCA) was used on absorbance images to remove background, bad pixels and shading. On the cleaned images. PCA could be used effectively to find histological classes including glassy (hard) and floury (soft) endosperm. PCA illustrated a distinct difference between glassy and floury endosperm along principal component (PC) three on the MatrixNIR and PC two on the sisuChema with two distinguishable clusters. Subsequently partial least squares discriminant analysis (PLS-DA) was applied to build a classification model. The PLS-DA model from the MatrixNIR image (12 kernels) resulted in root mean square error of prediction (RMSEP) value of 0.18. This was repeated on the MatrixNIR image of the 24 kernels which resulted in RMSEP of 0.18. The sisuChema image yielded RMSEP value of 0.29. The reproducible results obtained with the different data sets indicate that the method proposed in this paper has a real potential for future classification uses.
Resumo:
Gender perceptions, religious belief systems, and political thought have excluded women from politics, for ages, around the world. Combining feminist and modernisation theorists in my theoretical framework, I examine the trends in patriarchal Europe and I highlight the gender-sensitive model of the Nordic countries. Retracing local gender patterns from precolonial to postcolonial eras in sub-Saharan Africa, I explore the links between perceptions, needs, resources, education and women's political participation in Cameroon. Democratisation is supposed to open up political participation, to grant equal opportunities to all adults. One ironic feature of the liberalisation process in Cameroon has been the decrease of women in parliamentarian representation (14% in 1988, 6% in 1992, 5% in 1997 and 10% in 2002). What social, cultural and institutional mechanisms produced this paradoxical outcome, the exclusion of half the population? The gender complementarity of the indigenous context has been lost to male prevalence privileged by education, church, law, employment, economy and politics in the public sphere; most women are marginalised in the private sphere. Nation building and development have failed; ethnicism and individualism are growing. Some hope lies in the growing civil society. From two surveys and 21 focus groups across Cameroon, in 2000 and 2002, some significant results of the processed empirical data reveal low electoral registration (34.5% women and 65.9% men), contrasted by the willingness to run for municipal elections (33.3 % women and 45.2% men). The co-existence of customary and statutory laws, the corrupt political system and fraudulent practices, contribute to the marginalisation of women and men who are interested in politics. A large majority of female respondents consider female politicians more trustworthy and capable than their male counterparts; they even foresee the appointment of a female Prime Minister. The Nordic countries have institutionalised gender equality in their legislation, policies and practices. France has improved women's political inclusion with the parity laws; Rwanda is another model of women's representation, thanks to its post-conflict constitution. From my analysis, Cameroonian institutions, men and more so women, may learn and borrow from these experiences, in order to design and implement a sustainable and gender-balanced democracy. Keywords: democratisation, politics, gender equality, feminism, citizenship, Cameroon, Nordic countries, Finland, France, United Kingdom, quotas, societal social psychology.
Resumo:
Myopathies are muscular diseases in which muscle fibers degenerate due to many factors such as nutrient deficiency, infection and mutations in myofibrillar etc. The objective of this study is to identify the bio-markers to distinguish various muscle mutants in Drosophila (fruit fly) using Raman Spectroscopy. Principal Components based Linear Discriminant Analysis (PC-LDA) classification model yielding >95% accuracy was developed to classify such different mutants representing various myopathies according to their physiopathology.
Resumo:
Myopathies are muscular diseases in which muscle fibers degenerate due to many factors such as nutrient deficiency, infection and mutations in myofibrillar etc. The objective of this study is to identify the bio-markers to distinguish various muscle mutants in Drosophila (fruit fly) using Raman Spectroscopy. Principal Components based Linear Discriminant Analysis (PC-LDA) classification model yielding >95% accuracy was developed to classify such different mutants representing various myopathies according to their physiopathology.
Resumo:
This paper describes a representation of the dynamics of human walking action for the purpose of person identification and classification by gait appearance. Our gait representation is based on simple features such as moments extracted from video silhouettes of human walking motion. We claim that our gait dynamics representation is rich enough for the task of recognition and classification. The use of our feature representation is demonstrated in the task of person recognition from video sequences of orthogonal views of people walking. We demonstrate the accuracy of recognition on gait video sequences collected over different days and times, and under varying lighting environments. In addition, preliminary results are shown on gender classification using our gait dynamics features.
Resumo:
This thesis reports the application of metabolomics to human tissues and biofluids (blood plasma and urine) to unveil the metabolic signature of primary lung cancer. In Chapter 1, a brief introduction on lung cancer epidemiology and pathogenesis, together with a review of the main metabolic dysregulations known to be associated with cancer, is presented. The metabolomics approach is also described, addressing the analytical and statistical methods employed, as well as the current state of the art on its application to clinical lung cancer studies. Chapter 2 provides the experimental details of this work, in regard to the subjects enrolled, sample collection and analysis, and data processing. In Chapter 3, the metabolic characterization of intact lung tissues (from 56 patients) by proton High Resolution Magic Angle Spinning (HRMAS) Nuclear Magnetic Resonance (NMR) spectroscopy is described. After careful assessment of acquisition conditions and thorough spectral assignment (over 50 metabolites identified), the metabolic profiles of tumour and adjacent control tissues were compared through multivariate analysis. The two tissue classes could be discriminated with 97% accuracy, with 13 metabolites significantly accounting for this discrimination: glucose and acetate (depleted in tumours), together with lactate, alanine, glutamate, GSH, taurine, creatine, phosphocholine, glycerophosphocholine, phosphoethanolamine, uracil nucleotides and peptides (increased in tumours). Some of these variations corroborated typical features of cancer metabolism (e.g., upregulated glycolysis and glutaminolysis), while others suggested less known pathways (e.g., antioxidant protection, protein degradation) to play important roles. Another major and novel finding described in this chapter was the dependence of this metabolic signature on tumour histological subtype. While main alterations in adenocarcinomas (AdC) related to phospholipid and protein metabolisms, squamous cell carcinomas (SqCC) were found to have stronger glycolytic and glutaminolytic profiles, making it possible to build a valid classification model to discriminate these two subtypes. Chapter 4 reports the NMR metabolomic study of blood plasma from over 100 patients and near 100 healthy controls, the multivariate model built having afforded a classification rate of 87%. The two groups were found to differ significantly in the levels of lactate, pyruvate, acetoacetate, LDL+VLDL lipoproteins and glycoproteins (increased in patients), together with glutamine, histidine, valine, methanol, HDL lipoproteins and two unassigned compounds (decreased in patients). Interestingly, these variations were detected from initial disease stages and the magnitude of some of them depended on the histological type, although not allowing AdC vs. SqCC discrimination. Moreover, it is shown in this chapter that age mismatch between control and cancer groups could not be ruled out as a possible confounding factor, and exploratory external validation afforded a classification rate of 85%. The NMR profiling of urine from lung cancer patients and healthy controls is presented in Chapter 5. Compared to plasma, the classification model built with urinary profiles resulted in a superior classification rate (97%). After careful assessment of possible bias from gender, age and smoking habits, a set of 19 metabolites was proposed to be cancer-related (out of which 3 were unknowns and 6 were partially identified as N-acetylated metabolites). As for plasma, these variations were detected regardless of disease stage and showed some dependency on histological subtype, the AdC vs. SqCC model built showing modest predictive power. In addition, preliminary external validation of the urine-based classification model afforded 100% sensitivity and 90% specificity, which are exciting results in terms of potential for future clinical application. Chapter 6 describes the analysis of urine from a subset of patients by a different profiling technique, namely, Ultra-Performance Liquid Chromatography coupled to Mass Spectrometry (UPLC-MS). Although the identification of discriminant metabolites was very limited, multivariate models showed high classification rate and predictive power, thus reinforcing the value of urine in the context of lung cancer diagnosis. Finally, the main conclusions of this thesis are presented in Chapter 7, highlighting the potential of integrated metabolomics of tissues and biofluids to improve current understanding of lung cancer altered metabolism and to reveal new marker profiles with diagnostic value.
Resumo:
This paper describes a methodology that was developed for the classification of Medium Voltage (MV) electricity customers. Starting from a sample of data bases, resulting from a monitoring campaign, Data Mining (DM) techniques are used in order to discover a set of a MV consumer typical load profile and, therefore, to extract knowledge regarding to the electric energy consumption patterns. In first stage, it was applied several hierarchical clustering algorithms and compared the clustering performance among them using adequacy measures. In second stage, a classification model was developed in order to allow classifying new consumers in one of the obtained clusters that had resulted from the previously process. Finally, the interpretation of the discovered knowledge are presented and discussed.
Resumo:
This thesis describes a representation of gait appearance for the purpose of person identification and classification. This gait representation is based on simple localized image features such as moments extracted from orthogonal view video silhouettes of human walking motion. A suite of time-integration methods, spanning a range of coarseness of time aggregation and modeling of feature distributions, are applied to these image features to create a suite of gait sequence representations. Despite their simplicity, the resulting feature vectors contain enough information to perform well on human identification and gender classification tasks. We demonstrate the accuracy of recognition on gait video sequences collected over different days and times and under varying lighting environments. Each of the integration methods are investigated for their advantages and disadvantages. An improved gait representation is built based on our experiences with the initial set of gait representations. In addition, we show gender classification results using our gait appearance features, the effect of our heuristic feature selection method, and the significance of individual features.
Resumo:
This paper is concerned with the use of a genetic algorithm to select financial ratios for corporate distress classification models. For this purpose, the fitness value associated to a set of ratios is made to reflect the requirements of maximizing the amount of information available for the model and minimizing the collinearity between the model inputs. A case study involving 60 failed and continuing British firms in the period 1997-2000 is used for illustration. The classification model based on ratios selected by the genetic algorithm compares favorably with a model employing ratios usually found in the financial distress literature.
Resumo:
[EN]Different researches suggest that inner facial features are not the only discriminative features for tasks such as person identification or gender classification. Indeed, they have shown an influence of features which are part of the local face context, such as hair, on these tasks. However, object-centered approaches which ignore local context dominate the research in computational vision based facial analysis. In this paper, we performed an analysis to study which areas and which resolutions are diagnostic for the gender classification problem. We first demonstrate the importance of contextual features in human observers for gender classification using a psychophysical ”bubbles” technique.
Resumo:
Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.
Resumo:
Exposure to counter-stereotypic gender role models (e.g., a woman engineer) has been shown to successfully reduce the application of biased gender stereotypes. We tested the hypothesis that such efforts may more generally lessen the application of stereotypic knowledge in other (non-gendered) domains. Specifically, based on the notion that counter-stereotypes can stimulate a lesser reliance on heuristic thinking, we predicted that contesting gender stereotypes would eliminate a more general group prototypicality bias in the selection of leaders. Three studies supported this hypothesis. After exposing participants to a counter-stereotypic gender role model, group prototypicality no longer predicted leadership evaluation and selection. We discuss the implications of these findings for groups and organizations seeking to capitalize on the benefits of an increasingly diverse workforce.
Resumo:
This study examined the association of theoretically guided and empirically identified psychosocial variables on the co-occurrence of risky sexual behavior with alcohol consumption among university students. The study utilized event analysis to determine whether risky sex occurred during the same event in which alcohol was consumed. Relevant conceptualizations included alcohol disinhibition, self-efficacy, and social network theories. Predictor variables included negative condom attitudes, general risk taking, drinking motives, mistrust, social group membership, and gender. Factor analysis was employed to identify dimensions of drinking motives. Measured risky sex behaviors were (a) sex without a condom, (b) sex with people not known very well, (c) sex with injecting drug users (IDUs), (d) sex with people without knowing whether they had a STD, and (e) sex with using drugs. A purposive sample was used and included 222 male and female students recruited from a major urban university. Chi-square analysis was used to determine whether participants were more likely to engage in risky sex behavior in different alcohol use contexts. These contexts were only when drinking, only when not drinking, and when drinking or not. The chi-square findings did not support the hypothesis that university students who use alcohol with sex will engage in riskier sex. These results added to the literature by extending other similar findings to a university student sample. For each of the observed risky sex behaviors, discriminant analysis methodology was used to determine whether the predictor variables would differentiate the drinking contexts, or whether the behavior occurred. Results from discriminant analyses indicated that sex with people not known very well was the only behavior for which there were significant discriminant functions. Gender and enhancement drinking motives were important constructs in the classification model. Limitations of the study and implications for future research, social work practice and policy are discussed. ^