944 resultados para automated text classification


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Affiliation: Centre Robert-Cedergren de l'Université de Montréal en bio-informatique et génomique & Département de biochimie, Université de Montréal

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Background: Since their inception, Twitter and related microblogging systems have provided a rich source of information for researchers and have attracted interest in their affordances and use. Since 2009 PubMed has included 123 journal articles on medicine and Twitter, but no overview exists as to how the field uses Twitter in research. // Objective: This paper aims to identify published work relating to Twitter indexed by PubMed, and then to classify it. This classification will provide a framework in which future researchers will be able to position their work, and to provide an understanding of the current reach of research using Twitter in medical disciplines. Limiting the study to papers indexed by PubMed ensures the work provides a reproducible benchmark. // Methods: Papers, indexed by PubMed, on Twitter and related topics were identified and reviewed. The papers were then qualitatively classified based on the paper’s title and abstract to determine their focus. The work that was Twitter focused was studied in detail to determine what data, if any, it was based on, and from this a categorization of the data set size used in the studies was developed. Using open coded content analysis additional important categories were also identified, relating to the primary methodology, domain and aspect. // Results: As of 2012, PubMed comprises more than 21 million citations from biomedical literature, and from these a corpus of 134 potentially Twitter related papers were identified, eleven of which were subsequently found not to be relevant. There were no papers prior to 2009 relating to microblogging, a term first used in 2006. Of the remaining 123 papers which mentioned Twitter, thirty were focussed on Twitter (the others referring to it tangentially). The early Twitter focussed papers introduced the topic and highlighted the potential, not carrying out any form of data analysis. The majority of published papers used analytic techniques to sort through thousands, if not millions, of individual tweets, often depending on automated tools to do so. Our analysis demonstrates that researchers are starting to use knowledge discovery methods and data mining techniques to understand vast quantities of tweets: the study of Twitter is becoming quantitative research. // Conclusions: This work is to the best of our knowledge the first overview study of medical related research based on Twitter and related microblogging. We have used five dimensions to categorise published medical related research on Twitter. This classification provides a framework within which researchers studying development and use of Twitter within medical related research, and those undertaking comparative studies of research relating to Twitter in the area of medicine and beyond, can position and ground their work.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

High-throughput gene expression technologies such as microarrays have been utilized in a variety of scientific applications. Most of the work has been on assessing univariate associations between gene expression with clinical outcome (variable selection) or on developing classification procedures with gene expression data (supervised learning). We consider a hybrid variable selection/classification approach that is based on linear combinations of the gene expression profiles that maximize an accuracy measure summarized using the receiver operating characteristic curve. Under a specific probability model, this leads to consideration of linear discriminant functions. We incorporate an automated variable selection approach using LASSO. An equivalence between LASSO estimation with support vector machines allows for model fitting using standard software. We apply the proposed method to simulated data as well as data from a recently published prostate cancer study.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

PURPOSE: To develop and implement a method for improved cerebellar tissue classification on the MRI of brain by automatically isolating the cerebellum prior to segmentation. MATERIALS AND METHODS: Dual fast spin echo (FSE) and fluid attenuation inversion recovery (FLAIR) images were acquired on 18 normal volunteers on a 3 T Philips scanner. The cerebellum was isolated from the rest of the brain using a symmetric inverse consistent nonlinear registration of individual brain with the parcellated template. The cerebellum was then separated by masking the anatomical image with individual FLAIR images. Tissues in both the cerebellum and rest of the brain were separately classified using hidden Markov random field (HMRF), a parametric method, and then combined to obtain tissue classification of the whole brain. The proposed method for tissue classification on real MR brain images was evaluated subjectively by two experts. The segmentation results on Brainweb images with varying noise and intensity nonuniformity levels were quantitatively compared with the ground truth by computing the Dice similarity indices. RESULTS: The proposed method significantly improved the cerebellar tissue classification on all normal volunteers included in this study without compromising the classification in remaining part of the brain. The average similarity indices for gray matter (GM) and white matter (WM) in the cerebellum are 89.81 (+/-2.34) and 93.04 (+/-2.41), demonstrating excellent performance of the proposed methodology. CONCLUSION: The proposed method significantly improved tissue classification in the cerebellum. The GM was overestimated when segmentation was performed on the whole brain as a single object.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Cervical cancer is the leading cause of death and disease from malignant neoplasms among women in developing countries. Even though the Pap smear has significantly decreased the number of deaths from cervical cancer in the past years, it has its limitations. Researchers have developed an automated screening machine which can potentially detect abnormal cases that are overlooked by conventional screening. The goal of quantitative cytology is to classify the patient's tissue sample based on quantitative measurements of the individual cells. It is also much cheaper and potentially can take less time. One of the major challenges of collecting cells with a cytobrush is the possibility of not sampling any existing dysplastic cells on the cervix. Being able to correctly classify patients who have disease without the presence of dysplastic cells could improve the accuracy of quantitative cytology algorithms. Subtle morphologic changes in normal-appearing tissues adjacent to or distant from malignant tumors have been shown to exist, but a comparison of various statistical methods, including many recent advances in the statistical learning field, has not previously been done. The objective of this thesis is to use different classification methods applied to quantitative cytology data for the detection of malignancy associated changes (MACs). In this thesis, Elastic Net is the best algorithm. When we applied the Elastic Net algorithm to the test set, we combined the training set and validation set as "training" set and used 5-fold cross validation to choose the parameter for Elastic Net. It has a sensitivity of 47% at 80% specificity, an AUC 0.52, and a partial AUC 0.10 (95% CI 0.09-0.11).^

Relevância:

40.00% 40.00%

Publicador:

Resumo:

El daño cerebral adquirido (DCA) es un problema social y sanitario grave, de magnitud creciente y de una gran complejidad diagnóstica y terapéutica. Su elevada incidencia, junto con el aumento de la supervivencia de los pacientes, una vez superada la fase aguda, lo convierten también en un problema de alta prevalencia. En concreto, según la Organización Mundial de la Salud (OMS) el DCA estará entre las 10 causas más comunes de discapacidad en el año 2020. La neurorrehabilitación permite mejorar el déficit tanto cognitivo como funcional y aumentar la autonomía de las personas con DCA. Con la incorporación de nuevas soluciones tecnológicas al proceso de neurorrehabilitación se pretende alcanzar un nuevo paradigma donde se puedan diseñar tratamientos que sean intensivos, personalizados, monitorizados y basados en la evidencia. Ya que son estas cuatro características las que aseguran que los tratamientos son eficaces. A diferencia de la mayor parte de las disciplinas médicas, no existen asociaciones de síntomas y signos de la alteración cognitiva que faciliten la orientación terapéutica. Actualmente, los tratamientos de neurorrehabilitación se diseñan en base a los resultados obtenidos en una batería de evaluación neuropsicológica que evalúa el nivel de afectación de cada una de las funciones cognitivas (memoria, atención, funciones ejecutivas, etc.). La línea de investigación en la que se enmarca este trabajo de investigación pretende diseñar y desarrollar un perfil cognitivo basado no sólo en el resultado obtenido en esa batería de test, sino también en información teórica que engloba tanto estructuras anatómicas como relaciones funcionales e información anatómica obtenida de los estudios de imagen. De esta forma, el perfil cognitivo utilizado para diseñar los tratamientos integra información personalizada y basada en la evidencia. Las técnicas de neuroimagen representan una herramienta fundamental en la identificación de lesiones para la generación de estos perfiles cognitivos. La aproximación clásica utilizada en la identificación de lesiones consiste en delinear manualmente regiones anatómicas cerebrales. Esta aproximación presenta diversos problemas relacionados con inconsistencias de criterio entre distintos clínicos, reproducibilidad y tiempo. Por tanto, la automatización de este procedimiento es fundamental para asegurar una extracción objetiva de información. La delineación automática de regiones anatómicas se realiza mediante el registro tanto contra atlas como contra otros estudios de imagen de distintos sujetos. Sin embargo, los cambios patológicos asociados al DCA están siempre asociados a anormalidades de intensidad y/o cambios en la localización de las estructuras. Este hecho provoca que los algoritmos de registro tradicionales basados en intensidad no funcionen correctamente y requieran la intervención del clínico para seleccionar ciertos puntos (que en esta tesis hemos denominado puntos singulares). Además estos algoritmos tampoco permiten que se produzcan deformaciones grandes deslocalizadas. Hecho que también puede ocurrir ante la presencia de lesiones provocadas por un accidente cerebrovascular (ACV) o un traumatismo craneoencefálico (TCE). Esta tesis se centra en el diseño, desarrollo e implementación de una metodología para la detección automática de estructuras lesionadas que integra algoritmos cuyo objetivo principal es generar resultados que puedan ser reproducibles y objetivos. Esta metodología se divide en cuatro etapas: pre-procesado, identificación de puntos singulares, registro y detección de lesiones. Los trabajos y resultados alcanzados en esta tesis son los siguientes: Pre-procesado. En esta primera etapa el objetivo es homogeneizar todos los datos de entrada con el objetivo de poder extraer conclusiones válidas de los resultados obtenidos. Esta etapa, por tanto, tiene un gran impacto en los resultados finales. Se compone de tres operaciones: eliminación del cráneo, normalización en intensidad y normalización espacial. Identificación de puntos singulares. El objetivo de esta etapa es automatizar la identificación de puntos anatómicos (puntos singulares). Esta etapa equivale a la identificación manual de puntos anatómicos por parte del clínico, permitiendo: identificar un mayor número de puntos lo que se traduce en mayor información; eliminar el factor asociado a la variabilidad inter-sujeto, por tanto, los resultados son reproducibles y objetivos; y elimina el tiempo invertido en el marcado manual de puntos. Este trabajo de investigación propone un algoritmo de identificación de puntos singulares (descriptor) basado en una solución multi-detector y que contiene información multi-paramétrica: espacial y asociada a la intensidad. Este algoritmo ha sido contrastado con otros algoritmos similares encontrados en el estado del arte. Registro. En esta etapa se pretenden poner en concordancia espacial dos estudios de imagen de sujetos/pacientes distintos. El algoritmo propuesto en este trabajo de investigación está basado en descriptores y su principal objetivo es el cálculo de un campo vectorial que permita introducir deformaciones deslocalizadas en la imagen (en distintas regiones de la imagen) y tan grandes como indique el vector de deformación asociado. El algoritmo propuesto ha sido comparado con otros algoritmos de registro utilizados en aplicaciones de neuroimagen que se utilizan con estudios de sujetos control. Los resultados obtenidos son prometedores y representan un nuevo contexto para la identificación automática de estructuras. Identificación de lesiones. En esta última etapa se identifican aquellas estructuras cuyas características asociadas a la localización espacial y al área o volumen han sido modificadas con respecto a una situación de normalidad. Para ello se realiza un estudio estadístico del atlas que se vaya a utilizar y se establecen los parámetros estadísticos de normalidad asociados a la localización y al área. En función de las estructuras delineadas en el atlas, se podrán identificar más o menos estructuras anatómicas, siendo nuestra metodología independiente del atlas seleccionado. En general, esta tesis doctoral corrobora las hipótesis de investigación postuladas relativas a la identificación automática de lesiones utilizando estudios de imagen médica estructural, concretamente estudios de resonancia magnética. Basándose en estos cimientos, se han abrir nuevos campos de investigación que contribuyan a la mejora en la detección de lesiones. ABSTRACT Brain injury constitutes a serious social and health problem of increasing magnitude and of great diagnostic and therapeutic complexity. Its high incidence and survival rate, after the initial critical phases, makes it a prevalent problem that needs to be addressed. In particular, according to the World Health Organization (WHO), brain injury will be among the 10 most common causes of disability by 2020. Neurorehabilitation improves both cognitive and functional deficits and increases the autonomy of brain injury patients. The incorporation of new technologies to the neurorehabilitation tries to reach a new paradigm focused on designing intensive, personalized, monitored and evidence-based treatments. Since these four characteristics ensure the effectivity of treatments. Contrary to most medical disciplines, it is not possible to link symptoms and cognitive disorder syndromes, to assist the therapist. Currently, neurorehabilitation treatments are planned considering the results obtained from a neuropsychological assessment battery, which evaluates the functional impairment of each cognitive function (memory, attention, executive functions, etc.). The research line, on which this PhD falls under, aims to design and develop a cognitive profile based not only on the results obtained in the assessment battery, but also on theoretical information that includes both anatomical structures and functional relationships and anatomical information obtained from medical imaging studies, such as magnetic resonance. Therefore, the cognitive profile used to design these treatments integrates information personalized and evidence-based. Neuroimaging techniques represent an essential tool to identify lesions and generate this type of cognitive dysfunctional profiles. Manual delineation of brain anatomical regions is the classical approach to identify brain anatomical regions. Manual approaches present several problems related to inconsistencies across different clinicians, time and repeatability. Automated delineation is done by registering brains to one another or to a template. However, when imaging studies contain lesions, there are several intensity abnormalities and location alterations that reduce the performance of most of the registration algorithms based on intensity parameters. Thus, specialists may have to manually interact with imaging studies to select landmarks (called singular points in this PhD) or identify regions of interest. These two solutions have the same inconvenient than manual approaches, mentioned before. Moreover, these registration algorithms do not allow large and distributed deformations. This type of deformations may also appear when a stroke or a traumatic brain injury (TBI) occur. This PhD is focused on the design, development and implementation of a new methodology to automatically identify lesions in anatomical structures. This methodology integrates algorithms whose main objective is to generate objective and reproducible results. It is divided into four stages: pre-processing, singular points identification, registration and lesion detection. Pre-processing stage. In this first stage, the aim is to standardize all input data in order to be able to draw valid conclusions from the results. Therefore, this stage has a direct impact on the final results. It consists of three steps: skull-stripping, spatial and intensity normalization. Singular points identification. This stage aims to automatize the identification of anatomical points (singular points). It involves the manual identification of anatomical points by the clinician. This automatic identification allows to identify a greater number of points which results in more information; to remove the factor associated to inter-subject variability and thus, the results are reproducible and objective; and to eliminate the time spent on manual marking. This PhD proposed an algorithm to automatically identify singular points (descriptor) based on a multi-detector approach. This algorithm contains multi-parametric (spatial and intensity) information. This algorithm has been compared with other similar algorithms found on the state of the art. Registration. The goal of this stage is to put in spatial correspondence two imaging studies of different subjects/patients. The algorithm proposed in this PhD is based on descriptors. Its main objective is to compute a vector field to introduce distributed deformations (changes in different imaging regions), as large as the deformation vector indicates. The proposed algorithm has been compared with other registration algorithms used on different neuroimaging applications which are used with control subjects. The obtained results are promising and they represent a new context for the automatic identification of anatomical structures. Lesion identification. This final stage aims to identify those anatomical structures whose characteristics associated to spatial location and area or volume has been modified with respect to a normal state. A statistical study of the atlas to be used is performed to establish which are the statistical parameters associated to the normal state. The anatomical structures that may be identified depend on the selected anatomical structures identified on the atlas. The proposed methodology is independent from the selected atlas. Overall, this PhD corroborates the investigated research hypotheses regarding the automatic identification of lesions based on structural medical imaging studies (resonance magnetic studies). Based on these foundations, new research fields to improve the automatic identification of lesions in brain injury can be proposed.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

While the retrieval of existing designs to prevent unnecessary duplication of parts is a recognised strategy in the control of design costs the available techniques to achieve this, even in product data management systems, are limited in performance or require large resources. A novel system has been developed based on a new version of an existing coding system (CAMAC) that allows automatic coding of engineering drawings and their subsequent retrieval using a drawing of the desired component as the input. The ability to find designs using a detail drawing rather than textual descriptions is a significant achievement in itself. Previous testing of the system has demonstrated this capability but if a means could be found to find parts from a simple sketch then its practical application would be much more effective. This paper describes the development and testing of such a search capability using a database of over 3000 engineering components.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The G-protein coupled receptors--or GPCRs--comprise simultaneously one of the largest and one of the most multi-functional protein families known to modern-day molecular bioscience. From a drug discovery and pharmaceutical industry perspective, the GPCRs constitute one of the most commercially and economically important groups of proteins known. The GPCRs undertake numerous vital metabolic functions and interact with a hugely diverse range of small and large ligands. Many different methodologies have been developed to efficiently and accurately classify the GPCRs. These range from motif-based techniques to machine learning as well as a variety of alignment-free techniques based on the physiochemical properties of sequences. We review here the available methodologies for the classification of GPCRs. Part of this work focuses on how we have tried to build the intrinsically hierarchical nature of sequence relations, implicit within the family, into an adaptive approach to classification. Importantly, we also allude to some of the key innate problems in developing an effective approach to classifying the GPCRs: the lack of sequence similarity between the six classes that comprise the GPCR family and the low sequence similarity to other family members evinced by many newly revealed members of the family.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

SMS (Short Message Service) is now a hugely popular and a very powerful business communication technology for mobile phones. In order to respond correctly to a free form factual question given a large collection of texts, one needs to understand the question at a level that allows determining some of constraints the question imposes on a possible answer. These constraints may include a semantic classification of the sought after answer and may even suggest using different strategies when looking for and verifying a candidate answer. In this paper we focus on various attempts to overcome the major contradiction: the technical limitations of the SMS standard, and the huge number of found information for a possible answer.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This study explores factors related to the prompt difficulty in Automated Essay Scoring. The sample was composed of 6,924 students. For each student, there were 1-4 essays, across 20 different writing prompts, for a total of 20,243 essays. E-rater® v.2 essay scoring engine developed by the Educational Testing Service was used to score the essays. The scoring engine employs a statistical model that incorporates 10 predictors associated with writing characteristics of which 8 were used. The Rasch partial credit analysis was applied to the scores to determine the difficulty levels of prompts. In addition, the scores were used as outcomes in the series of hierarchical linear models (HLM) in which students and prompts constituted the cross-classification levels. This methodology was used to explore the partitioning of the essay score variance.^ The results indicated significant differences in prompt difficulty levels due to genre. Descriptive prompts, as a group, were found to be more difficult than the persuasive prompts. In addition, the essay score variance was partitioned between students and prompts. The amount of the essay score variance that lies between prompts was found to be relatively small (4 to 7 percent). When the essay-level, student-level-and prompt-level predictors were included in the model, it was able to explain almost all variance that lies between prompts. Since in most high-stakes writing assessments only 1-2 prompts per students are used, the essay score variance that lies between prompts represents an undesirable or "noise" variation. Identifying factors associated with this "noise" variance may prove to be important for prompt writing and for constructing Automated Essay Scoring mechanisms for weighting prompt difficulty when assigning essay score.^

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The need to provide computers with the ability to distinguish the affective state of their users is a major requirement for the practical implementation of affective computing concepts. This dissertation proposes the application of signal processing methods on physiological signals to extract from them features that can be processed by learning pattern recognition systems to provide cues about a person's affective state. In particular, combining physiological information sensed from a user's left hand in a non-invasive way with the pupil diameter information from an eye-tracking system may provide a computer with an awareness of its user's affective responses in the course of human-computer interactions. In this study an integrated hardware-software setup was developed to achieve automatic assessment of the affective status of a computer user. A computer-based "Paced Stroop Test" was designed as a stimulus to elicit emotional stress in the subject during the experiment. Four signals: the Galvanic Skin Response (GSR), the Blood Volume Pulse (BVP), the Skin Temperature (ST) and the Pupil Diameter (PD), were monitored and analyzed to differentiate affective states in the user. Several signal processing techniques were applied on the collected signals to extract their most relevant features. These features were analyzed with learning classification systems, to accomplish the affective state identification. Three learning algorithms: Naïve Bayes, Decision Tree and Support Vector Machine were applied to this identification process and their levels of classification accuracy were compared. The results achieved indicate that the physiological signals monitored do, in fact, have a strong correlation with the changes in the emotional states of the experimental subjects. These results also revealed that the inclusion of pupil diameter information significantly improved the performance of the emotion recognition system. ^

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Day by day, machine learning is changing our lives in ways we could not have imagined just 5 years ago. ML expertise is more and more requested and needed, though just a limited number of ML engineers are available on the job market, and their knowledge is always limited by an inherent characteristic of theirs: they are humans. This thesis explores the possibilities offered by meta-learning, a new field in ML that takes learning a level higher: models are trained on other models' training data, starting from features of the dataset they were trained on, inference times, obtained performances, to try to understand the relationship between a good model and the way it was obtained. The so-called metamodel was trained on data collected by OpenML, the largest ML metadata platform that's publicly available today. Datasets were analyzed to obtain meta-features that describe them, which were then tied to model performances in a regression task. The obtained metamodel predicts the expected performances of a given model type (e.g., a random forest) on a given ML task (e.g., classification on the UCI census dataset). This research was then integrated into a custom-made AutoML framework, to show how meta-learning is not an end in itself, but it can be used to further progress our ML research. Encoding ML engineering expertise in a model allows better, faster, and more impactful ML applications across the whole world, while reducing the cost that is inevitably tied to human engineers.