974 resultados para Language processing


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Community-driven Question Answering (CQA) systems that crowdsource experiential information in the form of questions and answers and have accumulated valuable reusable knowledge. Clustering of QA datasets from CQA systems provides a means of organizing the content to ease tasks such as manual curation and tagging. In this paper, we present a clustering method that exploits the two-part question-answer structure in QA datasets to improve clustering quality. Our method, {\it MixKMeans}, composes question and answer space similarities in a way that the space on which the match is higher is allowed to dominate. This construction is motivated by our observation that semantic similarity between question-answer data (QAs) could get localized in either space. We empirically evaluate our method on a variety of real-world labeled datasets. Our results indicate that our method significantly outperforms state-of-the-art clustering methods for the task of clustering question-answer archives.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

L’augmentation de la croissance des réseaux, des blogs et des utilisateurs des sites d’examen sociaux font d’Internet une énorme source de données, en particulier sur la façon dont les gens pensent, sentent et agissent envers différentes questions. Ces jours-ci, les opinions des gens jouent un rôle important dans la politique, l’industrie, l’éducation, etc. Alors, les gouvernements, les grandes et petites industries, les instituts universitaires, les entreprises et les individus cherchent à étudier des techniques automatiques fin d’extraire les informations dont ils ont besoin dans les larges volumes de données. L’analyse des sentiments est une véritable réponse à ce besoin. Elle est une application de traitement du langage naturel et linguistique informatique qui se compose de techniques de pointe telles que l’apprentissage machine et les modèles de langue pour capturer les évaluations positives, négatives ou neutre, avec ou sans leur force, dans des texte brut. Dans ce mémoire, nous étudions une approche basée sur les cas pour l’analyse des sentiments au niveau des documents. Notre approche basée sur les cas génère un classificateur binaire qui utilise un ensemble de documents classifies, et cinq lexiques de sentiments différents pour extraire la polarité sur les scores correspondants aux commentaires. Puisque l’analyse des sentiments est en soi une tâche dépendante du domaine qui rend le travail difficile et coûteux, nous appliquons une approche «cross domain» en basant notre classificateur sur les six différents domaines au lieu de le limiter à un seul domaine. Pour améliorer la précision de la classification, nous ajoutons la détection de la négation comme une partie de notre algorithme. En outre, pour améliorer la performance de notre approche, quelques modifications innovantes sont appliquées. Il est intéressant de mentionner que notre approche ouvre la voie à nouveaux développements en ajoutant plus de lexiques de sentiment et ensembles de données à l’avenir.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Objective: The study was designed to validate use of elec-tronic health records (EHRs) for diagnosing bipolar disorder and classifying control subjects. Method: EHR data were obtained from a health care system of more than 4.6 million patients spanning more than 20 years. Experienced clinicians reviewed charts to identify text features and coded data consistent or inconsistent with a diagnosis of bipolar disorder. Natural language processing was used to train a diagnostic algorithm with 95% specificity for classifying bipolar disorder. Filtered coded data were used to derive three additional classification rules for case subjects and one for control subjects. The positive predictive value (PPV) of EHR-based bipolar disorder and subphenotype di- agnoses was calculated against diagnoses from direct semi- structured interviews of 190 patients by trained clinicians blind to EHR diagnosis. Results: The PPV of bipolar disorder defined by natural language processing was 0.85. Coded classification based on strict filtering achieved a value of 0.79, but classifications based on less stringent criteria performed less well. No EHR- classified control subject received a diagnosis of bipolar dis- order on the basis of direct interview (PPV=1.0). For most subphenotypes, values exceeded 0.80. The EHR-based clas- sifications were used to accrue 4,500 bipolar disorder cases and 5,000 controls for genetic analyses. Conclusions: Semiautomated mining of EHRs can be used to ascertain bipolar disorder patients and control subjects with high specificity and predictive value compared with diagnostic interviews. EHRs provide a powerful resource for high-throughput phenotyping for genetic and clinical research.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The overwhelming amount and unprecedented speed of publication in the biomedical domain make it difficult for life science researchers to acquire and maintain a broad view of the field and gather all information that would be relevant for their research. As a response to this problem, the BioNLP (Biomedical Natural Language Processing) community of researches has emerged and strives to assist life science researchers by developing modern natural language processing (NLP), information extraction (IE) and information retrieval (IR) methods that can be applied at large-scale, to scan the whole publicly available biomedical literature and extract and aggregate the information found within, while automatically normalizing the variability of natural language statements. Among different tasks, biomedical event extraction has received much attention within BioNLP community recently. Biomedical event extraction constitutes the identification of biological processes and interactions described in biomedical literature, and their representation as a set of recursive event structures. The 2009–2013 series of BioNLP Shared Tasks on Event Extraction have given raise to a number of event extraction systems, several of which have been applied at a large scale (the full set of PubMed abstracts and PubMed Central Open Access full text articles), leading to creation of massive biomedical event databases, each of which containing millions of events. Sinece top-ranking event extraction systems are based on machine-learning approach and are trained on the narrow-domain, carefully selected Shared Task training data, their performance drops when being faced with the topically highly varied PubMed and PubMed Central documents. Specifically, false-positive predictions by these systems lead to generation of incorrect biomolecular events which are spotted by the end-users. This thesis proposes a novel post-processing approach, utilizing a combination of supervised and unsupervised learning techniques, that can automatically identify and filter out a considerable proportion of incorrect events from large-scale event databases, thus increasing the general credibility of those databases. The second part of this thesis is dedicated to a system we developed for hypothesis generation from large-scale event databases, which is able to discover novel biomolecular interactions among genes/gene-products. We cast the hypothesis generation problem as a supervised network topology prediction, i.e predicting new edges in the network, as well as types and directions for these edges, utilizing a set of features that can be extracted from large biomedical event networks. Routine machine learning evaluation results, as well as manual evaluation results suggest that the problem is indeed learnable. This work won the Best Paper Award in The 5th International Symposium on Languages in Biology and Medicine (LBM 2013).

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Les unités linguistiques sous-lexicales (p.ex., la syllabe, le phonème ou le phone) jouent un rôle crucial dans le traitement langagier. En particulier, le traitement langagier est profondément influencé par la distribution de ces unités. Par exemple, les syllabes les plus fréquentes sont articulées plus rapidement. Il est donc important d’avoir accès à des outils permettant de créer du matériel expérimental ou clinique pour l’étude du langage normal ou pathologique qui soit représentatif de l’utilisation des syllabes et des phones dans la langue orale. L’accès à ce type d’outil permet également de comparer des stimuli langagiers en fonction de leurs statistiques distributionnelles, ou encore d’étudier l’impact de ces statistiques sur le traitement langagier dans différentes populations. Pourtant, jusqu’à ce jour, aucun outil n’était disponible sur l’utilisation des unités linguistiques sous-lexicales du français oral québécois. Afin de combler cette lacune, un vaste corpus du français québécois oral spontané a été élaboré à partir d’enregistrements de 184 locuteurs québécois. Une base de données de syllabes et une base de données de phones ont ensuite été construites à partir de ce corpus, offrant une foule d’informations sur la structure des unités et sur leurs statistiques distributionnelles. Le fruit de ce projet, intitulé SyllabO +, sera rendu disponible en ligne en accès libre via le site web http://speechneurolab.ca/fr/syllabo dès la publication de l’article le décrivant. Cet outil incomparable sera d’une grande utilité dans plusieurs domaines, tels que les neurosciences cognitives, la psycholinguistique, la psychologie expérimentale, la phonétique, la phonologie, l’orthophonie et l’étude de l’acquisition des langues.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A primary goal of context-aware systems is delivering the right information at the right place and right time to users in order to enable them to make effective decisions and improve their quality of life. There are three key requirements for achieving this goal: determining what information is relevant, personalizing it based on the users’ context (location, preferences, behavioral history etc.), and delivering it to them in a timely manner without an explicit request from them. These requirements create a paradigm that we term as “Proactive Context-aware Computing”. Most of the existing context-aware systems fulfill only a subset of these requirements. Many of these systems focus only on personalization of the requested information based on users’ current context. Moreover, they are often designed for specific domains. In addition, most of the existing systems are reactive - the users request for some information and the system delivers it to them. These systems are not proactive i.e. they cannot anticipate users’ intent and behavior and act proactively without an explicit request from them. In order to overcome these limitations, we need to conduct a deeper analysis and enhance our understanding of context-aware systems that are generic, universal, proactive and applicable to a wide variety of domains. To support this dissertation, we explore several directions. Clearly the most significant sources of information about users today are smartphones. A large amount of users’ context can be acquired through them and they can be used as an effective means to deliver information to users. In addition, social media such as Facebook, Flickr and Foursquare provide a rich and powerful platform to mine users’ interests, preferences and behavioral history. We employ the ubiquity of smartphones and the wealth of information available from social media to address the challenge of building proactive context-aware systems. We have implemented and evaluated a few approaches, including some as part of the Rover framework, to achieve the paradigm of Proactive Context-aware Computing. Rover is a context-aware research platform which has been evolving for the last 6 years. Since location is one of the most important context for users, we have developed ‘Locus’, an indoor localization, tracking and navigation system for multi-story buildings. Other important dimensions of users’ context include the activities that they are engaged in. To this end, we have developed ‘SenseMe’, a system that leverages the smartphone and its multiple sensors in order to perform multidimensional context and activity recognition for users. As part of the ‘SenseMe’ project, we also conducted an exploratory study of privacy, trust, risks and other concerns of users with smart phone based personal sensing systems and applications. To determine what information would be relevant to users’ situations, we have developed ‘TellMe’ - a system that employs a new, flexible and scalable approach based on Natural Language Processing techniques to perform bootstrapped discovery and ranking of relevant information in context-aware systems. In order to personalize the relevant information, we have also developed an algorithm and system for mining a broad range of users’ preferences from their social network profiles and activities. For recommending new information to the users based on their past behavior and context history (such as visited locations, activities and time), we have developed a recommender system and approach for performing multi-dimensional collaborative recommendations using tensor factorization. For timely delivery of personalized and relevant information, it is essential to anticipate and predict users’ behavior. To this end, we have developed a unified infrastructure, within the Rover framework, and implemented several novel approaches and algorithms that employ various contextual features and state of the art machine learning techniques for building diverse behavioral models of users. Examples of generated models include classifying users’ semantic places and mobility states, predicting their availability for accepting calls on smartphones and inferring their device charging behavior. Finally, to enable proactivity in context-aware systems, we have also developed a planning framework based on HTN planning. Together, these works provide a major push in the direction of proactive context-aware computing.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Depuis le milieu des années 2000, une nouvelle approche en apprentissage automatique, l'apprentissage de réseaux profonds (deep learning), gagne en popularité. En effet, cette approche a démontré son efficacité pour résoudre divers problèmes en améliorant les résultats obtenus par d'autres techniques qui étaient considérées alors comme étant l'état de l'art. C'est le cas pour le domaine de la reconnaissance d'objets ainsi que pour la reconnaissance de la parole. Sachant cela, l’utilisation des réseaux profonds dans le domaine du Traitement Automatique du Langage Naturel (TALN, Natural Language Processing) est donc une étape logique à suivre. Cette thèse explore différentes structures de réseaux de neurones dans le but de modéliser le texte écrit, se concentrant sur des modèles simples, puissants et rapides à entraîner.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Socioeconomic status (SES) influences language and cognitive development, with discrepancies particularly noticeable in vocabulary development. This study examines how SES-related differences impact the development of syntactic processing, cognitive inhibition, and word learning. 38 4-5-year-olds from higher- and lower-SES backgrounds completed a word-learning task, in which novel words were embedded in active and passive sentences. Critically, unlike the active sentences, all passive sentences required a syntactic revision. Measures of cognitive inhibition were obtained through a modified Stroop task. Results indicate that lower-SES participants had more difficulty using inhibitory functions to resolve conflict compared to their higher-SES counterparts. However, SES did not impact language processing, as the language outcomes were similar across SES background. Additionally, stronger inhibitory processes were related to better language outcomes in the passive sentence condition. These results suggest that cognitive inhibition impact language processing, but this function may vary across children from different SES backgrounds

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A flexible and multipurpose bio-inspired hierarchical model for analyzing musical timbre is presented in this paper. Inspired by findings in the fields of neuroscience, computational neuroscience, and psychoacoustics, not only does the model extract spectral and temporal characteristics of a signal, but it also analyzes amplitude modulations on different timescales. It uses a cochlear filter bank to resolve the spectral components of a sound, lateral inhibition to enhance spectral resolution, and a modulation filter bank to extract the global temporal envelope and roughness of the sound from amplitude modulations. The model was evaluated in three applications. First, it was used to simulate subjective data from two roughness experiments. Second, it was used for musical instrument classification using the k-NN algorithm and a Bayesian network. Third, it was applied to find the features that characterize sounds whose timbres were labeled in an audiovisual experiment. The successful application of the proposed model in these diverse tasks revealed its potential in capturing timbral information.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Este Trabajo Fin de Grado (TFG) tiene como objetivo la creación de un framework para su uso en sistemas de recomendación. Se ha realizado por dos personas en la modalidad de trabajo en equipo. Las tareas de este TFG están divididas en dos partes, una realizada conjuntamente y la otra de manera individual. La parte conjunta se centra en construir un sistema que sea capaz de, a partir de comentarios y opiniones sobre puntos de interés (POIs) y haciendo uso de la herramienta de procesamiento de lenguaje natural AlchemyAPI, construir contextos formales y contextos formales multivaluados. Para crear este último es necesario hacer uso de ontologías. El context formal multivaluado es el punto de partida de la segunda parte (individual), que consistirá en, haciendo uso del contexto multivaluado, obtener un conjunto de dependencias funcionales mediante la implementación en Java del algoritmo FDMine. Estas dependencias podrán ser usados en un motor de recomendación. El sistema se ha implementado como una aplicación web Java EE versión 6 y una API para trabajar con contextos formales multivaluados. Para el desarrollo web se han empleado tecnologías actuales como Spring y jQuery. Este proyecto se presenta como un trabajo inicial en el que se expondrán, además del sistema construido, diversos problemas relacionados con la creacion de conjuntos de datos validos. Por último, también se propondrán líneas para futuros TFGs.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The four-skills on tests for young native speakers commonly do not generate correlation incongruency concerning the cognitive strategies frequently reported. Considering the non-native speakers there are parse evidence to determine which tasks are important to assess properly the cognitive and academic language proficiency (Cummins, 1980; 2012). Research questions: It is of high probability that young students with origin in immigration significantly differ on their communication strategies and skills in a second language processing context (1); attached to this first assumption, it is supposed that teachers significantly differ depending on their scientific area and previous training (2). Purpose: This study intends to examine whether school teachers (K-12) as having different origin in scientific domain of teaching and training perceive differently an adapted four-skills scale, in European Portuguese. Research methods: 77 teachers of five areas scientific areas, mean of teaching year service = 32 (SD= 2,7), 57 males and 46 females (from basic and high school levels). Main findings: ANOVA (Effect size and Post-hoc Tukey tests) and linear regression analysis (stepwise method) revealed statistically significant differences among teachers of different areas, mainly between language teachers and science teachers. Language teachers perceive more accurately tasks in a multiple manner to the broad skills that require to be measured in non-native students. Conclusion: If teachers perceive differently the importance of the big-four tasks, there would be incongruence on skills measurement that teachers select for immigrant puppils. Non-balanced tasks and the teachers’ perceptions on evaluation and toward competence of students would likely determine limitations for academic and cognitive development of non-native students. Furthermore, results showed sufficient evidence to conclude that tasks are perceived differently by teachers toward importance of specific skills subareas. Reading skills are best considered compared to oral comphreension skills in non-native students.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This dissertation applies statistical methods to the evaluation of automatic summarization using data from the Text Analysis Conferences in 2008-2011. Several aspects of the evaluation framework itself are studied, including the statistical testing used to determine significant differences, the assessors, and the design of the experiment. In addition, a family of evaluation metrics is developed to predict the score an automatically generated summary would receive from a human judge and its results are demonstrated at the Text Analysis Conference. Finally, variations on the evaluation framework are studied and their relative merits considered. An over-arching theme of this dissertation is the application of standard statistical methods to data that does not conform to the usual testing assumptions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

[EU]Hizkuntzaren prozesamenduan testu koherenteetan kausa taldeko erlazioak (KAUSA, ONDORIOA eta HELBURUA) automatikoki hautematea eta bereiztea erabilgarria da galdera-erantzun automatikoko sistemak eraikitzerako orduan. Horretarako Egitura Erretorikoaren Teoria (Rhetorical Structure Theory, aurrerantzean RST) eta bere erlazioak erabiliko ditugu, corpus bezala RST Treebank -a (Iruskieta et al., 2013) hartuta, zientziako laburpen-testuz osatutako corpusa, hain zuzen ere. Corpus hori XML formatuan deskargatu eta hortik XPATH tresnaren bidez informazio garrantzitsuena eskuratzen dugu. Lan honek 3 helburu nagusi ditu: lehendabizi, kausa taldeko erlazioak elkarren artean bereiztea, bigarrenez, kausa taldeko erlazio hauek beste erlazio guztiekin bereiztea, eta azkenik, EBALUAZIOA eta INTERPRETAZIOA erlazioak bereiztea sentimendu analisian aplikatu ahal izateko. Ataza horiek egiteko, RhetDB tresnarekin eskuratu diren patroi ensaguratsuenak erabili eta bi aplikazio garatu ditugu. Alde batetik, bilatu nahi ditugun patroiak adierazi eta erlazio-egitura duen edonolako testuetan bilaketak egiten dituen bilatzailea, eta bestetik, patroi esanguratsuenak emanda erlazioak etiketatzen dituen etiketatzailea. Bi aplikazio hauek gainera, ahalik eta modu parametrizagarrienean erabiltzeko garatu ditugu, kodea aldatu gabe edonork erabili ahal izateko antzeko atazak egiteko. Etiketatzaileak ebaluatu ondoren, identifikatzeko erlaziorik errazena HELBURUA erlazioa dela ikusi dugu eta KAUSA eta ONDORIOA bereizteko arazo gehiago dauzkagula ere ondorioztatu dugu. Modu berean, EBALUAZIOA eta INTERPRETAZIOA ere elkarren artean bereiz dezakegula ikusi dugu.