926 resultados para automatic categorization
Resumo:
This paper describes a preprocessing module for improving the performance of a Spanish into Spanish Sign Language (Lengua de Signos Espanola: LSE) translation system when dealing with sparse training data. This preprocessing module replaces Spanish words with associated tags. The list with Spanish words (vocabulary) and associated tags used by this module is computed automatically considering those signs that show the highest probability of being the translation of every Spanish word. This automatic tag extraction has been compared to a manual strategy achieving almost the same improvement. In this analysis, several alternatives for dealing with non-relevant words have been studied. Non-relevant words are Spanish words not assigned to any sign. The preprocessing module has been incorporated into two well-known statistical translation architectures: a phrase-based system and a Statistical Finite State Transducer (SFST). This system has been developed for a specific application domain: the renewal of Identity Documents and Driver's License. In order to evaluate the system a parallel corpus made up of 4080 Spanish sentences and their LSE translation has been used. The evaluation results revealed a significant performance improvement when including this preprocessing module. In the phrase-based system, the proposed module has given rise to an increase in BLEU (Bilingual Evaluation Understudy) from 73.8% to 81.0% and an increase in the human evaluation score from 0.64 to 0.83. In the case of SFST, BLEU increased from 70.6% to 78.4% and the human evaluation score from 0.65 to 0.82.
Resumo:
Aquest és un projecte que tracta sobre la indexació automàtica de continguts televisius. És una tasca que guanyarà importància amb els imminents canvis que hi haurà en la televisió que coneixem. L'entrada de la nova televisió digital farà que hi hagi una interacció molt més fluida entre l'espectador i la cadena, a més de grans quantitats de canals, cada un amb programes de tipus totalment diferents. Tot això farà que tenir mètodes de cerca basats en els continguts d'aquests programes sigui del tot imprescindible. Així doncs, el nostre projecte està basat plenament en poder extreure alguns d'aquests descriptors que faran possible la categorització dels diferents programes televisius.
Resumo:
Notre recherche a pour but de déterminer comment les genres textuels peuvent être exploités dans le design des environnements numériques de travail afin de faciliter l’accomplissement des pratiques textuelles de cadres et de secrétaires dans une municipalité et une administration fédérale canadiennes. À cet effet, le premier objectif consiste à évaluer l’aptitude des environnements numériques de travail à supporter les pratiques textuelles (lecture, écriture et manipulation des textes) de ces employés. Le deuxième objectif est de décrire les rôles des genres textuels au cours des pratiques textuelles. Avec l’exemple du courriel, le troisième objectif vise à examiner comment le genre peut être exploité dans une perspective d’assistance à la réalisation des pratiques textuelles dans les environnements numériques de travail. Cette recherche de nature qualitative comporte une méthodologie en deux étapes. La première étape consiste en un examen minutieux des pratiques textuelles, des difficultés rencontrées au cours de celles-ci, du rôle du genre dans les environnements numériques de travail, ainsi que des indices sollicités au cours de la gestion du courriel. Trois modes de collecte des données qualitatives sont utilisés auprès de 17 cadres et de 17 secrétaires issus de deux administrations publiques : l’entrevue semi-dirigée, le journal de bord et l’enquête cognitive. Les résultats sont examinés à l’aide de stratégies d’analyse de contenu qualitative. La deuxième phase comprend la mise au point d’une chaîne de traitement du courriel, visant à étayer notre réflexion sur le genre textuel et son exploitation dans la conception des environnements numériques de travail. Un corpus de 1703 messages est élaboré à partir d’un échantillon remis par deux cadres gouvernementaux. Les résultats permettent d’abord de dresser un portrait général des pratiques de lecture, d’écriture et de manipulation des textes communes et spécifiques aux cadres et aux secrétaires. L’importance du courriel, qui constitue environ 40% des systèmes notés dans les journaux de bord, est soulignée. Les difficultés rencontrées dans les environnements numériques de travail sont également décrites. Dans un deuxième temps, les rôles du genre au cours des pratiques textuelles sont examinés en fonction d’une matrice tenant à la fois compte de ses dimensions individuelles et collectives, ainsi que de ses trois principales facettes ; la forme, le contenu et la fonction. Ensuite, nous présentons un cadre d’analyse des indices affectant la gestion du courriel qui synthétise le processus d’interprétation des messages par le destinataire. Une typologie des patrons de catégorisation des cadres est également définie, puis employée dans une expérimentation statistique visant la description et la catégorisation automatique du courriel. Au terme de ce processus, on observe des comportements linguistiques marqués en fonction des catégories du courriel. Il s’avère également que la catégorisation automatique basée sur le lexique des messages est beaucoup plus performante que la catégorisation non lexicale. À l’issue de cette recherche, nous suggérons d’enrichir le paradigme traditionnel relevant de l’interaction humain-ordinateur par une sémiotique du genre dans les environnements numériques de travail. L’étude propose également une réflexion sur l’appartenance du courriel à un genre, en ayant recours aux concepts théoriques d’hypergenre, de genre et de sous-genre. Le succès de la catégorisation automatique du courriel en fonction de facettes tributaires du genre (le contenu, la forme et la fonction) offre des perspectives intéressantes sur l’application de ce concept au design des environnements numériques de travail en vue de faciliter l’accomplissement des pratiques textuelles par les employés.
Resumo:
In this paper we investigate whether conventional text categorization methods may suffice to infer different verbal intelligence levels. This research goal relies on the hypothesis that the vocabulary that speakers make use of reflects their verbal intelligence levels. Automatic verbal intelligence estimation of users in a spoken language dialog system may be useful when defining an optimal dialog strategy by improving its adaptation capabilities. The work is based on a corpus containing descriptions (i.e. monologs) of a short film by test persons yielding different educational backgrounds and the verbal intelligence scores of the speakers. First, a one-way analysis of variance was performed to compare the monologs with the film transcription and to demonstrate that there are differences in the vocabulary used by the test persons yielding different verbal intelligence levels. Then, for the classification task, the monologs were represented as feature vectors using the classical TF–IDF weighting scheme. The Naive Bayes, k-nearest neighbors and Rocchio classifiers were tested. In this paper we describe and compare these classification approaches, define the optimal classification parameters and discuss the classification results obtained.
Resumo:
A organização automática de mensagens de correio electrónico é um desafio actual na área da aprendizagem automática. O número excessivo de mensagens afecta cada vez mais utilizadores, especialmente os que usam o correio electrónico como ferramenta de comunicação e trabalho. Esta tese aborda o problema da organização automática de mensagens de correio electrónico propondo uma solução que tem como objectivo a etiquetagem automática de mensagens. A etiquetagem automática é feita com recurso às pastas de correio electrónico anteriormente criadas pelos utilizadores, tratando-as como etiquetas, e à sugestão de múltiplas etiquetas para cada mensagem (top-N). São estudadas várias técnicas de aprendizagem e os vários campos que compõe uma mensagem de correio electrónico são analisados de forma a determinar a sua adequação como elementos de classificação. O foco deste trabalho recai sobre os campos textuais (o assunto e o corpo das mensagens), estudando-se diferentes formas de representação, selecção de características e algoritmos de classificação. É ainda efectuada a avaliação dos campos de participantes através de algoritmos de classificação que os representam usando o modelo vectorial ou como um grafo. Os vários campos são combinados para classificação utilizando a técnica de combinação de classificadores Votação por Maioria. Os testes são efectuados com um subconjunto de mensagens de correio electrónico da Enron e um conjunto de dados privados disponibilizados pelo Institute for Systems and Technologies of Information, Control and Communication (INSTICC). Estes conjuntos são analisados de forma a perceber as características dos dados. A avaliação do sistema é realizada através da percentagem de acerto dos classificadores. Os resultados obtidos apresentam melhorias significativas em comparação com os trabalhos relacionados.
Resumo:
Introdução Actualmente, as mensagens electrónicas são consideradas um importante meio de comunicação. As mensagens electrónicas – vulgarmente conhecidas como emails – são utilizadas fácil e frequentemente para enviar e receber o mais variado tipo de informação. O seu uso tem diversos fins gerando diariamente um grande número de mensagens e, consequentemente um enorme volume de informação. Este grande volume de informação requer uma constante manipulação das mensagens de forma a manter o conjunto organizado. Tipicamente esta manipulação consiste em organizar as mensagens numa taxonomia. A taxonomia adoptada reflecte os interesses e as preferências particulares do utilizador. Motivação A organização manual de emails é uma actividade morosa e que consome tempo. A optimização deste processo através da implementação de um método automático, tende a melhorar a satisfação do utilizador. Cada vez mais existe a necessidade de encontrar novas soluções para a manipulação de conteúdo digital poupando esforços e custos ao utilizador; esta necessidade, concretamente no âmbito da manipulação de emails, motivou a realização deste trabalho. Hipótese O objectivo principal deste projecto consiste em permitir a organização ad-hoc de emails com um esforço reduzido por parte do utilizador. A metodologia proposta visa organizar os emails num conjunto de categorias, disjuntas, que reflectem as preferências do utilizador. A principal finalidade deste processo é produzir uma organização onde as mensagens sejam classificadas em classes apropriadas requerendo o mínimo número esforço possível por parte do utilizador. Para alcançar os objectivos estipulados, este projecto recorre a técnicas de mineração de texto, em especial categorização automática de texto, e aprendizagem activa. Para reduzir a necessidade de inquirir o utilizador – para etiquetar exemplos de acordo com as categorias desejadas – foi utilizado o algoritmo d-confidence. Processo de organização automática de emails O processo de organizar automaticamente emails é desenvolvido em três fases distintas: indexação, classificação e avaliação. Na primeira fase, fase de indexação, os emails passam por um processo transformativo de limpeza que visa essencialmente gerar uma representação dos emails adequada ao processamento automático. A segunda fase é a fase de classificação. Esta fase recorre ao conjunto de dados resultantes da fase anterior para produzir um modelo de classificação, aplicando-o posteriormente a novos emails. Partindo de uma matriz onde são representados emails, termos e os seus respectivos pesos, e um conjunto de exemplos classificados manualmente, um classificador é gerado a partir de um processo de aprendizagem. O classificador obtido é então aplicado ao conjunto de emails e a classificação de todos os emails é alcançada. O processo de classificação é feito com base num classificador de máquinas de vectores de suporte recorrendo ao algoritmo de aprendizagem activa d-confidence. O algoritmo d-confidence tem como objectivo propor ao utilizador os exemplos mais significativos para etiquetagem. Ao identificar os emails com informação mais relevante para o processo de aprendizagem, diminui-se o número de iterações e consequentemente o esforço exigido por parte dos utilizadores. A terceira e última fase é a fase de avaliação. Nesta fase a performance do processo de classificação e a eficiência do algoritmo d-confidence são avaliadas. O método de avaliação adoptado é o método de validação cruzada denominado 10-fold cross validation. Conclusões O processo de organização automática de emails foi desenvolvido com sucesso, a performance do classificador gerado e do algoritmo d-confidence foi relativamente boa. Em média as categorias apresentam taxas de erro relativamente baixas, a não ser as classes mais genéricas. O esforço exigido pelo utilizador foi reduzido, já que com a utilização do algoritmo d-confidence obteve-se uma taxa de erro próxima do valor final, mesmo com um número de casos etiquetados abaixo daquele que é requerido por um método supervisionado. É importante salientar, que além do processo automático de organização de emails, este projecto foi uma excelente oportunidade para adquirir conhecimento consistente sobre mineração de texto e sobre os processos de classificação automática e recuperação de informação. O estudo de áreas tão interessantes despertou novos interesses que consistem em verdadeiros desafios futuros.
Resumo:
The number of digital images has been increasing exponentially in the last few years. People have problems managing their image collections and finding a specific image. An automatic image categorization system could help them to manage images and find specific images. In this thesis, an unsupervised visual object categorization system was implemented to categorize a set of unknown images. The system is unsupervised, and hence, it does not need known images to train the system which needs to be manually obtained. Therefore, the number of possible categories and images can be huge. The system implemented in the thesis extracts local features from the images. These local features are used to build a codebook. The local features and the codebook are then used to generate a feature vector for an image. Images are categorized based on the feature vectors. The system is able to categorize any given set of images based on the visual appearance of the images. Images that have similar image regions are grouped together in the same category. Thus, for example, images which contain cars are assigned to the same cluster. The unsupervised visual object categorization system can be used in many situations, e.g., in an Internet search engine. The system can categorize images for a user, and the user can then easily find a specific type of image.
Resumo:
Image categorization by means of bag of visual words has received increasing attention by the image processing and vision communities in the last years. In these approaches, each image is represented by invariant points of interest which are mapped to a Hilbert Space representing a visual dictionary which aims at comprising the most discriminative features in a set of images. Notwithstanding, the main problem of such approaches is to find a compact and representative dictionary. Finding such representative dictionary automatically with no user intervention is an even more difficult task. In this paper, we propose a method to automatically find such dictionary by employing a recent developed graph-based clustering algorithm called Optimum-Path Forest, which does not make any assumption about the visual dictionary's size and is more efficient and effective than the state-of-the-art techniques used for dictionary generation. © 2012 IEEE.
Resumo:
Image categorization by means of bag of visual words has received increasing attention by the image processing and vision communities in the last years. In these approaches, each image is represented by invariant points of interest which are mapped to a Hilbert Space representing a visual dictionary which aims at comprising the most discriminative features in a set of images. Notwithstanding, the main problem of such approaches is to find a compact and representative dictionary. Finding such representative dictionary automatically with no user intervention is an even more difficult task. In this paper, we propose a method to automatically find such dictionary by employing a recent developed graph-based clustering algorithm called Optimum-Path Forest, which does not make any assumption about the visual dictionary's size and is more efficient and effective than the state-of-the-art techniques used for dictionary generation.
Resumo:
This thesis concerns artificially intelligent natural language processing systems that are capable of learning the properties of lexical items (properties like verbal valency or inflectional class membership) autonomously while they are fulfilling their tasks for which they have been deployed in the first place. Many of these tasks require a deep analysis of language input, which can be characterized as a mapping of utterances in a given input C to a set S of linguistically motivated structures with the help of linguistic information encoded in a grammar G and a lexicon L: G + L + C → S (1) The idea that underlies intelligent lexical acquisition systems is to modify this schematic formula in such a way that the system is able to exploit the information encoded in S to create a new, improved version of the lexicon: G + L + S → L' (2) Moreover, the thesis claims that a system can only be considered intelligent if it does not just make maximum usage of the learning opportunities in C, but if it is also able to revise falsely acquired lexical knowledge. So, one of the central elements in this work is the formulation of a couple of criteria for intelligent lexical acquisition systems subsumed under one paradigm: the Learn-Alpha design rule. The thesis describes the design and quality of a prototype for such a system, whose acquisition components have been developed from scratch and built on top of one of the state-of-the-art Head-driven Phrase Structure Grammar (HPSG) processing systems. The quality of this prototype is investigated in a series of experiments, in which the system is fed with extracts of a large English corpus. While the idea of using machine-readable language input to automatically acquire lexical knowledge is not new, we are not aware of a system that fulfills Learn-Alpha and is able to deal with large corpora. To instance four major challenges of constructing such a system, it should be mentioned that a) the high number of possible structural descriptions caused by highly underspeci ed lexical entries demands for a parser with a very effective ambiguity management system, b) the automatic construction of concise lexical entries out of a bulk of observed lexical facts requires a special technique of data alignment, c) the reliability of these entries depends on the system's decision on whether it has seen 'enough' input and d) general properties of language might render some lexical features indeterminable if the system tries to acquire them with a too high precision. The cornerstone of this dissertation is the motivation and development of a general theory of automatic lexical acquisition that is applicable to every language and independent of any particular theory of grammar or lexicon. This work is divided into five chapters. The introductory chapter first contrasts three different and mutually incompatible approaches to (artificial) lexical acquisition: cue-based queries, head-lexicalized probabilistic context free grammars and learning by unification. Then the postulation of the Learn-Alpha design rule is presented. The second chapter outlines the theory that underlies Learn-Alpha and exposes all the related notions and concepts required for a proper understanding of artificial lexical acquisition. Chapter 3 develops the prototyped acquisition method, called ANALYZE-LEARN-REDUCE, a framework which implements Learn-Alpha. The fourth chapter presents the design and results of a bootstrapping experiment conducted on this prototype: lexeme detection, learning of verbal valency, categorization into nominal count/mass classes, selection of prepositions and sentential complements, among others. The thesis concludes with a review of the conclusions and motivation for further improvements as well as proposals for future research on the automatic induction of lexical features.
Resumo:
The amount of information contained within the Internet has exploded in recent decades. As more and more news, blogs, and many other kinds of articles that are published on the Internet, categorization of articles and documents are increasingly desired. Among the approaches to categorize articles, labeling is one of the most common method; it provides a relatively intuitive and effective way to separate articles into different categories. However, manual labeling is limited by its efficiency, even thought the labels selected manually have relatively high quality. This report explores the topic modeling approach of Online Latent Dirichlet Allocation (Online-LDA). Additionally, a method to automatically label articles with their latent topics by combining the Online-LDA posterior with a probabilistic automatic labeling algorithm is implemented. The goal of this report is to examine the accuracy of the labels generated automatically by a topic model and probabilistic relevance algorithm for a set of real-world, dynamically updated articles from an online Rich Site Summary (RSS) service.
Resumo:
During development, children become capable of categorically associating stimuli and of using these relationships for memory recall. Brain damage in childhood can interfere with this development. This study investigated categorical association of stimuli and recall in four children with brain damages. The etiology, topography and timing of the lesions were diverse. Tasks included naming and immediate recall of 30 perceptually and semantically related figures, free sorting, delayed recall, and cued recall of the same material. Traditional neuropsychological tests were also employed. Two children with brain damage sustained in middle childhood relied on perceptual rather than on categorical associations in making associations between figures and showed deficits in delayed or cued recall, in contrast to those with perinatal lesions. One child exhibited normal performance in recall despite categorical association deficits. The present results suggest that brain damaged children show deficits in categorization and recall that are not usually identified in traditional neuropsychological tests.
Resumo:
A long-standing debate in the literature is whether attention can form two or more independent spatial foci in addition to the well-known unique spatial focus. There is evidence that voluntary visual attention divides in space. The possibility that this also occurs for automatic visual attention was investigated here. Thirty-six female volunteers were tested. In each trial, a prime stimulus was presented in the left or right visual hemifield. This stimulus was characterized by the blinking of a superior, middle or inferior ring, the blinking of all these rings, or the blinking of the superior and inferior rings. A target stimulus to which the volunteer should respond with the same side hand or a target stimulus to which she should not respond was presented 100 ms later in a primed location, a location between two primed locations or a location in the contralateral hemifield. Reaction time to the positive target stimulus in a primed location was consistently shorter than reaction time in the horizontally corresponding contralateral location. This attentional effect was significantly smaller or absent when the positive target stimulus appeared in the middle location after the double prime stimulus. These results suggest that automatic visual attention can focus on two separate locations simultaneously, to some extent sparing the region in between.
Resumo:
In this paper, we initially present an algorithm for automatic composition of melodies using chaotic dynamical systems. Afterward, we characterize chaotic music in a comprehensive way as comprising three perspectives: musical discrimination, dynamical influence on musical features, and musical perception. With respect to the first perspective, the coherence between generated chaotic melodies (continuous as well as discrete chaotic melodies) and a set of classical reference melodies is characterized by statistical descriptors and melodic measures. The significant differences among the three types of melodies are determined by discriminant analysis. Regarding the second perspective, the influence of dynamical features of chaotic attractors, e.g., Lyapunov exponent, Hurst coefficient, and correlation dimension, on melodic features is determined by canonical correlation analysis. The last perspective is related to perception of originality, complexity, and degree of melodiousness (Euler's gradus suavitatis) of chaotic and classical melodies by nonparametric statistical tests. (c) 2010 American Institute of Physics. [doi: 10.1063/1.3487516]
Resumo:
Complex networks have been characterised by their specific connectivity patterns (network motifs), but their building blocks can also be identified and described by node-motifs-a combination of local network features. One technique to identify single node-motifs has been presented by Costa et al. (L. D. F. Costa, F. A. Rodrigues, C. C. Hilgetag, and M. Kaiser, Europhys. Lett., 87, 1, 2009). Here, we first suggest improvements to the method including how its parameters can be determined automatically. Such automatic routines make high-throughput studies of many networks feasible. Second, the new routines are validated in different network-series. Third, we provide an example of how the method can be used to analyse network time-series. In conclusion, we provide a robust method for systematically discovering and classifying characteristic nodes of a network. In contrast to classical motif analysis, our approach can identify individual components (here: nodes) that are specific to a network. Such special nodes, as hubs before, might be found to play critical roles in real-world networks.