993 resultados para Chinese word segmentation
Resumo:
In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.
Resumo:
The increasing diversity of the Internet has created a vast number of multilingual resources on the Web. A huge number of these documents are written in various languages other than English. Consequently, the demand for searching in non-English languages is growing exponentially. It is desirable that a search engine can search for information over collections of documents in other languages. This research investigates the techniques for developing high-quality Chinese information retrieval systems. A distinctive feature of Chinese text is that a Chinese document is a sequence of Chinese characters with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose two approaches to deal with the problems. In the first approach, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach. In the second approach, we propose a novel query expansion method which applies text mining techniques in order to find the most relevant words to extend the query. Unlike most existing query expansion methods, which generally select the highly frequent indexing terms from the retrieved documents to expand the query. In our approach, we utilize text mining techniques to find patterns from the retrieved documents that highly correlate with the query term and then use the relevant words in the patterns to expand the original query. This research project develops and implements a Chinese information retrieval system for evaluating the proposed approaches. There are two stages in the experiments. The first stage is to investigate if high accuracy segmentation can make an improvement to Chinese information retrieval. In the second stage, a text mining based query expansion approach is implemented and a further experiment has been done to compare its performance with the standard Rocchio approach with the proposed text mining based query expansion method. The NTCIR5 Chinese collections are used in the experiments. The experiment results show that by incorporating the text mining based query expansion with the hybrid model, significant improvement has been achieved in both precision and recall assessments.
Resumo:
This paper describes our participation in the Chinese word segmentation task of CIPS-SIGHAN 2010. We implemented an n-gram mutual information (NGMI) based segmentation algorithm with the mixed-up features from unsupervised, supervised and dictionarybased segmentation methods. This algorithm is also combined with a simple strategy for out-of-vocabulary (OOV) word recognition. The evaluation for both open and closed training shows encouraging results of our system. The results for OOV word recognition in closed training evaluation were however found unsatisfactory.
Resumo:
The Thai written language is one of the languages that does not have word boundaries. In order to discover the meaning of the document, all texts must be separated into syllables, words, sentences, and paragraphs. This paper develops a novel method to segment the Thai text by combining a non-dictionary based technique with a dictionary-based technique. This method first applies the Thai language grammar rules to the text for identifying syllables. The hidden Markov model is then used for merging possible syllables into words. The identified words are verified with a lexical dictionary and a decision tree is employed to discover the words unidentified by the lexical dictionary. Documents used in the litigation process of Thai court proceedings have been used in experiments. The results which are segmented words, obtained by the proposed method outperform the results obtained by other existing methods.
Resumo:
Tutkielma käsittelee kiinan kielen automaattista käsittelyä ja kieliteknologiaa. Kieliteknologian osa-alueista keskitytään kiinan kielelle tyypilliseen sanarajatunnistus- eli segmentointiongelmaan, joka kumpuaa kiinan kielen kirjoitusjärjestelmän erityispiirteistä. Tutkielma on aihepiiriä esittelevä pilottitutkimus, jonka tarkoitettu lukijaryhmä on kiinan kieliteknologisesta tutkimuksesta kiinnostuneet opiskelijat ja tutkijat. Lähdemateriaali koostuu englannin- ja kiinankielisestä kirjallisuudesta, lähinnä konferenssiartikkeleista. Tutkielma esittelee kiinan kirjoitusjärjestelmää automaattisen käsittelyn näkökulmasta, käsittelee perinteisten ja yksinkertaistettujen merkkien eroja, merkkikoodauksia sekä erilaisia lähestymistapoja käyttäviä syöttöjärjestelmiä. Kirjoitusjärjestelmän esittely tarjoaa esitietoja kielen rakenteen ymmärtämiseksi sekä rakentaa pohjaa sanarajatunnistusta käsitteleviä osuuksia varten. Sanarajatunnistus- eli segmentointiongelma johtuu kiinan kirjoitusjärjestelmästä, jossa sanojen välejä ei merkitä välilyönneillä. Kielen kieliteknologista käsittelyä varten sanojen rajat tulee kuitenkin selvittää. Sanarajatunnistusjärjestelmät ovat tietokoneohjelmia, jotka etsivät ja merkitsevät nämä rajat automaattisesti. Tehtävä ei kuitenkaan ole yksinkertainen kielen monitulkintaisuuksien ja ns. tuntemattomien sanojen vuoksi. Joissain tilanteissa ei ole olemassa yksiselitteisen oikeaa segmentointia. Tutkielmassa esitellään kaksi segmentointijärjestelmää, keskittyen erityisesti niiden toiminnan kuvaukseen lukijalle ymmärrettävässä muodossa. Tärkeää on menetelmien ymmärtäminen, ei tekniset yksityiskohdat. Lopuksi paneudutaan segmentointijärjestelmien evaluaation ongelmiin. Sanarajatunnistusta suorittavien ohjelmien vertailu on usein hankalaa, koska monissa tapauksissa järjestelmät eivät tuota yhteismitallisia tuloksia. Tutkielmassa esitellään yritys saada aikaan yhteismitallisia evaluaatiomenetelmiä segmentointiohjelmien Chinese Word Segmentation Bakeoff -kilpailujen muodossa. Tutkielmassa todetaan sanarajatunnistusongelman olevan tärkeä tutkimuskohde. Ratkaisemattomia ongelmia on kuitenkin edelleen, tärkeimpänä evaluaatio. Avainsanat – Nyckelord – Keywords kiinan kieli, sanarajatunnistus, segmentointi,kirjoitusmerkit, merkkikoodaukset, kiinan syöttötavat
Resumo:
A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.
Resumo:
Les systèmes statistiques de traduction automatique ont pour tâche la traduction d’une langue source vers une langue cible. Dans la plupart des systèmes de traduction de référence, l'unité de base considérée dans l'analyse textuelle est la forme telle qu’observée dans un texte. Une telle conception permet d’obtenir une bonne performance quand il s'agit de traduire entre deux langues morphologiquement pauvres. Toutefois, ceci n'est plus vrai lorsqu’il s’agit de traduire vers une langue morphologiquement riche (ou complexe). Le but de notre travail est de développer un système statistique de traduction automatique comme solution pour relever les défis soulevés par la complexité morphologique. Dans ce mémoire, nous examinons, dans un premier temps, un certain nombre de méthodes considérées comme des extensions aux systèmes de traduction traditionnels et nous évaluons leurs performances. Cette évaluation est faite par rapport aux systèmes à l’état de l’art (système de référence) et ceci dans des tâches de traduction anglais-inuktitut et anglais-finnois. Nous développons ensuite un nouvel algorithme de segmentation qui prend en compte les informations provenant de la paire de langues objet de la traduction. Cet algorithme de segmentation est ensuite intégré dans le modèle de traduction à base d’unités lexicales « Phrase-Based Models » pour former notre système de traduction à base de séquences de segments. Enfin, nous combinons le système obtenu avec des algorithmes de post-traitement pour obtenir un système de traduction complet. Les résultats des expériences réalisées dans ce mémoire montrent que le système de traduction à base de séquences de segments proposé permet d’obtenir des améliorations significatives au niveau de la qualité de la traduction en terme de le métrique d’évaluation BLEU (Papineni et al., 2002) et qui sert à évaluer. Plus particulièrement, notre approche de segmentation réussie à améliorer légèrement la qualité de la traduction par rapport au système de référence et une amélioration significative de la qualité de la traduction est observée par rapport aux techniques de prétraitement de base (baseline).
Resumo:
This paper presents the design of a full fledged OCR system for printed Kannada text. The machine recognition of Kannada characters is difficult due to similarity in the shapes of different characters, script complexity and non-uniqueness in the representation of diacritics. The document image is subject to line segmentation, word segmentation and zone detection. From the zonal information, base characters, vowel modifiers and consonant conjucts are separated. Knowledge based approach is employed for recognizing the base characters. Various features are employed for recognising the characters. These include the coefficients of the Discrete Cosine Transform, Discrete Wavelet Transform and Karhunen-Louve Transform. These features are fed to different classifiers. Structural features are used in the subsequent levels to discriminate confused characters. Use of structural features, increases recognition rate from 93% to 98%. Apart from the classical pattern classification technique of nearest neighbour, Artificial Neural Network (ANN) based classifiers like Back Propogation and Radial Basis Function (RBF) Networks have also been studied. The ANN classifiers are trained in supervised mode using the transform features. Highest recognition rate of 99% is obtained with RBF using second level approximation coefficients of Haar wavelets as the features on presegmented base characters.
Resumo:
In this work, we describe a system, which recognises open vocabulary, isolated, online handwritten Tamil words and extend it to recognize a paragraph of writing. We explain in detail each step involved in the process: segmentation, preprocessing, feature extraction, classification and bigram-based post-processing. On our database of 45,000 handwritten words obtained through tablet PC, we have obtained symbol level accuracy of 78.5% and 85.3% without and with the usage of post-processing using symbol level language models, respectively. Word level accuracies for the same are 40.1% and 59.6%. A line and word level segmentation strategy is proposed, which gives promising results of 100% line segmentation and 98.1% word segmentation accuracies on our initial trials of 40 handwritten paragraphs. The two modules have been combined to obtain a full-fledged page recognition system for online handwritten Tamil data. To the knowledge of the authors, this is the first ever attempt on recognition of open vocabulary, online handwritten paragraphs in any Indian language.
Resumo:
手写汉字切分是根据输入笔迹的空间位置关系进行汉字部件的合并切分,形成完整的汉字笔划以便进行识别处理.综合利用了汉字部件的结构位置关系和笔划的空间位置关系,根据笔划的最小生成树(minimalspanningtree,简称MST)对联机连续手写输入汉字进行切分,取得了较好的切分结果.切分的准确率超过91.6%.
Resumo:
本论文采用音乐物理和数学方法揭示了汉语四声的奥秘——频率变比3∶2,这个比例被命名为“宝石配比”.这一发现为语音研究、语音教学以及计算机语音识别提供了一条重要的科学依据.
Resumo:
汉字微机数据库的最大缺点是速度太慢,因此难以实用化。本文提出了一种新的汉字检索方法——标志域法,解决了这个关键问题,使查找速度提高了近十倍;再采用单层连续提问等一系列措施,还可节省相当多的存储空间,扩大了微型机的应用范围。本数据库非常适合于“最终用户”使用,即使是不懂计算机的人,也能在一、两分钟内学会使用它。
Resumo:
Previous studies have witnessed some psychological or behavioral deviation (such as aggressive behavior) might have an association with cerebral hemisphere cooperative dysfunction, however, it is still unclear whether there is an association between individuals with social cognitive bias and their hemispheric cooperative functions especially while the interhemisphere cooperative processing is under the conditions of emotional interferences. The purpose of this study is to explore the differences between the social cognitive bias group and the normal group’s interhemispheric cooperative functional activity under the conditions of with or without interferences. Methods: According to Dodge’s (1993) model of “social-cognitive mechanisms in the development of conduct disorder and depression”, a 51 items of “social cognitive bias scale” was created and was used to screen the high score group. 20 male subjects was composed of high score group and other 23 matched the control group. Stimulus tachistoscopically presented to the bilateral visual field and compared with the central. Both group’s interhemispheric cooperative functional activity were observed and compared under the conditions of without interference- i.e. base level and with the emotional interferences of white noise level and negative evaluative feedback speech level while finishing: experiment one: Chinese word-figure Stroop analogue task; experiment two: two single Chinese Characters combination task. Heart rate and respiratory rate were simultaneously recorded as index of emotional changes. Results: ① The high score group showed a decrease in processing accuracy compared with the normal group under the condition of white noise interference level in experiment one. ② Still under the condition of white noise interference level, there were more reaction time and more errors were observed in high score group than normal in experiment two. ③ Both groups showed speed up effect and the strategic processing tendency of speed-accuracy trade-off effect under the condition of white noise interference level in both experiments. ④ Between group differences of interhemipheric cooperative function were not observed under the conditions of base level and the negative evaluative feedback speech level within both experiments. Conclusion: The results suggested that interhemispheric cooperative functional differences exists between the two groups, characterized as ① differences existed in interhemispheric cooperative processing strategy between the two groups, with the high score group presented “hierarchic” deficiency strategy. ② the appearance of the differences between the two groups were condition specified , and in this research it was only under the white noise interference condition. ③ the features of the differences between the two groups were the differences on multidimensional performances and with a deficit orientation in high score group. ④ the varieties of the differences were changing with cooperative tasks, as in this research the high score group performed worse in complementary cooperative task. In addition, both group adjusted the processing strategy respectively under the condition of white noise evoked emotional interference implied that the interaction between the interhemisphere cooperative processing and emotion might exist.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)