999 resultados para Segmentation-Free
Resumo:
In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.
Resumo:
The benefits of openness are widely apparent everywhere except, seemingly, in occupations. Yet the case against occupational licensing still remains strong. Consideration of dynamic costs strengthens the case further.
Resumo:
The Thai written language is one of the languages that does not have word boundaries. In order to discover the meaning of the document, all texts must be separated into syllables, words, sentences, and paragraphs. This paper develops a novel method to segment the Thai text by combining a non-dictionary based technique with a dictionary-based technique. This method first applies the Thai language grammar rules to the text for identifying syllables. The hidden Markov model is then used for merging possible syllables into words. The identified words are verified with a lexical dictionary and a decision tree is employed to discover the words unidentified by the lexical dictionary. Documents used in the litigation process of Thai court proceedings have been used in experiments. The results which are segmented words, obtained by the proposed method outperform the results obtained by other existing methods.
Resumo:
The automatic extraction of road features from remote sensed images has been a topic of great interest within the photogrammetric and remote sensing communities for over 3 decades. Although various techniques have been reported in the literature, it is still challenging to efficiently extract the road details with the increasing of image resolution as well as the requirement for accurate and up-to-date road data. In this paper, we will focus on the automatic detection of road lane markings, which are crucial for many applications, including lane level navigation and lane departure warning. The approach consists of four steps: i) data preprocessing, ii) image segmentation and road surface detection, iii) road lane marking extraction based on the generated road surface, and iv) testing and system evaluation. The proposed approach utilized the unsupervised ISODATA image segmentation algorithm, which segments the image into vegetation regions, and road surface based only on the Cb component of YCbCr color space. A shadow detection method based on YCbCr color space is also employed to detect and recover the shadows from the road surface casted by the vehicles and trees. Finally, the lane marking features are detected from the road surface using the histogram clustering. The experiments of applying the proposed method to the aerial imagery dataset of Gympie, Queensland demonstrate the efficiency of the approach.
Resumo:
Tagging has become one of the key activities in next generation websites which allow users selecting short labels to annotate, manage, and share multimedia information such as photos, videos and bookmarks. Tagging does not require users any prior training before participating in the annotation activities as they can freely choose any terms which best represent the semantic of contents without worrying about any formal structure or ontology. However, the practice of free-form tagging can lead to several problems, such as synonymy, polysemy and ambiguity, which potentially increase the complexity of managing the tags and retrieving information. To solve these problems, this research aims to construct a lightweight indexing scheme to structure tags by identifying and disambiguating the meaning of terms and construct a knowledge base or dictionary. News has been chosen as the primary domain of application to demonstrate the benefits of using structured tags for managing the rapidly changing and dynamic nature of news information. One of the main outcomes of this work is an automatically constructed vocabulary that defines the meaning of each named entity tag, which can be extracted from a news article (including person, location and organisation), based on experts suggestions from major search engines and the knowledge from public database such as Wikipedia. To demonstrate the potential applications of the vocabulary, we have used it to provide more functionalities in an online news website, including topic-based news reading, intuitive tagging, clipping and sharing of interesting news, as well as news filtering or searching based on named entity tags. The evaluation results on the impact of disambiguating tags have shown that the vocabulary can help to significantly improve news searching performance. The preliminary results from our user study have demonstrated that users can benefit from the additional functionalities on the news websites as they are able to retrieve more relevant news, clip and share news with friends and families effectively.
Resumo:
The increasing diversity of the Internet has created a vast number of multilingual resources on the Web. A huge number of these documents are written in various languages other than English. Consequently, the demand for searching in non-English languages is growing exponentially. It is desirable that a search engine can search for information over collections of documents in other languages. This research investigates the techniques for developing high-quality Chinese information retrieval systems. A distinctive feature of Chinese text is that a Chinese document is a sequence of Chinese characters with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose two approaches to deal with the problems. In the first approach, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach. In the second approach, we propose a novel query expansion method which applies text mining techniques in order to find the most relevant words to extend the query. Unlike most existing query expansion methods, which generally select the highly frequent indexing terms from the retrieved documents to expand the query. In our approach, we utilize text mining techniques to find patterns from the retrieved documents that highly correlate with the query term and then use the relevant words in the patterns to expand the original query. This research project develops and implements a Chinese information retrieval system for evaluating the proposed approaches. There are two stages in the experiments. The first stage is to investigate if high accuracy segmentation can make an improvement to Chinese information retrieval. In the second stage, a text mining based query expansion approach is implemented and a further experiment has been done to compare its performance with the standard Rocchio approach with the proposed text mining based query expansion method. The NTCIR5 Chinese collections are used in the experiments. The experiment results show that by incorporating the text mining based query expansion with the hybrid model, significant improvement has been achieved in both precision and recall assessments.
Resumo:
Semi-automatic segmentation of still images has vast and varied practical applications. Recently, an approach "GrabCut" has managed to successfully build upon earlier approaches based on colour and gradient information in order to address the problem of efficient extraction of a foreground object in a complex environment. In this paper, we extend the GrabCut algorithm further by applying an unsupervised algorithm for modelling the Gaussian Mixtures that are used to define the foreground and background in the segmentation algorithm. We show examples where the optimisation of the GrabCut framework leads to further improvements in performance.
Resumo:
The use of animal sera for the culture of therapeutically important cells impedes the clinical use of the cells. We sought to characterize the functional response of human mesenchymal stem cells (hMSCs) to specific proteins known to exist in bone tissue with a view to eliminating the requirement of animal sera. Insulin-like growth factor-I (IGF-I), via IGF binding protein-3 or -5 (IGFBP-3 or -5) and transforming growth factor-beta 1 (TGF-beta(1)) are known to associate with the extracellular matrix (ECM) protein vitronectin (VN) and elicit functional responses in a range of cell types in vitro. We found that specific combinations of VN, IGFBP-3 or -5, and IGF-I or TGF-beta(1) could stimulate initial functional responses in hMSCs and that IGF-I or TGF-beta(1) induced hMSC aggregation, but VN concentration modulated this effect. We speculated that the aggregation effect may be due to endogenous protease activity, although we found that neither IGF-I nor TGF-beta(1) affected the functional expression of matrix metalloprotease-2 or -9, two common proteases expressed by hMSCs. In summary, combinations of the ECM and growth factors described herein may form the basis of defined cell culture media supplements, although the effect of endogenous protease expression on the function of such proteins requires investigation.