876 resultados para text analytics
Resumo:
The world we live in is well labeled for the benefit of humans but to date robots have made little use of this resource. In this paper we describe a system that allows robots to read and interpret visible text and use it to understand the content of the scene. We use a generative probabilistic model that explains spotted text in terms of arbitrary search terms. This allows the robot to understand the underlying function of the scene it is looking at, such as whether it is a bank or a restaurant. We describe the text spotting engine at the heart of our system that is able to detect and parse wild text in images, and the generative model, and present results from images obtained with a robot in a busy city setting.
Resumo:
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance.
Resumo:
This magazine, written by Melissa Giles, features three Brisbane-based media organisations: Radio 4RPH, Queensland Pride and 98.9FM. The PDF file on this website contains a text-only version of the magazine. Contact the author if you would like a copy of the text-only EPUB file or a copy of the full digital magazine with images. An audio version of the magazine is available at http://eprints.qut.edu.au/41729/
Resumo:
From the late sixteenth century, in response to the problem of how best to teach children to read, a variety of texts such as primers, spellers and readers were produced in England for vernacular instruction. This paper describes how these materials were used by teachers to develop first, a specific religious understanding according to the stricture of the time and second, a moral reading practice that provided the child with a guide to secular conduct. The analysis focuses on the use of these texts as a productive means for shaping the child-reader in the context of newly emerging educational spaces which fostered a particular, morally formative relation among teacher, child and text.
Resumo:
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of the large number of terms, patterns, and noise. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern-based methods should perform better than term- based ones in describing user preferences, but many experiments do not support this hypothesis. This research presents a promising method, Relevance Feature Discovery (RFD), for solving this challenging issue. It discovers both positive and negative patterns in text documents as high-level features in order to accurately weight low-level features (terms) based on their specificity and their distributions in the high-level features. The thesis also introduces an adaptive model (called ARFD) to enhance the exibility of using RFD in adaptive environment. ARFD automatically updates the system's knowledge based on a sliding window over new incoming feedback documents. It can efficiently decide which incoming documents can bring in new knowledge into the system. Substantial experiments using the proposed models on Reuters Corpus Volume 1 and TREC topics show that the proposed models significantly outperform both the state-of-the-art term-based methods underpinned by Okapi BM25, Rocchio or Support Vector Machine and other pattern-based methods.
Resumo:
A rule-based approach for classifying previously identified medical concepts in the clinical free text into an assertion category is presented. There are six different categories of assertions for the task: Present, Absent, Possible, Conditional, Hypothetical and Not associated with the patient. The assertion classification algorithms were largely based on extending the popular NegEx and Context algorithms. In addition, a health based clinical terminology called SNOMED CT and other publicly available dictionaries were used to classify assertions, which did not fit the NegEx/Context model. The data for this task includes discharge summaries from Partners HealthCare and from Beth Israel Deaconess Medical Centre, as well as discharge summaries and progress notes from University of Pittsburgh Medical Centre. The set consists of 349 discharge reports, each with pairs of ground truth concept and assertion files for system development, and 477 reports for evaluation. The system’s performance on the evaluation data set was 0.83, 0.83 and 0.83 for recall, precision and F1-measure, respectively. Although the rule-based system shows promise, further improvements can be made by incorporating machine learning approaches.
Resumo:
In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.
Resumo:
It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.
Resumo:
WHAT if you lost someone you loved? What if you had to let go for the sake of your own sanity? Lachlan Philpott's Colder and Dennis Kelly's Orphans, playing as part of La Boite's and Queensland Theatre Company's independents programs, are emotionally and textually dense theatrical works...
Resumo:
The development of text classification techniques has been largely promoted in the past decade due to the increasing availability and widespread use of digital documents. Usually, the performance of text classification relies on the quality of categories and the accuracy of classifiers learned from samples. When training samples are unavailable or categories are unqualified, text classification performance would be degraded. In this paper, we propose an unsupervised multi-label text classification method to classify documents using a large set of categories stored in a world ontology. The approach has been promisingly evaluated by compared with typical text classification methods, using a real-world document collection and based on the ground truth encoded by human experts.
Resumo:
It is a big challenge to clearly identify the boundary between positive and negative streams. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on RCV1, and substantial experiments show that the proposed approach achieves encouraging performance.
Resumo:
Queensland University of Technology (QUT) was one of the first universities in Australia to establish an institutional repository. Launched in November 2003, the repository (QUT ePrints) uses the EPrints open source repository software (from Southampton) and has enjoyed the benefit of an institutional deposit mandate since January 2004. Currently (April 2012), the repository holds over 36,000 records, including 17,909 open access publications with another 2,434 publications embargoed but with mediated access enabled via the ‘Request a copy’ button which is a feature of the EPrints software. At QUT, the repository is managed by the library.QUT ePrints (http://eprints.qut.edu.au) The repository is embedded into a number of other systems at QUT including the staff profile system and the University’s research information system. It has also been integrated into a number of critical processes related to Government reporting and research assessment. Internally, senior research administrators often look to the repository for information to assist with decision-making and planning. While some statistics could be drawn from the advanced search feature and the existing download statistics feature, they were rarely at the level of granularity or aggregation required. Getting the information from the ‘back end’ of the repository was very time-consuming for the Library staff. In 2011, the Library funded a project to enhance the range of statistics which would be available from the public interface of QUT ePrints. The repository team conducted a series of focus groups and individual interviews to identify and prioritise functionality requirements for a new statistics ‘dashboard’. The participants included a mix research administrators, early career researchers and senior researchers. The repository team identified a number of business criteria (eg extensible, support available, skills required etc) and then gave each a weighting. After considering all the known options available, five software packages (IRStats, ePrintsStats, AWStats, BIRT and Google Urchin/Analytics) were thoroughly evaluated against a list of 69 criteria to determine which would be most suitable. The evaluation revealed that IRStats was the best fit for our requirements. It was deemed capable of meeting 21 out of the 31 high priority criteria. Consequently, IRStats was implemented as the basis for QUT ePrints’ new statistics dashboards which were launched in Open Access Week, October 2011. Statistics dashboards are now available at four levels; whole-of-repository level, organisational unit level, individual author level and individual item level. The data available includes, cumulative total deposits, time series deposits, deposits by item type, % fulltexts, % open access, cumulative downloads, time series downloads, downloads by item type, author ranking, paper ranking (by downloads), downloader geographic location, domains, internal v external downloads, citation data (from Scopus and Web of Science), most popular search terms, non-search referring websites. The data is displayed in charts, maps and table format. The new statistics dashboards are a great success. Feedback received from staff and students has been very positive. Individual researchers have said that they have found the information to be very useful when compiling a track record. It is now very easy for senior administrators (including the Deputy Vice Chancellor-Research) to compare the full-text deposit rates (i.e. mandate compliance rates) across organisational units. This has led to increased ‘encouragement’ from Heads of School and Deans in relation to the provision of full-text versions.
Resumo:
Real-time remote sales assistance is an underdeveloped component of online sales services. Solutions involving web page text chat, telephony and video support prove problematic when seeking to remotely guide customers in their sales processes, especially with configurations of physically complex artefacts. Recently, there has been great interest in the application of virtual worlds and augmented reality to create synthetic environments for remote sales of physical artefacts. However, there is a lack of analysis and development of appropriate software services to support these processes. We extend our previous work with the detailed design of configuration context services to support the management of an interactive sales session using augmented reality. We detail the context and configuration services required, presenting a novel data service streaming configuration information to the vendor for business analytics. We expect that a fully implemented configuration management service, based on our design, will improve the remote sales experience for both customers and vendors alike via analysis of the streamed information.
Resumo:
Much has been written on Michel Foucault’s reluctance to clearly delineate a research method, particularly with respect to genealogy (Harwood 2000; Meadmore, Hatcher, & McWilliam 2000; Tamboukou 1999). Foucault (1994, p. 288) himself disliked prescription stating, “I take care not to dictate how things should be” and wrote provocatively to disrupt equilibrium and certainty, so that “all those who speak for others or to others” no longer know what to do. It is doubtful, however, that Foucault ever intended for researchers to be stricken by that malaise to the point of being unwilling to make an intellectual commitment to methodological possibilities. Taking criticism of “Foucauldian” discourse analysis as a convenient point of departure to discuss the objectives of poststructural analyses of language, this paper develops what might be called a discursive analytic; a methodological plan to approach the analysis of discourses through the location of statements that function with constitutive effects.