994 resultados para Document description


20.00% 20.00%



The aim of this paper is to investigate the role of emotion features in diversifying document rankings to improve the effectiveness of Information Retrieval (IR) systems. For this purpose, two approaches are proposed to consider emotion features for diversification, and they are empirically tested on the TREC 678 Interactive Track collection. The results show that emotion features are capable of enhancing retrieval effectiveness.


20.00% 20.00%



In this thesis we investigate the use of quantum probability theory for ranking documents. Quantum probability theory is used to estimate the probability of relevance of a document given a user's query. We posit that quantum probability theory can lead to a better estimation of the probability of a document being relevant to a user's query than the common approach, i. e. the Probability Ranking Principle (PRP), which is based upon Kolmogorovian probability theory. Following our hypothesis, we formulate an analogy between the document retrieval scenario and a physical scenario, that of the double slit experiment. Through the analogy, we propose a novel ranking approach, the quantum probability ranking principle (qPRP). Key to our proposal is the presence of quantum interference. Mathematically, this is the statistical deviation between empirical observations and expected values predicted by the Kolmogorovian rule of additivity of probabilities of disjoint events in configurations such that of the double slit experiment. We propose an interpretation of quantum interference in the document ranking scenario, and examine how quantum interference can be effectively estimated for document retrieval. To validate our proposal and to gain more insights about approaches for document ranking, we (1) analyse PRP, qPRP and other ranking approaches, exposing the assumptions underlying their ranking criteria and formulating the conditions for the optimality of the two ranking principles, (2) empirically compare three ranking principles (i. e. PRP, interactive PRP, and qPRP) and two state-of-the-art ranking strategies in two retrieval scenarios, those of ad-hoc retrieval and diversity retrieval, (3) analytically contrast the ranking criteria of the examined approaches, exposing similarities and differences, (4) study the ranking behaviours of approaches alternative to PRP in terms of the kinematics they impose on relevant documents, i. e. by considering the extent and direction of the movements of relevant documents across the ranking recorded when comparing PRP against its alternatives. Our findings show that the effectiveness of the examined ranking approaches strongly depends upon the evaluation context. In the traditional evaluation context of ad-hoc retrieval, PRP is empirically shown to be better or comparable to alternative ranking approaches. However, when we turn to examine evaluation contexts that account for interdependent document relevance (i. e. when the relevance of a document is assessed also with respect to other retrieved documents, as it is the case in the diversity retrieval scenario) then the use of quantum probability theory and thus of qPRP is shown to improve retrieval and ranking effectiveness over the traditional PRP and alternative ranking strategies, such as Maximal Marginal Relevance, Portfolio theory, and Interactive PRP. This work represents a significant step forward regarding the use of quantum theory in information retrieval. It demonstrates in fact that the application of quantum theory to problems within information retrieval can lead to improvements both in modelling power and retrieval effectiveness, allowing the constructions of models that capture the complexity of information retrieval situations. Furthermore, the thesis opens up a number of lines for future research. These include: (1) investigating estimations and approximations of quantum interference in qPRP; (2) exploiting complex numbers for the representation of documents and queries, and; (3) applying the concepts underlying qPRP to tasks other than document ranking.


20.00% 20.00%



We provide the first molecular phylogeny of the clerid lineage (Coleoptera: Cleridae, Thanerocleridae) within the superfamily Cleroidea to examine the two most recently-proposed hypotheses of higher-level classification. Phylogenetic relationships of checkered beetles were inferred from approximately ~5,000nt of both nuclear and mitochondrial rDNA (28S, 16S, and 12S) and the mitochondrial protein-coding gene COI. A worldwide sample of ~70 genera representing almost a quarter of generic diversity of the clerid lineage was included and phylogenies were reconstructed using Bayesian and Maximum Likelihood approaches. Results support the monophyly of many proposed subfamilies but were not entirely congruent with either current classification system. The subfamilial relationships within the Cleridae are resolved with support for three main lineages. Tillinae are supported as the sister group to all other subfamilies within the Cleridae, whereas Thaneroclerinae, Korynetinae and a new subfamily formally described here, Epiclininae subf. n, form a sister group to Clerinae + Hydnocerinae.


20.00% 20.00%



Newsletter ACM SIGIR Forum: The Seventeenth Australian Document Computing Symposium was held in Dunedin, New Zealand on the 5th and 6th of December 2012. In total twenty four papers were submitted. From those eleven were accepted for full presentation and 8 for short presentation. A poster session was held jointly with the Australasian Language Technology Workshop.


20.00% 20.00%



With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.


20.00% 20.00%



This article presents a study of how humans perceive and judge the relevance of documents. Humans are adept at making reasonably robust and quick decisions about what information is relevant to them, despite the ever increasing complexity and volume of their surrounding information environment. The literature on document relevance has identified various dimensions of relevance (e.g., topicality, novelty, etc.), however little is understood about how these dimensions may interact. We performed a crowdsourced study of how human subjects judge two relevance dimensions in relation to document snippets retrieved from an internet search engine. The order of the judgment was controlled. For those judgments exhibiting an order effect, a q–test was performed to determine whether the order effects can be explained by a quantum decision model based on incompatible decision perspectives. Some evidence of incompatibility was found which suggests incompatible decision perspectives is appropriate for explaining interacting dimensions of relevance in such instances.


20.00% 20.00%



This thesis presents new methods for classification and thematic grouping of billions of web pages, at scales previously not achievable. This process is also known as document clustering, where similar documents are automatically associated with clusters that represent various distinct topic. These automatically discovered topics are in turn used to improve search engine performance by only searching the topics that are deemed relevant to particular user queries.


20.00% 20.00%



Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.


20.00% 20.00%



The use of ‘topic’ concepts has shown improved search performance, given a query, by bringing together relevant documents which use different terms to describe a higher level concept. In this paper, we propose a method for discovering and utilizing concepts in indexing and search for a domain specific document collection being utilized in industry. This approach differs from others in that we only collect focused concepts to build the concept space and that instead of turning a user’s query into a concept based query, we experiment with different techniques of combining the original query with a concept query. We apply the proposed approach to a real-world document collection and the results show that in this scenario the use of concept knowledge at index and search can improve the relevancy of results.


20.00% 20.00%



Objective: Individuals with chronic whiplash-associated disorders (WADs) often note driving as a difficult task. This study’s aims were to (1) compare, while driving, neck motor performance, mental effort, and fatigue in individuals with chronic WAD against healthy controls and (2) investigate the relationships of these variables and neck pain to self-reported driving difficulty in the WAD group. Design: This study involved 14 participants in each group (WAD and control). Measures included self-reported driving difficulty and measures of neck pain intensity, overall fatigue, mental effort, and neck motor performance (head rotation and upper trapezius activity) while driving a simulator. Results: The WAD group had greater absolute path of head rotation in a simulated city area and used greater mental effort (P = 0.04), but there were no differences in other measures while driving compared with the controls (all P Q 0.05). Self-reported driving difficulty correlated moderately with neck pain intensity, fatigue level, and maximum velocity of head rotation while driving in the WAD group (all P G 0.05). Conclusions: Individuals with chronic WAD do not seem to have impaired neck motor performance while driving yet use greater mental effort. Neck pain, fatigue, and maximum head rotation velocity could be potential contributors to self-reported driving difficulty in this group.


20.00% 20.00%



Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.


20.00% 20.00%



A long-held assumption in entrepreneurship research is that normal (i.e., Gaussian) distributions characterize variables of interest for both theory and practice. We challenge this assumption by examining more than 12,000 nascent, young, and hyper-growth firms. Results reveal that variables which play central roles in resource-, cognition-, action-, and environment-based entrepreneurship theories exhibit highly skewed power law distributions, where a few outliers account for a disproportionate amount of the distribution's total output. Our results call for the development of new theory to explain and predict the mechanisms that generate these distributions and the outliers therein. We offer a research agenda, including a description of non-traditional methodological approaches, to answer this call.


20.00% 20.00%



The Australian species of the Orthocladiinae genus Cricotopus Wulp (Diptera: Chironomidae) are revised for larval, pupal, adult male and female life stages. Eleven species, ten of which are new, are recognised and keyed, namely Cricotopus acornis Drayson & Cranston sp. nov., Cricotopus albitarsis Hergstrom sp. nov., Cricotopus annuliventris (Skuse), Cricotopus brevicornis Drayson & Cranston sp. nov., Cricotopus conicornis Drayson & Cranston sp. nov., Cricotopus hillmani Drayson & Cranston, sp. nov., Cricotopus howensis Cranston sp. nov., Cricotopus parbicinctus Hergstrom sp. nov., Cricotopus tasmania Drayson & Cranston sp. nov., Cricotopus varicornis Drayson & Cranston sp. nov. and Cricotopus wangi Cranston & Krosch sp. nov. Using data from this study, we consider the wider utility of morphological and molecular diagnostic tools in untangling species diversity in the Chironomidae. Morphological support for distinguishing Cricotopus from Paratrichocladius Santo-Abreu in larval and pupal stages appears lacking for Australian taxa and brief notes are provided concerning this matter.


20.00% 20.00%



We propose the use of optical flow information as a method for detecting and describing changes in the environment, from the perspective of a mobile camera. We analyze the characteristics of the optical flow signal and demonstrate how robust flow vectors can be generated and used for the detection of depth discontinuities and appearance changes at key locations. To successfully achieve this task, a full discussion on camera positioning, distortion compensation, noise filtering, and parameter estimation is presented. We then extract statistical attributes from the flow signal to describe the location of the scene changes. We also employ clustering and dominant shape of vectors to increase the descriptiveness. Once a database of nodes (where a node is a detected scene change) and their corresponding flow features is created, matching can be performed whenever nodes are encountered, such that topological localization can be achieved. We retrieve the most likely node according to the Mahalanobis and Chi-square distances between the current frame and the database. The results illustrate the applicability of the technique for detecting and describing scene changes in diverse lighting conditions, considering indoor and outdoor environments and different robot platforms.