945 resultados para DOCUMENT COLLECTIONS


Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis presents novel techniques for addressing the problems of continuous change and inconsistencies in large process model collections. The developed techniques treat process models as a collection of fragments and facilitate version control, standardization and automated process model discovery using fragment-based concepts. Experimental results show that the presented techniques are beneficial in consolidating large process model collections, specifically when there is a high degree of redundancy.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Newsletter ACM SIGIR Forum: The Seventeenth Australian Document Computing Symposium was held in Dunedin, New Zealand on the 5th and 6th of December 2012. In total twenty four papers were submitted. From those eleven were accepted for full presentation and 8 for short presentation. A poster session was held jointly with the Australasian Language Technology Workshop.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We identify relation completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation, RC attempts at linking entity pairs between two entity lists under the relation. To accomplish the RC goals, we propose to formulate search queries for each query entity α based on some auxiliary information, so that to detect its target entity β from the set of retrieved documents. For instance, a pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper presents a novel framework to further advance the recent trend of using query decomposition and high-order term relationships in query language modeling, which takes into account terms implicitly associated with different subsets of query terms. Existing approaches, most remarkably the language model based on the Information Flow method are however unable to capture multiple levels of associations and also suffer from a high computational overhead. In this paper, we propose to compute association rules from pseudo feedback documents that are segmented into variable length chunks via multiple sliding windows of different sizes. Extensive experiments have been conducted on various TREC collections and our approach significantly outperforms a baseline Query Likelihood language model, the Relevance Model and the Information Flow model.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article presents a study of how humans perceive and judge the relevance of documents. Humans are adept at making reasonably robust and quick decisions about what information is relevant to them, despite the ever increasing complexity and volume of their surrounding information environment. The literature on document relevance has identified various dimensions of relevance (e.g., topicality, novelty, etc.), however little is understood about how these dimensions may interact. We performed a crowdsourced study of how human subjects judge two relevance dimensions in relation to document snippets retrieved from an internet search engine. The order of the judgment was controlled. For those judgments exhibiting an order effect, a q–test was performed to determine whether the order effects can be explained by a quantum decision model based on incompatible decision perspectives. Some evidence of incompatibility was found which suggests incompatible decision perspectives is appropriate for explaining interacting dimensions of relevance in such instances.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article reports on a study investigating academic librarians' varying experiences of archives in order to promote understanding and communication among librarians and archivists. A qualitative, phenomenographic approach was adopted for the study. Three different ways of experiencing archives were identified from analysis of interviews. Archives may be experienced by academic librarians as 1) a place which protects collections; 2) resources to be used in accomplishing tasks such as teaching, research, or outreach; or 3) manifestations of politics. The third way of experiencing archives is the most complex, incorporating both the other experiences. The results of this study may help librarians, especially academic librarians, and archivists communicate more clearly on joint projects involving archival collections thereby enabling more collaboration.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis presents new methods for classification and thematic grouping of billions of web pages, at scales previously not achievable. This process is also known as document clustering, where similar documents are automatically associated with clusters that represent various distinct topic. These automatically discovered topics are in turn used to improve search engine performance by only searching the topics that are deemed relevant to particular user queries.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Herbarium accession data offer a useful historical botanical perspective and have been used to track the spread of plant invasions through time and space. Nevertheless, few studies have utilised this resource for genetic analysis to reconstruct a more complete picture of historical invasion dynamics, including the occurrence of separate introduction events. In this study, we combined nuclear and chloroplast microsatellite analyses of contemporary and historical collections of Senecio madagascariensis, a globally invasive weed first introduced to Australia c. 1918 from its native South Africa. Analysis of nuclear microsatellites, together with temporal spread data and simulations of herbarium voucher sampling, revealed distinct introductions to south-eastern Australia and mid-eastern Australia. Genetic diversity of the south-eastern invasive population was lower than in the native range, but higher than in the mid-eastern invasion. In the invasive range, despite its low resolution, our chloroplast microsatellite data revealed the occurrence of new haplotypes over time, probably as the result of subsequent introduction(s) to Australia from the native range during the latter half of the 20th century. Our work demonstrates how molecular studies of contemporary and historical field collections can be combined to reconstruct a more complete picture of the invasion history of introduced taxa. Further, our study indicates that a survey of contemporary samples only (as undertaken for the majority of invasive species studies) would be insufficient to identify potential source populations and occurrence of multiple introductions.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

It is well established that the traditional taxonomy and nomenclature of Chironomidae relies on adult males whose usually characteristic genitalia provide evidence of species distinction. In the early days some names were based on female adults of variable distinctiveness – but females are difficult to identify (Ekrem et al. 2010) and many of these names remain dubious. In Russia especially, a system based on larval morphology grew in parallel to the conventional adult-based system. The systems became reconciled with the studies that underlay the production of the Holarctic generic keys to Chironomidae, commencing notably with the larval volume (Wiederholm, 1983). Ever since Thienemann’s pioneering studies, it has been evident that the pupa, notably the cast skins (exuviae) provide a wealth of features that can aid in identification (e.g. Wiederholm, 1986). Furthermore, the pupae can be readily associated with name-bearing adults when a pharate (‘cloaked’) adult stage is visible within the pupa. Association of larvae with the name-bearing later stages has been much more difficult, time-consuming and fraught with risk of failure. Yet it is identification of the larval stage that is needed by most applied researchers due to the value of the immature stages of the family in aquatic monitoring for water quality, although the pupal stage also has advocates (reviewed by Sinclair & Gresens, 2008). Few use the adult stage for such purposes as their provenance and association with the water body can be verified only by emergence trapping, and sampling of adults lies outside regular aquatic monitoring protocols.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The use of ‘topic’ concepts has shown improved search performance, given a query, by bringing together relevant documents which use different terms to describe a higher level concept. In this paper, we propose a method for discovering and utilizing concepts in indexing and search for a domain specific document collection being utilized in industry. This approach differs from others in that we only collect focused concepts to build the concept space and that instead of turning a user’s query into a concept based query, we experiment with different techniques of combining the original query with a concept query. We apply the proposed approach to a real-world document collection and the results show that in this scenario the use of concept knowledge at index and search can improve the relevancy of results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article reports a survey that sought to capture a contemporary snapshot of curriculum collections in Australian universities. It highlights best practice and issues in collection organisation, development and access, the challenges facing these collections, and possible future directions. Many themes emerged, including: the need to make spaces a vibrant part of the teaching and learning environment; the need to integrate print and digital collections to raise students’ awareness and use of resources; the need to demonstrate a link between collections and services and the students’ learning experience; the difficulties resulting from reduced budgets; and the need to actively engage academics.