86 resultados para Document Ranking


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Intuitively, any ‘bag of words’ approach in IR should benefit from taking term dependencies into account. Unfortunately, for years the results of exploiting such dependencies have been mixed or inconclusive. To improve the situation, this paper shows how the natural language properties of the target documents can be used to transform and enrich the term dependencies to more useful statistics. This is done in three steps. The term co-occurrence statistics of queries and documents are each represented by a Markov chain. The paper proves that such a chain is ergodic, and therefore its asymptotic behavior is unique, stationary, and independent of the initial state. Next, the stationary distribution is taken to model queries and documents, rather than their initial distributions. Finally, ranking is achieved following the customary language modeling paradigm. The main contribution of this paper is to argue why the asymptotic behavior of the document model is a better representation then just the document’s initial distribution. A secondary contribution is to investigate the practical application of this representation in case the queries become increasingly verbose. In the experiments (based on Lemur’s search engine substrate) the default query model was replaced by the stable distribution of the query. Just modeling the query this way already resulted in significant improvements over a standard language model baseline. The results were on a par or better than more sophisticated algorithms that use fine-tuned parameters or extensive training. Moreover, the more verbose the query, the more effective the approach seems to become.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The benefits of applying tree-based methods to the purpose of modelling financial assets as opposed to linear factor analysis are increasingly being understood by market practitioners. Tree-based models such as CART (classification and regression trees) are particularly well suited to analysing stock market data which is noisy and often contains non-linear relationships and high-order interactions. CART was originally developed in the 1980s by medical researchers disheartened by the stringent assumptions applied by traditional regression analysis (Brieman et al. [1984]). In the intervening years, CART has been successfully applied to many areas of finance such as the classification of financial distress of firms (see Frydman, Altman and Kao [1985]), asset allocation (see Sorensen, Mezrich and Miller [1996]), equity style timing (see Kao and Shumaker [1999]) and stock selection (see Sorensen, Miller and Ooi [2000])...

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Load modeling plays an important role in power system dynamic stability assessment. One of the widely used methods in assessing load model impact on system dynamic response is through parametric sensitivity analysis. Load ranking provides an effective measure of such impact. Traditionally, load ranking is based on either static or dynamic load model alone. In this paper, composite load model based load ranking framework is proposed. It enables comprehensive investigation into load modeling impacts on system stability considering the dynamic interactions between load and system dynamics. The impact of load composition on the overall sensitivity and therefore on ranking of the load is also investigated. Dynamic simulations are performed to further elucidate the results obtained through sensitivity based load ranking approach.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Finding and labelling semantic features patterns of documents in a large, spatial corpus is a challenging problem. Text documents have characteristics that make semantic labelling difficult; the rapidly increasing volume of online documents makes a bottleneck in finding meaningful textual patterns. Aiming to deal with these issues, we propose an unsupervised documnent labelling approach based on semantic content and feature patterns. A world ontology with extensive topic coverage is exploited to supply controlled, structured subjects for labelling. An algorithm is also introduced to reduce dimensionality based on the study of ontological structure. The proposed approach was promisingly evaluated by compared with typical machine learning methods including SVMs, Rocchio, and kNN.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Enterprise Systems (ES) can be understood as the de facto standard for holistic operational and managerial support within an organization. Most commonly ES are offered as commercial off-the-shelf packages, requiring customization in the user organization. This process is a complex and resource-intensive task, which often prevents small and midsize enterprises (SME) from undertaking configuration projects. Especially in the SME market independent software vendors provide pre-configured ES for a small customer base. The problem of ES configuration is shifted from the customer to the vendor, but remains critical. We argue that the yet unexplored link between process configuration and business document configuration must be closer examined as both types of configuration are closely tied to one another.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Particulate matter research is essential because of the well known significant adverse effects of aerosol particles on human health and the environment. In particular, identification of the origin or sources of particulate matter emissions is of paramount importance in assisting efforts to control and reduce air pollution in the atmosphere. This thesis aims to: identify the sources of particulate matter; compare pollution conditions at urban, rural and roadside receptor sites; combine information about the sources with meteorological conditions at the sites to locate the emission sources; compare sources based on particle size or mass; and ultimately, provide the basis for control and reduction in particulate matter concentrations in the atmosphere. To achieve these objectives, data was obtained from assorted local and international receptor sites over long sampling periods. The samples were analysed using Ion Beam Analysis and Scanning Mobility Particle Sizer methods to measure the particle mass with chemical composition and the particle size distribution, respectively. Advanced data analysis techniques were employed to derive information from large, complex data sets. Multi-Criteria Decision Making (MCDM), a ranking method, drew on data variability to examine the overall trends, and provided the rank ordering of the sites and years that sampling was conducted. Coupled with the receptor model Positive Matrix Factorisation (PMF), the pollution emission sources were identified and meaningful information pertinent to the prioritisation of control and reduction strategies was obtained. This thesis is presented in the thesis by publication format. It includes four refereed papers which together demonstrate a novel combination of data analysis techniques that enabled particulate matter sources to be identified and sampling site/year ranked. The strength of this source identification process was corroborated when the analysis procedure was expanded to encompass multiple receptor sites. Initially applied to identify the contributing sources at roadside and suburban sites in Brisbane, the technique was subsequently applied to three receptor sites (roadside, urban and rural) located in Hong Kong. The comparable results from these international and national sites over several sampling periods indicated similarities in source contributions between receptor site-types, irrespective of global location and suggested the need to apply these methods to air pollution investigations worldwide. Furthermore, an investigation into particle size distribution data was conducted to deduce the sources of aerosol emissions based on particle size and elemental composition. Considering the adverse effects on human health caused by small-sized particles, knowledge of particle size distribution and their elemental composition provides a different perspective on the pollution problem. This thesis clearly illustrates that the application of an innovative combination of advanced data interpretation methods to identify particulate matter sources and rank sampling sites/years provides the basis for the prioritisation of future air pollution control measures. Moreover, this study contributes significantly to knowledge based on chemical composition of airborne particulate matter in Brisbane, Australia and on the identity and plausible locations of the contributing sources. Such novel source apportionment and ranking procedures are ultimately applicable to environmental investigations worldwide.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Grading osteoarthritic tissue has, until now, been a laboratory process confined to research activities. This thesis establishes a scientific protocol that extends osteoarthritic tissue ranking to surgical practice. The innovative protocol, which now incorporates the structural degeneration of collagen, enhances the traditional Modified Mankin ranking system, enabling its application to real time decision during surgery. Because it is fast and without time consuming laboratory process, it would potentially enable the cataloguing of tissues in osteoarthritic joints in all compartments of diseased joints during surgery for epistemological study and insight into the manifestation of osteoarthritis across age, gender, occupation, physical activities and race.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper elaborates the approach used by the Applied Data Mining Research Group (ADMRG) for the Social Event Detection (SED) Tasks of the 2013 MediaEval Benchmark. We extended the constrained clustering algorithm to apply to the first semi-supervised clustering task, and we compared several classifiers with Latent Dirichlet Allocation as feature selector in the second event classification task. The proposed approach focuses on scalability and efficient memory allocation when applied to a high dimensional data with large clusters. Results of the first task show the effectiveness of the proposed method. Results from task 2 indicate that attention on the imbalance categories distributions is needed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Besides responding to challenges of rapid urbanization and growing traffic congestion, the development of smart transport systems has attracted much attention in recent times. Many promising initiatives have emerged over the years. Despite these initiatives, there is still a lack of understanding about an appropriate definition of smart transport system. As such, it is challenging to identify the appropriate indicators of ‘smartness’. This paper proposes a comprehensive and practical framework to benchmark cities according to the smartness in their transportation systems. The proposed methodology was illustrated using a set of data collected from 26 cities across the world through web search and contacting relevant transport authorities and agencies. Results showed that London, Seattle and Sydney were among the world’s top smart transport cities. In particular, Seattle and Paris ranked high in smart private transport services while London and Singapore scored high on public transport services. London also appeared to be the smartest in terms of emergency transport services. The key value of the proposed innovative framework lies in a comparative analysis among cities, facilitating city-to-city learning.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cognitive impairment and physical disability are common in Parkinson’s disease (PD). As a result diet can be difficult to measure. This study aimed to evaluate the use of a photographic dietary record (PhDR) in people with PD. During a 12-week nutrition intervention study, 19 individuals with PD kept 3-day PhDRs on three occasions using point-and-shoot digital cameras. Details on food items present in the PhDRs and those not photographed were collected retrospectively during an interview. Following the first use of the PhDR method, the photographer completed a questionnaire (n=18). In addition, the quality of the PhDRs was evaluated at each time point. The person with PD was the sole photographer in 56% of the cases, with the remainder by the carer or combination of person with PD and the carer. The camera was rated as easy to use by 89%, keeping a PhDR was considered acceptable by 94% and none would rather use a “pen and paper” method. Eighty-three percent felt confident to use the camera again to record intake. Of the photos captured (n=730), 89% were of adequate quality (items visible, in-focus), while only 21% could be used alone (without interview information) to assess intake. Over the study, 22% of eating/drinking occasions were not photographed. PhDRs were considered an easy and acceptable method to measure intake among individuals with PD and their carers. The majority of PhDRs were of adequate quality, however in order to quantify intake the interview was necessary to obtain sufficient detail and capture missing items.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Topic modelling, such as Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in a collection of documents, which has been widely utilized in the fields of machine learning and information retrieval, etc. But its effectiveness in information filtering is rarely known. Patterns are always thought to be more representative than single terms for representing documents. In this paper, a novel information filtering model, Pattern-based Topic Model(PBTM) , is proposed to represent the text documents not only using the topic distributions at general level but also using semantic pattern representations at detailed specific level, both of which contribute to the accurate document representation and document relevance ranking. Extensive experiments are conducted to evaluate the effectiveness of PBTM by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed model achieves outstanding performance.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A known limitation of the Probability Ranking Principle (PRP) is that it does not cater for dependence between documents. Recently, the Quantum Probability Ranking Principle (QPRP) has been proposed, which implicitly captures dependencies between documents through “quantum interference”. This paper explores whether this new ranking principle leads to improved performance for subtopic retrieval, where novelty and diversity is required. In a thorough empirical investigation, models based on the PRP, as well as other recently proposed ranking strategies for subtopic retrieval (i.e. Maximal Marginal Relevance (MMR) and Portfolio Theory(PT)), are compared against the QPRP. On the given task, it is shown that the QPRP outperforms these other ranking strategies. And unlike MMR and PT, one of the main advantages of the QPRP is that no parameter estimation/tuning is required; making the QPRP both simple and effective. This research demonstrates that the application of quantum theory to problems within information retrieval can lead to significant improvements.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Quantum Probability Ranking Principle (QPRP) has been recently proposed, and accounts for interdependent document relevance when ranking. However, to be instantiated, the QPRP requires a method to approximate the interference" between two documents. In this poster, we empirically evaluate a number of different methods of approximation on two TREC test collections for subtopic retrieval. It is shown that these approximations can lead to significantly better retrieval performance over the state of the art.