893 resultados para Ranking scores
Resumo:
The problem of clustering a large document collection is not only challenged by the number of documents and the number of dimensions, but it is also affected by the number and sizes of the clusters. Traditional clustering methods fail to scale when they need to generate a large number of clusters. Furthermore, when the clusters size in the solution is heterogeneous, i.e. some of the clusters are large in size, the similarity measures tend to degrade. A ranking based clustering method is proposed to deal with these issues in the context of the Social Event Detection task. Ranking scores are used to select a small number of most relevant clusters in order to compare and place a document. Additionally,instead of conventional cluster centroids, cluster patches are proposed to represent clusters, that are hubs-like set of documents. Text, temporal, spatial and visual content information collected from the social event images is utilized in calculating similarity. Results show that these strategies allow us to have a balance between performance and accuracy of the clustering solution gained by the clustering method.
Resumo:
We developed a model to calculate a quantitative risk score for individual aquaculture sites. The score indicates the risk of the site being infected with a specific fish pathogen (viral haemorrhagic septicaemia virus (VHSV); infectious haematopoietic necrosis virus, Koi herpes virus), and is intended to be used for risk ranking sites to support surveillance for demonstration of zone or member state freedom from these pathogens. The inputs to the model include a range of quantitative and qualitative estimates of risk factors organised into five risk themes (1) Live fish and egg movements; (2) Exposure via water; (3) On-site processing; (4) Short-distance mechanical transmission; (5) Distance-independent mechanical transmission. The calculated risk score for an individual aquaculture site is a value between zero and one and is intended to indicate the risk of a site relative to the risk of other sites (thereby allowing ranking). The model was applied to evaluate 76 rainbow trout farms in 3 countries (42 from England, 32 from Italy and 2 from Switzerland) with the aim to establish their risk of being infected with VHSV. Risk scores for farms in England and Italy showed great variation, clearly enabling ranking. Scores ranged from 0.002 to 0.254 (mean score 0.080) in England and 0.011 to 0.778 (mean of 0.130) for Italy, reflecting the diversity of infection status of farms in these countries. Requirements for broader application of the model are discussed. Cost efficient farm data collection is important to realise the benefits from a risk-based approach.
Resumo:
Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.
Resumo:
In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.
Resumo:
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.
Resumo:
Over the past ten years, a variety of microRNA target prediction methods has been developed, and many of the methods are constantly improved and adapted to recent insights into miRNA-mRNA interactions. In a typical scenario, different methods return different rankings of putative targets, even if the ranking is reduced to selected mRNAs that are related to a specific disease or cell type. For the experimental validation it is then difficult to decide in which order to process the predicted miRNA-mRNA bindings, since each validation is a laborious task and therefore only a limited number of mRNAs can be analysed. We propose a new ranking scheme that combines ranked predictions from several methods and - unlike standard thresholding methods - utilises the concept of Pareto fronts as defined in multi-objective optimisation. In the present study, we attempt a proof of concept by applying the new ranking scheme to hsa-miR-21, hsa-miR-125b, and hsa-miR-373 and prediction scores supplied by PITA and RNAhybrid. The scores are interpreted as a two-objective optimisation problem, and the elements of the Pareto front are ranked by the STarMir score with a subsequent re-calculation of the Pareto front after removal of the top-ranked mRNA from the basic set of prediction scores. The method is evaluated on validated targets of the three miRNA, and the ranking is compared to scores from DIANA-microT and TargetScan. We observed that the new ranking method performs well and consistent, and the first validated targets are elements of Pareto fronts at a relatively early stage of the recurrent procedure. which encourages further research towards a higher-dimensional analysis of Pareto fronts. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Document ranking is an important process in information retrieval (IR). It presents retrieved documents in an order of their estimated degrees of relevance to query. Traditional document ranking methods are mostly based on the similarity computations between documents and query. In this paper we argue that the similarity-based document ranking is insufficient in some cases. There are two reasons. Firstly it is about the increased information variety. There are far too many different types documents available now for user to search. The second is about the users variety. In many cases user may want to retrieve documents that are not only similar but also general or broad regarding a certain topic. This is particularly the case in some domains such as bio-medical IR. In this paper we propose a novel approach to re-rank the retrieved documents by incorporating the similarity with their generality. By an ontology-based analysis on the semantic cohesion of text, document generality can be quantified. The retrieved documents are then re-ranked by their combined scores of similarity and the closeness of documents’ generality to the query’s. Our experiments have shown an encouraging performance on a large bio-medical document collection, OHSUMED, containing 348,566 medical journal references and 101 test queries.
Resumo:
This paper reports the application of multicriteria decision making techniques, PROMETHEE and GAIA, and receptor models, PCA/APCS and PMF, to data from an air monitoring site located on the campus of Queensland University of Technology in Brisbane, Australia and operated by Queensland Environmental Protection Agency (QEPA). The data consisted of the concentrations of 21 chemical species and meteorological data collected between 1995 and 2003. PROMETHEE/GAIA separated the samples into those collected when leaded and unleaded petrol were used to power vehicles in the region. The number and source profiles of the factors obtained from PCA/APCS and PMF analyses were compared. There are noticeable differences in the outcomes possibly because of the non-negative constraints imposed on the PMF analysis. While PCA/APCS identified 6 sources, PMF reduced the data to 9 factors. Each factor had distinctive compositions that suggested that motor vehicle emissions, controlled burning of forests, secondary sulphate, sea salt and road dust/soil were the most important sources of fine particulate matter at the site. The most plausible locations of the sources were identified by combining the results obtained from the receptor models with meteorological data. The study demonstrated the potential benefits of combining results from multi-criteria decision making analysis with those from receptor models in order to gain insights into information that could enhance the development of air pollution control measures.
Resumo:
An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).
Resumo:
The World Wide Web has become a medium for people to share information. People use Web-based collaborative tools such as question answering (QA) portals, blogs/forums, email and instant messaging to acquire information and to form online-based communities. In an online QA portal, a user asks a question and other users can provide answers based on their knowledge, with the question usually being answered by many users. It can become overwhelming and/or time/resource consuming for a user to read all of the answers provided for a given question. Thus, there exists a need for a mechanism to rank the provided answers so users can focus on only reading good quality answers. The majority of online QA systems use user feedback to rank users’ answers and the user who asked the question can decide on the best answer. Other users who didn’t participate in answering the question can also vote to determine the best answer. However, ranking the best answer via this collaborative method is time consuming and requires an ongoing continuous involvement of users to provide the needed feedback. The objective of this research is to discover a way to recommend the best answer as part of a ranked list of answers for a posted question automatically, without the need for user feedback. The proposed approach combines both a non-content-based reputation method and a content-based method to solve the problem of recommending the best answer to the user who posted the question. The non-content method assigns a score to each user which reflects the users’ reputation level in using the QA portal system. Each user is assigned two types of non-content-based reputations cores: a local reputation score and a global reputation score. The local reputation score plays an important role in deciding the reputation level of a user for the category in which the question is asked. The global reputation score indicates the prestige of a user across all of the categories in the QA system. Due to the possibility of user cheating, such as awarding the best answer to a friend regardless of the answer quality, a content-based method for determining the quality of a given answer is proposed, alongside the non-content-based reputation method. Answers for a question from different users are compared with an ideal (or expert) answer using traditional Information Retrieval and Natural Language Processing techniques. Each answer provided for a question is assigned a content score according to how well it matched the ideal answer. To evaluate the performance of the proposed methods, each recommended best answer is compared with the best answer determined by one of the most popular link analysis methods, Hyperlink-Induced Topic Search (HITS). The proposed methods are able to yield high accuracy, as shown by correlation scores: Kendall correlation and Spearman correlation. The reputation method outperforms the HITS method in terms of recommending the best answer. The inclusion of the reputation score with the content score improves the overall performance, which is measured through the use of Top-n match scores.
Resumo:
This article examines one approach to promoting creative and flexible use of mathematical ideas within an interdisciplinary context in the primary curriculum, namely, through modelling. Three classes of fifth-grade children worked on a modelling problem, The First Fleet (Australia’s settlement), situated within the curriculum domains of science and studies of society and environment. Reported here are the cycles of development displayed by one group of children as they worked the problem, together with the range of models created across the classes. Children developed mathematisation processes that extended beyond their regular curriculum, including identifying and prioritising key problem elements, exploring relationships among elements, quantifying qualitative data, ranking and aggregating data, and creating and working with weighted scores. Aspects of Goldin’s (2000, 2007) affective structures also appeared to play an important role in the children's mathematical developments.