987 resultados para Ranking methods
Resumo:
Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.
Resumo:
A cikk a páros összehasonlításokon alapuló pontozási eljárásokat tárgyalja axiomatikus megközelítésben. A szakirodalomban számos értékelő függvényt javasoltak erre a célra, néhány karakterizációs eredmény is ismert. Ennek ellenére a megfelelő módszer kiválasztása nem egy-szerű feladat, a különböző tulajdonságok bevezetése elsősorban ebben nyújthat segítséget. Itt az összehasonlított objektumok teljesítményén érvényesülő monotonitást tárgyaljuk az önkonzisztencia és önkonzisztens monotonitás axiómákból kiindulva. Bemutatásra kerülnek lehetséges gyengítéseik és kiterjesztéseik, illetve egy, az irreleváns összehasonlításoktól való függetlenséggel kapcsolatos lehetetlenségi tétel is. A tulajdonságok teljesülését három eljárásra, a klasszikus pontszám eljárásra, az ezt továbbfejlesztő általánosított sorösszegre és a legkisebb négyzetek módszerére vizsgáljuk meg, melyek mindegyike egy lineáris egyenletrendszer megoldásaként számítható. A kapott eredmények új szempontokkal gazdagítják a pontozási eljárás megválasztásának kérdését. _____ The paper provides an axiomatic analysis of some scoring procedures based on paired comparisons. Several methods have been proposed for these generalized tournaments, some of them have been also characterized by a set of properties. The choice of an appropriate method is supported by a discussion of their theoretical properties. In the paper we focus on the connections of self-consistency and self-consistent-monotonicity, two axioms based on the comparisons of object's performance. The contradiction of self-consistency and independence of irrel-evant matches is revealed, as well as some possible reductions and extensions of these properties. Their satisfiability is examined through three scoring procedures, the score, generalised row sum and least squares methods, each of them is calculated as a solution of a system of linear equations. Our results contribute to the problem of finding a proper paired comparison based scoring method.
Resumo:
A páronként összehasonlított alternatívák rangsorolásának problémája egyaránt felmerül a szavazáselmélet, a statisztika, a tudománymetria, a pszichológia és a sport területén. A nemzetközi szakirodalom alapján részletesen áttekintjük a megoldási lehetőségeket, bemutatjuk a gyakorlati alkalmazások során fellépő kérdések kezelésének, a valós adatoknak megfelelő matematikai környezet felépítésének módjait. Kiemelten tárgyaljuk a páros összehasonlítási mátrix megadását, az egyes pontozási eljárásokat és azok kapcsolatát. A tanulmány elméleti szempontból vizsgálja a Perron-Frobenius tételen alapuló invariáns, fair bets, PageRank, valamint az irányított gráfok csúcsainak rangsorolásra javasolt internal slackening és pozíciós erő módszereket. A közülük történő választáshoz az axiomatikus megközelítést ajánljuk, ennek keretében bemutatjuk az invariáns és a fair bets eljárások karakterizációját, és kitérünk a módszerek vitatható tulajdonságaira. _____ The ranking of the alternatives or selecting the best one are fundamental issues of social choice theory, statistics, psychology and sport. Different solution concepts, and various mathematical models of applications are reviewed based on the international literature. We are focusing on the de¯nition of paired comparison matrix, on main scoring procedures and their relation. The paper gives a theoretical analysis of the invariant, fair bets and PageRank methods, which are founded on Perron-Frobenius theorem, as well as the internal slackening and positional power procedures used for ranking the nodes of a directed graph. An axiomatic approach is proposed for the choice of an appropriate method. Besides some known characterizations for the invariant and fair bets methods, we also discuss the violation of some properties, meaning their main weakness.
Resumo:
Each item in a given collection is characterized by a set of possible performances. A (ranking) method is a function that assigns an ordering of the items to every performance profile. Ranking by Rating consists in evaluating each item’s performance by using an exogenous rating function, and ranking items according to their performance ratings. Any such method is separable: the ordering of two items does not depend on the performances of the remaining items. We prove that every separable method must be of the ranking-by-rating type if (i) the set of possible performances is the same for all items and the method is anonymous, or (ii) the set of performances of each item is ordered and the method is monotonic. When performances are m-dimensional vectors, a separable, continuous, anonymous, monotonic, and invariant method must rank items according to a weighted geometric mean of their performances along the m dimensions.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.
Resumo:
This study examines the question of whether the journal ranking VHB-JOURQUAL 2 can be considered as a good measure for the construct “scientific quality”. Various rankings in business research provide the database for the analysis. The correlations between theses rankings are used to assess the validity of VHB-JOURQUAL 2 along various validity criteria. The correlations with rankings that measure the same construct based on different methods show that VHB-JOURQUAL 2 has acceptable, but moderate convergent validity. The validity varies considerably across disciplines, showing that the heterogeneity of business administration is not sufficiently represented by this overall ranking. The variability is related to the variation in members per discipline represented by the German Association for Business Research. Furthermore, the measure shows a weak correlation with acceptance rates as an indicator of nomological validity in some disciplines.
Resumo:
Document ranking is an important process in information retrieval (IR). It presents retrieved documents in an order of their estimated degrees of relevance to query. Traditional document ranking methods are mostly based on the similarity computations between documents and query. In this paper we argue that the similarity-based document ranking is insufficient in some cases. There are two reasons. Firstly it is about the increased information variety. There are far too many different types documents available now for user to search. The second is about the users variety. In many cases user may want to retrieve documents that are not only similar but also general or broad regarding a certain topic. This is particularly the case in some domains such as bio-medical IR. In this paper we propose a novel approach to re-rank the retrieved documents by incorporating the similarity with their generality. By an ontology-based analysis on the semantic cohesion of text, document generality can be quantified. The retrieved documents are then re-ranked by their combined scores of similarity and the closeness of documents’ generality to the query’s. Our experiments have shown an encouraging performance on a large bio-medical document collection, OHSUMED, containing 348,566 medical journal references and 101 test queries.
Resumo:
Usually, data mining projects that are based on decision trees for classifying test cases will use the probabilities provided by these decision trees for ranking classified test cases. We have a need for a better method for ranking test cases that have already been classified by a binary decision tree because these probabilities are not always accurate and reliable enough. A reason for this is that the probability estimates computed by existing decision tree algorithms are always the same for all the different cases in a particular leaf of the decision tree. This is only one reason why the probability estimates given by decision tree algorithms can not be used as an accurate means of deciding if a test case has been correctly classified. Isabelle Alvarez has proposed a new method that could be used to rank the test cases that were classified by a binary decision tree [Alvarez, 2004]. In this paper we will give the results of a comparison of different ranking methods that are based on the probability estimate, the sensitivity of a particular case or both.
Resumo:
The paper reviews some axioms of additivity concerning ranking methods used for generalized tournaments with possible missing values and multiple comparisons. It is shown that one of the most natural properties, called consistency, has strong links to independence of irrelevant comparisons, an axiom judged unfavourable when players have different opponents. Therefore some directions of weakening consistency are suggested, and several ranking methods, the score, generalized row sum and least squares as well as fair bets and its two variants (one of them entirely new) are analysed whether they satisfy the properties discussed. It turns out that least squares and generalized row sum with an appropriate parameter choice preserve the relative ranking of two objects if the ranking problems added have the same comparison structure.
Resumo:
A set ranking method assigns to each tournament on a given set an ordering of the subsets of that set. Such a method is consistent if (i) the items in the set are ranked in the same order as the sets of items they beat and (ii) the ordering of the items fully determines the ordering of the sets of items. We describe two consistent set ranking methods.
Resumo:
Este documento se concentra en el estudio de las diferencias salariales mediante la comparación de las distribuciones de los salarios para las siete principales ciudades colombianas en el periodo 2001-2005 con datos de la Encuesta Continua de Hogares. Se detectan diferencias significativas que se explican a la luz de la teoría del capital humano y de segmentación laboral; mediante la estimación de ecuaciones de salarios a partir de las características socioeconómicas de los trabajadores junto con efectos particulares para región y rama de actividad económica que resultan significativos dando evidencia de segmentación del mercado laboralen Colombia. El componente de los salarios que es particular a la región y a la actividad económica se explica a partir de variables macroeconómicas como el crecimiento económico, la dotación sectorial de factores, el costo de vida y el desempleo a nivel regional.
Resumo:
This document describes the data collection and use for the computation of rankings within RePEc (Research Papers in Economics). This encompasses the determination of impact factors for journals and working paper series, as well as the ranking of authors, institutions and geographic regions. The various ranking methods are also compared, using a snapshot of the data.
Resumo:
This paper proposes a new multi-objective estimation of distribution algorithm (EDA) based on joint modeling of objectives and variables. This EDA uses the multi-dimensional Bayesian network as its probabilistic model. In this way it can capture the dependencies between objectives, variables and objectives, as well as the dependencies learnt between variables in other Bayesian network-based EDAs. This model leads to a problem decomposition that helps the proposed algorithm to find better trade-off solutions to the multi-objective problem. In addition to Pareto set approximation, the algorithm is also able to estimate the structure of the multi-objective problem. To apply the algorithm to many-objective problems, the algorithm includes four different ranking methods proposed in the literature for this purpose. The algorithm is applied to the set of walking fish group (WFG) problems, and its optimization performance is compared with an evolutionary algorithm and another multi-objective EDA. The experimental results show that the proposed algorithm performs significantly better on many of the problems and for different objective space dimensions, and achieves comparable results on some compared with the other algorithms.
Resumo:
Probabilistic modeling is the de�ning characteristic of estimation of distribution algorithms (EDAs) which determines their behavior and performance in optimization. Regularization is a well-known statistical technique used for obtaining an improved model by reducing the generalization error of estimation, especially in high-dimensional problems. `1-regularization is a type of this technique with the appealing variable selection property which results in sparse model estimations. In this thesis, we study the use of regularization techniques for model learning in EDAs. Several methods for regularized model estimation in continuous domains based on a Gaussian distribution assumption are presented, and analyzed from di�erent aspects when used for optimization in a high-dimensional setting, where the population size of EDA has a logarithmic scale with respect to the number of variables. The optimization results obtained for a number of continuous problems with an increasing number of variables show that the proposed EDA based on regularized model estimation performs a more robust optimization, and is able to achieve signi�cantly better results for larger dimensions than other Gaussian-based EDAs. We also propose a method for learning a marginally factorized Gaussian Markov random �eld model using regularization techniques and a clustering algorithm. The experimental results show notable optimization performance on continuous additively decomposable problems when using this model estimation method. Our study also covers multi-objective optimization and we propose joint probabilistic modeling of variables and objectives in EDAs based on Bayesian networks, speci�cally models inspired from multi-dimensional Bayesian network classi�ers. It is shown that with this approach to modeling, two new types of relationships are encoded in the estimated models in addition to the variable relationships captured in other EDAs: objectivevariable and objective-objective relationships. An extensive experimental study shows the e�ectiveness of this approach for multi- and many-objective optimization. With the proposed joint variable-objective modeling, in addition to the Pareto set approximation, the algorithm is also able to obtain an estimation of the multi-objective problem structure. Finally, the study of multi-objective optimization based on joint probabilistic modeling is extended to noisy domains, where the noise in objective values is represented by intervals. A new version of the Pareto dominance relation for ordering the solutions in these problems, namely �-degree Pareto dominance, is introduced and its properties are analyzed. We show that the ranking methods based on this dominance relation can result in competitive performance of EDAs with respect to the quality of the approximated Pareto sets. This dominance relation is then used together with a method for joint probabilistic modeling based on `1-regularization for multi-objective feature subset selection in classi�cation, where six di�erent measures of accuracy are considered as objectives with interval values. The individual assessment of the proposed joint probabilistic modeling and solution ranking methods on datasets with small-medium dimensionality, when using two di�erent Bayesian classi�ers, shows that comparable or better Pareto sets of feature subsets are approximated in comparison to standard methods.