905 resultados para Hierarchical document


Relevância:

100.00% 100.00%

Publicador:

Resumo:

One way to organize knowledge and make its search and retrieval easier is to create a structural representation divided by hierarchically related topics. Once this structure is built, it is necessary to find labels for each of the obtained clusters. In many cases the labels have to be built using only the terms in the documents of the collection. This paper presents the SeCLAR (Selecting Candidate Labels using Association Rules) method, which explores the use of association rules for the selection of good candidates for labels of hierarchical document clusters. The candidates are processed by a classical method to generate the labels. The idea of the proposed method is to process each parent-child relationship of the nodes as an antecedent-consequent relationship of association rules. The experimental results show that the proposed method can improve the precision and recall of labels obtained by classical methods. © 2010 Springer-Verlag.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

One way to organize knowledge and make its search and retrieval easier is to create a structural representation divided by hierarchically related topics. Once this structure is built, it is necessary to find labels for each of the obtained clusters. In many cases the labels must be built using all the terms in the documents of the collection. This paper presents the SeCLAR method, which explores the use of association rules in the selection of good candidates for labels of hierarchical document clusters. The purpose of this method is to select a subset of terms by exploring the relationship among the terms of each document. Thus, these candidates can be processed by a classical method to generate the labels. An experimental study demonstrates the potential of the proposed approach to improve the precision and recall of labels obtained by classical methods only considering the terms which are potentially more discriminative. © 2012 - IOS Press and the authors. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Point placement strategies aim at mapping data points represented in higher dimensions to bi-dimensional spaces and are frequently used to visualize relationships amongst data instances. They have been valuable tools for analysis and exploration of data sets of various kinds. Many conventional techniques, however, do not behave well when the number of dimensions is high, such as in the case of documents collections. Later approaches handle that shortcoming, but may cause too much clutter to allow flexible exploration to take place. In this work we present a novel hierarchical point placement technique that is capable of dealing with these problems. While good grouping and separation of data with high similarity is maintained without increasing computation cost, its hierarchical structure lends itself both to exploration in various levels of detail and to handling data in subsets, improving analysis capability and also allowing manipulation of larger data sets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We analyze the influence of time-, firm-, industry- and country-level determinants of capital structure. First, we apply hierarchical linear modeling in order to assess the relative importance of those levels. We find that time and firm levels explain 78% of firm leverage. Second, we include random intercepts and random coefficients in order to analyze the direct and indirect influences of firm/industry/country characteristics on firm leverage. We document several important indirect influences of variables at industry and country-levels on firm determinants of leverage, as well as several structural differences in the financial behavior between firms of developed and emerging countries. (C) 2010 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A parts based model is a parametrization of an object class using a collection of landmarks following the object structure. The matching of parts based models is one of the problems where pairwise Conditional Random Fields have been successfully applied. The main reason of their effectiveness is tractable inference and learning due to the simplicity of involved graphs, usually trees. However, these models do not consider possible patterns of statistics among sets of landmarks, and thus they sufffer from using too myopic information. To overcome this limitation, we propoese a novel structure based on a hierarchical Conditional Random Fields, which we explain in the first part of this memory. We build a hierarchy of combinations of landmarks, where matching is performed taking into account the whole hierarchy. To preserve tractable inference we effectively sample the label set. We test our method on facial feature selection and human pose estimation on two challenging datasets: Buffy and MultiPIE. In the second part of this memory, we present a novel approach to multiple kernel combination that relies on stacked classification. This method can be used to evaluate the landmarks of the parts-based model approach. Our method is based on combining responses of a set of independent classifiers for each individual kernel. Unlike earlier approaches that linearly combine kernel responses, our approach uses them as inputs to another set of classifiers. We will show that we outperform state-of-the-art methods on most of the standard benchmark datasets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Microsatellites are used to unravel the fine-scale genetic structure of a hybrid zone between chromosome races Valais and Cordon of the common shrew (Sorex araneus) located in the French Alps. A total of 269 individuals collected between 1992 and 1995 was typed for seven microsatellite loci. A modified version of the classical multiple correspondence analysis is carried out. This analysis clearly shows the dichotomy between the two races. Several approaches are used to study genetic structuring. Gene flow is clearly reduced between these chromosome races and is estimated at one migrant every two generations using X-statistics and one migrant per generation using F-statistics. Hierarchical F- and R-statistics are compared and their efficiency to detect inter- and intraracial patterns of divergence is discussed. Within-race genetic structuring is significant, but remains weak. F-ST displays similar values on both sides of the hybrid zone, although no environmental barriers are found on the Cordon side, whereas the Valais side is divided by several mountain rivers. We introduce the exact G-test to microsatellite data which proved to be a powerful test to detect genetic differentiation within as well as among races. The genetic background of karyotypic hybrids was compared with the genetic background of pure parental forms using a CRT-MCA. Our results indicate that, without knowledge of the karyotypes, we would not have been able to distinguish these hybrids from karyotypically pure samples.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Evidence of a sport-specific hierarchy of protective factors against doping would thus be a powerful aid in adapting information and prevention campaigns to target the characteristics of specific athlete groups, and especially those athletes most vulnerable for doping control. The contents of phone calls to a free and anonymous national anti-doping service called 'ecoute dopage' were analysed (192 bodybuilders, 124 cyclists and 44 footballers). The results showed that the protective factors that emerged from analysis could be categorised into two groups. The first comprised 'Health concerns', 'Respect for the law' and 'Doping controls from the environment' and the second comprised 'Doubts about the effectiveness of illicit products, 'Thinking skills' and 'Doubts about doctors'. The ranking of the factors for the cyclists differed from that of the other athletes. The ordering of factors was 1) respect for the law, 2) doping controls from the environment, 3) health concerns 4) doubts about doctors, and 5) doubts about the effectiveness illicit products. The results are analysed in terms of the ranking in each athlete group and the consequences on the athletes' experience and relationship to doping. Specific prevention campaigns are proposed to limit doping behaviour in general and for each sport.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we address the issue of locating hierarchical facilities in the presence of congestion. Two hierarchical models are presented, where lower level servers attend requests first, and then, some of the served customers are referred to higher level servers. In the first model, the objective is to find the minimum number of servers and theirlocations that will cover a given region with a distance or time standard. The second model is cast as a Maximal Covering Location formulation. A heuristic procedure is then presented together with computational experience. Finally, some extensions of these models that address other types of spatial configurations are offered.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The package HIERFSTAT for the statistical software R, created by the R Development Core Team, allows the estimate of hierarchical F-statistics from a hierarchy with any numbers of levels. In addition, it allows testing the statistical significance of population differentiation for these different levels, using a generalized likelihood-ratio test. The package HIERFSTAT is available at http://www.unil.ch/popgen/softwares/hierfstat.htm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

MOTIVATION: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy-independent, i.e. unsupervised, clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have thus far largely been overlooked. RESULTS: More than 1 million hyper-variable internal transcribed spacer 1 (ITS1) sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, which complements the other methods by providing insights into the structure of the data. AVAILABILITY: An executable is freely available for non-commercial users at ftp://ftp.vital-it.ch/tools/dbc454. It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system. CONTACT: dbc454@vital-it.ch or nicolas.guex@isb-sib.ch.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Peer-reviewed

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a hierarchical clustering method for semantic Web service discovery. This method aims to improve the accuracy and efficiency of the traditional service discovery using vector space model. The Web service is converted into a standard vector format through the Web service description document. With the help of WordNet, a semantic analysis is conducted to reduce the dimension of the term vector and to make semantic expansion to meet the user’s service request. The process and algorithm of hierarchical clustering based semantic Web service discovery is discussed. Validation is carried out on the dataset.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This thesis applies a hierarchical latent trait model system to a large quantity of data. The motivation for it was lack of viable approaches to analyse High Throughput Screening datasets which maybe include thousands of data points with high dimensions. High Throughput Screening (HTS) is an important tool in the pharmaceutical industry for discovering leads which can be optimised and further developed into candidate drugs. Since the development of new robotic technologies, the ability to test the activities of compounds has considerably increased in recent years. Traditional methods, looking at tables and graphical plots for analysing relationships between measured activities and the structure of compounds, have not been feasible when facing a large HTS dataset. Instead, data visualisation provides a method for analysing such large datasets, especially with high dimensions. So far, a few visualisation techniques for drug design have been developed, but most of them just cope with several properties of compounds at one time. We believe that a latent variable model (LTM) with a non-linear mapping from the latent space to the data space is a preferred choice for visualising a complex high-dimensional data set. As a type of latent variable model, the latent trait model can deal with either continuous data or discrete data, which makes it particularly useful in this domain. In addition, with the aid of differential geometry, we can imagine the distribution of data from magnification factor and curvature plots. Rather than obtaining the useful information just from a single plot, a hierarchical LTM arranges a set of LTMs and their corresponding plots in a tree structure. We model the whole data set with a LTM at the top level, which is broken down into clusters at deeper levels of t.he hierarchy. In this manner, the refined visualisation plots can be displayed in deeper levels and sub-clusters may be found. Hierarchy of LTMs is trained using expectation-maximisation (EM) algorithm to maximise its likelihood with respect to the data sample. Training proceeds interactively in a recursive fashion (top-down). The user subjectively identifies interesting regions on the visualisation plot that they would like to model in a greater detail. At each stage of hierarchical LTM construction, the EM algorithm alternates between the E- and M-step. Another problem that can occur when visualising a large data set is that there may be significant overlaps of data clusters. It is very difficult for the user to judge where centres of regions of interest should be put. We address this problem by employing the minimum message length technique, which can help the user to decide the optimal structure of the model. In this thesis we also demonstrate the applicability of the hierarchy of latent trait models in the field of document data mining.