Biblioteca Digital

Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.

Veja mais

Prediction of activities of oxygen in dilute quaternary solutions using binary data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Equations are developed for predicting the activity coefficients of oxygen dissolved in ternary liquid alloys. These are extensions of earlier treatments, and are based on a model in which each oxygen atom is assumed to make four bonds with neighboring metal atoms. It is also postulated that the strong oxygen-metal bonds distort the electronic configuration around the metal atoms bonded to oxygen, and that the quantitative reduction of the strength of bonds made by these atoms with all of the adjacent metal atoms is equivalent to a factor of approximately two. The predictions of the quasichemical equation which is derived agree satisfactorily with the partial molar free energies of oxygen in Ag-Cu-Sn solutions at 1200°C reported in literature. An extension of this treatment to multicomponent solutions is also indicated.

Veja mais

The estimation of the thermodynamic properties of ternary alloys from binary data using the shortest distance composition path

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A new composition path, Xi-Xj=constant, is suggested for the semi-empirical calculation of the thermodynamic properties of ternary ‘substitutional’ solutions from binary data, when the binary systems show deviations from the regular solution model. A comparison is made between the results obtained for integral and partial properties using this composition path and those calculated employing other composition paths suggested in literature. It appears that the best estimate of the ternary properties is obtained when binary data at compositions closest to the ternary composition are used.

Veja mais

Computation of thermodynamic properties of quaternary and higher order systems from binary data using shortest distance composition paths

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Equations for the computation of integral and partial thermodynamic properties of mixing in quarternary systems are derived using data on constituent binary systems and shortest distance composition paths to the binaries. The composition path from a quarternary composition to the i-j binary is characterized by a constant value of (Xi − Xj). The merits of this composition path over others with constant values for View the MathML source or Xi are discussed. Finally the equations are generalized for higher order systems. They are exact for regular solutions, but may be used in a semiempirical mode for non-regular solutions.

Veja mais

Patient self-report and medical records:Measuring agreement for binary data

Relevância:

100.00% 100.00%

Publicador:

Veja mais

A score test for binary data with patient non-compliance

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A score test is developed for binary clinical trial data, which incorporates patient non-compliance while respecting randomization. It is assumed in this paper that compliance is all-or-nothing, in the sense that a patient either accepts all of the treatment assigned as specified in the protocol, or none of it. Direct analytic comparisons of the adjusted test statistic for both the score test and the likelihood ratio test are made with the corresponding test statistics that adhere to the intention-to-treat principle. It is shown that no gain in power is possible over the intention-to-treat analysis, by adjusting for patient non-compliance. Sample size formulae are derived and simulation studies are used to demonstrate that the sample size approximation holds. Copyright © 2003 John Wiley & Sons, Ltd.

Veja mais

Meta-analysis of binary data using profile likelihood

Relevância:

100.00% 100.00%

Publicador:

Veja mais

Active-control trials with binary data: a comparison of methods for testing superiority or non-inferiority using the odds ratio

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper considers methods for testing for superiority or non-inferiority in active-control trials with binary data, when the relative treatment effect is expressed as an odds ratio. Three asymptotic tests for the log-odds ratio based on the unconditional binary likelihood are presented, namely the likelihood ratio, Wald and score tests. All three tests can be implemented straightforwardly in standard statistical software packages, as can the corresponding confidence intervals. Simulations indicate that the three alternatives are similar in terms of the Type I error, with values close to the nominal level. However, when the non-inferiority margin becomes large, the score test slightly exceeds the nominal level. In general, the highest power is obtained from the score test, although all three tests are similar and the observed differences in power are not of practical importance. Copyright (C) 2007 John Wiley & Sons, Ltd.

Veja mais

Meta-analysis of binary data based upon dichotomized criteria

Relevância:

100.00% 100.00%

Publicador:

Veja mais

Sample size considerations in active-control non-inferiority trials with binary data based on the odds ratio

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents an approximate closed form sample size formula for determining non-inferiority in active-control trials with binary data. We use the odds-ratio as the measure of the relative treatment effect, derive the sample size formula based on the score test and compare it with a second, well-known formula based on the Wald test. Both closed form formulae are compared with simulations based on the likelihood ratio test. Within the range of parameter values investigated, the score test closed form formula is reasonably accurate when non-inferiority margins are based on odds-ratios of about 0.5 or above and when the magnitude of the odds ratio under the alternative hypothesis lies between about 1 and 2.5. The accuracy generally decreases as the odds ratio under the alternative hypothesis moves upwards from 1. As the non-inferiority margin odds ratio decreases from 0.5, the score test closed form formula increasingly overestimates the sample size irrespective of the magnitude of the odds ratio under the alternative hypothesis. The Wald test closed form formula is also reasonably accurate in the cases where the score test closed form formula works well. Outside these scenarios, the Wald test closed form formula can either underestimate or overestimate the sample size, depending on the magnitude of the non-inferiority margin odds ratio and the odds ratio under the alternative hypothesis. Although neither approximation is accurate for all cases, both approaches lead to satisfactory sample size calculation for non-inferiority trials with binary data where the odds ratio is the parameter of interest.

Veja mais

981 resultados para clustered binary data

Filtro por publicador