88 resultados para Sample selection model
A hierarchical Bayesian model for predicting the functional consequences of amino-acid polymorphisms
Resumo:
Genetic polymorphisms in deoxyribonucleic acid coding regions may have a phenotypic effect on the carrier, e.g. by influencing susceptibility to disease. Detection of deleterious mutations via association studies is hampered by the large number of candidate sites; therefore methods are needed to narrow down the search to the most promising sites. For this, a possible approach is to use structural and sequence-based information of the encoded protein to predict whether a mutation at a particular site is likely to disrupt the functionality of the protein itself. We propose a hierarchical Bayesian multivariate adaptive regression spline (BMARS) model for supervised learning in this context and assess its predictive performance by using data from mutagenesis experiments on lac repressor and lysozyme proteins. In these experiments, about 12 amino-acid substitutions were performed at each native amino-acid position and the effect on protein functionality was assessed. The training data thus consist of repeated observations at each position, which the hierarchical framework is needed to account for. The model is trained on the lac repressor data and tested on the lysozyme mutations and vice versa. In particular, we show that the hierarchical BMARS model, by allowing for the clustered nature of the data, yields lower out-of-sample misclassification rates compared with both a BMARS and a frequen-tist MARS model, a support vector machine classifier and an optimally pruned classification tree.
Resumo:
A number of authors have proposed clinical trial designs involving the comparison of several experimental treatments with a control treatment in two or more stages. At the end of the first stage, the most promising experimental treatment is selected, and all other experimental treatments are dropped from the trial. Provided it is good enough, the selected experimental treatment is then compared with the control treatment in one or more subsequent stages. The analysis of data from such a trial is problematic because of the treatment selection and the possibility of stopping at interim analyses. These aspects lead to bias in the maximum-likelihood estimate of the advantage of the selected experimental treatment over the control and to inaccurate coverage for the associated confidence interval. In this paper, we evaluate the bias of the maximum-likelihood estimate and propose a bias-adjusted estimate. We also propose an approach to the construction of a confidence region for the vector of advantages of the experimental treatments over the control based on an ordering of the sample space. These regions are shown to have accurate coverage, although they are also shown to be necessarily unbounded. Confidence intervals for the advantage of the selected treatment are obtained from the confidence regions and are shown to have more accurate coverage than the standard confidence interval based upon the maximum-likelihood estimate and its asymptotic standard error.
Resumo:
Most statistical methodology for phase III clinical trials focuses on the comparison of a single experimental treatment with a control. An increasing desire to reduce the time before regulatory approval of a new drug is sought has led to development of two-stage or sequential designs for trials that combine the definitive analysis associated with phase III with the treatment selection element of a phase II study. In this paper we consider a trial in which the most promising of a number of experimental treatments is selected at the first interim analysis. This considerably reduces the computational load associated with the construction of stopping boundaries compared to the approach proposed by Follman, Proschan and Geller (Biometrics 1994; 50: 325-336). The computational requirement does not exceed that for the sequential comparison of a single experimental treatment with a control. Existing methods are extended in two ways. First, the use of the efficient score as a test statistic makes the analysis of binary, normal or failure-time data, as well as adjustment for covariates or stratification straightforward. Second, the question of trial power is also considered, enabling the determination of sample size required to give specified power. Copyright © 2003 John Wiley & Sons, Ltd.
Resumo:
The proportional odds model provides a powerful tool for analysing ordered categorical data and setting sample size, although for many clinical trials its validity is questionable. The purpose of this paper is to present a new class of constrained odds models which includes the proportional odds model. The efficient score and Fisher's information are derived from the profile likelihood for the constrained odds model. These results are new even for the special case of proportional odds where the resulting statistics define the Mann-Whitney test. A strategy is described involving selecting one of these models in advance, requiring assumptions as strong as those underlying proportional odds, but allowing a choice of such models. The accuracy of the new procedure and its power are evaluated.
Resumo:
Background: Selecting the highest quality 3D model of a protein structure from a number of alternatives remains an important challenge in the field of structural bioinformatics. Many Model Quality Assessment Programs (MQAPs) have been developed which adopt various strategies in order to tackle this problem, ranging from the so called "true" MQAPs capable of producing a single energy score based on a single model, to methods which rely on structural comparisons of multiple models or additional information from meta-servers. However, it is clear that no current method can separate the highest accuracy models from the lowest consistently. In this paper, a number of the top performing MQAP methods are benchmarked in the context of the potential value that they add to protein fold recognition. Two novel methods are also described: ModSSEA, which based on the alignment of predicted secondary structure elements and ModFOLD which combines several true MQAP methods using an artificial neural network. Results: The ModSSEA method is found to be an effective model quality assessment program for ranking multiple models from many servers, however further accuracy can be gained by using the consensus approach of ModFOLD. The ModFOLD method is shown to significantly outperform the true MQAPs tested and is competitive with methods which make use of clustering or additional information from multiple servers. Several of the true MQAPs are also shown to add value to most individual fold recognition servers by improving model selection, when applied as a post filter in order to re-rank models. Conclusion: MQAPs should be benchmarked appropriately for the practical context in which they are intended to be used. Clustering based methods are the top performing MQAPs where many models are available from many servers; however, they often do not add value to individual fold recognition servers when limited models are available. Conversely, the true MQAP methods tested can often be used as effective post filters for re-ranking few models from individual fold recognition servers and further improvements can be achieved using a consensus of these methods.
Resumo:
1. Jerdon's courser Rhinoptilus bitorquatus is a nocturnally active cursorial bird that is only known to occur in a small area of scrub jungle in Andhra Pradesh, India, and is listed as critically endangered by the IUCN. Information on its habitat requirements is needed urgently to underpin conservation measures. We quantified the habitat features that correlated with the use of different areas of scrub jungle by Jerdon's coursers, and developed a model to map potentially suitable habitat over large areas from satellite imagery and facilitate the design of surveys of Jerdon's courser distribution. 2. We used 11 arrays of 5-m long tracking strips consisting of smoothed fine soil to detect the footprints of Jerdon's coursers, and measured tracking rates (tracking events per strip night). We counted the number of bushes and trees, and described other attributes of vegetation and substrate in a 10-m square plot centred on each strip. We obtained reflectance data from Landsat 7 satellite imagery for the pixel within which each strip lay. 3. We used logistic regression models to describe the relationship between tracking rate by Jerdon's coursers and characteristics of the habitat around the strips, using ground-based survey data and satellite imagery. 4. Jerdon's coursers were most likely to occur where the density of large (>2 m tall) bushes was in the range 300-700 ha(-1) and where the density of smaller bushes was less than 1000 ha(-1). This habitat was detectable using satellite imagery. 5. Synthesis and applications. The occurrence of Jerdon's courser is strongly correlated with the density of bushes and trees, and is in turn affected by grazing with domestic livestock, woodcutting and mechanical clearance of bushes to create pasture, orchards and farmland. It is likely that there is an optimal level of grazing and woodcutting that would maintain or create suitable conditions for the species. Knowledge of the species' distribution is incomplete and there is considerable pressure from human use of apparently suitable habitats. Hence, distribution mapping is a high conservation priority. A two-step procedure is proposed, involving the use of ground surveys of bush density to calibrate satellite image-based mapping of potential habitat. These maps could then be used to select priority areas for Jerdon's courser surveys. The use of tracking strips to study habitat selection and distribution has potential in studies of other scarce and secretive species.
Resumo:
This paper highlights the key role played by solubility in influencing gelation and demonstrates that many facets of the gelation process depend on this vital parameter. In particular, we relate thermal stability (T-gel) and minimum gelation concentration (MGC) values of small-molecule gelation in terms of the solubility and cooperative self-assembly of gelator building blocks. By employing a van't Hoff analysis of solubility data, determined from simple NMR measurements, we are able to generate T-calc values that reflect the calculated temperature for complete solubilization of the networked gelator. The concentration dependence of T-calc allows the previously difficult to rationalize "plateau-region" thermal stability values to be elucidated in terms of gelator molecular design. This is demonstrated for a family of four gelators with lysine units attached to each end of an aliphatic diamine, with different peripheral groups (Z or Bee) in different locations on the periphery of the molecule. By tuning the peripheral protecting groups of the gelators, the solubility of the system is modified, which in turn controls the saturation point of the system and hence controls the concentration at which network formation takes place. We report that the critical concentration (C-crit) of gelator incorporated into the solid-phase sample-spanning network within the gel is invariant of gelator structural design. However, because some systems have higher solubilities, they are less effective gelators and require the application of higher total concentrations to achieve gelation, hence shedding light on the role of the MGC parameter in gelation. Furthermore, gelator structural design also modulates the level of cooperative self-assembly through solubility effects, as determined by applying a cooperative binding model to NMR data. Finally, the effect of gelator chemical design on the spatial organization of the networked gelator was probed by small-angle neutron and X-ray scattering (SANS/SAXS) on the native gel, and a tentative self-assembly model was proposed.
Resumo:
Sunflower oil-in-water emulsions containing TBHQ, caffeic acid, epigallocatechin gallate (EGCG), or 6-hydroxy-2,5,7,8-tetramethylchroman-2-carboxylic acid (Trolox), both with and without BSA, were stored at 50 and 30degreesC. Oxidation of the oil was monitored by determination of the PV, conjugated diene content, and hexanal formation. Emulsions containing EGCG, caffeic acid, and, to a lesser extent, Trolox were much more stable during storage in the presence of BSA than in its absence even though BSA itself did not provide an antioxidant effect. BSA did not have a synergistic effect on the antioxidant activity of TBHQ. The BSA structure changed, with a considerable loss of fluorescent tryptophan groups during storage of solutions containing BSA and antioxidants, and a BSA-antioxidant adduct with radical-scavenging activity was formed. The highest radical-scavenging activity observed was for the isolated protein from a sample containing EGCG and BSA incubated at 30degreesC for 10 d. This fraction contained unchanged BSA as well as BSA-antioxidant adduct, but 95.7% of the initial fluorescence had been lost, showing that most of the BSA had been altered. It can be concluded that BSA exerts its synergistic effect with antioxidants because of formation of a protein-antioxidant adduct during storage, which is concentrated at the oil-water interface owing to the surface-active nature of the protein.
Resumo:
Background: High rates of co-morbidity between Generalized Social Phobia (GSP) and Generalized Anxiety Disorder (GAD) have been documented. The reason for this is unclear. Family studies are one means of clarifying the nature of co-morbidity between two disorders. Methods: Six models of co-morbidity between GSP and GAD were investigated in a family aggregation study of 403 first-degree relatives of non-clinical probands: 37 with GSP, 22 with GAD, 15 with co-morbid GSP/GAD, and 41 controls with no history of GSP or GAD. Psychiatric data were collected for probands and relatives. Mixed methods (direct and family history interviews) were utilised. Results: Primary contrasts (against controls) found an increased rate of pure GSP in the relatives of both GSP probands and co-morbid GSP/GAD probands, and found relatives of co-morbid GSP/GAD probands to have an increased rate of both pure GAD and comorbid GSP/GAD. Secondary contrasts found (i) increased GSP in the relatives of GSP only probands compared to the relatives of GAD only probands; and (ii) increased GAD in the relatives of co-morbid GSP/GAD probands compared to the relatives of GSP only probands. Limitations: The study did not directly interview all relatives, although the reliability of family history data was assessed. The study was based on an all-female proband sample. The implications of both these limitations are discussed. Conclusions: The results were most consistent with a co-morbidity model indicating independent familial transmission of GSP and GAD. This has clinical implications for the treatment of patients with both disorders. (C) 2006 Elsevier B.V. All fights reserved.
Resumo:
Supplier selection has a great impact on supply chain management. The quality of supplier selection also affects profitability of organisations which work in the supply chain. As suppliers can provide variety of services and customers demand higher quality of service provision, the organisation is facing challenges for making the right choice of supplier for the right needs. The existing methods for supplier selection, such as data envelopment analysis (DEA) and analytical hierarchy process (AHP) can automatically perform selection of competitive suppliers and further decide winning supplier(s). However, these methods are not capable of determining the right selection criteria which should be derived from the business strategy. An ontology model described in this paper integrates the strengths of DEA and AHP with new mechanisms which ensure the right supplier to be selected by the right criteria for the right customer's needs.
Resumo:
This paper is concerned with the selection of inputs for classification models based on ratios of measured quantities. For this purpose, all possible ratios are built from the quantities involved and variable selection techniques are used to choose a convenient subset of ratios. In this context, two selection techniques are proposed: one based on a pre-selection procedure and another based on a genetic algorithm. In an example involving the financial distress prediction of companies, the models obtained from ratios selected by the proposed techniques compare favorably to a model using ratios usually found in the financial distress literature.
Resumo:
An efficient model identification algorithm for a large class of linear-in-the-parameters models is introduced that simultaneously optimises the model approximation ability, sparsity and robustness. The derived model parameters in each forward regression step are initially estimated via the orthogonal least squares (OLS), followed by being tuned with a new gradient-descent learning algorithm based on the basis pursuit that minimises the l(1) norm of the parameter estimate vector. The model subset selection cost function includes a D-optimality design criterion that maximises the determinant of the design matrix of the subset to ensure model robustness and to enable the model selection procedure to automatically terminate at a sparse model. The proposed approach is based on the forward OLS algorithm using the modified Gram-Schmidt procedure. Both the parameter tuning procedure, based on basis pursuit, and the model selection criterion, based on the D-optimality that is effective in ensuring model robustness, are integrated with the forward regression. As a consequence the inherent computational efficiency associated with the conventional forward OLS approach is maintained in the proposed algorithm. Examples demonstrate the effectiveness of the new approach.
Resumo:
This correspondence introduces a new orthogonal forward regression (OFR) model identification algorithm using D-optimality for model structure selection and is based on an M-estimators of parameter estimates. M-estimator is a classical robust parameter estimation technique to tackle bad data conditions such as outliers. Computationally, The M-estimator can be derived using an iterative reweighted least squares (IRLS) algorithm. D-optimality is a model structure robustness criterion in experimental design to tackle ill-conditioning in model Structure. The orthogonal forward regression (OFR), often based on the modified Gram-Schmidt procedure, is an efficient method incorporating structure selection and parameter estimation simultaneously. The basic idea of the proposed approach is to incorporate an IRLS inner loop into the modified Gram-Schmidt procedure. In this manner, the OFR algorithm for parsimonious model structure determination is extended to bad data conditions with improved performance via the derivation of parameter M-estimators with inherent robustness to outliers. Numerical examples are included to demonstrate the effectiveness of the proposed algorithm.
Resumo:
Many kernel classifier construction algorithms adopt classification accuracy as performance metrics in model evaluation. Moreover, equal weighting is often applied to each data sample in parameter estimation. These modeling practices often become problematic if the data sets are imbalanced. We present a kernel classifier construction algorithm using orthogonal forward selection (OFS) in order to optimize the model generalization for imbalanced two-class data sets. This kernel classifier identification algorithm is based on a new regularized orthogonal weighted least squares (ROWLS) estimator and the model selection criterion of maximal leave-one-out area under curve (LOO-AUC) of the receiver operating characteristics (ROCs). It is shown that, owing to the orthogonalization procedure, the LOO-AUC can be calculated via an analytic formula based on the new regularized orthogonal weighted least squares parameter estimator, without actually splitting the estimation data set. The proposed algorithm can achieve minimal computational expense via a set of forward recursive updating formula in searching model terms with maximal incremental LOO-AUC value. Numerical examples are used to demonstrate the efficacy of the algorithm.
Resumo:
We propose a simple and computationally efficient construction algorithm for two class linear-in-the-parameters classifiers. In order to optimize model generalization, a forward orthogonal selection (OFS) procedure is used for minimizing the leave-one-out (LOO) misclassification rate directly. An analytic formula and a set of forward recursive updating formula of the LOO misclassification rate are developed and applied in the proposed algorithm. Numerical examples are used to demonstrate that the proposed algorithm is an excellent alternative approach to construct sparse two class classifiers in terms of performance and computational efficiency.