9 resultados para clustered binary data
em BORIS: Bern Open Repository and Information System - Berna - Suiça
Resumo:
Index tracking has become one of the most common strategies in asset management. The index-tracking problem consists of constructing a portfolio that replicates the future performance of an index by including only a subset of the index constituents in the portfolio. Finding the most representative subset is challenging when the number of stocks in the index is large. We introduce a new three-stage approach that at first identifies promising subsets by employing data-mining techniques, then determines the stock weights in the subsets using mixed-binary linear programming, and finally evaluates the subsets based on cross validation. The best subset is returned as the tracking portfolio. Our approach outperforms state-of-the-art methods in terms of out-of-sample performance and running times.
Resumo:
We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.
Does published orthodontic research account for clustering effects during statistical data analysis?
Resumo:
In orthodontics, multiple site observations within patients or multiple observations collected at consecutive time points are often encountered. Clustered designs require larger sample sizes compared to individual randomized trials and special statistical analyses that account for the fact that observations within clusters are correlated. It is the purpose of this study to assess to what degree clustering effects are considered during design and data analysis in the three major orthodontic journals. The contents of the most recent 24 issues of the American Journal of Orthodontics and Dentofacial Orthopedics (AJODO), Angle Orthodontist (AO), and European Journal of Orthodontics (EJO) from December 2010 backwards were hand searched. Articles with clustering effects and whether the authors accounted for clustering effects were identified. Additionally, information was collected on: involvement of a statistician, single or multicenter study, number of authors in the publication, geographical area, and statistical significance. From the 1584 articles, after exclusions, 1062 were assessed for clustering effects from which 250 (23.5 per cent) were considered to have clustering effects in the design (kappa = 0.92, 95 per cent CI: 0.67-0.99 for inter rater agreement). From the studies with clustering effects only, 63 (25.20 per cent) had indicated accounting for clustering effects. There was evidence that the studies published in the AO have higher odds of accounting for clustering effects [AO versus AJODO: odds ratio (OR) = 2.17, 95 per cent confidence interval (CI): 1.06-4.43, P = 0.03; EJO versus AJODO: OR = 1.90, 95 per cent CI: 0.84-4.24, non-significant; and EJO versus AO: OR = 1.15, 95 per cent CI: 0.57-2.33, non-significant). The results of this study indicate that only about a quarter of the studies with clustering effects account for this in statistical data analysis.
Resumo:
Publication bias and related bias in meta-analysis is often examined by visually checking for asymmetry in funnel plots of treatment effect against its standard error. Formal statistical tests of funnel plot asymmetry have been proposed, but when applied to binary outcome data these can give false-positive rates that are higher than the nominal level in some situations (large treatment effects, or few events per trial, or all trials of similar sizes). We develop a modified linear regression test for funnel plot asymmetry based on the efficient score and its variance, Fisher's information. The performance of this test is compared to the other proposed tests in simulation analyses based on the characteristics of published controlled trials. When there is little or no between-trial heterogeneity, this modified test has a false-positive rate close to the nominal level while maintaining similar power to the original linear regression test ('Egger' test). When the degree of between-trial heterogeneity is large, none of the tests that have been proposed has uniformly good properties.
Resumo:
OBJECTIVES: This paper is concerned with checking goodness-of-fit of binary logistic regression models. For the practitioners of data analysis, the broad classes of procedures for checking goodness-of-fit available in the literature are described. The challenges of model checking in the context of binary logistic regression are reviewed. As a viable solution, a simple graphical procedure for checking goodness-of-fit is proposed. METHODS: The graphical procedure proposed relies on pieces of information available from any logistic analysis; the focus is on combining and presenting these in an informative way. RESULTS: The information gained using this approach is presented with three examples. In the discussion, the proposed method is put into context and compared with other graphical procedures for checking goodness-of-fit of binary logistic models available in the literature. CONCLUSION: A simple graphical method can significantly improve the understanding of any logistic regression analysis and help to prevent faulty conclusions.
Resumo:
Well-known data mining algorithms rely on inputs in the form of pairwise similarities between objects. For large datasets it is computationally impossible to perform all pairwise comparisons. We therefore propose a novel approach that uses approximate Principal Component Analysis to efficiently identify groups of similar objects. The effectiveness of the approach is demonstrated in the context of binary classification using the supervised normalized cut as a classifier. For large datasets from the UCI repository, the approach significantly improves run times with minimal loss in accuracy.
Resumo:
BackgroundThe aim of the present study was to evaluate the feasibility of using a telephone survey in gaining an understanding of the possible herd and management factors influencing the performance (i.e. safety and efficacy) of a vaccine against porcine circovirus type 2 (PCV2) in a large number of herds and to estimate customers¿ satisfaction.ResultsDatasets from 227 pig herds that currently applied or have applied a PCV2 vaccine were analysed. Since 1-, 2- and 3-site production systems were surveyed, the herds were allocated in one of two subsets, where only applicable variables out of 180 were analysed. Group 1 was comprised of herds with sows, suckling pigs and nursery pigs, whereas herds in Group 2 in all cases kept fattening pigs. Overall 14 variables evaluating the subjective satisfaction with one particular PCV2 vaccine were comingled to an abstract dependent variable for further models, which was characterized by a binary outcome from a cluster analysis: good/excellent satisfaction (green cluster) and moderate satisfaction (red cluster). The other 166 variables comprised information about diagnostics, vaccination, housing, management, were considered as independent variables. In Group 1, herds using the vaccine due to recognised PCV2 related health problems (wasting, mortality or porcine dermatitis and nephropathy syndrome) had a 2.4-fold increased chance (1/OR) of belonging to the green cluster. In the final model for Group 1, the diagnosis of diseases other than PCV2, the reason for vaccine administration being other than PCV2-associated diseases and using a single injection of iron had significant influence on allocating into the green cluster (P¿<¿0.05). In Group 2, only unchanged time or delay of time of vaccination influenced the satisfaction (P¿<¿0.05).ConclusionThe methodology and statistical approach used in this study were feasible to scientifically assess ¿satisfaction¿, and to determine factors influencing farmers¿ and vets¿ opinion about the safety and efficacy of a new vaccine.