28 resultados para Parameter tuning
em Université de Lausanne, Switzerland
Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.
Resumo:
BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.
Resumo:
A recurring task in the analysis of mass genome annotation data from high-throughput technologies is the identification of peaks or clusters in a noisy signal profile. Examples of such applications are the definition of promoters on the basis of transcription start site profiles, the mapping of transcription factor binding sites based on ChIP-chip data and the identification of quantitative trait loci (QTL) from whole genome SNP profiles. Input to such an analysis is a set of genome coordinates associated with counts or intensities. The output consists of a discrete number of peaks with respective volumes, extensions and center positions. We have developed for this purpose a flexible one-dimensional clustering tool, called MADAP, which we make available as a web server and as standalone program. A set of parameters enables the user to customize the procedure to a specific problem. The web server, which returns results in textual and graphical form, is useful for small to medium-scale applications, as well as for evaluation and parameter tuning in view of large-scale applications, requiring a local installation. The program written in C++ can be freely downloaded from ftp://ftp.epd.unil.ch/pub/software/unix/madap. The MADAP web server can be accessed at http://www.isrec.isb-sib.ch/madap/.
Resumo:
Cette thèse s'intéresse à étudier les propriétés extrémales de certains modèles de risque d'intérêt dans diverses applications de l'assurance, de la finance et des statistiques. Cette thèse se développe selon deux axes principaux, à savoir: Dans la première partie, nous nous concentrons sur deux modèles de risques univariés, c'est-à- dire, un modèle de risque de déflation et un modèle de risque de réassurance. Nous étudions le développement des queues de distribution sous certaines conditions des risques commun¬s. Les principaux résultats sont ainsi illustrés par des exemples typiques et des simulations numériques. Enfin, les résultats sont appliqués aux domaines des assurances, par exemple, les approximations de Value-at-Risk, d'espérance conditionnelle unilatérale etc. La deuxième partie de cette thèse est consacrée à trois modèles à deux variables: Le premier modèle concerne la censure à deux variables des événements extrême. Pour ce modèle, nous proposons tout d'abord une classe d'estimateurs pour les coefficients de dépendance et la probabilité des queues de distributions. Ces estimateurs sont flexibles en raison d'un paramètre de réglage. Leurs distributions asymptotiques sont obtenues sous certaines condi¬tions lentes bivariées de second ordre. Ensuite, nous donnons quelques exemples et présentons une petite étude de simulations de Monte Carlo, suivie par une application sur un ensemble de données réelles d'assurance. L'objectif de notre deuxième modèle de risque à deux variables est l'étude de coefficients de dépendance des queues de distributions obliques et asymétriques à deux variables. Ces distri¬butions obliques et asymétriques sont largement utiles dans les applications statistiques. Elles sont générées principalement par le mélange moyenne-variance de lois normales et le mélange de lois normales asymétriques d'échelles, qui distinguent la structure de dépendance de queue comme indiqué par nos principaux résultats. Le troisième modèle de risque à deux variables concerne le rapprochement des maxima de séries triangulaires elliptiques obliques. Les résultats théoriques sont fondés sur certaines hypothèses concernant le périmètre aléatoire sous-jacent des queues de distributions. -- This thesis aims to investigate the extremal properties of certain risk models of interest in vari¬ous applications from insurance, finance and statistics. This thesis develops along two principal lines, namely: In the first part, we focus on two univariate risk models, i.e., deflated risk and reinsurance risk models. Therein we investigate their tail expansions under certain tail conditions of the common risks. Our main results are illustrated by some typical examples and numerical simu¬lations as well. Finally, the findings are formulated into some applications in insurance fields, for instance, the approximations of Value-at-Risk, conditional tail expectations etc. The second part of this thesis is devoted to the following three bivariate models: The first model is concerned with bivariate censoring of extreme events. For this model, we first propose a class of estimators for both tail dependence coefficient and tail probability. These estimators are flexible due to a tuning parameter and their asymptotic distributions are obtained under some second order bivariate slowly varying conditions of the model. Then, we give some examples and present a small Monte Carlo simulation study followed by an application on a real-data set from insurance. The objective of our second bivariate risk model is the investigation of tail dependence coefficient of bivariate skew slash distributions. Such skew slash distributions are extensively useful in statistical applications and they are generated mainly by normal mean-variance mixture and scaled skew-normal mixture, which distinguish the tail dependence structure as shown by our principle results. The third bivariate risk model is concerned with the approximation of the component-wise maxima of skew elliptical triangular arrays. The theoretical results are based on certain tail assumptions on the underlying random radius.
Resumo:
PURPOSE: All kinds of blood manipulations aim to increase the total hemoglobin mass (tHb-mass). To establish tHb-mass as an effective screening parameter for detecting blood doping, the knowledge of its normal variation over time is necessary. The aim of the present study, therefore, was to determine the intraindividual variance of tHb-mass in elite athletes during a training year emphasizing off, training, and race seasons at sea level. METHODS: tHb-mass and hemoglobin concentration ([Hb]) were determined in 24 endurance athletes five times during a year and were compared with a control group (n = 6). An analysis of covariance was used to test the effects of training phases, age, gender, competition level, body mass, and training volume. Three error models, based on 1) a total percentage error of measurement, 2) the combination of a typical percentage error (TE) of analytical origin with an absolute SD of biological origin, and 3) between-subject and within-subject variance components as obtained by an analysis of variance, were tested. RESULTS: In addition to the expected influence of performance status, the main results were that the effects of training volume (P = 0.20) and training phases (P = 0.81) on tHb-mass were not significant. We found that within-subject variations mainly have an analytical origin (TE approximately 1.4%) and a very small SD (7.5 g) of biological origin. CONCLUSION: tHb-mass shows very low individual oscillations during a training year (<6%), and these oscillations are below the expected changes in tHb-mass due to Herythropoetin (EPO) application or blood infusion (approximately 10%). The high stability of tHb-mass over a period of 1 year suggests that it should be included in an athlete's biological passport and analyzed by recently developed probabilistic inference techniques that define subject-based reference ranges.
Resumo:
X-ray is a technology that is used for numerous applications in the medical field. The process of X-ray projection gives a 2-dimension (2D) grey-level texture from a 3- dimension (3D) object. Until now no clear demonstration or correlation has positioned the 2D texture analysis as a valid indirect evaluation of the 3D microarchitecture. TBS is a new texture parameter based on the measure of the experimental variogram. TBS evaluates the variation between 2D image grey-levels. The aim of this study was to evaluate existing correlations between 3D bone microarchitecture parameters - evaluated from μCT reconstructions - and the TBS value, calculated on 2D projected images. 30 dried human cadaveric vertebrae were acquired on a micro-scanner (eXplorer Locus, GE) at isotropic resolution of 93 μm. 3D vertebral body models were used. The following 3D microarchitecture parameters were used: Bone volume fraction (BV/TV), Trabecular thickness (TbTh), trabecular space (TbSp), trabecular number (TbN) and connectivity density (ConnD). 3D/2D projections has been done by taking into account the Beer-Lambert Law at X-ray energy of 50, 100, 150 KeV. TBS was assessed on 2D projected images. Correlations between TBS and the 3D microarchitecture parameters were evaluated using a linear regression analysis. Paired T-test is used to assess the X-ray energy effects on TBS. Multiple linear regressions (backward) were used to evaluate relationships between TBS and 3D microarchitecture parameters using a bootstrap process. BV/TV of the sample ranged from 18.5 to 37.6% with an average value at 28.8%. Correlations' analysis showedthat TBSwere strongly correlatedwith ConnD(0.856≤r≤0.862; p<0.001),with TbN (0.805≤r≤0.810; p<0.001) and negatively with TbSp (−0.714≤r≤−0.726; p<0.001), regardless X-ray energy. Results show that lower TBS values are related to "degraded" microarchitecture, with low ConnD, low TbN and a high TbSp. The opposite is also true. X-ray energy has no effect onTBS neither on the correlations betweenTBS and the 3Dmicroarchitecture parameters. In this study, we demonstrated that TBS was significantly correlated with 3D microarchitecture parameters ConnD and TbN, and negatively with TbSp, no matter what X-ray energy has been used. This article is part of a Special Issue entitled ECTS 2011. Disclosure of interest: None declared.
Resumo:
In the context of Systems Biology, computer simulations of gene regulatory networks provide a powerful tool to validate hypotheses and to explore possible system behaviors. Nevertheless, modeling a system poses some challenges of its own: especially the step of model calibration is often difficult due to insufficient data. For example when considering developmental systems, mostly qualitative data describing the developmental trajectory is available while common calibration techniques rely on high-resolution quantitative data. Focusing on the calibration of differential equation models for developmental systems, this study investigates different approaches to utilize the available data to overcome these difficulties. More specifically, the fact that developmental processes are hierarchically organized is exploited to increase convergence rates of the calibration process as well as to save computation time. Using a gene regulatory network model for stem cell homeostasis in Arabidopsis thaliana the performance of the different investigated approaches is evaluated, documenting considerable gains provided by the proposed hierarchical approach.
Resumo:
Part I of this series of articles focused on the construction of graphical probabilistic inference procedures, at various levels of detail, for assessing the evidential value of gunshot residue (GSR) particle evidence. The proposed models - in the form of Bayesian networks - address the issues of background presence of GSR particles, analytical performance (i.e., the efficiency of evidence searching and analysis procedures) and contamination. The use and practical implementation of Bayesian networks for case pre-assessment is also discussed. This paper, Part II, concentrates on Bayesian parameter estimation. This topic complements Part I in that it offers means for producing estimates useable for the numerical specification of the proposed probabilistic graphical models. Bayesian estimation procedures are given a primary focus of attention because they allow the scientist to combine (his/her) prior knowledge about the problem of interest with newly acquired experimental data. The present paper also considers further topics such as the sensitivity of the likelihood ratio due to uncertainty in parameters and the study of likelihood ratio values obtained for members of particular populations (e.g., individuals with or without exposure to GSR).
Resumo:
Biochemical systems are commonly modelled by systems of ordinary differential equations (ODEs). A particular class of such models called S-systems have recently gained popularity in biochemical system modelling. The parameters of an S-system are usually estimated from time-course profiles. However, finding these estimates is a difficult computational problem. Moreover, although several methods have been recently proposed to solve this problem for ideal profiles, relatively little progress has been reported for noisy profiles. We describe a special feature of a Newton-flow optimisation problem associated with S-system parameter estimation. This enables us to significantly reduce the search space, and also lends itself to parameter estimation for noisy data. We illustrate the applicability of our method by applying it to noisy time-course data synthetically produced from previously published 4- and 30-dimensional S-systems. In addition, we propose an extension of our method that allows the detection of network topologies for small S-systems. We introduce a new method for estimating S-system parameters from time-course profiles. We show that the performance of this method compares favorably with competing methods for ideal profiles, and that it also allows the determination of parameters for noisy profiles.
Resumo:
Abnormal adipokine production, along with defective uptake and metabolism of glucose within adipocytes, contributes to insulin resistance and altered glucose homeostasis. Recent research has highlighted one of the mechanisms that accounts for impaired production of adiponectin (ADIPOQ) and adipocyte glucose uptake in obesity. In adipocytes of human obese subjects and mice fed with a high fat diet, the level of the inducible cAMP early repressor (ICER) is diminished. Reduction of ICER elevates the cAMP response element binding protein (CREB) activity, which in turn increases the repressor activating transcription factor 3. In fine, the cascade triggers reduction in the ADIPOQ and GLUT4 levels, which ultimately hampers insulin-mediated glucose uptake. The c-Jun N-terminal kinase (JNK) interacting-protein 1, also called islet brain 1 (IB1), is a target of CREB/ICER that promotes JNK-mediated insulin resistance in adipocytes. A rise in IB1 and c-Jun levels accompanies the drop of ICER in white adipose tissues of obese mice when compared with mice fed with a chow diet. Other than the expression of ADIPOQ and glucose transport, decline in ICER expression might impact insulin signaling. Impairment of ICER is a critical issue that will need major consideration in future therapeutic purposes.
Resumo:
In Quantitative Microbial Risk Assessment, it is vital to understand how lag times of individual cells are distributed over a bacterial population. Such identified distributions can be used to predict the time by which, in a growth-supporting environment, a few pathogenic cells can multiply to a poisoning concentration level. We model the lag time of a single cell, inoculated into a new environment, by the delay of the growth function characterizing the generated subpopulation. We introduce an easy-to-implement procedure, based on the method of moments, to estimate the parameters of the distribution of single cell lag times. The advantage of the method is especially apparent for cases where the initial number of cells is small and random, and the culture is detectable only in the exponential growth phase.
Resumo:
BACKGROUND: Pneumocystis jirovecii dihydropteroate synthase (DHPS) mutations are associated with failure of prophylaxis with sulfa drugs. This retrospective study sought to better understand the geographical variation in the prevalence of these mutations. METHODS: DHPS polymorphisms in 394 clinical specimens from immunosuppressed patients who received a diagnosis of P. jirovecii pneumonia and who were hospitalized in 3 European cities were examined using polymerase chain reaction (PCR) single-strand conformation polymorphism. Demographic and clinical characteristics were obtained from patients' medical charts. RESULTS: Of the 394 patients, 79 (20%) were infected with a P. jirovecii strain harboring one or both of the previously reported DHPS mutations. The prevalence of DHPS mutations was significantly higher in Lyon than in Switzerland (33.0% vs 7.5%; P < .001). The proportion of patients with no evidence of sulfa exposure who harbored a mutant P. jirovecii DHPS genotype was significantly higher in Lyon than in Switzerland (29.7% vs 3.0%; P < .001). During the study period in Lyon, in contrast to the Swiss hospitals, measures to prevent dissemination of P. jirovecii from patients with P. jirovecii pneumonia were generally not implemented, and most patients received suboptimal prophylaxis, the failure of which was strictly associated with mutated P. jirovecii. Thus, nosocomial interhuman transmission of mutated strains directly or indirectly from other individuals in whom selection of mutants occurred may explain the high proportion of mutations without sulfa exposure in Lyon. CONCLUSIONS: Interhuman transmission of P. jirovecii, rather than selection pressure by sulfa prophylaxis, may play a predominant role in the geographical variation in the prevalence in the P. jirovecii DHPS mutations.
Resumo:
The paper proposes an approach aimed at detecting optimal model parameter combinations to achieve the most representative description of uncertainty in the model performance. A classification problem is posed to find the regions of good fitting models according to the values of a cost function. Support Vector Machine (SVM) classification in the parameter space is applied to decide if a forward model simulation is to be computed for a particular generated model. SVM is particularly designed to tackle classification problems in high-dimensional space in a non-parametric and non-linear way. SVM decision boundaries determine the regions that are subject to the largest uncertainty in the cost function classification, and, therefore, provide guidelines for further iterative exploration of the model space. The proposed approach is illustrated by a synthetic example of fluid flow through porous media, which features highly variable response due to the parameter values' combination.
Resumo:
The potential of type-2 fuzzy sets for managing high levels of uncertainty in the subjective knowledge of experts or of numerical information has focused on control and pattern classification systems in recent years. One of the main challenges in designing a type-2 fuzzy logic system is how to estimate the parameters of type-2 fuzzy membership function (T2MF) and the Footprint of Uncertainty (FOU) from imperfect and noisy datasets. This paper presents an automatic approach for learning and tuning Gaussian interval type-2 membership functions (IT2MFs) with application to multi-dimensional pattern classification problems. T2MFs and their FOUs are tuned according to the uncertainties in the training dataset by a combination of genetic algorithm (GA) and crossvalidation techniques. In our GA-based approach, the structure of the chromosome has fewer genes than other GA methods and chromosome initialization is more precise. The proposed approach addresses the application of the interval type-2 fuzzy logic system (IT2FLS) for the problem of nodule classification in a lung Computer Aided Detection (CAD) system. The designed IT2FLS is compared with its type-1 fuzzy logic system (T1FLS) counterpart. The results demonstrate that the IT2FLS outperforms the T1FLS by more than 30% in terms of classification accuracy.
Resumo:
CD34/QBEND10 immunostaining has been assessed in 150 bone marrow biopsies (BMB) including 91 myelodysplastic syndromes (MDS), 16 MDS-related AML, 25 reactive BMB, and 18 cases where RA could neither be established nor ruled out. All cases were reviewed and classified according to the clinical and morphological FAB criteria. The percentage of CD34-positive (CD34 +) hematopoietic cells and the number of clusters of CD34+ cells in 10 HPF were determined. In most cases the CD34+ cell count was similar to the blast percentage determined morphologically. In RA, however, not only typical blasts but also less immature hemopoietic cells lying morphologically between blasts and promyelocytes were stained with CD34. The CD34+ cell count and cluster values were significantly higher in RA than in BMB with reactive changes (p<0.0001 for both), in RAEB than in RA (p=0.0006 and p=0.0189, respectively), in RAEBt than in RAEB (p=0.0001 and p=0.0038), and in MDS-AML than in RAEBt (p<0.0001 and p=0.0007). Presence of CD34+ cell clusters in RA correlated with increased risk of progression of the disease. We conclude that CD34 immunostaining in BMB is a useful tool for distinguishing RA from other anemias, assessing blast percentage in MDS cases, classifying them according to FAB, and following their evolution.