906 resultados para Exploratory statistical data analysis
Resumo:
INTRODUCTION/OBJECTIVES: Detection rates for adenoma and early colorectal cancer (CRC) are insufficient due to low compliance towards invasive screening procedures, like colonoscopy.Available non-invasive screening tests have unfortunately low sensitivity and specificity performances.Therefore, there is a large unmet need calling for a cost-effective, reliable and non-invasive test to screen for early neoplastic and pre-neoplastic lesions AIMS & Methods: The objective is to develop a screening test able to detect early CRCs and adenomas.This test is based on a nucleic acids multi-gene assay performed on peripheral blood mononuclear cells (PBMCs).A colonoscopy-controlled feasibility study was conducted on 179 subjects.The first 92 subjects was used as training set to generate a statistical significant signature.Colonoscopy revealed 21 subjects with CRC,30 with adenoma bigger than 1 cm and 41 with no neoplastic or inflammatory lesions.The second group of 48 subjects (controls, CRC and polyps) was used as a test set and will be kept blinded for the entire data analysis.To determine the organ and disease specificity 38 subjects were used:24 with inflammatory bowel disease (IBD),14 with other cancers than CRC (OC).Blood samples were taken from each patient the day of the colonoscopy and PBMCs were purified. Total RNA was extracted following standard procedures.Multiplex RT-qPCR was applied on 92 different candidate biomarkers.Different univariate and multivariate statistical methods were applied on these candidates and among them 60 biomarkers with significant p-values (<0.01) were selected.These biomarkers are involved in several different biological functions as cellular movement,cell signaling and interaction,tissue and cellular development,cancer and cell growth and proliferation.Two distinct biomarker signatures are used to separate patients without lesion from those with cancer or with adenoma, named COLOX CRC and COLOX POL respectively.COLOX performances were validated using random resampling method, bootstrap. RESULTS: COLOX CRC and POL tests successfully separate patients without lesions from those with CRC (Se 67%,Sp 93%,AUC 0.87) and from those with adenoma bigger than 1cm (Se 63%,Sp 83%,AUC 0.77),respectively. 6/24 patients in the IBD group and 1/14 patients in the OC group have a positive COLOX CRC CONCLUSION: The two COLOX tests demonstrated a high sensitivity and specificity to detect the presence of CRCs and adenomas bigger than 1 cm.A prospective, multicenter, pivotal study is underway in order to confirm these promising results in a larger cohort.
Resumo:
SUMMARY : Eukaryotic DNA interacts with the nuclear proteins using non-covalent ionic interactions. Proteins can recognize specific nucleotide sequences based on the sterical interactions with the DNA and these specific protein-DNA interactions are the basis for many nuclear processes, e.g. gene transcription, chromosomal replication, and recombination. New technology termed ChIP-Seq has been recently developed for the analysis of protein-DNA interactions on a whole genome scale and it is based on immunoprecipitation of chromatin and high-throughput DNA sequencing procedure. ChIP-Seq is a novel technique with a great potential to replace older techniques for mapping of protein-DNA interactions. In this thesis, we bring some new insights into the ChIP-Seq data analysis. First, we point out to some common and so far unknown artifacts of the method. Sequence tag distribution in the genome does not follow uniform distribution and we have found extreme hot-spots of tag accumulation over specific loci in the human and mouse genomes. These artifactual sequence tags accumulations will create false peaks in every ChIP-Seq dataset and we propose different filtering methods to reduce the number of false positives. Next, we propose random sampling as a powerful analytical tool in the ChIP-Seq data analysis that could be used to infer biological knowledge from the massive ChIP-Seq datasets. We created unbiased random sampling algorithm and we used this methodology to reveal some of the important biological properties of Nuclear Factor I DNA binding proteins. Finally, by analyzing the ChIP-Seq data in detail, we revealed that Nuclear Factor I transcription factors mainly act as activators of transcription, and that they are associated with specific chromatin modifications that are markers of open chromatin. We speculate that NFI factors only interact with the DNA wrapped around the nucleosome. We also found multiple loci that indicate possible chromatin barrier activity of NFI proteins, which could suggest the use of NFI binding sequences as chromatin insulators in biotechnology applications. RESUME : L'ADN des eucaryotes interagit avec les protéines nucléaires par des interactions noncovalentes ioniques. Les protéines peuvent reconnaître les séquences nucléotidiques spécifiques basées sur l'interaction stérique avec l'ADN, et des interactions spécifiques contrôlent de nombreux processus nucléaire, p.ex. transcription du gène, la réplication chromosomique, et la recombinaison. Une nouvelle technologie appelée ChIP-Seq a été récemment développée pour l'analyse des interactions protéine-ADN à l'échelle du génome entier et cette approche est basée sur l'immuno-précipitation de la chromatine et sur la procédure de séquençage de l'ADN à haut débit. La nouvelle approche ChIP-Seq a donc un fort potentiel pour remplacer les anciennes techniques de cartographie des interactions protéine-ADN. Dans cette thèse, nous apportons de nouvelles perspectives dans l'analyse des données ChIP-Seq. Tout d'abord, nous avons identifié des artefacts très communs associés à cette méthode qui étaient jusqu'à présent insoupçonnés. La distribution des séquences dans le génome ne suit pas une distribution uniforme et nous avons constaté des positions extrêmes d'accumulation de séquence à des régions spécifiques, des génomes humains et de la souris. Ces accumulations des séquences artéfactuelles créera de faux pics dans toutes les données ChIP-Seq, et nous proposons différentes méthodes de filtrage pour réduire le nombre de faux positifs. Ensuite, nous proposons un nouvel échantillonnage aléatoire comme un outil puissant d'analyse des données ChIP-Seq, ce qui pourraient augmenter l'acquisition de connaissances biologiques à partir des données ChIP-Seq. Nous avons créé un algorithme d'échantillonnage aléatoire et nous avons utilisé cette méthode pour révéler certaines des propriétés biologiques importantes de protéines liant à l'ADN nommés Facteur Nucléaire I (NFI). Enfin, en analysant en détail les données de ChIP-Seq pour la famille de facteurs de transcription nommés Facteur Nucléaire I, nous avons révélé que ces protéines agissent principalement comme des activateurs de transcription, et qu'elles sont associées à des modifications de la chromatine spécifiques qui sont des marqueurs de la chromatine ouverte. Nous pensons que lés facteurs NFI interagir uniquement avec l'ADN enroulé autour du nucléosome. Nous avons également constaté plusieurs régions génomiques qui indiquent une éventuelle activité de barrière chromatinienne des protéines NFI, ce qui pourrait suggérer l'utilisation de séquences de liaison NFI comme séquences isolatrices dans des applications de la biotechnologie.
Resumo:
The aim of this study was to analyze the associations of plasma aldosterone and plasma renin activity with the metabolic syndrome and each of its components. We analyzed data from a family based study in the Seychelles made up of 356 participants (160 men and 196 women) from 69 families of African descent. In multivariable models, plasma aldosterone was associated positively (P < 0.05) with blood pressure in older individuals (interaction with age, P < 0.05) and with waist circumference in men (interaction with sex, P < 0.05) and negatively with high-density lipoprotein cholesterol, in particular in individuals with elevated urinary potassium excretion (interaction with urinary potassium, P < 0.05); plasma renin activity was significantly associated with triglycerides and fasting blood glucose. Plasma aldosterone, but not plasma renin activity, was associated with the metabolic syndrome per se, independently of the association with its separate components. The observation that plasma renin activity was associated with some components of the metabolic syndrome, whereas plasma aldosterone was associated with other components of the metabolic syndrome, suggests different underlying mechanisms. These findings reinforce previous observations suggesting that aldosterone is associated with several cardiovascular risk factors and also suggest that aldosterone might contribute to the increased cardiovascular disease risk in individuals of African descent with the metabolic syndrome.
Resumo:
BACKGROUND: Sudden cardiac death (SCD) among the young is a rare and devastating event, but its exact incidence in many countries remains unknown. An autopsy is recommended in every case because some of the cardiac pathologies may have a genetic origin, which can have an impact on the living family members. The aims of this retrospective study completed in the canton of Vaud, Switzerland were to determine both the incidence of SCD and the autopsy rate for individuals from 5 to 39 years of age. METHODS: The study was conducted from 2000 to 2007 on the basis of official statistics and analysis of the International Classification of Diseases codes for potential SCDs and other deaths that might have been due to cardiac disease. RESULTS: During the 8 year study period there was an average of 292'546 persons aged 5-39 and there were a total of 1122 deaths, certified as potential SCDs in 3.6% of cases. The calculated incidence is 1.71/100'000 person-years (2.73 for men and 0.69 for women). If all possible cases of SCD (unexplained deaths, drowning, traffic accidents, etc.) are included, the incidence increases to 13.67/100'000 person-years. However, the quality of the officially available data was insufficient to provide an accurate incidence of SCD as well as autopsy rates. The presumed autopsy rate of sudden deaths classified as diseases of the circulatory system is 47.5%. For deaths of unknown cause (11.1% of the deaths), the autopsy was conducted in 13.7% of the cases according to codified data. CONCLUSIONS: The incidence of presumed SCD in the canton of Vaud, Switzerland, is comparable to the data published in the literature for other geographic regions but may be underestimated as it does not take into account other potential SCDs, as unexplained deaths. Increasing the autopsy rate of SCD in the young, better management of information obtained from autopsies as well developing of structured registry could improve the reliability of the statistical data, optimize the diagnostic procedures, and the preventive measures for the family members.
Resumo:
Background and purpose: Decision making (DM) has been defined as the process through which a person forms preferences, selects and executes actions, and evaluates the outcome related to a selected choice. This ability represents an important factor for adequate behaviour in everyday life. DM impairment in multiple sclerosis (MS) has been previously reported. The purpose of the present study was to assess DM in patients with MS at the earliest clinically detectable time point of the disease. Methods: Patients with definite (n=109) or possible (clinically isolated syndrome, CIS; n=56) MS, a short disease duration (mean 2.3 years) and a minor neurological disability (mean EDSS 1.8) were compared to 50 healthy controls aged 18 to 60 years (mean age 32.2) using the Iowa Gambling Task (IGT). Subjects had to select a card from any of 4 decks (A/B [disadvantageous]; C/D [advantageous]). The game consisted of 100 trials then grouped in blocks of 20 cards for data analysis. Skill in DM was assessed by means of a learning index (LI) defined as the difference between the averaged last three block indexes and first two block indexes (LI=[(BI-3+BI-4+BI-5)/3-(BI-1+B2)/2]). Non parametric tests were used for statistical analysis. Results: LI was higher in the control group (0.24, SD 0.44) than in the MS group (0.21, SD 0.38), however without reaching statistical significance (p=0.7). Interesting differences were detected when MS patients were grouped according to phenotype. A trend to a difference between MS subgroups and controls was observed for LI (p=0.06), which became significant between MS subgroups (p=0.03). CIS patients who confirmed MS diagnosis by presenting a second relapse after study entry showed a dysfunction in the IGT in comparison to the other CIS (p=0.01) and definite MS (p=0.04) patients. In the opposite, CIS patients characterised by not entirely fulfilled McDonald criteria at inclusion and absence of relapse during the study showed an normal learning pattern on the IGT. Finally, comparing MS patients who developed relapses after study entry, those who remained clinically stable and controls, we observed impaired performances only in relapsing patients in comparison to stable patients (p=0.008) and controls (p=0.03). Discussion: These results raise the assumption of a sustained role for both MS relapsing activity and disease heterogeneity (i.e. infra-clinical severity or activity of MS) in the impaired process of decision making.
Resumo:
This paper describes the development of an analytical technique for arsenic analyses that is based on genetically-modified bioreporter bacteria bearing a gene encoding for the production of a green fluorescent protein (gfp). Upon exposure to arsenic (in the aqueous form of arsenite), the bioreporter production of the fluorescent reporter molecule is monitored spectroscopically. We compared the response measured as a function of time and concentration by steady-state fluorimetry (SSF) to that measured by epi-fluorescent microscopy (EFM). SSF is a bulk technique; as such it inherently yields less information, whereas EFM monitors the response of many individual cells simultaneously and data can be processed in terms of population averages or subpopulations. For the bioreporter strain used here, as well as for the literature we cite, the two techniques exhibit similar performance characteristics. The results presented here show that the EFM technique can compete with SSF and shows substantially more promise for future improvement; it is a matter of research interest to develop optimized methods of EFM image analysis and statistical data treatment. EFM is a conduit for understanding the dynamics of individual cell response vs. population response, which is not only a matter of research interest, but is also promising in the practical terms of developing micro-scale analysis.
Resumo:
This paper presents general problems and approaches for the spatial data analysis using machine learning algorithms. Machine learning is a very powerful approach to adaptive data analysis, modelling and visualisation. The key feature of the machine learning algorithms is that they learn from empirical data and can be used in cases when the modelled environmental phenomena are hidden, nonlinear, noisy and highly variable in space and in time. Most of the machines learning algorithms are universal and adaptive modelling tools developed to solve basic problems of learning from data: classification/pattern recognition, regression/mapping and probability density modelling. In the present report some of the widely used machine learning algorithms, namely artificial neural networks (ANN) of different architectures and Support Vector Machines (SVM), are adapted to the problems of the analysis and modelling of geo-spatial data. Machine learning algorithms have an important advantage over traditional models of spatial statistics when problems are considered in a high dimensional geo-feature spaces, when the dimension of space exceeds 5. Such features are usually generated, for example, from digital elevation models, remote sensing images, etc. An important extension of models concerns considering of real space constrains like geomorphology, networks, and other natural structures. Recent developments in semi-supervised learning can improve modelling of environmental phenomena taking into account on geo-manifolds. An important part of the study deals with the analysis of relevant variables and models' inputs. This problem is approached by using different feature selection/feature extraction nonlinear tools. To demonstrate the application of machine learning algorithms several interesting case studies are considered: digital soil mapping using SVM, automatic mapping of soil and water system pollution using ANN; natural hazards risk analysis (avalanches, landslides), assessments of renewable resources (wind fields) with SVM and ANN models, etc. The dimensionality of spaces considered varies from 2 to more than 30. Figures 1, 2, 3 demonstrate some results of the studies and their outputs. Finally, the results of environmental mapping are discussed and compared with traditional models of geostatistics.
Resumo:
AIM: Antidoping procedures are expected to greatly benefit from untargeted metabolomic approaches through the discovery of new biomarkers of prohibited substances abuse. RESULTS: Endogenous steroid metabolites were monitored in urine samples from a controlled elimination study of testosterone undecanoate after ingestion. A platform coupling ultra-high pressure LC with high-resolution quadrupole TOF MS was used and high between-subject metabolic variability was successfully handled using a multiblock data analysis strategy. Links between specific subsets of metabolites and influential genetic polymorphisms of the UGT2B17 enzyme were highlighted. CONCLUSION: This exploratory metabolomic strategy constitutes a first step toward a better understanding of the underlying patterns driving the high interindividual variability of steroid metabolism. Promising biomarkers were selected for further targeted study.
Resumo:
The geographic information system approach has permitted integration between demographic, socio-economic and environmental data, providing correlation between information from several data banks. In the current work, occurrence of human and canine visceral leishmaniases and insect vectors (Lutzomyia longipalpis) as well as biogeographic information related to 9 areas that comprise the city of Belo Horizonte, Brazil, between April 2001 and March 2002 were correlated and georeferenced. By using this technique it was possible to define concentration loci of canine leishmaniasis in the following regions: East; Northeast; Northwest; West; and Venda Nova. However, as for human leishmaniasis, it was not possible to perform the same analysis. Data analysis has also shown that 84.2% of the human leishmaniasis cases were related with canine leishmaniasis cases. Concerning biogeographic (altitude, area of vegetation influence, hydrographic, and areas of poverty) analysis, only altitude showed to influence emergence of leishmaniasis cases. A number of 4673 canine leishmaniasis cases and 64 human leishmaniasis cases were georeferenced, of which 67.5 and 71.9%, respectively, were living between 780 and 880 m above the sea level. At these same altitudes, a large number of phlebotomine sand flies were collected. Therefore, we suggest control measures for leishmaniasis in the city of Belo Horizonte, giving priority to canine leishmaniasis foci and regions at altitudes between 780 and 880 m.
Resumo:
In this paper we look at how a web-based social software can be used to make qualitative data analysis of online peer-to-peer learning experiences. Specifically, we propose to use Cohere, a web-based social sense-making tool, to observe, track, annotate and visualize discussion group activities in online courses. We define a specific methodology for data observation and structuring, and present results of the analysis of peer interactions conducted in discussion forum in a real case study of a P2PU course. Finally we discuss how network visualization and analysis can be used to gather a better understanding of the peer-to-peer learning experience. To do so, we provide preliminary insights on the social, dialogical and conceptual connections that have been generated within one online discussion group.
Resumo:
Planners in public and private institutions would like coherent forecasts of the components of age-specic mortality, such as causes of death. This has been di cult toachieve because the relative values of the forecast components often fail to behave ina way that is coherent with historical experience. In addition, when the group forecasts are combined the result is often incompatible with an all-groups forecast. It hasbeen shown that cause-specic mortality forecasts are pessimistic when compared withall-cause forecasts (Wilmoth, 1995). This paper abandons the conventional approachof using log mortality rates and forecasts the density of deaths in the life table. Sincethese values obey a unit sum constraint for both conventional single-decrement life tables (only one absorbing state) and multiple-decrement tables (more than one absorbingstate), they are intrinsically relative rather than absolute values across decrements aswell as ages. Using the methods of Compositional Data Analysis pioneered by Aitchison(1986), death densities are transformed into the real space so that the full range of multivariate statistics can be applied, then back-transformed to positive values so that theunit sum constraint is honoured. The structure of the best-known, single-decrementmortality-rate forecasting model, devised by Lee and Carter (1992), is expressed incompositional form and the results from the two models are compared. The compositional model is extended to a multiple-decrement form and used to forecast mortalityby cause of death for Japan
Resumo:
Theory of compositional data analysis is often focused on the composition only. However in practical applications we often treat a composition together with covariableswith some other scale. This contribution systematically gathers and develop statistical tools for this situation. For instance, for the graphical display of the dependenceof a composition with a categorical variable, a colored set of ternary diagrams mightbe a good idea for a first look at the data, but it will fast hide important aspects ifthe composition has many parts, or it takes extreme values. On the other hand colored scatterplots of ilr components could not be very instructive for the analyst, if theconventional, black-box ilr is used.Thinking on terms of the Euclidean structure of the simplex, we suggest to set upappropriate projections, which on one side show the compositional geometry and on theother side are still comprehensible by a non-expert analyst, readable for all locations andscales of the data. This is e.g. done by defining special balance displays with carefully-selected axes. Following this idea, we need to systematically ask how to display, explore,describe, and test the relation to complementary or explanatory data of categorical, real,ratio or again compositional scales.This contribution shows that it is sufficient to use some basic concepts and very fewadvanced tools from multivariate statistics (principal covariances, multivariate linearmodels, trellis or parallel plots, etc.) to build appropriate procedures for all these combinations of scales. This has some fundamental implications in their software implementation, and how might they be taught to analysts not already experts in multivariateanalysis
Resumo:
Functional Data Analysis (FDA) deals with samples where a whole function is observedfor each individual. A particular case of FDA is when the observed functions are densityfunctions, that are also an example of infinite dimensional compositional data. In thiswork we compare several methods for dimensionality reduction for this particular typeof data: functional principal components analysis (PCA) with or without a previousdata transformation and multidimensional scaling (MDS) for diferent inter-densitiesdistances, one of them taking into account the compositional nature of density functions. The difeerent methods are applied to both artificial and real data (householdsincome distributions)
Resumo:
In this paper we examine the problem of compositional data from a different startingpoint. Chemical compositional data, as used in provenance studies on archaeologicalmaterials, will be approached from the measurement theory. The results will show, in avery intuitive way that chemical data can only be treated by using the approachdeveloped for compositional data. It will be shown that compositional data analysis is aparticular case in projective geometry, when the projective coordinates are in thepositive orthant, and they have the properties of logarithmic interval metrics. Moreover,it will be shown that this approach can be extended to a very large number ofapplications, including shape analysis. This will be exemplified with a case study inarchitecture of Early Christian churches dated back to the 5th-7th centuries AD
Resumo:
This analysis was stimulated by the real data analysis problem of householdexpenditure data. The full dataset contains expenditure data for a sample of 1224 households. The expenditure is broken down at 2 hierarchical levels: 9 major levels (e.g. housing, food, utilities etc.) and 92 minor levels. There are also 5 factors and 5 covariates at the household level. Not surprisingly, there are a small number of zeros at the major level, but many zeros at the minor level. The question is how best to model the zeros. Clearly, models that tryto add a small amount to the zero terms are not appropriate in general as at least some of the zeros are clearly structural, e.g. alcohol/tobacco for households that are teetotal. The key question then is how to build suitable conditional models. For example, is the sub-composition of spendingexcluding alcohol/tobacco similar for teetotal and non-teetotal households?In other words, we are looking for sub-compositional independence. Also, what determines whether a household is teetotal? Can we assume that it is independent of the composition? In general, whether teetotal will clearly depend on the household level variables, so we need to be able to model this dependence. The other tricky question is that with zeros on more than onecomponent, we need to be able to model dependence and independence of zeros on the different components. Lastly, while some zeros are structural, others may not be, for example, for expenditure on durables, it may be chance as to whether a particular household spends money on durableswithin the sample period. This would clearly be distinguishable if we had longitudinal data, but may still be distinguishable by looking at the distribution, on the assumption that random zeros will usually be for situations where any non-zero expenditure is not small.While this analysis is based on around economic data, the ideas carry over tomany other situations, including geological data, where minerals may be missing for structural reasons (similar to alcohol), or missing because they occur only in random regions which may be missed in a sample (similar to the durables)