85 resultados para data harmonization


Relevância:

20.00% 20.00%

Publicador:

Resumo:

A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a ‘coding statistic’ is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C.elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: Systematic approaches for identifying proteins involved in different types of cancer are needed. Experimental techniques such as microarrays are being used to characterize cancer, but validating their results can be a laborious task. Computational approaches are used to prioritize between genes putatively involved in cancer, usually based on further analyzing experimental data. Results: We implemented a systematic method using the PIANA software that predicts cancer involvement of genes by integrating heterogeneous datasets. Specifically, we produced lists of genes likely to be involved in cancer by relying on: (i) protein-protein interactions; (ii) differential expression data; and (iii) structural and functional properties of cancer genes. The integrative approach that combines multiple sources of data obtained positive predictive values ranging from 23% (on a list of 811 genes) to 73% (on a list of 22 genes), outperforming the use of any of the data sources alone. We analyze a list of 20 cancer gene predictions, finding that most of them have been recently linked to cancer in literature. Conclusion: Our approach to identifying and prioritizing candidate cancer genes can be used to produce lists of genes likely to be involved in cancer. Our results suggest that differential expression studies yielding high numbers of candidate cancer genes can be filtered using protein interaction networks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The increasing volume of data describing humandisease processes and the growing complexity of understanding, managing, and sharing such data presents a huge challenge for clinicians and medical researchers. This paper presents the@neurIST system, which provides an infrastructure for biomedical research while aiding clinical care, by bringing together heterogeneous data and complex processing and computing services. Although @neurIST targets the investigation and treatment of cerebral aneurysms, the system’s architecture is generic enough that it could be adapted to the treatment of other diseases.Innovations in @neurIST include confining the patient data pertaining to aneurysms inside a single environment that offers cliniciansthe tools to analyze and interpret patient data and make use of knowledge-based guidance in planning their treatment. Medicalresearchers gain access to a critical mass of aneurysm related data due to the system’s ability to federate distributed informationsources. A semantically mediated grid infrastructure ensures that both clinicians and researchers are able to seamlessly access andwork on data that is distributed across multiple sites in a secure way in addition to providing computing resources on demand forperforming computationally intensive simulations for treatment planning and research.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The spectral efficiency achievable with joint processing of pilot and data symbol observations is compared with that achievable through the conventional (separate) approach of first estimating the channel on the basis of the pilot symbols alone, and subsequently detecting the datasymbols. Studied on the basis of a mutual information lower bound, joint processing is found to provide a non-negligible advantage relative to separate processing, particularly for fast fading. It is shown that, regardless of the fading rate, only a very small number of pilot symbols (at most one per transmit antenna and per channel coherence interval) shouldbe transmitted if joint processing is allowed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

When continuous data are coded to categorical variables, two types of coding are possible: crisp coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where each observation is transformed to a set of "degrees of membership" between 0 and 1, using co-called membership functions. It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the solution in a low-dimensional space. Since the crisp data only code the categories to which each individual case belongs, an alternative measure of fit is simply to count how well these categories are predicted by the solution. Another approach is to consider multiple correspondence analysis equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal tables of the Burt matrix - the measure of fit is then computed as the quality of explaining these tables only. The correspondence analysis of fuzzy coded data, called "fuzzy multiple correspondence analysis", suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions are made of the categories which have highest degree of membership. But here one can also defuzzify the results of the analysis to obtain estimated values of the original data, and then calculate a measure of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance. Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the crisp case can be applied to analyse the off-diagonal part of this matrix. In this paper these alternative measures of fit are defined and applied to a data set of continuous meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit is further discussed when the data set consists of a mixture of discrete and continuous variables.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This article reviews the methodology of the studies on drug utilization with particular emphasis on primary care. Population based studies of drug inappropriateness can be done with microdata from Health Electronic Records and e-prescriptions. Multilevel models estimate the influence of factors affecting the appropriateness of drug prescription at different hierarchical levels: patient, doctor, health care organization and regulatory environment. Work by the GIUMAP suggest that patient characteristics are the most important factor in the appropriateness of prescriptions with significant effects at the general practicioner level.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The aim of this paper is to analyse the impact of university knowledge and technology transfer activities on academic research output. Specifically, we study whether researchers with collaborative links with the private sector publish less than their peers without such links, once controlling for other sources of heterogeneity. We report findings from a longitudinal dataset on researchers from two engineering departments in the UK between 1985 until 2006. Our results indicate that researchers with industrial links publish significantly more than their peers. Academic productivity, though, is higher for low levels of industry involvement as compared to high levels.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

One of the disadvantages of old age is that there is more past than future: this,however, may be turned into an advantage if the wealth of experience and, hopefully,wisdom gained in the past can be reflected upon and throw some light on possiblefuture trends. To an extent, then, this talk is necessarily personal, certainly nostalgic,but also self critical and inquisitive about our understanding of the discipline ofstatistics. A number of almost philosophical themes will run through the talk: searchfor appropriate modelling in relation to the real problem envisaged, emphasis onsensible balances between simplicity and complexity, the relative roles of theory andpractice, the nature of communication of inferential ideas to the statistical layman, theinter-related roles of teaching, consultation and research. A list of keywords might be:identification of sample space and its mathematical structure, choices betweentransform and stay, the role of parametric modelling, the role of a sample spacemetric, the underused hypothesis lattice, the nature of compositional change,particularly in relation to the modelling of processes. While the main theme will berelevance to compositional data analysis we shall point to substantial implications forgeneral multivariate analysis arising from experience of the development ofcompositional data analysis…

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Modern methods of compositional data analysis are not well known in biomedical research.Moreover, there appear to be few mathematical and statistical researchersworking on compositional biomedical problems. Like the earth and environmental sciences,biomedicine has many problems in which the relevant scienti c information isencoded in the relative abundance of key species or categories. I introduce three problemsin cancer research in which analysis of compositions plays an important role. Theproblems involve 1) the classi cation of serum proteomic pro les for early detection oflung cancer, 2) inference of the relative amounts of di erent tissue types in a diagnostictumor biopsy, and 3) the subcellular localization of the BRCA1 protein, and it'srole in breast cancer patient prognosis. For each of these problems I outline a partialsolution. However, none of these problems is \solved". I attempt to identify areas inwhich additional statistical development is needed with the hope of encouraging morecompositional data analysts to become involved in biomedical research

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The aim of this talk is to convince the reader that there are a lot of interesting statisticalproblems in presentday life science data analysis which seem ultimately connected withcompositional statistics.Key words: SAGE, cDNA microarrays, (1D-)NMR, virus quasispecies