213 resultados para imputation


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Most genome-wide association studies to date have been performed in populations of European descent, but there is increasing interest in expanding these studies to other populations. The performance of genotyping chips in Asian populations is not well established. Therefore, we sought to test the performance of widely used fixed-marker, genome-wide association studies chips in the Han Chinese population. Non-HapMap Chinese samples (n = 396) were genotyped using the Illumina OmniExpress and Affymetrix 6.0 platforms, whereas a subset also were genotyped using the Immunochip. Genotyped markers from the Affymetrix 6.0 and Illumina OmniExpress were used for full genome imputation based on the HapMap 2 JPT+CHB (Japanese from Tokyo, Japan and Chinese from Beijing, China) reference panel. The concordance between markers genotypes for the three platforms was very high whether directly genotyped or genotyped and imputed single nucleotide polymorphisms (SNPs; .99.8% for directly genotyped and .99.5% for genotyped and imputed SNPs, respectively) were compared. The OmniExpress chip data enabled more SNPs to be imputed, particularly SNPs with minor allele frequency .5%. The OmniExpress chip achieved better coverage of HapMap SNPs than the Affymetrix 6.0 chip (73.6% vs. 65.9%, respectively, for minor allele frequency .5%). The Affymetrix 6.0 and Illumina OmniExpress chip have similar genotyping accuracy and provide similar accuracy of imputed SNPs. The OmniExpress chip however provides better coverage of Asian HapMap SNPs, although its coverage of HapMap SNPs is moderate. © 2013 Jiang et al.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Environmental data usually include measurements, such as water quality data, which fall below detection limits, because of limitations of the instruments or of certain analytical methods used. The fact that some responses are not detected needs to be properly taken into account in statistical analysis of such data. However, it is well-known that it is challenging to analyze a data set with detection limits, and we often have to rely on the traditional parametric methods or simple imputation methods. Distributional assumptions can lead to biased inference and justification of distributions is often not possible when the data are correlated and there is a large proportion of data below detection limits. The extent of bias is usually unknown. To draw valid conclusions and hence provide useful advice for environmental management authorities, it is essential to develop and apply an appropriate statistical methodology. This paper proposes rank-based procedures for analyzing non-normally distributed data collected at different sites over a period of time in the presence of multiple detection limits. To take account of temporal correlations within each site, we propose an optimal linear combination of estimating functions and apply the induced smoothing method to reduce the computational burden. Finally, we apply the proposed method to the water quality data collected at Susquehanna River Basin in United States of America, which dearly demonstrates the advantages of the rank regression models.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Genome-wide association studies (GWAS) have identified numerous common prostate cancer (PrCa) susceptibility loci. We have fine-mapped 64 GWAS regions known at the conclusion of the iCOGS study using large-scale genotyping and imputation in 25 723 PrCa cases and 26 274 controls of European ancestry. We detected evidence for multiple independent signals at 16 regions, 12 of which contained additional newly identified significant associations. A single signal comprising a spectrum of correlated variation was observed at 39 regions; 35 of which are now described by a novel more significantly associated lead SNP, while the originally reported variant remained as the lead SNP only in 4 regions. We also confirmed two association signals in Europeans that had been previously reported only in East-Asian GWAS. Based on statistical evidence and linkage disequilibrium (LD) structure, we have curated and narrowed down the list of the most likely candidate causal variants for each region. Functional annotation using data from ENCODE filtered for PrCa cell lines and eQTL analysis demonstrated significant enrichment for overlap with bio-features within this set. By incorporating the novel risk variants identified here alongside the refined data for existing association signals, we estimate that these loci now explain ∼38.9% of the familial relative risk of PrCa, an 8.9% improvement over the previously reported GWAS tag SNPs. This suggests that a significant fraction of the heritability of PrCa may have been hidden during the discovery phase of GWAS, in particular due to the presence of multiple independent signals within the same region.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The impact of erroneous genotypes having passed standard quality control (QC) can be severe in genome-wide association studies, genotype imputation, and estimation of heritability and prediction of genetic risk based on single nucleotide polymorphisms (SNP). To detect such genotyping errors, a simple two-locus QC method, based on the difference in test statistic of association between single SNPs and pairs of SNPs, was developed and applied. The proposed approach could detect many problematic SNPs with statistical significance even when standard single SNP QC analyses fail to detect them in real data. Depending on the data set used, the number of erroneous SNPs that were not filtered out by standard single SNP QC but detected by the proposed approach varied from a few hundred to thousands. Using simulated data, it was shown that the proposed method was powerful and performed better than other tested existing methods. The power of the proposed approach to detect erroneous genotypes was approximately 80% for a 3% error rate per SNP. This novel QC approach is easy to implement and computationally efficient, and can lead to a better quality of genotypes for subsequent genotype-phenotype investigations.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Northern Demersal Scalefish Fishery has historically comprised a small fleet (≤10 vessels year−1) operating over a relatively large area off the northwest coast of Australia. This multispecies fishery primarily harvests two species of snapper: goldband snapper, Pristipomoides multidens and red emperor, Lutjanus sebae. A key input to age-structured assessments of these stocks has been the annual time-series of the catch rate. We used an approach that combined Generalized Linear Models, spatio-temporal imputation, and computer-intensive methods to standardize the fishery catch rates and report uncertainty in the indices. These analyses, which represent one of the first attempts to standardize fish trap catch rates, were also augmented to gain additional insights into the effects of targeting, historical effort creep, and spatio-temporal resolution of catch and effort data on trap fishery dynamics. Results from monthly reported catches (i.e. 1993 on) were compared with those reported daily from more recently (i.e. 2008 on) enhanced catch and effort logbooks. Model effects of catches of one species on the catch rates of another became more conspicuous when the daily data were analysed and produced estimates with greater precision. The rate of putative effort creep estimated for standardized catch rates was much lower than estimated for nominal catch rates. These results therefore demonstrate how important additional insights into fishery and fish population dynamics can be elucidated from such “pre-assessment” analyses.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Snapper (Pagrus auratus) is widely distributed throughout subtropical and temperate southern oceans and forms a significant recreational and commercial fishery in Queensland, Australia. Using data from government reports, media sources, popular publications and a government fisheries survey carried out in 1910, we compiled information on individual snapper fishing trips that took place prior to the commencement of fisherywide organized data collection, from 1871 to 1939. In addition to extracting all available quantitative data, we translated qualitative information into bounded estimates and used multiple imputation to handle missing values, forming 287 records for which catch rate (snapper fisher−1 h−1) could be derived. Uncertainty was handled through a parametric maximum likelihood framework (a transformed trivariate Gaussian), which facilitated statistical comparisons between data sources. No statistically significant differences in catch rates were found among media sources and the government fisheries survey. Catch rates remained stable throughout the time series, averaging 3.75 snapper fisher−1 h−1 (95% confidence interval, 3.42–4.09) as the fishery expanded into new grounds. In comparison, a contemporary (1993–2002) south-east Queensland charter fishery produced an average catch rate of 0.4 snapper fisher−1 h−1 (95% confidence interval, 0.31–0.58). These data illustrate the productivity of a fishery during its earliest years of development and represent the earliest catch rate data globally for this species. By adopting a formalized approach to address issues common to many historical records – missing data, a lack of quantitative information and reporting bias – our analysis demonstrates the potential for historical narratives to contribute to contemporary fisheries management.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The extent to which low-frequency (minor allele frequency (MAF) between 1-5%) and rare (MAF imputation of genotyped samples using a combined UK10K/1000 Genomes reference panel (n = 26,534), and de novo replication genotyping (n = 20,271). We identified a low-frequency non-coding variant near a novel locus, EN1, with an effect size fourfold larger than the mean of previously reported common variants for lumbar spine BMD (rs11692564(T), MAF = 1.6%, replication effect size = +0.20 s.d., Pmeta = 2 x 10(-14)), which was also associated with a decreased risk of fracture (odds ratio = 0.85; P = 2 x 10(-11); ncases = 98,742 and ncontrols = 409,511). Using an En1(cre/flox) mouse model, we observed that conditional loss of En1 results in low bone mass, probably as a consequence of high bone turnover. We also identified a novel low-frequency non-coding variant with large effects on BMD near WNT16 (rs148771817(T), MAF = 1.2%, replication effect size = +0.41 s.d., Pmeta = 1 x 10(-11)). In general, there was an excess of association signals arising from deleterious coding and conserved non-coding variants. These findings provide evidence that low-frequency non-coding variants have large effects on BMD and fracture, thereby providing rationale for whole-genome sequencing and improved imputation reference panels to study the genetic architecture of complex traits and disease in the general population.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We propose a novel second order cone programming formulation for designing robust classifiers which can handle uncertainty in observations. Similar formulations are also derived for designing regression functions which are robust to uncertainties in the regression setting. The proposed formulations are independent of the underlying distribution, requiring only the existence of second order moments. These formulations are then specialized to the case of missing values in observations for both classification and regression problems. Experiments show that the proposed formulations outperform imputation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In recent years, thanks to developments in information technology, large-dimensional datasets have been increasingly available. Researchers now have access to thousands of economic series and the information contained in them can be used to create accurate forecasts and to test economic theories. To exploit this large amount of information, researchers and policymakers need an appropriate econometric model.Usual time series models, vector autoregression for example, cannot incorporate more than a few variables. There are two ways to solve this problem: use variable selection procedures or gather the information contained in the series to create an index model. This thesis focuses on one of the most widespread index model, the dynamic factor model (the theory behind this model, based on previous literature, is the core of the first part of this study), and its use in forecasting Finnish macroeconomic indicators (which is the focus of the second part of the thesis). In particular, I forecast economic activity indicators (e.g. GDP) and price indicators (e.g. consumer price index), from 3 large Finnish datasets. The first dataset contains a large series of aggregated data obtained from the Statistics Finland database. The second dataset is composed by economic indicators from Bank of Finland. The last dataset is formed by disaggregated data from Statistic Finland, which I call micro dataset. The forecasts are computed following a two steps procedure: in the first step I estimate a set of common factors from the original dataset. The second step consists in formulating forecasting equations including the factors extracted previously. The predictions are evaluated using relative mean squared forecast error, where the benchmark model is a univariate autoregressive model. The results are dataset-dependent. The forecasts based on factor models are very accurate for the first dataset (the Statistics Finland one), while they are considerably worse for the Bank of Finland dataset. The forecasts derived from the micro dataset are still good, but less accurate than the ones obtained in the first case. This work leads to multiple research developments. The results here obtained can be replicated for longer datasets. The non-aggregated data can be represented in an even more disaggregated form (firm level). Finally, the use of the micro data, one of the major contributions of this thesis, can be useful in the imputation of missing values and the creation of flash estimates of macroeconomic indicator (nowcasting).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Clustering techniques which can handle incomplete data have become increasingly important due to varied applications in marketing research, medical diagnosis and survey data analysis. Existing techniques cope up with missing values either by using data modification/imputation or by partial distance computation, often unreliable depending on the number of features available. In this paper, we propose a novel approach for clustering data with missing values, which performs the task by Symmetric Non-Negative Matrix Factorization (SNMF) of a complete pair-wise similarity matrix, computed from the given incomplete data. To accomplish this, we define a novel similarity measure based on Average Overlap similarity metric which can effectively handle missing values without modification of data. Further, the similarity measure is more reliable than partial distances and inherently possesses the properties required to perform SNMF. The experimental evaluation on real world datasets demonstrates that the proposed approach is efficient, scalable and shows significantly better performance compared to the existing techniques.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Resumen: El delito penal es aquello que la tradición, vivida en la experiencia de la familia y de la comunidad, permite a cada uno reconocer como un grave alejamiento de lo verdadero, lo bueno y lo correcto. El fundamento de la punibilidad penal es la imputación, el reconocimiento de la pertenencia del delito al sujeto como a su causa. Se intenta mostrar cómo este vínculo del derecho penal con la tradición jurídica sufrió dos rupturas: con el iluminismo jurídico y el kantismo se separó la imputación jurídica de su fundamento moral y con la codificación, se quebró la unidad del derecho penal universal fundado en lo bonum et aequum otorgando prevalencia a la idea de la legalidad esclava de los intereses de los Estados. Al mismo tiempo, se produjo una segunda ruptura: se predica una responsabilidad que se atribuye desde afuera, en forma objetiva, a centros de imputación –con frecuencia colectivos– que realizan la producción industrial. Así, según exigencias de la seguridad y la salud, el derecho penal se transforma en un instrumento de la política criminal y los contornos del tipo penal se operan a través de jueces y fiscales en prevención de las consecuencias futuras del “riesgo” progresivo de la producción industrial. La nota dominante es el “riesgo” colectivo y no el “hecho”.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Resumen: El artículo examina el concepto de persona, en sus fuentes aporéticas y en el pensamiento de Santo Tomás de Aquino para, finalmente, analizar algunas de sus propiedades en orden a fundamentar una teoría de la imputación. La primera parte comienza con la consideración de los orígenes semánticos, históricos y teológicos de los términos persona e hypóstasis. Trata luego el problema teórico que este concepto implica para el pensamiento moderno, teniendo en cuenta el empobrecimiento de su Metafísica, como consecuencia del nominalismo, el principio de inmanencia y una deficiente concepción de la experiencia. Por último, se toman en consideración algunas consecuencias que dicho empobrecimiento tienen en el pensamiento contemporáneo. La segunda parte está dedicada a exponer la doctrina tomista que de un modo teóricamente definitivo dio solución, con elementos ontológicos aristotélicos, a los problemas planteados en la época Patrística. Se analiza el concepto de sustancia individual (o suppositum), que en la definición de persona opera analógicamente como género próximo, y el de naturaleza espiritual individuada, que opera como diferencia específica. Finalmente, se aportan textos del Aquinate acerca de la definición y de las diferencias conceptuales y ontológicas entre naturaleza y persona. En la tercera parte se exponen algunas propiedades, dividiéndolas en cuatro grupos de tesis: 1°) la persona como sujeto de atribución y dueña de sus proyectos vitales; 2°) como sujeto ontológicamente abierto al mundo, a los semejantes y a Dios; 3°) como sujeto consciente, libre y autónomo; 4°) el carácter ético de la persona, como sujeto de imputación y responsabilidad.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background: Malignancies arising in the large bowel cause the second largest number of deaths from cancer in the Western World. Despite progresses made during the last decades, colorectal cancer remains one of the most frequent and deadly neoplasias in the western countries. Methods: A genomic study of human colorectal cancer has been carried out on a total of 31 tumoral samples, corresponding to different stages of the disease, and 33 non-tumoral samples. The study was carried out by hybridisation of the tumour samples against a reference pool of non-tumoral samples using Agilent Human 1A 60- mer oligo microarrays. The results obtained were validated by qRT-PCR. In the subsequent bioinformatics analysis, gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling were built. The consensus among all the induced models produced a hierarchy of dependences and, thus, of variables. Results: After an exhaustive process of pre-processing to ensure data quality–lost values imputation, probes quality, data smoothing and intraclass variability filtering–the final dataset comprised a total of 8, 104 probes. Next, a supervised classification approach and data analysis was carried out to obtain the most relevant genes. Two of them are directly involved in cancer progression and in particular in colorectal cancer. Finally, a supervised classifier was induced to classify new unseen samples. Conclusions: We have developed a tentative model for the diagnosis of colorectal cancer based on a biomarker panel. Our results indicate that the gene profile described herein can discriminate between non-cancerous and cancerous samples with 94.45% accuracy using different supervised classifiers (AUC values in the range of 0.997 and 0.955).

Relevância:

10.00% 10.00%

Publicador:

Resumo:

O presente estudo teve, por objetivo, corrigir a magnitude dos óbitos registrados por câncer do colo do útero no Brasil, e analisar a magnitude da mortalidade por este câncer e sua associação com indicadores sociais, nos estados da região Nordeste, Brasil, no período compreendido entre 1996 a 2005. Para a correção do sub-registro, foram utilizados os fatores criados pelo Projeto Carga Global de Doença no Brasil-1998. Metodologia de redistribuição proporcional foi utilizada para redistribuir as categorias de diagnósticos desconhecidas, incompletas ou mal definidas de óbitos identificadas no sistema de informação sobre mortalidade, exceto os dados ausentes de idade, corrigidos através de imputação. As correções foram aplicadas para cada Unidade Federativa do pais, segundo sexo e grupo etário, e os resultados apresentados para o Brasil e cada grande região e suas respectivas áreas geográficas (capital, demais municípios das regiões metropolitanas e interior). Tendências temporais de mortalidade foram analisadas através de regressão linear simples para cada estado da região Nordeste. Índice de variação percentual foi utilizado para determinar a variabilidade da magnitude das taxas, antes e após a correção dos óbitos. Através de regressão linear, foram analisados o comportamento da correção, e as correlações entre os indicadores socioeconômicos e as taxas de mortalidade por câncer do colo de útero sem e com correção. Após as correções, as taxas de mortalidade por câncer do colo do útero no Brasil mostraram um acréscimo percentual 103,4%, com variação de 35%, para as capitais da região Sul, a 339%, para o interior da região Nordeste. Foram encontradas correlações positivas entre alguns indicadores socioeconômicos e taxas sem correção, e correlações negativa entre esses mesmos indicadores e taxas corrigidas. Com outros indicadores socioeconômicos, observou-se o inverso dessa situação. Os resultados da correção apresentaram consistência em termos geográficos e em relação aos achados da literatura, permitindo concluir que a metodologia proposta foi adequada para corrigir a magnitude das taxas de mortalidade por câncer do colo do útero no país. Se analises comparativas sobre as condições socioeconômicas e o comportamento deste câncer forem estimadas sem quaisquer conhecimentos acerca da cobertura e qualidade de registro dos óbitos, pode-se incorrer a conclusões equivocadas. Considerando a magnitude corrigida da mortalidade por câncer do colo do útero, podemos afirmar que o problema desta doença na região Nordeste e no país, e mais grave do que o observado nos informes oficiais. Contudo, os resultados apontam que os programas de controle e detecção precoce desenvolvidos no país já mostram resultados positivos.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

efeitos são frequentemente observados na morbidade e mortalidade por doenças respiratórias e cardiovasculares, câncer de pulmão, diminuição da função respiratória, absenteísmo escolar e problemas relacionados com a gravidez. Estudos também sugerem que os grupos mais suscetíveis são as crianças e os idosos. Esta tese apresenta estudos sobre o efeito da poluição do ar na saúde na saúde na cidade do Rio de Janeiro e aborda aspectos metodológicos sobre a análise de dados e imputação de dados faltantes em séries temporais epidemiológicas. A análise de séries temporais foi usada para estimar o efeito da poluição do ar na mortalidade de pessoas idosas por câncer de pulmão com dados dos anos 2000 e 2001. Este estudo teve como objetivo avaliar se a poluição do ar está associada com antecipação de óbitos de pessoas que já fazem parte de uma população de risco. Outro estudo foi realizado para avaliar o efeito da poluição do ar no baixo peso ao nascer de nascimentos a termo. O desenho deste estudo foi o de corte transversal usando os dados disponíveis no ano de 2002. Em ambos os estudos foram estimados efeitos moderados da poluição do ar. Aspectos metodológicos dos estudos epidemiológicos da poluição do ar na saúde também são abordados na tese. Um método para imputação de dados faltantes é proposto e implementado numa biblioteca para o aplicativo R. A metodologia de imputação é avaliada e comparada com outros métodos frequentemente usados para imputação de séries temporais de concentrações de poluentes atmosféricos por meio de técnicas de simulação. O método proposto apresentou desempenho superior aos tradicionalmente utilizados. Também é realizada uma breve revisão da metodologia usada nos estudos de séries temporais sobre os efeitos da poluição do ar na saúde. Os tópicos abordados na revisão estão implementados numa biblioteca para a análise de dados de séries temporais epidemiológicas no aplicativo estatístico R. O uso da biblioteca é exemplificado com dados de internações hospitalares de crianças por doenças respiratórias no Rio de Janeiro. Os estudos de cunho metodológico foram desenvolvidos no âmbito do estudo multicêntrico para avaliação dos efeitos da poluição do ar na América Latina o Projeto ESCALA.