836 resultados para Regression imputation


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Virtual metrology (VM) aims to predict metrology values using sensor data from production equipment and physical metrology values of preceding samples. VM is a promising technology for the semiconductor manufacturing industry as it can reduce the frequency of in-line metrology operations and provide supportive information for other operations such as fault detection, predictive maintenance and run-to-run control. The prediction models for VM can be from a large variety of linear and nonlinear regression methods and the selection of a proper regression method for a specific VM problem is not straightforward, especially when the candidate predictor set is of high dimension, correlated and noisy. Using process data from a benchmark semiconductor manufacturing process, this paper evaluates the performance of four typical regression methods for VM: multiple linear regression (MLR), least absolute shrinkage and selection operator (LASSO), neural networks (NN) and Gaussian process regression (GPR). It is observed that GPR performs the best among the four methods and that, remarkably, the performance of linear regression approaches that of GPR as the subset of selected input variables is increased. The observed competitiveness of high-dimensional linear regression models, which does not hold true in general, is explained in the context of extreme learning machines and functional link neural networks.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A forward and backward least angle regression (LAR) algorithm is proposed to construct the nonlinear autoregressive model with exogenous inputs (NARX) that is widely used to describe a large class of nonlinear dynamic systems. The main objective of this paper is to improve model sparsity and generalization performance of the original forward LAR algorithm. This is achieved by introducing a replacement scheme using an additional backward LAR stage. The backward stage replaces insignificant model terms selected by forward LAR with more significant ones, leading to an improved model in terms of the model compactness and performance. A numerical example to construct four types of NARX models, namely polynomials, radial basis function (RBF) networks, neuro fuzzy and wavelet networks, is presented to illustrate the effectiveness of the proposed technique in comparison with some popular methods.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In many applications, and especially those where batch processes are involved, a target scalar output of interest is often dependent on one or more time series of data. With the exponential growth in data logging in modern industries such time series are increasingly available for statistical modeling in soft sensing applications. In order to exploit time series data for predictive modelling, it is necessary to summarise the information they contain as a set of features to use as model regressors. Typically this is done in an unsupervised fashion using simple techniques such as computing statistical moments, principal components or wavelet decompositions, often leading to significant information loss and hence suboptimal predictive models. In this paper, a functional learning paradigm is exploited in a supervised fashion to derive continuous, smooth estimates of time series data (yielding aggregated local information), while simultaneously estimating a continuous shape function yielding optimal predictions. The proposed Supervised Aggregative Feature Extraction (SAFE) methodology can be extended to support nonlinear predictive models by embedding the functional learning framework in a Reproducing Kernel Hilbert Spaces setting. SAFE has a number of attractive features including closed form solution and the ability to explicitly incorporate first and second order derivative information. Using simulation studies and a practical semiconductor manufacturing case study we highlight the strengths of the new methodology with respect to standard unsupervised feature extraction approaches.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Both polygenicity (many small genetic effects) and confounding biases, such as cryptic relatedness and population stratification, can yield an inflated distribution of test statistics in genome-wide association studies (GWAS). However, current methods cannot distinguish between inflation from a true polygenic signal and bias. We have developed an approach, LD Score regression, that quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD). The LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control. We find strong evidence that polygenicity accounts for the majority of the inflation in test statistics in many GWAS of large sample size.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: Around 10-15% of patients with locally advanced rectal cancer (LARC) undergo a pathologically complete response (TRG4) to neoadjuvant chemoradiotherapy; the rest of patients exhibit a spectrum of tumour regression (TRG1-3). Understanding therapy-related genomic alterations may help us to identify underlying biology or novel targets associated with response that could increase the efficacy of therapy in patients that do not benefit from the current standard of care.
Methods: 48 FFPE rectal cancer biopsies and matched resections were analysed using the WG-DASL HumanHT-12_v4 Beadchip array on the illumina iScan. Bioinformatic analysis was conducted in Partek genomics suite and R studio. Limma and glmnet packages were used to identify genes differentially expressed between tumour regression grades. Validation of microarray results will be carried out using IHC, RNAscope and RT-PCR.
Results: Immune response genes were observed from supervised analysis of the biopsies which may have predictive value. Differential gene expression from the resections as well as pre and post therapy analysis revealed induction of genes in a tumour regression dependent manner. Pathway mapping and Gene Ontology analysis of these genes suggested antigen processing and natural killer mediated cytotoxicity respectively. The natural killer-like gene signature was switched off in non-responders and on in the responders. IHC has confirmed the presence of Natural killer cells through CD56+ staining.
Conclusion: Identification of NK cell genes and CD56+ cells in patients responding to neoadjuvant chemoradiotherapy warrants further investigation into their association with tumour regression grade in LARC. NK cells are known to lyse malignant cells and determining whether their presence is a cause or consequence of response is crucial. Interrogation of the cytokines upregulated in our NK-like signature will help guide future in vitro models.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Histone deacetylases (HDACs) are enzymes involved in transcriptional repression. We aimed to examine the significance of HDAC1 and HDAC2 gene expression in the prediction of recurrence and survival in 156 patients with hepatocellular carcinoma (HCC) among a South East Asian population who underwent curative surgical resection in Singapore. We found that HDAC1 and HDAC2 were upregulated in the majority of HCC tissues. The presence of HDAC1 in tumor tissues was correlated with poor tumor differentiation. Notably, HDAC1 expression in adjacent non-tumor hepatic tissues was correlated with the presence of satellite nodules and multiple lesions, suggesting that HDAC1 upregulation within the field of HCC may contribute to tumor spread. Using competing risk regression analysis, we found that increased cancer-specific mortality was significantly associated with HDAC2 expression. Mortality was also increased with high HDAC1 expression. In the liver cancer cell lines, HEP3B, HEPG2, PLC5, and a colorectal cancer cell line, HCT116, the combined knockdown of HDAC1 and HDAC2 increased cell death and reduced cell proliferation as well as colony formation. In contrast, knockdown of either HDAC1 or HDAC2 alone had minimal effects on cell death and proliferation. Taken together, our study suggests that both HDAC1 and HDAC2 exert pro-survival effects in HCC cells, and the combination of isoform-specific HDAC inhibitors against both HDACs may be effective in targeting HCC to reduce mortality.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Since July 2014, the Office for National Statistics has committed to a predominantly online 2021 UK Census. Item-level imputation will play an important role in adjusting the 2021 Census database. Research indicates that the internet may yield cleaner data than paper based capture and attract people with particular characteristics. Here, we provide preliminary results from research directed at understanding how we might manage these features in a 2021 UK Census imputation strategy. Our findings suggest that if using a donor-based imputation method, it may need to consider including response mode as a matching variable in the underlying imputation model.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

BACKGROUND: Preclinical studies have shown that statins, particularly simvastatin, can prevent growth in breast cancer cell lines and animal models. We investigated whether statins used after breast cancer diagnosis reduced the risk of breast cancer-specific, or all-cause, mortality in a large cohort of breast cancer patients.

METHODS: A cohort of 17,880 breast cancer patients, newly diagnosed between 1998 and 2009, was identified from English cancer registries (from the National Cancer Data Repository). This cohort was linked to the UK Clinical Practice Research Datalink, providing prescription records, and to the Office of National Statistics mortality data (up to 2013), identifying 3694 deaths, including 1469 deaths attributable to breast cancer. Unadjusted and adjusted hazard ratios (HRs) for breast cancer-specific, and all-cause, mortality in statin users after breast cancer diagnosis were calculated using time-dependent Cox regression models. Sensitivity analyses were conducted using multiple imputation methods, propensity score methods and a case-control approach.

RESULTS: There was some evidence that statin use after a diagnosis of breast cancer had reduced mortality due to breast cancer and all causes (fully adjusted HR = 0.84 [95% confidence interval = 0.68-1.04] and 0.84 [0.72-0.97], respectively). These associations were more marked for simvastatin 0.79 (0.63-1.00) and 0.81 (0.70-0.95), respectively.

CONCLUSIONS: In this large population-based breast cancer cohort, there was some evidence of reduced mortality in statin users after breast cancer diagnosis. However, these associations were weak in magnitude and were attenuated in some sensitivity analyses.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Dissertação de Mestrado, Gestão da Água e da Costa, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2010

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Airborne concentrations of Poaceae pollen have been monitored in Poznań for more than ten years and the length of the dataset is now considered sufficient for statistical analysis. The objective of this paper is to produce long-range forecasts that predict certain characteristics of the grass pollen season (such as the start, peak and end dates of the grass pollen season) as well as short-term forecasts that predict daily variations in grass pollen counts for the next day or next few days throughout the main grass pollen season. The method of forecasting was regression analysis. Correlation analysis was used to examine the relationship between grass pollen counts and the factors that affect its production, release and dispersal. The models were constructed with data from 1994-2004 and tested on data from 2005 and 2006. The forecast models predicted the start of the grass pollen season to within 2 days and achieved 61% and 70% accuracy on a scale of 1-4 when forecasting variations in daily grass pollen counts in 2005 and 2006 respectively. This study has emphasised how important the weather during the few weeks or months preceding pollination is to grass pollen production, and draws attention to the importance of considering large-scale patterns of climate variability (indices of the North Atlantic Oscillation) when constructing forecast models for allergenic pollen.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The mesoscale (100–102 m) of river habitats has been identified as the scale that simultaneously offers insights into ecological structure and falls within the practical bounds of river management. Mesoscale habitat (mesohabitat) classifications for relatively large rivers, however, are underdeveloped compared with those produced for smaller streams. Approaches to habitat modelling have traditionally focused on individual species or proceeded on a species-by-species basis. This is particularly problematic in larger rivers where the effects of biological interactions are more complex and intense. Community-level approaches can rapidly model many species simultaneously, thereby integrating the effects of biological interactions while providing information on the relative importance of environmental variables in structuring the community. One such community-level approach, multivariate regression trees, was applied in order to determine the relative influences of abiotic factors on fish assemblages within shoreline mesohabitats of San Pedro River, Chile, and to define reference communities prior to the planned construction of a hydroelectric power plant. Flow depth, bank materials and the availability of riparian and instream cover, including woody debris, were the main variables driving differences between the assemblages. Species strongly indicative of distinctive mesohabitat types included the endemic Galaxias platei. Among other outcomes, the results provide information on the impact of non-native salmonids on river-dwelling Galaxias platei, suggesting a degree of habitat segregation between these taxa based on flow depth. The results support the use of the mesohabitat concept in large, relatively pristine river systems, and they represent a basis for assessing the impact of any future hydroelectric power plant construction and operation. By combing community classifications with simple sets of environmental rules, the multivariate regression trees produced can be used to predict the community structure of any mesohabitat along the reach.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Long-term contractual decisions are the basis of an efficient risk management. However those types of decisions have to be supported with a robust price forecast methodology. This paper reports a different approach for long-term price forecast which tries to give answers to that need. Making use of regression models, the proposed methodology has as main objective to find the maximum and a minimum Market Clearing Price (MCP) for a specific programming period, and with a desired confidence level α. Due to the problem complexity, the meta-heuristic Particle Swarm Optimization (PSO) was used to find the best regression parameters and the results compared with the obtained by using a Genetic Algorithm (GA). To validate these models, results from realistic data are presented and discussed in detail.