963 resultados para Predictive regression
Resumo:
For clustered survival data, the traditional Gehan-type estimator is asymptotically equivalent to using only the between-cluster ranks, and the within-cluster ranks are ignored. The contribution of this paper is two fold: - (i) incorporating within-cluster ranks in censored data analysis, and; - (ii) applying the induced smoothing of Brown and Wang (2005, Biometrika) for computational convenience. Asymptotic properties of the resulting estimating functions are given. We also carry out numerical studies to assess the performance of the proposed approach and conclude that the proposed approach can lead to much improved estimators when strong clustering effects exist. A dataset from a litter-matched tumorigenesis experiment is used for illustration.
Resumo:
With growing population and fast urbanization in Australia, it is a challenging task to maintain our water quality. It is essential to develop an appropriate statistical methodology in analyzing water quality data in order to draw valid conclusions and hence provide useful advices in water management. This paper is to develop robust rank-based procedures for analyzing nonnormally distributed data collected over time at different sites. To take account of temporal correlations of the observations within sites, we consider the optimally combined estimating functions proposed by Wang and Zhu (Biometrika, 93:459-464, 2006) which leads to more efficient parameter estimation. Furthermore, we apply the induced smoothing method to reduce the computational burden. Smoothing leads to easy calculation of the parameter estimates and their variance-covariance matrix. Analysis of water quality data from Total Iron and Total Cyanophytes shows the differences between the traditional generalized linear mixed models and rank regression models. Our analysis also demonstrates the advantages of the rank regression models for analyzing nonnormal data.
Resumo:
Environmental data usually include measurements, such as water quality data, which fall below detection limits, because of limitations of the instruments or of certain analytical methods used. The fact that some responses are not detected needs to be properly taken into account in statistical analysis of such data. However, it is well-known that it is challenging to analyze a data set with detection limits, and we often have to rely on the traditional parametric methods or simple imputation methods. Distributional assumptions can lead to biased inference and justification of distributions is often not possible when the data are correlated and there is a large proportion of data below detection limits. The extent of bias is usually unknown. To draw valid conclusions and hence provide useful advice for environmental management authorities, it is essential to develop and apply an appropriate statistical methodology. This paper proposes rank-based procedures for analyzing non-normally distributed data collected at different sites over a period of time in the presence of multiple detection limits. To take account of temporal correlations within each site, we propose an optimal linear combination of estimating functions and apply the induced smoothing method to reduce the computational burden. Finally, we apply the proposed method to the water quality data collected at Susquehanna River Basin in United States of America, which dearly demonstrates the advantages of the rank regression models.
Resumo:
We consider rank regression for clustered data analysis and investigate the induced smoothing method for obtaining the asymptotic covariance matrices of the parameter estimators. We prove that the induced estimating functions are asymptotically unbiased and the resulting estimators are strongly consistent and asymptotically normal. The induced smoothing approach provides an effective way for obtaining asymptotic covariance matrices for between- and within-cluster estimators and for a combined estimator to take account of within-cluster correlations. We also carry out extensive simulation studies to assess the performance of different estimators. The proposed methodology is substantially Much faster in computation and more stable in numerical results than the existing methods. We apply the proposed methodology to a dataset from a randomized clinical trial.
Resumo:
We consider ranked-based regression models for clustered data analysis. A weighted Wilcoxon rank method is proposed to take account of within-cluster correlations and varying cluster sizes. The asymptotic normality of the resulting estimators is established. A method to estimate covariance of the estimators is also given, which can bypass estimation of the density function. Simulation studies are carried out to compare different estimators for a number of scenarios on the correlation structure, presence/absence of outliers and different correlation values. The proposed methods appear to perform well, in particular, the one incorporating the correlation in the weighting achieves the highest efficiency and robustness against misspecification of correlation structure and outliers. A real example is provided for illustration.
Resumo:
We consider rank-based regression models for repeated measures. To account for possible withinsubject correlations, we decompose the total ranks into between- and within-subject ranks and obtain two different estimators based on between- and within-subject ranks. A simple perturbation method is then introduced to generate bootstrap replicates of the estimating functions and the parameter estimates. This provides a convenient way for combining the corresponding two types of estimating function for more efficient estimation.
Resumo:
Adaptions of weighted rank regression to the accelerated failure time model for censored survival data have been successful in yielding asymptotically normal estimates and flexible weighting schemes to increase statistical efficiencies. However, for only one simple weighting scheme, Gehan or Wilcoxon weights, are estimating equations guaranteed to be monotone in parameter components, and even in this case are step functions, requiring the equivalent of linear programming for computation. The lack of smoothness makes standard error or covariance matrix estimation even more difficult. An induced smoothing technique overcame these difficulties in various problems involving monotone but pure jump estimating equations, including conventional rank regression. The present paper applies induced smoothing to the Gehan-Wilcoxon weighted rank regression for the accelerated failure time model, for the more difficult case of survival time data subject to censoring, where the inapplicability of permutation arguments necessitates a new method of estimating null variance of estimating functions. Smooth monotone parameter estimation and rapid, reliable standard error or covariance matrix estimation is obtained.
Resumo:
This article is motivated by a lung cancer study where a regression model is involved and the response variable is too expensive to measure but the predictor variable can be measured easily with relatively negligible cost. This situation occurs quite often in medical studies, quantitative genetics, and ecological and environmental studies. In this article, by using the idea of ranked-set sampling (RSS), we develop sampling strategies that can reduce cost and increase efficiency of the regression analysis for the above-mentioned situation. The developed method is applied retrospectively to a lung cancer study. In the lung cancer study, the interest is to investigate the association between smoking status and three biomarkers: polyphenol DNA adducts, micronuclei, and sister chromatic exchanges. Optimal sampling schemes with different optimality criteria such as A-, D-, and integrated mean square error (IMSE)-optimality are considered in the application. With set size 10 in RSS, the improvement of the optimal schemes over simple random sampling (SRS) is great. For instance, by using the optimal scheme with IMSE-optimality, the IMSEs of the estimated regression functions for the three biomarkers are reduced to about half of those incurred by using SRS.
Resumo:
Predictive models based on near infra-red spectroscopy for the assessment of fruit internal quality attributes must exhibit a degree of robustness across the parameters of variety, district and time to be of practical use in fruit grading. At the time this thesis was initiated, while there were a number of published reports on the development of near infra-red based calibration models for the assessment of internal quality attributes of intact fruit, there were no reports of the reliability ("robustness") of such models across time, cultivars or growing regions. As existing published reports varied in instrumentation employed, a re-analysis of existing data was not possible. An instrument platform, based on partial transmittance optics, a halogen light source and (Zeiss MMS 1) detector operating in the short wavelength near infra-red region was developed for use in the assessment of intact fruit. This platform was used to assess populations of macadamia kernels, melons and mandarin fruit for total soluble solids, dry matter and oil concentration. Calibration procedures were optimised and robustness assessed across growing areas, time of harvest, season and variety. In general, global modified partial least squares regression (MPLS) calibration models based on derivatised absorbance data were better than either multiple linear regression or `local' MPLS models in the prediction of independent validation populations . Robustness was most affected by growing season, relative to the growing district or variety . Various calibration updating procedures were evaluated in terms of calibration robustness. Random selection of samples from the validation population for addition to the calibration population was equivalent to or better than other methods of sample addition (methods based on the Mahalanobis distance of samples from either the centroid of the population or neighbourhood samples). In these exercises the global Mahalanobis distance (GH) was calculated using the scores and loadings from the calibration population on the independent validation population. In practice, it is recommended that model predictive performance be monitored in terms of predicted sample GH, with model updating using as few as 10 samples from the new population undertaken when the average GH value exceeds 1 .0 .
Resumo:
Volatile chemical compounds responsible for the aroma of wine are derived from a number of different biochemical and chemical pathways. These chemical compounds are formed during grape berry metabolism, crushing of the berries, fermentation processes (i.e. yeast and malolactic bacteria) and also from the ageing and storage of wine. Not surprisingly, there are a large number of chemical classes of compounds found in wine which are present at varying concentrations (ng L-1 to mg L-1), exhibit differing potencies, and have a broad range of volatilities and boiling points. The aim of this work was to investigate the potential use of near infrared (NIR) spectroscopy combined with chemometrics as a rapid and low-cost technique to measure volatile compounds in Riesling wines. Samples of commercial Riesling wine were analyzed using an NIR instrument and volatile compounds by gas chromatography (GC) coupled with selected ion monitoring mass spectrometry. Correlation between the NIR and GC data were developed using partial least-squares (PLS) regression with full cross validation (leave one out). Coefficients of determination in cross validation (R 2) and the standard error in cross validation (SECV) were 0.74 (SECV: 313.6 μg L−1) for esters, 0.90 (SECV: 20.9 μg L−1) for monoterpenes and 0.80 (SECV: 1658 ?g L-1) for short-chain fatty acids. This study has shown that volatile chemical compounds present in wine can be measured by NIR spectroscopy. Further development with larger data sets will be required to test the predictive ability of the NIR calibration models developed.
Resumo:
Gaussian processes (GPs) are promising Bayesian methods for classification and regression problems. Design of a GP classifier and making predictions using it is, however, computationally demanding, especially when the training set size is large. Sparse GP classifiers are known to overcome this limitation. In this letter, we propose and study a validation-based method for sparse GP classifier design. The proposed method uses a negative log predictive (NLP) loss measure, which is easy to compute for GP models. We use this measure for both basis vector selection and hyperparameter adaptation. The experimental results on several real-world benchmark data sets show better orcomparable generalization performance over existing methods.
Resumo:
Overprocessing waste occurs in a business process when effort is spent in a way that does not add value to the customer nor to the business. Previous studies have identied a recurrent overprocessing pattern in business processes with so-called "knockout checks", meaning activities that classify a case into "accepted" or "rejected", such that if the case is accepted it proceeds forward, while if rejected, it is cancelled and all work performed in the case is considered unnecessary. Thus, when a knockout check rejects a case, the effort spent in other (previous) checks becomes overprocessing waste. Traditional process redesign methods propose to order knockout checks according to their mean effort and rejection rate. This paper presents a more fine-grained approach where knockout checks are ordered at runtime based on predictive machine learning models. Experiments on two real-life processes show that this predictive approach outperforms traditional methods while incurring minimal runtime overhead.
Resumo:
This paper addresses the following predictive business process monitoring problem: Given the execution trace of an ongoing case,and given a set of traces of historical (completed) cases, predict the most likely outcome of the ongoing case. In this context, a trace refers to a sequence of events with corresponding payloads, where a payload consists of a set of attribute-value pairs. Meanwhile, an outcome refers to a label associated to completed cases, like, for example, a label indicating that a given case completed “on time” (with respect to a given desired duration) or “late”, or a label indicating that a given case led to a customer complaint or not. The paper tackles this problem via a two-phased approach. In the first phase, prefixes of historical cases are encoded using complex symbolic sequences and clustered. In the second phase, a classifier is built for each of the clusters. To predict the outcome of an ongoing case at runtime given its (uncompleted) trace, we select the closest cluster(s) to the trace in question and apply the respective classifier(s), taking into account the Euclidean distance of the trace from the center of the clusters. We consider two families of clustering algorithms – hierarchical clustering and k-medoids – and use random forests for classification. The approach was evaluated on four real-life datasets.
Resumo:
Understanding the effects of different types and quality of data on bioclimatic modeling predictions is vital to ascertaining the value of existing models, and to improving future models. Bioclimatic models were constructed using the CLIMEX program, using different data types – seasonal dynamics, geographic (overseas) distribution, and a combination of the two – for two biological control agents for the major weed Lantana camara L. in Australia. The models for one agent, Teleonemia scrupulosa Stål (Hemiptera:Tingidae) were based on a higher quality and quantity of data than the models for the other agent, Octotoma scabripennis Guérin-Méneville (Coleoptera: Chrysomelidae). Predictions of the geographic distribution for Australia showed that T. scrupulosa models exhibited greater accuracy with a progressive improvement from seasonal dynamics data, to the model based on overseas distribution, and finally the model combining the two data types. In contrast, O. scabripennis models were of low accuracy, and showed no clear trends across the various model types. These case studies demonstrate the importance of high quality data for developing models, and of supplementing distributional data with species seasonal dynamics data wherever possible. Seasonal dynamics data allows the modeller to focus on the species response to climatic trends, while distributional data enables easier fitting of stress parameters by restricting the species envelope to the described distribution. It is apparent that CLIMEX models based on low quality seasonal dynamics data, together with a small quantity of distributional data, are of minimal value in predicting the spatial extent of species distribution.
Resumo:
BACKGROUND: The inability to consistently guarantee internal quality of horticulture produce is of major importance to the primary producer, marketers and ultimately the consumer. Currently, commercial avocado maturity estimation is based on the destructive assessment of percentage dry matter (%DM), and sometimes percentage oil, both of which are highly correlated with maturity. In this study the utility of Fourier transform (FT) near-infrared spectroscopy (NIRS) was investigated for the first time as a non-invasive technique for estimating %DM of whole intact 'Hass' avocado fruit. Partial least squares regression models were developed from the diffuse reflectance spectra to predict %DM, taking into account effects of intra-seasonal variation and orchard conditions. RESULTS: It was found that combining three harvests (early, mid and late) from a single farm in the major production district of central Queensland yielded a predictive model for %DM with a coefficient of determination for the validation set of 0.76 and a root mean square error of prediction of 1.53% for DM in the range 19.4-34.2%. CONCLUSION: The results of the study indicate the potential of FT-NIRS in diffuse reflectance mode to non-invasively predict %DM of whole 'Hass' avocado fruit. When the FT-NIRS system was assessed on whole avocados, the results compared favourably against data from other NIRS systems identified in the literature that have been used in research applications on avocados.