791 resultados para Missing values
Resumo:
Multivariate normal distribution is commonly encountered in any field, a frequent issue is the missing values in practice. The purpose of this research was to estimate the parameters in three-dimensional covariance permutation-symmetric normal distribution with complete data and all possible patterns of incomplete data. In this study, MLE with missing data were derived, and the properties of the MLE as well as the sampling distributions were obtained. A Monte Carlo simulation study was used to evaluate the performance of the considered estimators for both cases when ρ was known and unknown. All results indicated that, compared to estimators in the case of omitting observations with missing data, the estimators derived in this article led to better performance. Furthermore, when ρ was unknown, using the estimate of ρ would lead to the same conclusion.
Resumo:
Background: The two-stage Total Laparoscopic Hysterectomy (TLH) versus Total Abdominal Hysterectomy (TAH) for stage I endometrial cancer (LACE) randomised controlled trial was initiated in 2005. The primary objective of stage 1 was to assess whether TLH results in equivalent or improved QoL up to 6 months after surgery compared to TAH. The primary objective of stage 2 was to test the hypothesis that disease-free survival at 4.5 years is equivalent for TLH and TAH. Results addressing the primary objective of stage 1 of the LACE trial are presented here. Methods: The first 361 LACE participants (TAH n= 142, TLH n=190) were enrolled in the QoL substudy at 19 centres across Australia, New Zealand and Hong Kong, and 332 completed the QoL analysis. Randomisation was performed centrally and independently from other study procedures via a computer generated, web-based system (providing concealment of the next assigned treatment) using stratified permuted blocks of 3 and 6, and assigned patients with histologically confirmed stage 1 endometrioid endometrial adenocarcinoma and ECOG performance status <2 to TLH or TAH stratified by histological grade and study centre. No blinding of patients or study personnel was attempted. QoL was measured at baseline, 1 and 4 weeks (early), and 3 and 6 months (late) after surgery using the Functional Assessment of Cancer Therapy-General (FACT-G) questionnaire. The primary endpoint was the difference between the groups in QoL change from baseline at early and late time points (a 5% difference was considered clinically significant). Analysis was performed according to the intention-to-treat principle using generalized estimating equations on differences from baseline for the early and late QoL recovery. The LACE trial is registered with clinicaltrials.gov (NCT00096408) and the Australian New Zealand Clinical Trials Registry (CTRN12606000261516). Patients for both stages of the trial have now been recruited and are being followed up for disease-specific outcomes. Findings: The proportion of missing values at the 5%, 10% 15% and 20% differences in the FACT-G scale was 6% (12/190) in the TLH and 14% (20/142) in the TAH group. There were 8/332 conversions (2.4%, 7 of which were from TLH to TAH). In the early phase of recovery, patients undergoing TLH reported significantly greater improvement of QoL from baseline compared to TAH in all subscales except the emotional and social well-being subscales. Improvements in QoL up to 6 months post-surgery continued to favour TLH except for the emotional and social well-being of the FACT and the visual analogue scale of the EuroQoL five dimensions (EuroQoL-VAS). Length of operating time was significantly longer in the TLH group (138±43 mins), than in the TAH group at (109±34 mins; p=0.001). While the proportion of intraoperative adverse events was similar between the treatment groups (TAH 8/142, 5.6%; TLH 14/190, 7.4%; p=0.55), postoperatively, twice as many patients in the TAH group experienced adverse events of CTC grade 3+ than in the TLH group (33/142, 23.2% and 22/190, 11.6%, respectively; p=0.004). Postoperative serious adverse events occurred more frequently in patients who had a TAH (27/142, 19.0%) than a TLH (15/190, 7.9%) (p=0.002). Interpretation: QoL improvements from baseline during early and later phases of recovery, and the adverse event profile significantly favour TLH compared to TAH for patients treated for Stage I endometrial cancer.
Resumo:
The purpose of this review is to integrate and summarize specific measurement topics (instrument and metric choice, validity, reliability, how many and what types of days, reactivity, and data treatment) appropriate to the study of youth physical activity. Research quality pedometers are necessary to aid interpretation of steps per day collected in a range of young populations under a variety of circumstances. Steps per day is the most appropriate metric choice, but steps per minute can be used to interpret time-in-intensity in specifically delimited time periods (e.g., physical education class). Reported intraclass correlations (ICC) have ranged from .65 over 2 days (although higher values also have been reported for 2 days) to .87 over 8 days (although higher values have been reported for fewer days). Reported ICCs are lower on weekend days (.59) versus weekdays (.75) and lower over vacation days (.69) versus school days (.74). There is no objective evidence of reactivity at this time. Data treatment includes (a) identifying and addressing missing values, (b) identifying outliers and reducing data appropriately if necessary, and (c) transforming the data as required in preparation for inferential analysis. As more pedometry studies in young populations are published, these preliminary methodological recommendations should be modified and refined.
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.
Resumo:
Modern mobile computing devices are versatile, but bring the burden of constant settings adjustment according to the current conditions of the environment. While until today, this task has to be accomplished by the human user, the variety of sensors usually deployed in such a handset provides enough data for autonomous self-configuration by a learning, adaptive system. However, this data is not fully available at certain points in time, or can contain false values. Handling potentially incomplete sensor data to detect context changes without a semantic layer represents a scientific challenge which we address with our approach. A novel machine learning technique is presented - the Missing-Values-SOM - which solves this problem by predicting setting adjustments based on context information. Our method is centered around a self-organizing map, extending it to provide a means of handling missing values. We demonstrate the performance of our approach on mobile context snapshots, as well as on classical machine learning datasets.
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.
Resumo:
This chapter investigates a variety of water quality assessment tools for reservoirs with balanced/unbalanced monitoring designs and focuses on providing informative water quality assessments to ensure decision-makers are able to make risk-informed management decisions about reservoir health. In particular, two water quality assessment methods are described: non-compliance (probability of the number of times the indicator exceeds the recommended guideline) and amplitude (degree of departure from the guideline). Strengths and weaknesses of current and alternative water quality methods will be discussed. The proposed methodology is particularly applicable to unbalanced designs with/without missing values and reflects the general conditions and is not swayed too heavily by the occasional extreme value (very high or very low quality). To investigate the issues in greater detail, we use as a case study, a reservoir within South-East Queensland (SEQ), Australia. The purpose here is to obtain an annual score that reflected the overall water quality, temporally, spatially and across water quality indicators for each reservoir.
Resumo:
In recent years, considerable research efforts have been directed to micro-array technologies and their role in providing simultaneous information on expression profiles for thousands of genes. These data, when subjected to clustering and classification procedures, can assist in identifying patterns and providing insight on biological processes. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges corresponding to gene-sample node couples in the dataset. Biologically meaningful subgraphs can be sought, but performance can be influenced both by the search algorithm, and, by the graph-weighting scheme and both merit rigorous investigation. In this paper, we focus on edge-weighting schemes for bipartite graphical representation of gene expression. Two novel methods are presented: the first is based on empirical evidence; the second on a geometric distribution. The schemes are compared for several real datasets, assessing efficiency of performance based on four essential properties: robustness to noise and missing values, discrimination, parameter influence on scheme efficiency and reusability. Recommendations and limitations are briefly discussed. Keywords: Edge-weighting; weighted graphs; gene expression; bi-clustering
Resumo:
A new database called the World Resource Table is constructed in this study. Missing values are known to produce complications when constructing global databases. This study provides a solution for applying multiple imputation techniques and estimates the global environmental Kuznets curve (EKC) for CO2, SO2, PM10, and BOD. Policy implications for each type of emission are derived based on the results of the EKC using WRI. Finally, we predicted the future emissions trend and regional share of CO2 emissions. We found that East Asia and South Asia will be increasing their emissions share while other major CO2 emitters will still produce large shares of the total global emissions.
Resumo:
Objective Child maltreatment is a problem that has longer recognition in the northern hemisphere and in high-income countries. Recent work has highlighted the nearly universal nature of the problem in other countries but demonstrated the lack of comparability of studies because of the variations in definitions and measures used. The International Society for the Prevention of Child Abuse and Neglect has developed instrumentation that may be used with cross-cultural and cross-national benchmarking by local investigators. Design and sampling The instrument design began with a team of expert in Brisbane in 2004. A large bank of questions were subjected to two rounds of Delphi review to develop the fielded version of the instrument. Convenience samples included approximately 120 parent respondents with children under the age of 18 in each of six countries (697 total). Results This paper presents an instrument that measures parental behaviors directed at children and reports data from pilot work in 6 countries and 7 languages. Patterns of response revealed few missing values and distributions of responses that generally were similar in the six countries. Subscales performed well in terms of internal consistency with Cronbach's alpha in very good range (0.77–0.88) with the exception of the neglect and sex abuse subscales. Results varied by child age and gender in expected directions but with large variations among the samples. About 15% of children were shaken, 24% hit on the buttocks with an object, and 37% were spanked. Reports of choking and smothering were made by 2% of parents. Conclusion These pilot data demonstrate that the instrument is well tolerated and captures variations in, and potentially harmful forms of child discipline. Practice implications The ISPCAN Child Abuse Screening Tool – Parent Version (ICAST-P) has been developed as a survey instrument to be administered to parents for the assessment of child maltreatment in a multi-national and multi-cultural context. It was developed with broad input from international experts and subjected to Dephi review, translation, and pilot testing in six countries. The results of the Delphi study and pilot testing are presented. This study demonstrates that a single instrument can be used in a broad range of cultures and languages with low rates of missing data and moderate to high internal consistency.
Resumo:
Snapper (Pagrus auratus) is widely distributed throughout subtropical and temperate southern oceans and forms a significant recreational and commercial fishery in Queensland, Australia. Using data from government reports, media sources, popular publications and a government fisheries survey carried out in 1910, we compiled information on individual snapper fishing trips that took place prior to the commencement of fisherywide organized data collection, from 1871 to 1939. In addition to extracting all available quantitative data, we translated qualitative information into bounded estimates and used multiple imputation to handle missing values, forming 287 records for which catch rate (snapper fisher−1 h−1) could be derived. Uncertainty was handled through a parametric maximum likelihood framework (a transformed trivariate Gaussian), which facilitated statistical comparisons between data sources. No statistically significant differences in catch rates were found among media sources and the government fisheries survey. Catch rates remained stable throughout the time series, averaging 3.75 snapper fisher−1 h−1 (95% confidence interval, 3.42–4.09) as the fishery expanded into new grounds. In comparison, a contemporary (1993–2002) south-east Queensland charter fishery produced an average catch rate of 0.4 snapper fisher−1 h−1 (95% confidence interval, 0.31–0.58). These data illustrate the productivity of a fishery during its earliest years of development and represent the earliest catch rate data globally for this species. By adopting a formalized approach to address issues common to many historical records – missing data, a lack of quantitative information and reporting bias – our analysis demonstrates the potential for historical narratives to contribute to contemporary fisheries management.
Resumo:
Water quality data are often collected at different sites over time to improve water quality management. Water quality data usually exhibit the following characteristics: non-normal distribution, presence of outliers, missing values, values below detection limits (censored), and serial dependence. It is essential to apply appropriate statistical methodology when analyzing water quality data to draw valid conclusions and hence provide useful advice in water management. In this chapter, we will provide and demonstrate various statistical tools for analyzing such water quality data, and will also introduce how to use a statistical software R to analyze water quality data by various statistical methods. A dataset collected from the Susquehanna River Basin will be used to demonstrate various statistical methods provided in this chapter. The dataset can be downloaded from website http://www.srbc.net/programs/CBP/nutrientprogram.htm.
Resumo:
In recent years, thanks to developments in information technology, large-dimensional datasets have been increasingly available. Researchers now have access to thousands of economic series and the information contained in them can be used to create accurate forecasts and to test economic theories. To exploit this large amount of information, researchers and policymakers need an appropriate econometric model.Usual time series models, vector autoregression for example, cannot incorporate more than a few variables. There are two ways to solve this problem: use variable selection procedures or gather the information contained in the series to create an index model. This thesis focuses on one of the most widespread index model, the dynamic factor model (the theory behind this model, based on previous literature, is the core of the first part of this study), and its use in forecasting Finnish macroeconomic indicators (which is the focus of the second part of the thesis). In particular, I forecast economic activity indicators (e.g. GDP) and price indicators (e.g. consumer price index), from 3 large Finnish datasets. The first dataset contains a large series of aggregated data obtained from the Statistics Finland database. The second dataset is composed by economic indicators from Bank of Finland. The last dataset is formed by disaggregated data from Statistic Finland, which I call micro dataset. The forecasts are computed following a two steps procedure: in the first step I estimate a set of common factors from the original dataset. The second step consists in formulating forecasting equations including the factors extracted previously. The predictions are evaluated using relative mean squared forecast error, where the benchmark model is a univariate autoregressive model. The results are dataset-dependent. The forecasts based on factor models are very accurate for the first dataset (the Statistics Finland one), while they are considerably worse for the Bank of Finland dataset. The forecasts derived from the micro dataset are still good, but less accurate than the ones obtained in the first case. This work leads to multiple research developments. The results here obtained can be replicated for longer datasets. The non-aggregated data can be represented in an even more disaggregated form (firm level). Finally, the use of the micro data, one of the major contributions of this thesis, can be useful in the imputation of missing values and the creation of flash estimates of macroeconomic indicator (nowcasting).
Resumo:
The paper deals with a model-theoretic approach to clustering. The approach can be used to generate cluster description based on knowledge alone. Such a process of generating descriptions would be extremely useful in clustering partially specified objects. A natural byproduct of the proposed approach is that missing values of attributes of an object can be estimated with ease in a meaningful fashion. An important feature of the approach is that noisy objects can be detected effectively, leading to the formation of natural groups. The proposed algorithm is applied to a library database consisting of a collection of books.