965 resultados para INCOMPLETE-DATA


Relevância:

80.00% 80.00%

Publicador:

Resumo:

P>In the context of either Bayesian or classical sensitivity analyses of over-parametrized models for incomplete categorical data, it is well known that prior-dependence on posterior inferences of nonidentifiable parameters or that too parsimonious over-parametrized models may lead to erroneous conclusions. Nevertheless, some authors either pay no attention to which parameters are nonidentifiable or do not appropriately account for possible prior-dependence. We review the literature on this topic and consider simple examples to emphasize that in both inferential frameworks, the subjective components can influence results in nontrivial ways, irrespectively of the sample size. Specifically, we show that prior distributions commonly regarded as slightly informative or noninformative may actually be too informative for nonidentifiable parameters, and that the choice of over-parametrized models may drastically impact the results, suggesting that a careful examination of their effects should be considered before drawing conclusions.Resume Que ce soit dans un cadre Bayesien ou classique, il est bien connu que la surparametrisation, dans les modeles pour donnees categorielles incompletes, peut conduire a des conclusions erronees. Cependant, certains auteurs persistent a negliger les problemes lies a la presence de parametres non identifies. Nous passons en revue la litterature dans ce domaine, et considerons quelques exemples surparametres simples dans lesquels les elements subjectifs influencent de facon non negligeable les resultats, independamment de la taille des echantillons. Plus precisement, nous montrons comment des a priori consideres comme peu ou non-informatifs peuvent se reveler extremement informatifs en ce qui concerne les parametres non identifies, et que le recours a des modeles surparametres peut avoir sur les conclusions finales un impact considerable. Ceci suggere un examen tres attentif de l`impact potentiel des a priori.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

An increasing number of parameter estimation tasks involve the use of at least two information sources, one complete but limited, the other abundant but incomplete. Standard algorithms such as EM (or em) used in this context are unfortunately not stable in the sense that they can lead to a dramatic loss of accuracy with the inclusion of incomplete observations. We provide a more controlled solution to this problem through differential equations that govern the evolution of locally optimal solutions (fixed points) as a function of the source weighting. This approach permits us to explicitly identify any critical (bifurcation) points leading to choices unsupported by the available complete data. The approach readily applies to any graphical model in O(n^3) time where n is the number of parameters. We use the naive Bayes model to illustrate these ideas and demonstrate the effectiveness of our approach in the context of text classification problems.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).

Relevância:

70.00% 70.00%

Publicador:

Resumo:

When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We review some issues related to the implications of different missing data mechanisms on statistical inference for contingency tables and consider simulation studies to compare the results obtained under such models to those where the units with missing data are disregarded. We confirm that although, in general, analyses under the correct missing at random and missing completely at random models are more efficient even for small sample sizes, there are exceptions where they may not improve the results obtained by ignoring the partially classified data. We show that under the missing not at random (MNAR) model, estimates on the boundary of the parameter space as well as lack of identifiability of the parameters of saturated models may be associated with undesirable asymptotic properties of maximum likelihood estimators and likelihood ratio tests; even in standard cases the bias of the estimators may be low only for very large samples. We also show that the probability of a boundary solution obtained under the correct MNAR model may be large even for large samples and that, consequently, we may not always conclude that a MNAR model is misspecified because the estimate is on the boundary of the parameter space.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

A new version of the TomoRebuild data reduction software package is presented, for the reconstruction of scanning transmission ion microscopy tomography (STIMT) and particle induced X-ray emission tomography (PIXET) images. First, we present a state of the art of the reconstruction codes available for ion beam microtomography. The algorithm proposed here brings several advantages. It is a portable, multi-platform code, designed in C++ with well-separated classes for easier use and evolution. Data reduction is separated in different steps and the intermediate results may be checked if necessary. Although no additional graphic library or numerical tool is required to run the program as a command line, a user friendly interface was designed in Java, as an ImageJ plugin. All experimental and reconstruction parameters may be entered either through this plugin or directly in text format files. A simple standard format is proposed for the input of experimental data. Optional graphic applications using the ROOT interface may be used separately to display and fit energy spectra. Regarding the reconstruction process, the filtered backprojection (FBP) algorithm, already present in the previous version of the code, was optimized so that it is about 10 times as fast. In addition, Maximum Likelihood Expectation Maximization (MLEM) and its accelerated version Ordered Subsets Expectation Maximization (OSEM) algorithms were implemented. A detailed user guide in English is available. A reconstruction example of experimental data from a biological sample is given. It shows the capability of the code to reduce noise in the sinograms and to deal with incomplete data, which puts a new perspective on tomography using low number of projections or limited angle.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

70.00% 70.00%

Publicador:

Resumo:

With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Global databases of calcium carbonate concentrations and mass accumulation rates in Holocene and last glacial maximum sediments were used to estimate the deep-sea sedimentary calcium carbonate burial rate during these two time intervals. Sparse calcite mass accumulation rate data were extrapolated across regions of varying calcium carbonate concentration using a gridded map of calcium carbonate concentrations and the assumption that accumulation of noncarbonate material is uncorrelated with calcite concentration within some geographical region. Mean noncarbonate accumulation rates were estimated within each of nine regions, determined by the distribution and nature of the accumulation rate data. For core-top sediments the regions of reasonable data coverage encompass 67% of the high-calcite (>75%) sediments globally, and within these regions we estimate an accumulation rate of 55.9 ± 3.6 x 10**11 mol/yr. The same regions cover 48% of glacial high-CaCO3 sediments (the smaller fraction is due to a shift of calcite deposition to the poorly sampled South Pacific) and total 44.1 ± 6.0 x 10**11 mol/yr. Projecting both estimates to 100 % coverage yields accumulation estimates of 8.3 x 10**12 mol/yr today and 9.2 x 10**12 mol/yr during glacial time. This is little better than a guess given the incomplete data coverage, but it suggests that glacial deep sea calcite burial rate was probably not considerably faster than today in spite of a presumed decrease in shallow water burial during glacial time.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Multivariate normal distribution is commonly encountered in any field, a frequent issue is the missing values in practice. The purpose of this research was to estimate the parameters in three-dimensional covariance permutation-symmetric normal distribution with complete data and all possible patterns of incomplete data. In this study, MLE with missing data were derived, and the properties of the MLE as well as the sampling distributions were obtained. A Monte Carlo simulation study was used to evaluate the performance of the considered estimators for both cases when ρ was known and unknown. All results indicated that, compared to estimators in the case of omitting observations with missing data, the estimators derived in this article led to better performance. Furthermore, when ρ was unknown, using the estimate of ρ would lead to the same conclusion.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The quality of early life experiences are known to influence a child’s capacities for emotional, social, cognitive and physical competence throughout their life (Peterson, 1996; Zubrick et al., 2008). These early life experiences are directly affected by parenting and family environments. A lack of positive parenting has significant implications both for children, and the broader communities in which they live (Davies & Cummings, 1994; Dryfoos, 1990; Sanders, 1995). Young parents are known to be at risk of experiencing adverse circumstances that affect their ability to provide positive parenting to their children (Milan et al., 2004; Trad, 1995). There is a need to provide parenting support programs to young parents that offer opportunities for them to come together, support each other and learn ways to provide for their children’s developmental needs in a friendly, engaging and non-judgemental environment. This research project examines the effectiveness of a 10 week group music therapy program Sing & Grow as an early parenting intervention for 535 young parents. Sing & Grow is a national early parenting intervention program funded by the Australian Government and delivered by Playgroup Queensland. It is designed and delivered by Registered Music Therapists for families at risk of marginalisation with children aged from birth to three years. The aim of the program is to improve parenting skills and parent-child interactions, and increase social support networks through participation in a group that is strengths-based and structured in a way that lends itself to modelling, peer learning and facilitated learning. During the 10 weeks parents have opportunities to learn practical, hands-on ways to interact and play with their children that are conducive to positive parent-child relationships and ongoing child development. A range of interactive, nurturing, stimulating and developmental music activities provide the framework for parents to interact and play with their children. This research uses data collected through the Sing & Grow National Evaluation Study to examine outcomes for all participants aged 25 years and younger, who attended programs during the Sing & Grow pilot study and main study from mid-2005 to the end of 2007. The research examines the change from pre to post in self-reported parent behaviours, parent mental health and parent social support, and therapist observed parent-child interactions. A range of statistical analyses are used to address each Research Objective for the young parent population, and for subgroups within this population. Research Objective 1 explored the patterns of attendance in the Sing & Grow program for young parents, and for subgroups within this population. Results showed that levels of attendance were lower than expected and influenced by Indigenous status and source of family income. Patterns of attendance showed a decline over time and incomplete data rates were high which may indicate high dropout rates. Research Objective 2 explored perceived satisfaction, benefits and social support links made. Satisfaction levels with the program and staff were very high. Indigenous status was associated with lower levels of reported satisfaction with both the program and staff. Perceived benefits from participation in the program were very high. Employment status was associated with perceived benefits: parents who were not employed were more likely than employed parents to report that their understanding of child development had increased as a result of participation in the program. Social support connections were reported for participants with other professionals, services and parents. In particular, families were more likely to link up with playgroup staff and services. Those parents who attended six or more sessions were significantly more likely to attend a playgroup than those who attended five sessions or less. Social support connections were related to source of family income, level of education, Indigenous status and language background. Research Objective 3 investigated pre to post change on self-report parenting skills and parent mental health. Results indicated that participation in the Sing & Grow program was associated with improvements in parent mental health. No improvements were found for self-reported parenting skills. Research Objective 4 investigated pre to post change in therapist observation measures of parent-child interactions. Results indicated that participation in the Sing & Grow program was associated with large and significant improvements in parent sensitivity to, engagement with and acceptance of the child. There were significant interactions across time (pre to post) for the parent characteristics of Indigenous status, family income and level of education. Research Objective 5 explored the relationship between the number of sessions attended and extent of change on self-report outcomes and therapist observed outcomes, respectively. For each, an overall change score was devised to ascertain those parents who had made any positive changes over time. Results showed that there was no significant relationship between high attendance and positive change in either the self-report or therapist observed behavioural measures. A risk index was also constructed to test for a relationship between the risk status of the parent. Parents with the highest risk status were significantly more likely to attend six or more sessions than other parents, but risk status was not associated with any differences in parent reported outcomes or therapist observations. The results of this research study indicate that Sing & Grow is effective in improving outcomes for young parents’ mental health, parent-child interactions and social support connections. High attendance by families in the highest category for risk factors may indicate that the program is effective at engaging and retaining parents who are most at-risk and therefore traditionally hard to reach. Very high levels of satisfaction and perceived benefits support this. Further research is required to help confirm the promising evidence from the current study that a short term group music therapy program can support young parents and improve their parenting outcomes. In particular, this needs to address the more disappointing outcomes of the current research study to improve attendance and engagement of all young parents in the program and especially the needs of young Indigenous parents.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We formalise and present a new generic multifaceted complex system approach for modelling complex business enterprises. Our method has a strong focus on integrating the various data types available in an enterprise which represent the diverse perspectives of various stakeholders. We explain the challenges faced and define a novel approach to converting diverse data types into usable Bayesian probability forms. The data types that can be integrated include historic data, survey data, and management planning data, expert knowledge and incomplete data. The structural complexities of the complex system modelling process, based on various decision contexts, are also explained along with a solution. This new application of complex system models as a management tool for decision making is demonstrated using a railway transport case study. The case study demonstrates how the new approach can be utilised to develop a customised decision support model for a specific enterprise. Various decision scenarios are also provided to illustrate the versatility of the decision model at different phases of enterprise operations such as planning and control.