910 resultados para Missing Data


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Clare, A. and King R.D. (2002) Machine learning of functional class from phenotype data. Bioinformatics 18(1) 160-166

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We propose a novel data-delivery method for delay-sensitive traffic that significantly reduces the energy consumption in wireless sensor networks without reducing the number of packets that meet end-to-end real-time deadlines. The proposed method, referred to as SensiQoS, leverages the spatial and temporal correlation between the data generated by events in a sensor network and realizes energy savings through application-specific in-network aggregation of the data. SensiQoS maximizes energy savings by adaptively waiting for packets from upstream nodes to perform in-network processing without missing the real-time deadline for the data packets. SensiQoS is a distributed packet scheduling scheme, where nodes make localized decisions on when to schedule a packet for transmission to meet its end-to-end real-time deadline and to which neighbor they should forward the packet to save energy. We also present a localized algorithm for nodes to adapt to network traffic to maximize energy savings in the network. Simulation results show that SensiQoS improves the energy savings in sensor networks where events are sensed by multiple nodes, and spatial and/or temporal correlation exists among the data packets. Energy savings due to SensiQoS increase with increase in the density of the sensor nodes and the size of the sensed events. © 2010 Harshavardhan Sabbineni and Krishnendu Chakrabarty.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: The National Comprehensive Cancer Network and the American Society of Clinical Oncology have established guidelines for the treatment and surveillance of colorectal cancer (CRC), respectively. Considering these guidelines, an accurate and efficient method is needed to measure receipt of care. METHODS: The accuracy and completeness of Veterans Health Administration (VA) administrative data were assessed by comparing them with data manually abstracted during the Colorectal Cancer Care Collaborative (C4) quality improvement initiative for 618 patients with stage I-III CRC. RESULTS: The VA administrative data contained gender, marital, and birth information for all patients but race information was missing for 62.1% of patients. The percent agreement for demographic variables ranged from 98.1-100%. The kappa statistic for receipt of treatments ranged from 0.21 to 0.60 and there was a 96.9% agreement for the date of surgical resection. The percentage of post-diagnosis surveillance events in C4 also in VA administrative data were 76.0% for colonoscopy, 84.6% for physician visit, and 26.3% for carcinoembryonic antigen (CEA) test. CONCLUSIONS: VA administrative data are accurate and complete for non-race demographic variables, receipt of CRC treatment, colonoscopy, and physician visits; but alternative data sources may be necessary to capture patient race and receipt of CEA tests.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Symposium, “Towards the sustainable use of Europe’s forests”, with sub-title “Forest ecosystem and landscape research: scientific challenges and opportunities” lists three fundamental substantive areas of research that are involved: Forest management and practices, Ecosystem processes and functional ecology, and Environmental economics and sociology. This paper argues that there are essential catalytic elements missing! Without these elements there is great danger that the aimed-for world leadership in the forest sciences will not materialize. What are the missing elements? All the sciences, and in particular biology, environmental sciences, sociology, economics, and forestry have evolved so that they include good scientific methodology. Good methodology is imperative in both the design and analysis of research studies, the management of research data, and in the interpretation of research finding. The methodological disciplines of Statistics, Modelling and Informatics (“SMI”) are crucial elements in a proposed Centre of European Forest Science, and the full involvement of professionals in these methodological disciplines is needed if the research of the Centre is to be world-class. Distributed Virtual Institute (DVI) for Statistics, Modelling and Informatics in Forestry and the Environment (SMIFE) is a consortium with the aim of providing world-class methodological support and collaboration to European research in the areas of Forestry and the Environment. It is suggested that DVI: SMIFE should be a formal partner in the proposed Centre for European Forest Science.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Baited cameras are often used for abundance estimation wherever alternative techniques are precluded, e.g. in abyssal systems and areas such as reefs. This method has thus far used models of the arrival process that are deterministic and, therefore, permit no estimate of precision.
Furthermore, errors due to multiple counting of fish and missing those not seen by the camera have restricted the technique to using only the time of first arrival, leaving a lot of data redundant. Here, we reformulate the arrival process using a stochastic model, which allows the precision of abundance
estimates to be quantified. Assuming a non-gregarious, cross-current-scavenging fish, we show that prediction of abundance from first arrival time is extremely uncertain. Using example data, we show
that simple regression-based prediction from the initial (rising) slope of numbers at the bait gives good precision, accepting certain assumptions. The most precise abundance estimates were obtained
by including the declining phase of the time series, using a simple model of departures, and taking account of scavengers beyond the camera’s view, using a hidden Markov model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a novel method of audio-visual feature-level fusion for person identification where both the speech and facial modalities may be corrupted, and there is a lack of prior knowledge about the corruption. Furthermore, we assume there are limited amount of training data for each modality (e.g., a short training speech segment and a single training facial image for each person). A new multimodal feature representation and a modified cosine similarity are introduced to combine and compare bimodal features with limited training data, as well as vastly differing data rates and feature sizes. Optimal feature selection and multicondition training are used to reduce the mismatch between training and testing, thereby making the system robust to unknown bimodal corruption. Experiments have been carried out on a bimodal dataset created from the SPIDRE speaker recognition database and AR face recognition database with variable noise corruption of speech and occlusion in the face images. The system's speaker identification performance on the SPIDRE database, and facial identification performance on the AR database, is comparable with the literature. Combining both modalities using the new method of multimodal fusion leads to significantly improved accuracy over the unimodal systems, even when both modalities have been corrupted. The new method also shows improved identification accuracy compared with the bimodal systems based on multicondition model training or missing-feature decoding alone.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Over the last 15 years, the supernova community has endeavoured to directly identify progenitor stars for core-collapse supernovae discovered in nearby galaxies. These precursors are often visible as resolved stars in high-resolution images from space-and ground-based telescopes. The discovery rate of progenitor stars is limited by the local supernova rate and the availability and depth of archive images of galaxies, with 18 detections of precursor objects and 27 upper limits. This review compiles these results (from 1999 to 2013) in a distance-limited sample and discusses the implications of the findings. The vast majority of the detections of progenitor stars are of type II-P, II-L, or IIb with one type Ib progenitor system detected and many more upper limits for progenitors of Ibc supernovae (14 in all). The data for these 45 supernovae progenitors illustrate a remarkable deficit of high-luminosity stars above an apparent limit of log L/L-circle dot similar or equal to 5.1 dex. For a typical Salpeter initial mass function, one would expect to have found 13 high-luminosity and high-mass progenitors by now. There is, possibly, only one object in this time-and volume-limited sample that is unambiguously high-mass (the progenitor of SN2009ip) although the nature of that supernovae is still debated. The possible biases due to the influence of circumstellar dust, the luminosity analysis, and sample selection methods are reviewed. It does not appear likely that these can explain the missing high-mass progenitor stars. This review concludes that the community's work to date shows that the observed populations of supernovae in the local Universe are not, on the whole, produced by high-mass (M greater than or similar to 18 M-circle dot) stars. Theoretical explosions of model stars also predict that black hole formation and failed supernovae tend to occur above an initial mass of M similar or equal to 18 M-circle dot. The models also suggest there is no simple single mass division for neutron star or black-hole formation and that there are islands of explodability for stars in the 8-120 M-circle dot range. The observational constraints are quite consistent with the bulk of stars above M similar or equal to 18 M-circle dot collapsing to form black holes with no visible supernovae.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this brief, a hybrid filter algorithm is developed to deal with the state estimation (SE) problem for power systems by taking into account the impact from the phasor measurement units (PMUs). Our aim is to include PMU measurements when designing the dynamic state estimators for power systems with traditional measurements. Also, as data dropouts inevitably occur in the transmission channels of traditional measurements from the meters to the control center, the missing measurement phenomenon is also tackled in the state estimator design. In the framework of extended Kalman filter (EKF) algorithm, the PMU measurements are treated as inequality constraints on the states with the aid of the statistical criterion, and then the addressed SE problem becomes a constrained optimization one based on the probability-maximization method. The resulting constrained optimization problem is then solved using the particle swarm optimization algorithm together with the penalty function approach. The proposed algorithm is applied to estimate the states of the power systems with both traditional and PMU measurements in the presence of probabilistic data missing phenomenon. Extensive simulations are carried out on the IEEE 14-bus test system and it is shown that the proposed algorithm gives much improved estimation performances over the traditional EKF method.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Learning disability (LD) is a neurological condition that affects a child’s brain and impairs his ability to carry out one or many specific tasks. LD affects about 10% of children enrolled in schools. There is no cure for learning disabilities and they are lifelong. The problems of children with specific learning disabilities have been a cause of concern to parents and teachers for some time. Just as there are many different types of LDs, there are a variety of tests that may be done to pinpoint the problem The information gained from an evaluation is crucial for finding out how the parents and the school authorities can provide the best possible learning environment for child. This paper proposes a new approach in artificial neural network (ANN) for identifying LD in children at early stages so as to solve the problems faced by them and to get the benefits to the students, their parents and school authorities. In this study, we propose a closest fit algorithm data preprocessing with ANN classification to handle missing attribute values. This algorithm imputes the missing values in the preprocessing stage. Ignoring of missing attribute values is a common trend in all classifying algorithms. But, in this paper, we use an algorithm in a systematic approach for classification, which gives a satisfactory result in the prediction of LD. It acts as a tool for predicting the LD accurately, and good information of the child is made available to the concerned

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This analysis was stimulated by the real data analysis problem of household expenditure data. The full dataset contains expenditure data for a sample of 1224 households. The expenditure is broken down at 2 hierarchical levels: 9 major levels (e.g. housing, food, utilities etc.) and 92 minor levels. There are also 5 factors and 5 covariates at the household level. Not surprisingly, there are a small number of zeros at the major level, but many zeros at the minor level. The question is how best to model the zeros. Clearly, models that try to add a small amount to the zero terms are not appropriate in general as at least some of the zeros are clearly structural, e.g. alcohol/tobacco for households that are teetotal. The key question then is how to build suitable conditional models. For example, is the sub-composition of spending excluding alcohol/tobacco similar for teetotal and non-teetotal households? In other words, we are looking for sub-compositional independence. Also, what determines whether a household is teetotal? Can we assume that it is independent of the composition? In general, whether teetotal will clearly depend on the household level variables, so we need to be able to model this dependence. The other tricky question is that with zeros on more than one component, we need to be able to model dependence and independence of zeros on the different components. Lastly, while some zeros are structural, others may not be, for example, for expenditure on durables, it may be chance as to whether a particular household spends money on durables within the sample period. This would clearly be distinguishable if we had longitudinal data, but may still be distinguishable by looking at the distribution, on the assumption that random zeros will usually be for situations where any non-zero expenditure is not small. While this analysis is based on around economic data, the ideas carry over to many other situations, including geological data, where minerals may be missing for structural reasons (similar to alcohol), or missing because they occur only in random regions which may be missed in a sample (similar to the durables)

Relevância:

30.00% 30.00%

Publicador:

Resumo:

As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completely absent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involved parts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method is introduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that the theoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approach has reasonable properties from a compositional point of view. In particular, it is “natural” in the sense that it recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in the same paper a substitution method for missing values on compositional data sets is introduced

Relevância:

30.00% 30.00%

Publicador:

Resumo:

R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computing and graphics. The environment in which many classical and modern statistical techniques have been implemented, but many are supplied as packages. There are 8 standard packages and many more are available through the cran family of Internet sites http://cran.r-project.org . We started to develop a library of functions in R to support the analysis of mixtures and our goal is a MixeR package for compositional data analysis that provides support for operations on compositions: perturbation and power multiplication, subcomposition with or without residuals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances, compositional Kullback-Leibler divergence etc. graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features: barycenter, geometric mean of the data set, the percentiles lines, marking and coloring of subsets of the data set, theirs geometric means, notation of individual data in the set . . . dealing with zeros and missing values in compositional data sets with R procedures for simple and multiplicative replacement strategy, the time series analysis of compositional data. We’ll present the current status of MixeR development and illustrate its use on selected data sets

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The common GIS-based approach to regional analyses of soil organic carbon (SOC) stocks and changes is to define geographic layers for which unique sets of driving variables are derived, which include land use, climate, and soils. These GIS layers, with their associated attribute data, can then be fed into a range of empirical and dynamic models. Common methodologies for collating and formatting regional data sets on land use, climate, and soils were adopted for the project Assessment of Soil Organic Carbon Stocks and Changes at National Scale (GEFSOC). This permitted the development of a uniform protocol for handling the various input for the dynamic GEFSOC Modelling System. Consistent soil data sets for Amazon-Brazil, the Indo-Gangetic Plains (IGP) of India, Jordan and Kenya, the case study areas considered in the GEFSOC project, were prepared using methodologies developed for the World Soils and Terrain Database (SOTER). The approach involved three main stages: (1) compiling new soil geographic and attribute data in SOTER format; (2) using expert estimates and common sense to fill selected gaps in the measured or primary data; (3) using a scheme of taxonomy-based pedotransfer rules and expert-rules to derive soil parameter estimates for similar soil units with missing soil analytical data. The most appropriate approach varied from country to country, depending largely on the overall accessibility and quality of the primary soil data available in the case study areas. The secondary SOTER data sets discussed here are appropriate for a wide range of environmental applications at national scale. These include agro-ecological zoning, land evaluation, modelling of soil C stocks and changes, and studies of soil vulnerability to pollution. Estimates of national-scale stocks of SOC, calculated using SOTER methods, are presented as a first example of database application. Independent estimates of SOC stocks are needed to evaluate the outcome of the GEFSOC Modelling System for current conditions of land use and climate. (C) 2007 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

To construct Biodiversity richness maps from Environmental Niche Models (ENMs) of thousands of species is time consuming. A separate species occurrence data pre-processing phase enables the experimenter to control test AUC score variance due to species dataset size. Besides, removing duplicate occurrences and points with missing environmental data, we discuss the need for coordinate precision, wide dispersion, temporal and synonymity filters. After species data filtering, the final task of a pre-processing phase should be the automatic generation of species occurrence datasets which can then be directly ’plugged-in’ to the ENM. A software application capable of carrying out all these tasks will be a valuable time-saver particularly for large scale biodiversity studies.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The contribution investigates the problem of estimating the size of a population, also known as the missing cases problem. Suppose a registration system is targeting to identify all cases having a certain characteristic such as a specific disease (cancer, heart disease, ...), disease related condition (HIV, heroin use, ...) or a specific behavior (driving a car without license). Every case in such a registration system has a certain notification history in that it might have been identified several times (at least once) which can be understood as a particular capture-recapture situation. Typically, cases are left out which have never been listed at any occasion, and it is this frequency one wants to estimate. In this paper modelling is concentrating on the counting distribution, e.g. the distribution of the variable that counts how often a given case has been identified by the registration system. Besides very simple models like the binomial or Poisson distribution, finite (nonparametric) mixtures of these are considered providing rather flexible modelling tools. Estimation is done using maximum likelihood by means of the EM algorithm. A case study on heroin users in Bangkok in the year 2001 is completing the contribution.