910 resultados para Missing data
Resumo:
Background: Bovine respiratory disease complex (BRDC) is a multi-factorial disease in which numerous factors, such as animal management, pathogen exposure and environmental conditions, contribute to the development of acute respiratory illness in feedlot cattle. The role of specific pathogens in the development of BRDC has been difficult to define because of the complex nature of the disease and the presence of implicated bacterial pathogens in the upper respiratory tract of healthy animals. Mycoplasma bovis is an important pathogen of cattle and recognised as a major contributor to cases of mastitis, caseonecrotic bronchopneumonia, arthritis and otitis media. To date, the role of M.bovis in the development of BRDC of Australian feeder cattle has not been investigated. Methods: In this review, the current literature pertaining to the role of M.bovis in BRDC is evaluated. In addition, preliminary data are presented that identify M.bovis as a potential contributor to BRDC in Australian feedlots, which has not been considered previously. Results and Conclusion: The preliminary results demonstrate detection of M.bovis in samples from all feedlots studied. When considered in the context of the reviewed literature, they support the inclusion of M.bovis on the list of pathogens to be considered during investigations into BRDC in Australia. © 2014 Australian Veterinary Association.
Resumo:
Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.
Resumo:
Water quality data are often collected at different sites over time to improve water quality management. Water quality data usually exhibit the following characteristics: non-normal distribution, presence of outliers, missing values, values below detection limits (censored), and serial dependence. It is essential to apply appropriate statistical methodology when analyzing water quality data to draw valid conclusions and hence provide useful advice in water management. In this chapter, we will provide and demonstrate various statistical tools for analyzing such water quality data, and will also introduce how to use a statistical software R to analyze water quality data by various statistical methods. A dataset collected from the Susquehanna River Basin will be used to demonstrate various statistical methods provided in this chapter. The dataset can be downloaded from website http://www.srbc.net/programs/CBP/nutrientprogram.htm.
Resumo:
We report a search for single top quark production with the CDF II detector using 2.1 fb-1 of integrated luminosity of pbar p collisions at sqrt{s}=1.96 TeV. The data selected consist of events characterized by large energy imbalance in the transverse plane and hadronic jets, and no identified electrons and muons, so the sample is enriched in W -> tau nu decays. In order to suppress backgrounds, additional kinematic and topological requirements are imposed through a neural network, and at least one of the jets must be identified as a b-quark jet. We measure an excess of signal-like events in agreement with the standard model prediction, but inconsistent with a model without single top quark production by 2.1 standard deviations (sigma), with a median expected sensitivity of 1.4 sigma. Assuming a top quark mass of 175 GeV/c2 and ascribing the excess to single top quark production, the cross section is measured to be 4.9+2.5-2.2(stat+syst)pb, consistent with measurements performed in independent datasets and with the standard model prediction.
Resumo:
Using data from 2.9 fb-1 of integrated luminosity collected with the CDF II detector at the Tevatron, we search for resonances decaying into a pair of on-shell gauge bosons, WW or WZ, where one W decays into an electron and a neutrino, and the other boson decays into two jets. We observed no statistically significant excess above the expected standard model background, and we set cross section limits at 95% confidence level on G* (Randall-Sundrum graviton), Z′, and W′ bosons. By comparing these limits to theoretical cross sections, mass exclusion regions for the three particles are derived. The mass exclusion regions for Z′ and W′ are further evaluated as a function of their gauge coupling strength.
Resumo:
Using data from 2.9/fb of integrated luminosity collected with the CDF II detector at the Tevatron, we search for resonances decaying into a pair of on-shell gauge bosons, WW or WZ, where one W decays into an electron and a neutrino, and the other boson decays into two jets. We observed no statistically significant excess above the expected standard model background, and we set cross section limits at 95% confidence level on G*(Randall-Sundrum graviton), Z', and W' bosons. By comparing these limits to theoretical cross sections, mass exclusion regions for the three particles are derived. The mass exclusion regions for Z' and W' are further evaluated as a function of their gauge coupling strength.
Resumo:
We report on a search for the standard model Higgs boson produced in association with a $W$ or $Z$ boson in $p\bar{p}$ collisions at $\sqrt{s} = 1.96$ TeV recorded by the CDF II experiment at the Tevatron in a data sample corresponding to an integrated luminosity of 2.1 fb$^{-1}$. We consider events which have no identified charged leptons, an imbalance in transverse momentum, and two or three jets where at least one jet is consistent with originating from the decay of a $b$ hadron. We find good agreement between data and predictions. We place 95% confidence level upper limits on the production cross section for several Higgs boson masses ranging from 110$\gevm$ to 150$\gevm$. For a mass of 115$\gevm$ the observed (expected) limit is 6.9 (5.6) times the standard model prediction.
Resumo:
We present a signature-based search for anomalous production of events containing a photon, two jets, of which at least one is identified as originating from a b quark, and missing transverse energy. The search uses data corresponding to 2.0/fb of integrated luminosity from p-pbar collisions at a center-of-mass energy of sqrt(s)=1.96 TeV, collected with the CDF II detector at the Fermilab Tevatron. From 6,697,466 events with a photon candidate with transverse energy ET> 25 GeV, we find 617 events with missing transverse energy > 25 GeV and two or more jets with ET> 15 GeV, at least one identified as originating from a b quark, versus an expectation of 607+- 113 events. Increasing the requirement on missing transverse energy to 50 GeV, we find 28 events versus an expectation of 30+-11 events. We find no indications of non-standard-model phenomena.
Resumo:
We present results of a signature-based search for new physics using a dijet plus missing transverse energy data sample collected in 2 fb-1 of p-pbar collisions at sqrt(s) = 1.96 TeV with the CDF II detector at the Fermilab Tevatron. We observe no significant event excess with respect to the standard model prediction and extract a 95% C.L. upper limit on the cross section times acceptance for a potential contribution from a non-standard model process. Based on this limit the mass of a first or second generation scalar leptoquark is constrained to be above 187 GeV/c^2.
Resumo:
Computerized tomography is an imaging technique which produces cross sectional map of an object from its line integrals. Image reconstruction algorithms require collection of line integrals covering the whole measurement range. However, in many practical situations part of projection data is inaccurately measured or not measured at all. In such incomplete projection data situations, conventional image reconstruction algorithms like the convolution back projection algorithm (CBP) and the Fourier reconstruction algorithm, assuming the projection data to be complete, produce degraded images. In this paper, a multiresolution multiscale modeling using the wavelet transform coefficients of projections is proposed for projection completion. The missing coefficients are then predicted based on these models at each scale followed by inverse wavelet transform to obtain the estimated projection data.
Resumo:
The objective in this work is to develop downscaling methodologies to obtain a long time record of inundation extent at high spatial resolution based on the existing low spatial resolution results of the Global Inundation Extent from Multi-Satellites (GIEMS) dataset. In semiarid regions, high-spatial-resolution a priori information can be provided by visible and infrared observations from the Moderate Resolution Imaging Spectroradiometer (MODIS). The study concentrates on the Inner Niger Delta where MODIS-derived inundation extent has been estimated at a 500-m resolution. The space-time variability is first analyzed using a principal component analysis (PCA). This is particularly effective to understand the inundation variability, interpolate in time, or fill in missing values. Two innovative methods are developed (linear regression and matrix inversion) both based on the PCA representation. These GIEMS downscaling techniques have been calibrated using the 500-m MODIS data. The downscaled fields show the expected space-time behaviors from MODIS. A 20-yr dataset of the inundation extent at 500 m is derived from this analysis for the Inner Niger Delta. The methods are very general and may be applied to many basins and to other variables than inundation, provided enough a priori high-spatial-resolution information is available. The derived high-spatial-resolution dataset will be used in the framework of the Surface Water Ocean Topography (SWOT) mission to develop and test the instrument simulator as well as to select the calibration validation sites (with high space-time inundation variability). In addition, once SWOT observations are available, the downscaled methodology will be calibrated on them in order to downscale the GIEMS datasets and to extend the SWOT benefits back in time to 1993.
Resumo:
In this study, we applied the integration methodology developed in the companion paper by Aires (2014) by using real satellite observations over the Mississippi Basin. The methodology provides basin-scale estimates of the four water budget components (precipitation P, evapotranspiration E, water storage change Delta S, and runoff R) in a two-step process: the Simple Weighting (SW) integration and a Postprocessing Filtering (PF) that imposes the water budget closure. A comparison with in situ observations of P and E demonstrated that PF improved the estimation of both components. A Closure Correction Model (CCM) has been derived from the integrated product (SW+PF) that allows to correct each observation data set independently, unlike the SW+PF method which requires simultaneous estimates of the four components. The CCM allows to standardize the various data sets for each component and highly decrease the budget residual (P - E - Delta S - R). As a direct application, the CCM was combined with the water budget equation to reconstruct missing values in any component. Results of a Monte Carlo experiment with synthetic gaps demonstrated the good performances of the method, except for the runoff data that has a variability of the same order of magnitude as the budget residual. Similarly, we proposed a reconstruction of Delta S between 1990 and 2002 where no Gravity Recovery and Climate Experiment data are available. Unlike most of the studies dealing with the water budget closure at the basin scale, only satellite observations and in situ runoff measurements are used. Consequently, the integrated data sets are model independent and can be used for model calibration or validation.
Resumo:
Clustering techniques which can handle incomplete data have become increasingly important due to varied applications in marketing research, medical diagnosis and survey data analysis. Existing techniques cope up with missing values either by using data modification/imputation or by partial distance computation, often unreliable depending on the number of features available. In this paper, we propose a novel approach for clustering data with missing values, which performs the task by Symmetric Non-Negative Matrix Factorization (SNMF) of a complete pair-wise similarity matrix, computed from the given incomplete data. To accomplish this, we define a novel similarity measure based on Average Overlap similarity metric which can effectively handle missing values without modification of data. Further, the similarity measure is more reliable than partial distances and inherently possesses the properties required to perform SNMF. The experimental evaluation on real world datasets demonstrates that the proposed approach is efficient, scalable and shows significantly better performance compared to the existing techniques.
Resumo:
This study presents a comprehensive evaluation of five widely used multisatellite precipitation estimates (MPEs) against 1 degrees x 1 degrees gridded rain gauge data set as ground truth over India. One decade observations are used to assess the performance of various MPEs (Climate Prediction Center (CPC)-South Asia data set, CPC Morphing Technique (CMORPH), Precipitation Estimation From Remotely Sensed Information Using Artificial Neural Networks, Tropical Rainfall Measuring Mission's Multisatellite Precipitation Analysis (TMPA-3B42), and Global Precipitation Climatology Project). All MPEs have high detection skills of rain with larger probability of detection (POD) and smaller ``missing'' values. However, the detection sensitivity differs from one product (and also one region) to the other. While the CMORPH has the lowest sensitivity of detecting rain, CPC shows highest sensitivity and often overdetects rain, as evidenced by large POD and false alarm ratio and small missing values. All MPEs show higher rain sensitivity over eastern India than western India. These differential sensitivities are found to alter the biases in rain amount differently. All MPEs show similar spatial patterns of seasonal rain bias and root-mean-square error, but their spatial variability across India is complex and pronounced. The MPEs overestimate the rainfall over the dry regions (northwest and southeast India) and severely underestimate over mountainous regions (west coast and northeast India), whereas the bias is relatively small over the core monsoon zone. Higher occurrence of virga rain due to subcloud evaporation and possible missing of small-scale convective events by gauges over the dry regions are the main reasons for the observed overestimation of rain by MPEs. The decomposed components of total bias show that the major part of overestimation is due to false precipitation. The severe underestimation of rain along the west coast is attributed to the predominant occurrence of shallow rain and underestimation of moderate to heavy rain by MPEs. The decomposed components suggest that the missed precipitation and hit bias are the leading error sources for the total bias along the west coast. All evaluation metrics are found to be nearly equal in two contrasting monsoon seasons (southwest and northeast), indicating that the performance of MPEs does not change with the season, at least over southeast India. Among various MPEs, the performance of TMPA is found to be better than others, as it reproduced most of the spatial variability exhibited by the reference.
Resumo:
Cluster analysis of ranking data, which occurs in consumer questionnaires, voting forms or other inquiries of preferences, attempts to identify typical groups of rank choices. Empirically measured rankings are often incomplete, i.e. different numbers of filled rank positions cause heterogeneity in the data. We propose a mixture approach for clustering of heterogeneous rank data. Rankings of different lengths can be described and compared by means of a single probabilistic model. A maximum entropy approach avoids hidden assumptions about missing rank positions. Parameter estimators and an efficient EM algorithm for unsupervised inference are derived for the ranking mixture model. Experiments on both synthetic data and real-world data demonstrate significantly improved parameter estimates on heterogeneous data when the incomplete rankings are included in the inference process.