875 resultados para statistical data analysis
Resumo:
Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole - term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering. Copyright 2014 ACM.
Resumo:
Critical to the research of urban morphologists is the availability of historical records that document the urban transformation of the study area. However, thus far little work has been done towards an empirical approach to the validation of archival data in this field. Outlined in this paper, therefore, is a new methodology for validating the accuracy of archival records and mapping data, accrued through the process of urban morphological research, so as to establish a reliable platform from which analysis can proceed. The paper particularly addresses the problems of inaccuracies in existing curated historical information, as well as errors in archival research by student assistants, which together give rise to unacceptable levels of uncertainty in the documentation. The paper discusses the problems relating to the reliability of historical information, demonstrates the importance of data verification in urban morphological research, and proposes a rigorous method for objective testing of collected archival data through the use of qualitative data analysis software.
Resumo:
The development of microfinance in Vietnam since 1990s has coincided with a remarkable progress in poverty reduction. Numerous descriptive studies have illustrated that microfinance is an effective tool to eradicate poverty in Vietnam but evidence from quantitative studies is mixed. This study contributes to the literature by providing new evidence on the impact of microfinance to poverty reduction in Vietnam using the repeated cross - sectional data from the Vietnam Living Standard s Survey (VLSS) during period 1992 - 2010. Our results show that micro - loans contribute significantly to household consumption.
Resumo:
Most real-life data analysis problems are difficult to solve using exact methods, due to the size of the datasets and the nature of the underlying mechanisms of the system under investigation. As datasets grow even larger, finding the balance between the quality of the approximation and the computing time of the heuristic becomes non-trivial. One solution is to consider parallel methods, and to use the increased computational power to perform a deeper exploration of the solution space in a similar time. It is, however, difficult to estimate a priori whether parallelisation will provide the expected improvement. In this paper we consider a well-known method, genetic algorithms, and evaluate on two distinct problem types the behaviour of the classic and parallel implementations.
Resumo:
There is a current lack of understanding regarding the use of unregistered vehicles on public roads and road-related areas, and the links between the driving of unregistered vehicles and a range of dangerous driving behaviours. This report documents the findings of data analysis conducted to investigate the links between unlicensed driving and the driving of unregistered vehicles, and is an important initial undertaking into understanding these behaviours. This report examines de-identified data from two sources: crash data; and offence data. The data was extracted from the Queensland Department of Transport and Main Roads (TMR) databases and covered the period from 2003 to 2008.
Resumo:
Big Datasets are endemic, but they are often notoriously difficult to analyse because of their size, heterogeneity, history and quality. The purpose of this paper is to open a discourse on the use of modern experimental design methods to analyse Big Data in order to answer particular questions of interest. By appealing to a range of examples, it is suggested that this perspective on Big Data modelling and analysis has wide generality and advantageous inferential and computational properties. In particular, the principled experimental design approach is shown to provide a flexible framework for analysis that, for certain classes of objectives and utility functions, delivers near equivalent answers compared with analyses of the full dataset under a controlled error rate. It can also provide a formalised method for iterative parameter estimation, model checking, identification of data gaps and evaluation of data quality. Finally, it has the potential to add value to other Big Data sampling algorithms, in particular divide-and-conquer strategies, by determining efficient sub-samples.
Resumo:
Selection criteria and misspecification tests for the intra-cluster correlation structure (ICS) in longitudinal data analysis are considered. In particular, the asymptotical distribution of the correlation information criterion (CIC) is derived and a new method for selecting a working ICS is proposed by standardizing the selection criterion as the p-value. The CIC test is found to be powerful in detecting misspecification of the working ICS structures, while with respect to the working ICS selection, the standardized CIC test is also shown to have satisfactory performance. Some simulation studies and applications to two real longitudinal datasets are made to illustrate how these criteria and tests might be useful.
Resumo:
A modeling paradigm is proposed for covariate, variance and working correlation structure selection for longitudinal data analysis. Appropriate selection of covariates is pertinent to correct variance modeling and selecting the appropriate covariates and variance function is vital to correlation structure selection. This leads to a stepwise model selection procedure that deploys a combination of different model selection criteria. Although these criteria find a common theoretical root based on approximating the Kullback-Leibler distance, they are designed to address different aspects of model selection and have different merits and limitations. For example, the extended quasi-likelihood information criterion (EQIC) with a covariance penalty performs well for covariate selection even when the working variance function is misspecified, but EQIC contains little information on correlation structures. The proposed model selection strategies are outlined and a Monte Carlo assessment of their finite sample properties is reported. Two longitudinal studies are used for illustration.
Resumo:
The method of generalized estimating equations (GEEs) provides consistent estimates of the regression parameters in a marginal regression model for longitudinal data, even when the working correlation model is misspecified (Liang and Zeger, 1986). However, the efficiency of a GEE estimate can be seriously affected by the choice of the working correlation model. This study addresses this problem by proposing a hybrid method that combines multiple GEEs based on different working correlation models, using the empirical likelihood method (Qin and Lawless, 1994). Analyses show that this hybrid method is more efficient than a GEE using a misspecified working correlation model. Furthermore, if one of the working correlation structures correctly models the within-subject correlations, then this hybrid method provides the most efficient parameter estimates. In simulations, the hybrid method's finite-sample performance is superior to a GEE under any of the commonly used working correlation models and is almost fully efficient in all scenarios studied. The hybrid method is illustrated using data from a longitudinal study of the respiratory infection rates in 275 Indonesian children.
Resumo:
We consider ranked-based regression models for clustered data analysis. A weighted Wilcoxon rank method is proposed to take account of within-cluster correlations and varying cluster sizes. The asymptotic normality of the resulting estimators is established. A method to estimate covariance of the estimators is also given, which can bypass estimation of the density function. Simulation studies are carried out to compare different estimators for a number of scenarios on the correlation structure, presence/absence of outliers and different correlation values. The proposed methods appear to perform well, in particular, the one incorporating the correlation in the weighting achieves the highest efficiency and robustness against misspecification of correlation structure and outliers. A real example is provided for illustration.
Resumo:
We consider the problem of estimating a population size from successive catches taken during a removal experiment and propose two estimating functions approaches, the traditional quasi-likelihood (TQL) approach for dependent observations and the conditional quasi-likelihood (CQL) approach using the conditional mean and conditional variance of the catch given previous catches. Asymptotic covariance of the estimates and the relationship between the two methods are derived. Simulation results and application to the catch data from smallmouth bass show that the proposed estimating functions perform better than other existing methods, especially in the presence of overdispersion.
Resumo:
A graphical method is presented for Hall data analysis, including the temperature variation of activation energy due to screening. This method removes the discrepancies noted in the analysis of recently reported Hall data on Si(In).
Resumo:
The majority of Australian weeds are exotic plant species that were intentionally introduced for a variety of horticultural and agricultural purposes. A border weed risk assessment system (WRA) was implemented in 1997 in order to reduce the high economic costs and massive environmental damage associated with introducing serious weeds. We review the behaviour of this system with regard to eight years of data collected from the assessment of species proposed for importation or held within genetic resource centres in Australia. From a taxonomic perspective, species from the Chenopodiaceae and Poaceae were most likely to be rejected and those from the Arecaceae and Flacourtiaceae were most likely to be accepted. Dendrogram analysis and classification and regression tree (TREE) models were also used to analyse the data. The latter revealed that a small subset of the 35 variables assessed was highly associated with the outcome of the original assessment. The TREE model examining all of the data contained just five variables: unintentional human dispersal, congeneric weed, weed elsewhere, tolerates or benefits from mutilation, cultivation or fire, and reproduction by vegetative propagation. It gave the same outcome as the full WRA model for 71% of species. Weed elsewhere was not the first splitting variable in this model, indicating that the WRA has a capacity for capturing species that have no history of weediness. A reduced TREE model (in which human-mediated variables had been removed) contained four variables: broad climate suitability, reproduction in less or than equal to 1 year, self-fertilisation, and tolerates and benefits from mutilation, cultivation or fire. It yielded the same outcome as the full WRA model for 65% of species. Data inconsistencies and the relative importance of questions are discussed, with some recommendations made for improving the use of the system.
Resumo:
The emerging carbon economy will have a major impact on grazing businesses because of significant livestock methane and land-use change emissions. Livestock methane emissions alone account for similar to 11% of Australia's reported greenhouse gas emissions. Grazing businesses need to develop an understanding of their greenhouse gas impact and be able to assess the impact of alternative management options. This paper attempts to generate a greenhouse gas budget for two scenarios using a spread sheet model. The first scenario was based on one land-type '20-year-old brigalow regrowth' in the brigalow bioregion of southern-central Queensland. The 50 year analysis demonstrated the substantially different greenhouse gas outcomes and livestock carrying capacity for three alternative regrowth management options: retain regrowth (sequester 71.5 t carbon dioxide equivalents per hectare, CO2-e/ha), clear all regrowth (emit 42.8 t CO2-e/ha) and clear regrowth strips (emit 5.8 t CO2-e/ha). The second scenario was based on a 'remnant eucalypt savanna-woodland' land type in the Einasleigh Uplands bioregion of north Queensland. The four alternative vegetation management options were: retain current woodland structure (emit 7.4 t CO2-e/ha), allow woodland to thicken increasing tree basal area (sequester 20.7 t CO2-e/ha), thin trees less than 10 cm diameter (emit 8.9 t CO2-e/ha), and thin trees <20 cm diameter (emit 12.4 t CO2-e/ha). Significant assumptions were required to complete the budgets due to gaps in current knowledge on the response of woody vegetation, soil carbon and non-CO2 soil emissions to management options and land-type at the property scale. The analyses indicate that there is scope for grazing businesses to choose alternative management options to influence their greenhouse gas budget. However, a key assumption is that accumulation of carbon or avoidance of emissions somewhere on a grazing business (e.g. in woody vegetation or soil) will be recognised as an offset for emissions elsewhere in the business (e.g. livestock methane). This issue will be a challenge for livestock industries and policy makers to work through in the coming years.
Resumo:
The use of near infrared (NIR) hyperspectral imaging and hyperspectral image analysis for distinguishing between hard, intermediate and soft maize kernels from inbred lines was evaluated. NIR hyperspectral images of two sets (12 and 24 kernels) of whole maize kernels were acquired using a Spectral Dimensions MatrixNIR camera with a spectral range of 960-1662 nm and a sisuChema SWIR (short wave infrared) hyperspectral pushbroom imaging system with a spectral range of 1000-2498 nm. Exploratory principal component analysis (PCA) was used on absorbance images to remove background, bad pixels and shading. On the cleaned images. PCA could be used effectively to find histological classes including glassy (hard) and floury (soft) endosperm. PCA illustrated a distinct difference between glassy and floury endosperm along principal component (PC) three on the MatrixNIR and PC two on the sisuChema with two distinguishable clusters. Subsequently partial least squares discriminant analysis (PLS-DA) was applied to build a classification model. The PLS-DA model from the MatrixNIR image (12 kernels) resulted in root mean square error of prediction (RMSEP) value of 0.18. This was repeated on the MatrixNIR image of the 24 kernels which resulted in RMSEP of 0.18. The sisuChema image yielded RMSEP value of 0.29. The reproducible results obtained with the different data sets indicate that the method proposed in this paper has a real potential for future classification uses.