28 resultados para Unicode Common Locale Data Repository
Resumo:
In any investigation in optometry involving more that two treatment or patient groups, an investigator should be using ANOVA to analyse the results assuming that the data conform reasonably well to the assumptions of the analysis. Ideally, specific null hypotheses should be built into the experiment from the start so that the treatments variation can be partitioned to test these effects directly. If 'post-hoc' tests are used, then an experimenter should examine the degree of protection offered by the test against the possibilities of making either a type 1 or a type 2 error. All experimenters should be aware of the complexity of ANOVA. The present article describes only one common form of the analysis, viz., that which applies to a single classification of the treatments in a randomised design. There are many different forms of the analysis each of which is appropriate to the analysis of a specific experimental design. The uses of some of the most common forms of ANOVA in optometry have been described in a further article. If in any doubt, an investigator should consult a statistician with experience of the analysis of experiments in optometry since once embarked upon an experiment with an unsuitable design, there may be little that a statistician can do to help.
Resumo:
Visualising data for exploratory analysis is a major challenge in many applications. Visualisation allows scientists to gain insight into the structure and distribution of the data, for example finding common patterns and relationships between samples as well as variables. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are employed. These methods are favoured because of their simplicity, but they cannot cope with missing data and it is difficult to incorporate prior knowledge about properties of the variable space into the analysis; this is particularly important in the high-dimensional, sparse datasets typical in geochemistry. In this paper we show how to utilise a block-structured correlation matrix using a modification of a well known non-linear probabilistic visualisation model, the Generative Topographic Mapping (GTM), which can cope with missing data. The block structure supports direct modelling of strongly correlated variables. We show that including prior structural information it is possible to improve both the data visualisation and the model fit. These benefits are demonstrated on artificial data as well as a real geochemical dataset used for oil exploration, where the proposed modifications improved the missing data imputation results by 3 to 13%.
Resumo:
Exploratory analysis of data seeks to find common patterns to gain insights into the structure and distribution of the data. In geochemistry it is a valuable means to gain insights into the complicated processes making up a petroleum system. Typically linear visualisation methods like principal components analysis, linked plots, or brushing are used. These methods can not directly be employed when dealing with missing data and they struggle to capture global non-linear structures in the data, however they can do so locally. This thesis discusses a complementary approach based on a non-linear probabilistic model. The generative topographic mapping (GTM) enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate more structure than a two dimensional principal components plot. The model can deal with uncertainty, missing data and allows for the exploration of the non-linear structure in the data. In this thesis a novel approach to initialise the GTM with arbitrary projections is developed. This makes it possible to combine GTM with algorithms like Isomap and fit complex non-linear structure like the Swiss-roll. Another novel extension is the incorporation of prior knowledge about the structure of the covariance matrix. This extension greatly enhances the modelling capabilities of the algorithm resulting in better fit to the data and better imputation capabilities for missing data. Additionally an extensive benchmark study of the missing data imputation capabilities of GTM is performed. Further a novel approach, based on missing data, will be introduced to benchmark the fit of probabilistic visualisation algorithms on unlabelled data. Finally the work is complemented by evaluating the algorithms on real-life datasets from geochemical projects.
Resumo:
Common approaches to IP-traffic modelling have featured the use of stochastic models, based on the Markov property, which can be classified into black box and white box models based on the approach used for modelling traffic. White box models, are simple to understand, transparent and have a physical meaning attributed to each of the associated parameters. To exploit this key advantage, this thesis explores the use of simple classic continuous-time Markov models based on a white box approach, to model, not only the network traffic statistics but also the source behaviour with respect to the network and application. The thesis is divided into two parts: The first part focuses on the use of simple Markov and Semi-Markov traffic models, starting from the simplest two-state model moving upwards to n-state models with Poisson and non-Poisson statistics. The thesis then introduces the convenient to use, mathematically derived, Gaussian Markov models which are used to model the measured network IP traffic statistics. As one of the most significant contributions, the thesis establishes the significance of the second-order density statistics as it reveals that, in contrast to first-order density, they carry much more unique information on traffic sources and behaviour. The thesis then exploits the use of Gaussian Markov models to model these unique features and finally shows how the use of simple classic Markov models coupled with use of second-order density statistics provides an excellent tool for capturing maximum traffic detail, which in itself is the essence of good traffic modelling. The second part of the thesis, studies the ON-OFF characteristics of VoIP traffic with reference to accurate measurements of the ON and OFF periods, made from a large multi-lingual database of over 100 hours worth of VoIP call recordings. The impact of the language, prosodic structure and speech rate of the speaker on the statistics of the ON-OFF periods is analysed and relevant conclusions are presented. Finally, an ON-OFF VoIP source model with log-normal transitions is contributed as an ideal candidate to model VoIP traffic and the results of this model are compared with those of previously published work.
Resumo:
Exploratory analysis of petroleum geochemical data seeks to find common patterns to help distinguish between different source rocks, oils and gases, and to explain their source, maturity and any intra-reservoir alteration. However, at the outset, one is typically faced with (a) a large matrix of samples, each with a range of molecular and isotopic properties, (b) a spatially and temporally unrepresentative sampling pattern, (c) noisy data and (d) often, a large number of missing values. This inhibits analysis using conventional statistical methods. Typically, visualisation methods like principal components analysis are used, but these methods are not easily able to deal with missing data nor can they capture non-linear structure in the data. One approach to discovering complex, non-linear structure in the data is through the use of linked plots, or brushing, while ignoring the missing data. In this paper we introduce a complementary approach based on a non-linear probabilistic model. Generative topographic mapping enables the visualisation of the effects of very many variables on a single plot, while also dealing with missing data. We show how using generative topographic mapping also provides an optimal method with which to replace missing values in two geochemical datasets, particularly where a large proportion of the data is missing.
Resumo:
In previous Statnotes, many of the statistical tests described rely on the assumption that the data are a random sample from a normal or Gaussian distribution. These include most of the tests in common usage such as the ‘t’ test ), the various types of analysis of variance (ANOVA), and Pearson’s correlation coefficient (‘r’) . In microbiology research, however, not all variables can be assumed to follow a normal distribution. Yeast populations, for example, are a notable feature of freshwater habitats, representatives of over 100 genera having been recorded . Most common are the ‘red yeasts’ such as Rhodotorula, Rhodosporidium, and Sporobolomyces and ‘black yeasts’ such as Aurobasidium pelculans, together with species of Candida. Despite the abundance of genera and species, the overall density of an individual species in freshwater is likely to be low and hence, samples taken from such a population will contain very low numbers of cells. A rare organism living in an aquatic environment may be distributed more or less at random in a volume of water and therefore, samples taken from such an environment may result in counts which are more likely to be distributed according to the Poisson than the normal distribution. The Poisson distribution was named after the French mathematician Siméon Poisson (1781-1840) and has many applications in biology, especially in describing rare or randomly distributed events, e.g., the number of mutations in a given sequence of DNA after exposure to a fixed amount of radiation or the number of cells infected by a virus given a fixed level of exposure. This Statnote describes how to fit the Poisson distribution to counts of yeast cells in samples taken from a freshwater lake.
Resumo:
One of the aims of the Science and Technology Committee (STC) of the Group on Earth Observations (GEO) was to establish a GEO Label- a label to certify geospatial datasets and their quality. As proposed, the GEO Label will be used as a value indicator for geospatial data and datasets accessible through the Global Earth Observation System of Systems (GEOSS). It is suggested that the development of such a label will significantly improve user recognition of the quality of geospatial datasets and that its use will help promote trust in datasets that carry the established GEO Label. Furthermore, the GEO Label is seen as an incentive to data providers. At the moment GEOSS contains a large amount of data and is constantly growing. Taking this into account, a GEO Label could assist in searching by providing users with visual cues of dataset quality and possibly relevance; a GEO Label could effectively stand as a decision support mechanism for dataset selection. Currently our project - GeoViQua, - together with EGIDA and ID-03 is undertaking research to define and evaluate the concept of a GEO Label. The development and evaluation process will be carried out in three phases. In phase I we have conducted an online survey (GEO Label Questionnaire) to identify the initial user and producer views on a GEO Label or its potential role. In phase II we will conduct a further study presenting some GEO Label examples that will be based on Phase I. We will elicit feedback on these examples under controlled conditions. In phase III we will create physical prototypes which will be used in a human subject study. The most successful prototypes will then be put forward as potential GEO Label options. At the moment we are in phase I, where we developed an online questionnaire to collect the initial GEO Label requirements and to identify the role that a GEO Label should serve from the user and producer standpoint. The GEO Label Questionnaire consists of generic questions to identify whether users and producers believe a GEO Label is relevant to geospatial data; whether they want a single "one-for-all" label or separate labels that will serve a particular role; the function that would be most relevant for a GEO Label to carry; and the functionality that users and producers would like to see from common rating and review systems they use. To distribute the questionnaire, relevant user and expert groups were contacted at meetings or by email. At this stage we successfully collected over 80 valid responses from geospatial data users and producers. This communication will provide a comprehensive analysis of the survey results, indicating to what extent the users surveyed in Phase I value a GEO Label, and suggesting in what directions a GEO Label may develop. Potential GEO Label examples based on the results of the survey will be presented for use in Phase II.
Resumo:
We present a data based statistical study on the effects of seasonal variations in the growth rates of the gastro-intestinal (GI) parasitic infection in livestock. The alluded growth rate is estimated through the variation in the number of eggs per gram (EPG) of faeces in animals. In accordance with earlier studies, our analysis too shows that rainfall is the dominant variable in determining EPG infection rates compared to other macro-parameters like temperature and humidity. Our statistical analysis clearly indicates an oscillatory dependence of EPG levels on rainfall fluctuations. Monsoon recorded the highest infection with a comparative increase of at least 2.5 times compared to the next most infected period (summer). A least square fit of the EPG versus rainfall data indicates an approach towards a super diffusive (i. e. root mean square displacement growing faster than the square root of the elapsed time as obtained for simple diffusion) infection growth pattern regime for low rainfall regimes (technically defined as zeroth level dependence) that gets remarkably augmented for large rainfall zones. Our analysis further indicates that for low fluctuations in temperature (true on the bulk data), EPG level saturates beyond a critical value of the rainfall, a threshold that is expected to indicate the onset of the nonlinear regime. The probability density functions (PDFs) of the EPG data show oscillatory behavior in the large rainfall regime (greater than 500 mm), the frequency of oscillation, once again, being determined by the ambient wetness (rainfall, and humidity). Data recorded over three pilot projects spanning three measures of rainfall and humidity bear testimony to the universality of this statistical argument. © 2013 Chattopadhyay and Bandyopadhyay.
Resumo:
Background - Recent studies have implicated variants of the transcription factor 7-like 2 (TCF7L2) gene in genetic susceptibility to type 2 diabetes mellitus in several different populations. The aim of this study was to determine whether variants of this gene are also risk factors for type 2 diabetes development in a UK-resident South Asian cohort of Punjabi ancestry. Methods - We genotyped four single nucleotide polymorphisms (SNPs) of TCF7L2 (rs7901695, rs7903146, rs11196205 and rs12255372) in 831 subjects with diabetes and 437 control subjects. Results - The minor allele of each variant was significantly associated with type 2 diabetes; the greatest risk of developing the disease was conferred by rs7903146, with an allelic odds ratio (OR) of 1.31 (95% CI: 1.11 – 1.56, p = 1.96 × 10-3). For each variant, disease risk associated with homozygosity for the minor allele was greater than that for heterozygotes, with the exception of rs12255372. To determine the effect on the observed associations of including young control subjects in our data set, we reanalysed the data using subsets of the control group defined by different minimum age thresholds. Increasing the minimum age of our control subjects resulted in a corresponding increase in OR for all variants of the gene (p ≤ 1.04 × 10-7). Conclusion - Our results support recent findings that TCF7L2 is an important genetic risk factor for the development of type 2 diabetes in multiple ethnic groups.
Resumo:
Background and Objective: To maximise the benefit from statin therapy, patients must maintain regular therapy indefinitely. Non-compliance is thought to be common in those taking medication at regular intervals over long periods of time, especially where they may perceive no immediate benefit (News editorial, 2002). This study extends previous work in which commonly held prescribing data is used as a surrogate marker of compliance and was designed to examine compliance in those stabilised on statins in a large General Practice. Design: Following ethical approval, details of all patients who had a single statin for 12 consecutive months with no changes in drug, frequency or dose, between December 1999 and March 2003, were obtained. Setting: An Eastern Birmingham Primary Care Trust GP surgery. Main Outcome Measures: A compliance ratio was calculated by dividing the number of days treatment by the number of doses prescribed. For a once daily regimen the ratio for full compliance_1. Results: 324 patients were identified. The average compliance ratio for the first six months of the study was 1.06 ± 0.01 (range 0.46 – 2.13) and for the full twelve months was 1.05 ± 0.01 (range 0.58 – 2.08). Conclusions: The data shown here indicates that as a group, long-term, stabilised statin users appear compliant. However, the range of values obtained show that there are identifiable subsets of patients who are not taking their therapy as prescribed. Although the apparent use of more doses than prescribed in some patients may result from medication hording, this cannot be the case in the patients who apparently take less. It has been demonstrated here that the compliance ratio can be used as an early indicator of problems allowing targeted compliance advice can be given where it will have the most benefit. References: News Editorial. Pharmacy records could be used to enhance statin compliance in elderly. Pharm. J. 2002; 269: 121.
Resumo:
Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The probabilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete continuous data, can be extended to model heterogeneous (i.e. containing both continuous and discrete values) and missing data. This paper describes and assesses the resulting model on both synthetic and real-world heterogeneous data with missing values.
Resumo:
This article presents a new method for data collection in regional dialectology based on site-restricted web searches. The method measures the usage and determines the distribution of lexical variants across a region of interest using common web search engines, such as Google or Bing. The method involves estimating the proportions of the variants of a lexical alternation variable over a series of cities by counting the number of webpages that contain the variants on newspaper websites originating from these cities through site-restricted web searches. The method is evaluated by mapping the 26 variants of 10 lexical variables with known distributions in American English. In almost all cases, the maps based on site-restricted web searches align closely with traditional dialect maps based on data gathered through questionnaires, demonstrating the accuracy of this method for the observation of regional linguistic variation. However, unlike collecting dialect data using traditional methods, which is a relatively slow process, the use of site-restricted web searches allows for dialect data to be collected from across a region as large as the United States in a matter of days.
Resumo:
Imagining a familiar environment is different from imagining an environmental map and clinical evidence demonstrated the existence of double dissociations in brain-damaged patients due to the contents of mental images. Here, we assessed a large sample of young and old participants by considering their ability to generate different kinds of mental images, namely, buildings or common objects. As buildings are environmental stimuli that have an important role in human navigation, we expected that elderly participants would have greater difficulty in generating images of buildings than common objects. We found that young and older participants differed in generating both buildings and common objects. For young participants there were no differences between buildings and common objects, but older participants found easier to generate common objects than buildings. Buildings are a special type of visual stimuli because in urban environments they are commonly used as landmarks for navigational purposes. Considering that topographical orientation is one of the abilities mostly affected in normal and pathological aging, the present data throw some light on the impaired processes underlying human navigation.