919 resultados para VLE data sets
Resumo:
We propose a new model for estimating the size of a population from successive catches taken during a removal experiment. The data from these experiments often have excessive variation, known as overdispersion, as compared with that predicted by the multinomial model. The new model allows catchability to vary randomly among samplings, which accounts for overdispersion. When the catchability is assumed to have a beta distribution, the likelihood function, which is refered to as beta-multinomial, is derived, and hence the maximum likelihood estimates can be evaluated. Simulations show that in the presence of extravariation in the data, the confidence intervals have been substantially underestimated in previous models (Leslie-DeLury, Moran) and that the new model provides more reliable confidence intervals. The performance of these methods was also demonstrated using two real data sets: one with overdispersion, from smallmouth bass (Micropterus dolomieu), and the other without overdispersion, from rat (Rattus rattus).
Resumo:
This article develops a method for analysis of growth data with multiple recaptures when the initial ages for all individuals are unknown. The existing approaches either impute the initial ages or model them as random effects. Assumptions about the initial age are not verifiable because all the initial ages are unknown. We present an alternative approach that treats all the lengths including the length at first capture as correlated repeated measures for each individual. Optimal estimating equations are developed using the generalized estimating equations approach that only requires the first two moment assumptions. Explicit expressions for estimation of both mean growth parameters and variance components are given to minimize the computational complexity. Simulation studies indicate that the proposed method works well. Two real data sets are analyzed for illustration, one from whelks (Dicathais aegaota) and the other from southern rock lobster (Jasus edwardsii) in South Australia.
Resumo:
Criminological theories of cross-national studies of homicide have underestimated the effects of quality governance of liberal democracy and region. Data sets from several sources are combined and a comprehensive model of homicide is proposed. Results of the spatial regression model, which controls for the effect of spatial autocorrelation, show that quality governance, human development, economic inequality, and ethnic heterogeneity are statistically significant in predicting homicide. In addition, regions of Latin America and non-Muslim Sub-Saharan Africa have significantly higher rates of homicides ceteris paribus while the effects of East Asian countries and Islamic societies are not statistically significant. These findings are consistent with the expectation of the new modernization and regional theories.
Resumo:
The use of near infrared (NIR) hyperspectral imaging and hyperspectral image analysis for distinguishing between hard, intermediate and soft maize kernels from inbred lines was evaluated. NIR hyperspectral images of two sets (12 and 24 kernels) of whole maize kernels were acquired using a Spectral Dimensions MatrixNIR camera with a spectral range of 960-1662 nm and a sisuChema SWIR (short wave infrared) hyperspectral pushbroom imaging system with a spectral range of 1000-2498 nm. Exploratory principal component analysis (PCA) was used on absorbance images to remove background, bad pixels and shading. On the cleaned images. PCA could be used effectively to find histological classes including glassy (hard) and floury (soft) endosperm. PCA illustrated a distinct difference between glassy and floury endosperm along principal component (PC) three on the MatrixNIR and PC two on the sisuChema with two distinguishable clusters. Subsequently partial least squares discriminant analysis (PLS-DA) was applied to build a classification model. The PLS-DA model from the MatrixNIR image (12 kernels) resulted in root mean square error of prediction (RMSEP) value of 0.18. This was repeated on the MatrixNIR image of the 24 kernels which resulted in RMSEP of 0.18. The sisuChema image yielded RMSEP value of 0.29. The reproducible results obtained with the different data sets indicate that the method proposed in this paper has a real potential for future classification uses.
Resumo:
A study was performed to investigate the value of near infrared reflectance spectroscopy (NIRS) as an alternate method to analytical techniques for identifying QTL associated with feed quality traits. Milled samples from an F6-derived recombinant inbred Tallon/Scarlett population were incubated in the rumen of fistulated cattle, recovered, washed and dried to determine the in-situ dry matter digestibility (DMD). Both pre- and post-digestion samples were analysed using NIRS to quantify key quality components relating to acid detergent fibre, starch and protein. This phenotypic data was used to identify trait associated QTL and compare them to previously identified QTL. Though a number of genetic correlations were identified between the phenotypic data sets, the only correlation of most interest was between DMD and starch digested (r = -0.382). The significance of this genetic correlation was that the NIRS data set identified a putative QTL on chromosomes 7H (LOD = 3.3) associated with starch digested. A QTL for DMD occurred in the same region of chromosome 7H, with flanking markers fAG/CAT63 and bPb-0758. The significant correlation and identification of this putative QTL, highlights the potential of technologies like NIRS in QTL analysis.
Resumo:
NeEstimator v2 is a completely revised and updated implementation of software that produces estimates of contemporary effective population size, using several different methods and a single input file. NeEstimator v2 includes three single-sample estimators (updated versions of the linkage disequilibrium and heterozygote-excess methods, and a new method based on molecular coancestry), as well as the two-sample (moment-based temporal) method. New features include the following: (i) an improved method for accounting for missing data; (ii) options for screening out rare alleles; (iii) confidence intervals for all methods; (iv) the ability to analyse data sets with large numbers of genetic markers (10000 or more); (v) options for batch processing large numbers of different data sets, which will facilitate cross-method comparisons using simulated data; and (vi) correction for temporal estimates when individuals sampled are not removed from the population (Plan I sampling). The user is given considerable control over input data and composition, and format of output files. The freely available software has a new JAVA interface and runs under MacOS, Linux and Windows.
Resumo:
This research studied distributed computing of all-to-all comparison problems with big data sets. The thesis formalised the problem, and developed a high-performance and scalable computing framework with a programming model, data distribution strategies and task scheduling policies to solve the problem. The study considered storage usage, data locality and load balancing for performance improvement in solving the problem. The research outcomes can be applied in bioinformatics, biometrics and data mining and other domains in which all-to-all comparisons are a typical computing pattern.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
Large-scale chromosome rearrangements such as copy number variants (CNVs) and inversions encompass a considerable proportion of the genetic variation between human individuals. In a number of cases, they have been closely linked with various inheritable diseases. Single-nucleotide polymorphisms (SNPs) are another large part of the genetic variance between individuals. They are also typically abundant and their measuring is straightforward and cheap. This thesis presents computational means of using SNPs to detect the presence of inversions and deletions, a particular variety of CNVs. Technically, the inversion-detection algorithm detects the suppressed recombination rate between inverted and non-inverted haplotype populations whereas the deletion-detection algorithm uses the EM-algorithm to estimate the haplotype frequencies of a window with and without a deletion haplotype. As a contribution to population biology, a coalescent simulator for simulating inversion polymorphisms has been developed. Coalescent simulation is a backward-in-time method of modelling population ancestry. Technically, the simulator also models multiple crossovers by using the Counting model as the chiasma interference model. Finally, this thesis includes an experimental section. The aforementioned methods were tested on synthetic data to evaluate their power and specificity. They were also applied to the HapMap Phase II and Phase III data sets, yielding a number of candidates for previously unknown inversions, deletions and also correctly detecting known such rearrangements.
Resumo:
Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.
Resumo:
It has long been thought that tropical rainfall retrievals from satellites have large errors. Here we show, using a new daily 1 degree gridded rainfall data set based on about 1800 gauges from the India Meteorology Department (IMD), that modern satellite estimates are reasonably close to observed rainfall over the Indian monsoon region. Daily satellite rainfalls from the Global Precipitation Climatology Project (GPCP 1DD) and the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) are available since 1998. The high summer monsoon (June-September) rain over the Western Ghats and Himalayan foothills is captured in TMPA data. Away from hilly regions, the seasonal mean and intraseasonal variability of rainfall (averaged over regions of a few hundred kilometers linear dimension) from both satellite products are about 15% of observations. Satellite data generally underestimate both the mean and variability of rain, but the phase of intraseasonal variations is accurate. On synoptic timescales, TMPA gives reasonable depiction of the pattern and intensity of torrential rain from individual monsoon low-pressure systems and depressions. A pronounced biennial oscillation of seasonal total central India rain is seen in all three data sets, with GPCP 1DD being closest to IMD observations. The new satellite data are a promising resource for the study of tropical rainfall variability.
Resumo:
Big Data and Learning Analytics’ promise to revolutionise educational institutions, endeavours, and actions through more and better data is now compelling. Multiple, and continually updating, data sets produce a new sense of ‘personalised learning’. A crucial attribute of the datafication, and subsequent profiling, of learner behaviour and engagement is the continual modification of the learning environment to induce greater levels of investment on the parts of each learner. The assumption is that more and better data, gathered faster and fed into ever-updating algorithms, provide more complete tools to understand, and therefore improve, learning experiences through adaptive personalisation. The argument in this paper is that Learning Personalisation names a new logistics of investment as the common ‘sense’ of the school, in which disciplinary education is ‘both disappearing and giving way to frightful continual training, to continual monitoring'.
Resumo:
Changes in alcohol pricing have been documented as inversely associated with changes in consumption and alcohol-related problems. Evidence of the association between price changes and health problems is nevertheless patchy and is based to a large extent on cross-sectional state-level data, or time series of such cross-sectional analyses. Natural experimental studies have been called for. There was a substantial reduction in the price of alcohol in Finland in 2004 due to a reduction in alcohol taxes of one third, on average, and the abolition of duty-free allowances for travellers from the EU. These changes in the Finnish alcohol policy could be considered a natural experiment, which offered a good opportunity to study what happens with regard to alcohol-related problems when prices go down. The present study investigated the effects of this reduction in alcohol prices on (1) alcohol-related and all-cause mortality, and mortality due to cardiovascular diseases, (2) alcohol-related morbidity in terms of hospitalisation, (3) socioeconomic differentials in alcohol-related mortality, and (4) small-area differences in interpersonal violence in the Helsinki Metropolitan area. Differential trends in alcohol-related mortality prior to the price reduction were also analysed. A variety of population-based register data was used in the study. Time-series intervention analysis modelling was applied to monthly aggregations of deaths and hospitalisation for the period 1996-2006. These and other mortality analyses were carried out for men and women aged 15 years and over. Socioeconomic differentials in alcohol-related mortality were assessed on a before/after basis, mortality being followed up in 2001-2003 (before the price reduction) and 2004-2005 (after). Alcohol-related mortality was defined in all the studies on mortality on the basis of information on both underlying and contributory causes of death. Hospitalisation related to alcohol meant that there was a reference to alcohol in the primary diagnosis. Data on interpersonal violence was gathered from 86 administrative small-areas in the Helsinki Metropolitan area and was also assessed on a before/after basis followed up in 2002-2003 and 2004-2005. The statistical methods employed to analyse these data sets included time-series analysis, and Poisson and linear regression. The results of the study indicate that alcohol-related deaths increased substantially among men aged 40-69 years and among women aged 50-69 after the price reduction when trends and seasonal variation were taken into account. The increase was mainly attributable to chronic causes, particularly liver diseases. Mortality due to cardiovascular diseases and all-cause mortality, on the other hand, decreased considerably among the-over-69-year-olds. The increase in alcohol-related mortality in absolute terms among the 30-59-year-olds was largest among the unemployed and early-age pensioners, and those with a low level of education, social class or income. The relative differences in change between the education and social class subgroups were small. The employed and those under the age of 35 did not suffer from increased alcohol-related mortality in the two years following the price reduction. The gap between the age and education groups, which was substantial in the 1980s, thus further broadened. With regard to alcohol-related hospitalisation, there was an increase in both chronic and acute causes among men under the age of 70, and among women in the 50-69-year age group when trends and seasonal variation were taken into account. Alcohol dependence and other alcohol-related mental and behavioural disorders were the largest category in both the total number of chronic hospitalisation and in the increase. There was no increase in the rate of interpersonal violence in the Helsinki Metropolitan area, and even a decrease in domestic violence. There was a significant relationship between the measures of social disadvantage on the area level and interpersonal violence, although the differences in the effects of the price reduction between the different areas were small. The findings of the present study suggest that that a reduction in alcohol prices may lead to a substantial increase in alcohol-related mortality and morbidity. However, large population group differences were observed regarding responsiveness to the price changes. In particular, the less privileged, such as the unemployed, were most sensitive. In contrast, at least in the Finnish context, the younger generations and the employed do not appear to be adversely affected, and those in the older age groups may even benefit from cheaper alcohol in terms of decreased rates of CVD mortality. The results also suggest that reductions in alcohol prices do not necessarily affect interpersonal violence. The population group differences in the effects of the price changes on alcohol-related harm should be acknowledged, and therefore the policy actions should focus on the population subgroups that are primarily responsive to the price reduction.
Resumo:
The core aim of machine learning is to make a computer program learn from the experience. Learning from data is usually defined as a task of learning regularities or patterns in data in order to extract useful information, or to learn the underlying concept. An important sub-field of machine learning is called multi-view learning where the task is to learn from multiple data sets or views describing the same underlying concept. A typical example of such scenario would be to study a biological concept using several biological measurements like gene expression, protein expression and metabolic profiles, or to classify web pages based on their content and the contents of their hyperlinks. In this thesis, novel problem formulations and methods for multi-view learning are presented. The contributions include a linear data fusion approach during exploratory data analysis, a new measure to evaluate different kinds of representations for textual data, and an extension of multi-view learning for novel scenarios where the correspondence of samples in the different views or data sets is not known in advance. In order to infer the one-to-one correspondence of samples between two views, a novel concept of multi-view matching is proposed. The matching algorithm is completely data-driven and is demonstrated in several applications such as matching of metabolites between humans and mice, and matching of sentences between documents in two languages.
Resumo:
We revise and extend the extreme value statistic, introduced in Gupta et al., to study direction dependence in the high-redshift supernova data, arising either from departures, from the cosmological principle or due to direction-dependent statistical systematics in the data. We introduce a likelihood function that analytically marginalizes over the,Hubble constant and use it to extend our previous statistic. We also introduce a new statistic that is sensitive to direction dependence arising from living off-centre inside a large void as well as from previously mentioned reasons for anisotropy. We show that for large data sets, this statistic has a limiting form that can be computed analytically. We apply our statistics to the gold data sets from Riess et al., as in our previous work. Our revision and extension of the previous statistic show that the effect of marginalizing over the Hubble constant instead of using its best-fitting value on our results is only marginal. However, correction of errors in our previous work reduces the level of non-Gaussianity in the 2004 gold data that were found in our earlier work. The revised results for the 2007 gold data show that the data are consistent with isotropy and Gaussianity. Our second statistic confirms these results.