907 resultados para random graphs
Resumo:
Quantitative studies of nascent entrepreneurs such as GEM and PSED are required to generate their samples by screening the adult population, usually by phone in developed economies. Phone survey research has recently been challenged by shifting patterns of ownership and response rates of landline versus mobile (cell) phones, particularly for younger respondents. This challenge is acutely intense for entrepreneurship which is a strongly age-dependent phenomenon. Although shifting ownership rates have received some attention, shifting response rates have remained largely unexplored. For the Australian GEM 2010 adult population study we conducted a dual-frame approach that allows comparison between samples of mobile and landline phones. We find a substantial response bias towards younger, male and metropolitan respondents for mobile phones – far greater than explained by ownership rates. We also found these response rate differences significantly biases the estimates of the prevalence of early stage entrepreneurship by both samples, even when each sample is weighted to match the Australian population.
Resumo:
In cloud computing resource allocation and scheduling of multiple composite web services is an important challenge. This is especially so in a hybrid cloud where there may be some free resources available from private clouds but some fee-paying resources from public clouds. Meeting this challenge involves two classical computational problems. One is assigning resources to each of the tasks in the composite web service. The other is scheduling the allocated resources when each resource may be used by more than one task and may be needed at different points of time. In addition, we must consider Quality-of-Service issues, such as execution time and running costs. Existing approaches to resource allocation and scheduling in public clouds and grid computing are not applicable to this new problem. This paper presents a random-key genetic algorithm that solves new resource allocation and scheduling problem. Experimental results demonstrate the effectiveness and scalability of the algorithm.
Resumo:
Objective: The global implementation of oral random roadside drug testing is relatively limited, and correspondingly, the literature that focuses on the effectiveness of this intervention is scant. This study aims to provide a preliminary indication of the impact of roadside drug testing in Queensland. Methods: A sample of Queensland motorists’ (N= 922) completed a self-report questionnaire to investigate their drug driving behaviour, as well as examine the perceived affect of legal sanctions (certainty, severity and swiftness) and knowledge of the countermeasure on their subsequent offending behaviour. Results: Analysis of the collected data revealed that approximately 20% of participants reported drug driving at least once in the last six months. Overall, there was considerable variability in respondent’s perceptions regarding the certainty, severity and swiftness of legal sanctions associated with the testing regime and a considerable proportion remained unaware of testing practices. In regards to predicting those who intended to drug driving again in the future, perceptions of apprehension certainty, more specifically low certainty of apprehension, were significantly associated with self-reported intentions to offend. Additionally, self-reported recent drug driving activity and frequent drug consumption were also identified as significant predictors, which indicates that in the current context, past behaviour is a prominent predictor of future behaviour. To a lesser extent, awareness of testing practices was a significant predictor of intending not to drug drive in the future. Conclusion: The results indicate that drug driving is relatively prevalent on Queensland roads, and a number of factors may influence such behaviour. Additionally, while the roadside testing initiative is beginning to have a deterrent impact, its success will likely be linked with targeted intelligence-led implementation in order to increase apprehension levels as well as the general deterrent effect.
Resumo:
Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log(1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.
Resumo:
Analytical expressions are derived for the mean and variance, of estimates of the bispectrum of a real-time series assuming a cosinusoidal model. The effects of spectral leakage, inherent in discrete Fourier transform operation when the modes present in the signal have a nonintegral number of wavelengths in the record, are included in the analysis. A single phase-coupled triad of modes can cause the bispectrum to have a nonzero mean value over the entire region of computation owing to leakage. The variance of bispectral estimates in the presence of leakage has contributions from individual modes and from triads of phase-coupled modes. Time-domain windowing reduces the leakage. The theoretical expressions for the mean and variance of bispectral estimates are derived in terms of a function dependent on an arbitrary symmetric time-domain window applied to the record. the number of data, and the statistics of the phase coupling among triads of modes. The theoretical results are verified by numerical simulations for simple test cases and applied to laboratory data to examine phase coupling in a hypothesis testing framework
Resumo:
The CDKN2 gene, encoding the cyclin-dependent kinase inhibitor p16, is a tumour suppressor gene that maps to chromosome band 9p21-p22. The most common mechanism of inactivation of this gene in human cancers is through homozygous deletion; however, in a smaller proportion of tumours and tumour cell lines intragenic mutations occur. In this study we have compiled a database of over 120 published point mutations in the CDKN2 gene from a wide variety of tumour types. A further 50 deletions, insertions, and splice mutations in CDKN2 have also been compiled. Furthermore, we have standardised the numbering of all mutations according to the full-length 156 amino acid form of p16. From this study we are able to define several hot spots, some of which occur at conserved residues within the ankyrin domains of p16. While many of the hotspots are shared by a number of cancers, the relative importance of each position varies, possibly reflecting the role of different carcinogens in the development of certain tumours. As reported previously, the mutational spectrum of CDKN2 in melanomas differs from that of internal malignancies and supports the involvement of UV in melanoma tumorigenesis. Notably, 52% of all substitutions in melanoma-derived samples occurred at just six nucleotide positions. Nonsense mutations comprise a comparatively high proportion of mutations present in the CDKN2 gene, and possible explanations for this are discussed.
Resumo:
The research objectives of this thesis were to contribute to Bayesian statistical methodology by contributing to risk assessment statistical methodology, and to spatial and spatio-temporal methodology, by modelling error structures using complex hierarchical models. Specifically, I hoped to consider two applied areas, and use these applications as a springboard for developing new statistical methods as well as undertaking analyses which might give answers to particular applied questions. Thus, this thesis considers a series of models, firstly in the context of risk assessments for recycled water, and secondly in the context of water usage by crops. The research objective was to model error structures using hierarchical models in two problems, namely risk assessment analyses for wastewater, and secondly, in a four dimensional dataset, assessing differences between cropping systems over time and over three spatial dimensions. The aim was to use the simplicity and insight afforded by Bayesian networks to develop appropriate models for risk scenarios, and again to use Bayesian hierarchical models to explore the necessarily complex modelling of four dimensional agricultural data. The specific objectives of the research were to develop a method for the calculation of credible intervals for the point estimates of Bayesian networks; to develop a model structure to incorporate all the experimental uncertainty associated with various constants thereby allowing the calculation of more credible credible intervals for a risk assessment; to model a single day’s data from the agricultural dataset which satisfactorily captured the complexities of the data; to build a model for several days’ data, in order to consider how the full data might be modelled; and finally to build a model for the full four dimensional dataset and to consider the timevarying nature of the contrast of interest, having satisfactorily accounted for possible spatial and temporal autocorrelations. This work forms five papers, two of which have been published, with two submitted, and the final paper still in draft. The first two objectives were met by recasting the risk assessments as directed, acyclic graphs (DAGs). In the first case, we elicited uncertainty for the conditional probabilities needed by the Bayesian net, incorporated these into a corresponding DAG, and used Markov chain Monte Carlo (MCMC) to find credible intervals, for all the scenarios and outcomes of interest. In the second case, we incorporated the experimental data underlying the risk assessment constants into the DAG, and also treated some of that data as needing to be modelled as an ‘errors-invariables’ problem [Fuller, 1987]. This illustrated a simple method for the incorporation of experimental error into risk assessments. In considering one day of the three-dimensional agricultural data, it became clear that geostatistical models or conditional autoregressive (CAR) models over the three dimensions were not the best way to approach the data. Instead CAR models are used with neighbours only in the same depth layer. This gave flexibility to the model, allowing both the spatially structured and non-structured variances to differ at all depths. We call this model the CAR layered model. Given the experimental design, the fixed part of the model could have been modelled as a set of means by treatment and by depth, but doing so allows little insight into how the treatment effects vary with depth. Hence, a number of essentially non-parametric approaches were taken to see the effects of depth on treatment, with the model of choice incorporating an errors-in-variables approach for depth in addition to a non-parametric smooth. The statistical contribution here was the introduction of the CAR layered model, the applied contribution the analysis of moisture over depth and estimation of the contrast of interest together with its credible intervals. These models were fitted using WinBUGS [Lunn et al., 2000]. The work in the fifth paper deals with the fact that with large datasets, the use of WinBUGS becomes more problematic because of its highly correlated term by term updating. In this work, we introduce a Gibbs sampler with block updating for the CAR layered model. The Gibbs sampler was implemented by Chris Strickland using pyMCMC [Strickland, 2010]. This framework is then used to consider five days data, and we show that moisture in the soil for all the various treatments reaches levels particular to each treatment at a depth of 200 cm and thereafter stays constant, albeit with increasing variances with depth. In an analysis across three spatial dimensions and across time, there are many interactions of time and the spatial dimensions to be considered. Hence, we chose to use a daily model and to repeat the analysis at all time points, effectively creating an interaction model of time by the daily model. Such an approach allows great flexibility. However, this approach does not allow insight into the way in which the parameter of interest varies over time. Hence, a two-stage approach was also used, with estimates from the first-stage being analysed as a set of time series. We see this spatio-temporal interaction model as being a useful approach to data measured across three spatial dimensions and time, since it does not assume additivity of the random spatial or temporal effects.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
Fusion techniques have received considerable attention for achieving performance improvement with biometrics. While a multi-sample fusion architecture reduces false rejects, it also increases false accepts. This impact on performance also depends on the nature of subsequent attempts, i.e., random or adaptive. Expressions for error rates are presented and experimentally evaluated in this work by considering the multi-sample fusion architecture for text-dependent speaker verification using HMM based digit dependent speaker models. Analysis incorporating correlation modeling demonstrates that the use of adaptive samples improves overall fusion performance compared to randomly repeated samples. For a text dependent speaker verification system using digit strings, sequential decision fusion of seven instances with three random samples is shown to reduce the overall error of the verification system by 26% which can be further reduced by 6% for adaptive samples. This analysis novel in its treatment of random and adaptive multiple presentations within a sequential fused decision architecture, is also applicable to other biometric modalities such as finger prints and handwriting samples.
Resumo:
Poisson distribution has often been used for count like accident data. Negative Binomial (NB) distribution has been adopted in the count data to take care of the over-dispersion problem. However, Poisson and NB distributions are incapable of taking into account some unobserved heterogeneities due to spatial and temporal effects of accident data. To overcome this problem, Random Effect models have been developed. Again another challenge with existing traffic accident prediction models is the distribution of excess zero accident observations in some accident data. Although Zero-Inflated Poisson (ZIP) model is capable of handling the dual-state system in accident data with excess zero observations, it does not accommodate the within-location correlation and between-location correlation heterogeneities which are the basic motivations for the need of the Random Effect models. This paper proposes an effective way of fitting ZIP model with location specific random effects and for model calibration and assessment the Bayesian analysis is recommended.