971 resultados para clustered binary data
Resumo:
Various time-memory tradeoffs attacks for stream ciphers have been proposed over the years. However, the claimed success of these attacks assumes the initialisation process of the stream cipher is one-to-one. Some stream cipher proposals do not have a one-to-one initialisation process. In this paper, we examine the impact of this on the success of time-memory-data tradeoff attacks. Under the circumstances, some attacks are more successful than previously claimed while others are less. The conditions for both cases are established.
Resumo:
With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques were used to derive this interesting information. Mining on XML documents is impacted by its model due to the semi-structured nature of these documents. Hence, in this chapter we present an overview of the various models of XML documents, how these models were used for mining and some of the issues and challenges in these models. In addition, this chapter also provides some insights into the future models of XML documents for effectively capturing the two important features namely structure and content of XML documents for mining.
Resumo:
This special issue of the Journal of Urban Technology brings together five articles that are based on presentations given at the Street Computing workshop held on 24 November 2009 in Melbourne in conjunction with the Australian Computer-Human Interaction conference (OZCHI 2009). Our own article introduces the Street Computing vision and explores the potential, challenges and foundations of this research vision. In order to do so, we first look at the currently available sources of information and discuss their link to existing research efforts. Section 2 then introduces the notion of Street Computing and our research approach in more detail. Section 3 looks beyond the core concept itself and summarises related work in this field of interest.
Resumo:
In this paper we present a sequential Monte Carlo algorithm for Bayesian sequential experimental design applied to generalised non-linear models for discrete data. The approach is computationally convenient in that the information of newly observed data can be incorporated through a simple re-weighting step. We also consider a flexible parametric model for the stimulus-response relationship together with a newly developed hybrid design utility that can produce more robust estimates of the target stimulus in the presence of substantial model and parameter uncertainty. The algorithm is applied to hypothetical clinical trial or bioassay scenarios. In the discussion, potential generalisations of the algorithm are suggested to possibly extend its applicability to a wide variety of scenarios
Resumo:
Several authors stress the importance of data’s crucial foundation for operational, tactical and strategic decisions (e.g., Redman 1998, Tee et al. 2007). Data provides the basis for decision making as data collection and processing is typically associated with reducing uncertainty in order to make more effective decisions (Daft and Lengel 1986). While the first series of investments of Information Systems/Information Technology (IS/IT) into organizations improved data collection, restricted computational capacity and limited processing power created challenges (Simon 1960). Fifty years on, capacity and processing problems are increasingly less relevant; in fact, the opposite exists. Determining data relevance and usefulness is complicated by increased data capture and storage capacity, as well as continual improvements in information processing capability. As the IT landscape changes, businesses are inundated with ever-increasing volumes of data from both internal and external sources available on both an ad-hoc and real-time basis. More data, however, does not necessarily translate into more effective and efficient organizations, nor does it increase the likelihood of better or timelier decisions. This raises questions about what data managers require to assist their decision making processes.
Resumo:
Handling information overload online, from the user's point of view is a big challenge, especially when the number of websites is growing rapidly due to growth in e-commerce and other related activities. Personalization based on user needs is the key to solving the problem of information overload. Personalization methods help in identifying relevant information, which may be liked by a user. User profile and object profile are the important elements of a personalization system. When creating user and object profiles, most of the existing methods adopt two-dimensional similarity methods based on vector or matrix models in order to find inter-user and inter-object similarity. Moreover, for recommending similar objects to users, personalization systems use the users-users, items-items and users-items similarity measures. In most cases similarity measures such as Euclidian, Manhattan, cosine and many others based on vector or matrix methods are used to find the similarities. Web logs are high-dimensional datasets, consisting of multiple users, multiple searches with many attributes to each. Two-dimensional data analysis methods may often overlook latent relationships that may exist between users and items. In contrast to other studies, this thesis utilises tensors, the high-dimensional data models, to build user and object profiles and to find the inter-relationships between users-users and users-items. To create an improved personalized Web system, this thesis proposes to build three types of profiles: individual user, group users and object profiles utilising decomposition factors of tensor data models. A hybrid recommendation approach utilising group profiles (forming the basis of a collaborative filtering method) and object profiles (forming the basis of a content-based method) in conjunction with individual user profiles (forming the basis of a model based approach) is proposed for making effective recommendations. A tensor-based clustering method is proposed that utilises the outcomes of popular tensor decomposition techniques such as PARAFAC, Tucker and HOSVD to group similar instances. An individual user profile, showing the user's highest interest, is represented by the top dimension values, extracted from the component matrix obtained after tensor decomposition. A group profile, showing similar users and their highest interest, is built by clustering similar users based on tensor decomposed values. A group profile is represented by the top association rules (containing various unique object combinations) that are derived from the searches made by the users of the cluster. An object profile is created to represent similar objects clustered on the basis of their similarity of features. Depending on the category of a user (known, anonymous or frequent visitor to the website), any of the profiles or their combinations is used for making personalized recommendations. A ranking algorithm is also proposed that utilizes the personalized information to order and rank the recommendations. The proposed methodology is evaluated on data collected from a real life car website. Empirical analysis confirms the effectiveness of recommendations made by the proposed approach over other collaborative filtering and content-based recommendation approaches based on two-dimensional data analysis methods.
Resumo:
Mixture models are a flexible tool for unsupervised clustering that have found popularity in a vast array of research areas. In studies of medicine, the use of mixtures holds the potential to greatly enhance our understanding of patient responses through the identification of clinically meaningful clusters that, given the complexity of many data sources, may otherwise by intangible. Furthermore, when developed in the Bayesian framework, mixture models provide a natural means for capturing and propagating uncertainty in different aspects of a clustering solution, arguably resulting in richer analyses of the population under study. This thesis aims to investigate the use of Bayesian mixture models in analysing varied and detailed sources of patient information collected in the study of complex disease. The first aim of this thesis is to showcase the flexibility of mixture models in modelling markedly different types of data. In particular, we examine three common variants on the mixture model, namely, finite mixtures, Dirichlet Process mixtures and hidden Markov models. Beyond the development and application of these models to different sources of data, this thesis also focuses on modelling different aspects relating to uncertainty in clustering. Examples of clustering uncertainty considered are uncertainty in a patient’s true cluster membership and accounting for uncertainty in the true number of clusters present. Finally, this thesis aims to address and propose solutions to the task of comparing clustering solutions, whether this be comparing patients or observations assigned to different subgroups or comparing clustering solutions over multiple datasets. To address these aims, we consider a case study in Parkinson’s disease (PD), a complex and commonly diagnosed neurodegenerative disorder. In particular, two commonly collected sources of patient information are considered. The first source of data are on symptoms associated with PD, recorded using the Unified Parkinson’s Disease Rating Scale (UPDRS) and constitutes the first half of this thesis. The second half of this thesis is dedicated to the analysis of microelectrode recordings collected during Deep Brain Stimulation (DBS), a popular palliative treatment for advanced PD. Analysis of this second source of data centers on the problems of unsupervised detection and sorting of action potentials or "spikes" in recordings of multiple cell activity, providing valuable information on real time neural activity in the brain.
Resumo:
This paper argues for a renewed focus on statistical reasoning in the beginning school years, with opportunities for children to engage in data modelling. Results are reported from the first year of a 3-year longitudinal study in which three classes of first-grade children (6-year-olds) and their teachers engaged in data modelling activities. The theme of Looking after our Environment, part of the children’s science curriculum, provided the task context. The goals for the two activities addressed here included engaging children in core components of data modelling, namely, selecting attributes, structuring and representing data, identifying variation in data, and making predictions from given data. Results include the various ways in which children represented and re represented collected data, including attribute selection, and the metarepresentational competence they displayed in doing so. The “data lenses” through which the children dealt with informal inference (variation and prediction) are also reported.
Resumo:
In response to the need to leverage private finance and the lack of competition in some parts of the Australian public sector infrastructure market, especially in the very large economic infrastructure sector procured using Pubic Private Partnerships, the Australian Federal government has demonstrated its desire to attract new sources of in-bound foreign direct investment (FDI). This paper aims to report on progress towards an investigation into the determinants of multinational contractors’ willingness to bid for Australian public sector major infrastructure projects. This research deploys Dunning’s eclectic theory for the first time in terms of in-bound FDI by multinational contractors into Australia. Elsewhere, the authors have developed Dunning’s principal hypothesis to suit the context of this research and to address a weakness arising in this hypothesis that is based on a nominal approach to the factors in Dunning's eclectic framework and which fails to speak to the relative explanatory power of these factors. In this paper, a first stage test of the authors' development of Dunning's hypothesis is presented by way of an initial review of secondary data vis-à-vis the selected sector (roads and bridges) in Australia (as the host location) and with respect to four selected home countries (China; Japan; Spain; and US). In doing so, the next stage in the research method concerning sampling and case studies is also further developed and described in this paper. In conclusion, the extent to which the initial review of secondary data suggests the relative importance of the factors in the eclectic framework is considered. It is noted that more robust conclusions are expected following the future planned stages of the research including primary data from the case studies and a global survey of the world’s largest contractors and which is briefly previewed. Finally, and beyond theoretical contributions expected from the overall approach taken to developing and testing Dunning’s framework, other expected contributions concerning research method and practical implications are mentioned.
Resumo:
A rule-based approach for classifying previously identified medical concepts in the clinical free text into an assertion category is presented. There are six different categories of assertions for the task: Present, Absent, Possible, Conditional, Hypothetical and Not associated with the patient. The assertion classification algorithms were largely based on extending the popular NegEx and Context algorithms. In addition, a health based clinical terminology called SNOMED CT and other publicly available dictionaries were used to classify assertions, which did not fit the NegEx/Context model. The data for this task includes discharge summaries from Partners HealthCare and from Beth Israel Deaconess Medical Centre, as well as discharge summaries and progress notes from University of Pittsburgh Medical Centre. The set consists of 349 discharge reports, each with pairs of ground truth concept and assertion files for system development, and 477 reports for evaluation. The system’s performance on the evaluation data set was 0.83, 0.83 and 0.83 for recall, precision and F1-measure, respectively. Although the rule-based system shows promise, further improvements can be made by incorporating machine learning approaches.
Resumo:
This paper investigates the determinants of China’s regional innovation capacity (RIC) and variations in these determinants between different types of regions. Based on the framework of national innovation capacity (NIC) and research on innovation system, this paper develops a framework of RIC in the Chinese context. Using panel data from 1991 to 2009, clustering analysis is first employed to classify regions according to their innovation development path. Panel data regressions with fixed effect model are conducted to explore the determinants of RIC and how these vary across the different regional clusters. We find that the 30 regions can be clustered into three groups, and there are considerable differences in the drivers of RIC between these different regional groups.
Resumo:
Projects funded by the Australian National Data Service(ANDS). The specific projects that were funded included: a) Greenhouse Gas Emissions Project (N2O) with Prof. Peter Grace from QUT’s Institute of Sustainable Resources. b) Q150 Project for the management of multimedia data collected at Festival events with Prof. Phil Graham from QUT’s Institute of Creative Industries. c) Bio-diversity environmental sensing with Prof. Paul Roe from the QUT Microsoft eResearch Centre. For the purposes of these projects the Eclipse Rich Client Platform (Eclipse RCP) was chosen as an appropriate software development framework within which to develop the respective software. This poster will present a brief overview of the requirements of the projects, an overview of the experiences of the project team in using Eclipse RCP, report on the advantages and disadvantages of using Eclipse and it’s perspective on Eclipse as an integrated tool for supporting future data management requirements.
Resumo:
A novel method for genotyping the clustered, regularly interspaced short-palindromic-repeat (CRISPR) locus of Campylobacter jejuni is described. Following real-time PCR, CRISPR products were subjected to high-resolution melt (HRM) analysis, a new technology that allows precise melt profile determination of amplicons. This investigation shows that the CRISPR HRM assay provides a powerful addition to existing C. jejuni genotyping methods and emphasizes the potential of HRM for genotyping short sequence repeats in other species
Resumo:
It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.
Resumo:
This paper studies the missing covariate problem which is often encountered in survival analysis. Three covariate imputation methods are employed in the study, and the effectiveness of each method is evaluated within the hazard prediction framework. Data from a typical engineering asset is used in the case study. Covariate values in some time steps are deliberately discarded to generate an incomplete covariate set. It is found that although the mean imputation method is simpler than others for solving missing covariate problems, the results calculated by it can differ largely from the real values of the missing covariates. This study also shows that in general, results obtained from the regression method are more accurate than those of the mean imputation method but at the cost of a higher computational expensive. Gaussian Mixture Model (GMM) method is found to be the most effective method within these three in terms of both computation efficiency and predication accuracy.