905 resultados para Web Mining, Data Mining, User Topic Model, Web User Profiles
Resumo:
In this paper, we describe the evaluation of a method for building detection by the Dempster-Shafer fusion of LIDAR data and multispectral images. For that purpose, ground truth was digitised for two test sites with quite different characteristics. Using these data sets, the heuristic model for the probability mass assignments of the method is validated, and rules for the tuning of the parameters of this model are discussed. Further we evaluate the contributions of the individual cues used in the classification process to the quality of the classification results. Our results show the degree to which the overall correctness of the results can be improved by fusing LIDAR data with multispectral images.
Resumo:
Visualising data for exploratory analysis is a big challenge in scientific and engineering domains where there is a need to gain insight into the structure and distribution of the data. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are used, but it is difficult to incorporate prior knowledge about structure of the data into the analysis. In this technical report we discuss a complementary approach based on an extension of a well known non-linear probabilistic model, the Generative Topographic Mapping. We show that by including prior information of the covariance structure into the model, we are able to improve both the data visualisation and the model fit.
Resumo:
Recently, Drǎgulescu and Yakovenko proposed an analytical formula for computing the probability density function of stock log returns, based on the Heston model, which they tested empirically. Their research design inadvertently favourably biased the fit of the data to the Heston model, thus overstating their empirical results. Furthermore, Drǎgulescu and Yakovenko did not perform any goodness-of-fit statistical tests. This study employs a research design that facilitates statistical tests of the goodness-of-fit of the Heston model to empirical returns. Robustness checks are also performed. In brief, the Heston model outperformed the Gaussian model only at high frequencies and even so does not provide a statistically acceptable fit to the data. The Gaussian model performed (marginally) better at medium and low frequencies, at which points the extra parameters of the Heston model have adverse impacts on the test statistics. © 2005 Taylor & Francis Group Ltd.
Resumo:
Overlaying maps using a desktop GIS is often the first step of a multivariate spatial analysis. The potential of this operation has increased considerably as data sources an dWeb services to manipulate them are becoming widely available via the Internet. Standards from the OGC enable such geospatial ‘mashups’ to be seamless and user driven, involving discovery of thematic data. The user is naturally inclined to look for spatial clusters and ‘correlation’ of outcomes. Using classical cluster detection scan methods to identify multivariate associations can be problematic in this context, because of a lack of control on or knowledge about background populations. For public health and epidemiological mapping, this limiting factor can be critical but often the focus is on spatial identification of risk factors associated with health or clinical status. In this article we point out that this association itself can ensure some control on underlying populations, and develop an exploratory scan statistic framework for multivariate associations. Inference using statistical map methodologies can be used to test the clustered associations. The approach is illustrated with a hypothetical data example and an epidemiological study on community MRSA. Scenarios of potential use for online mashups are introduced but full implementation is left for further research.
Resumo:
1. Pearson's correlation coefficient only tests whether the data fit a linear model. With large numbers of observations, quite small values of r become significant and the X variable may only account for a minute proportion of the variance in Y. Hence, the value of r squared should always be calculated and included in a discussion of the significance of r. 2. The use of r assumes that a bivariate normal distribution is present and this assumption should be examined prior to the study. If Pearson's r is not appropriate, then a non-parametric correlation coefficient such as Spearman's rs may be used. 3. A significant correlation should not be interpreted as indicating causation especially in observational studies in which there is a high probability that the two variables are correlated because of their mutual correlations with other variables. 4. In studies of measurement error, there are problems in using r as a test of reliability and the ‘intra-class correlation coefficient’ should be used as an alternative. A correlation test provides only limited information as to the relationship between two variables. Fitting a regression line to the data using the method known as ‘least square’ provides much more information and the methods of regression and their application in optometry will be discussed in the next article.
Resumo:
Visualising data for exploratory analysis is a major challenge in many applications. Visualisation allows scientists to gain insight into the structure and distribution of the data, for example finding common patterns and relationships between samples as well as variables. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are employed. These methods are favoured because of their simplicity, but they cannot cope with missing data and it is difficult to incorporate prior knowledge about properties of the variable space into the analysis; this is particularly important in the high-dimensional, sparse datasets typical in geochemistry. In this paper we show how to utilise a block-structured correlation matrix using a modification of a well known non-linear probabilistic visualisation model, the Generative Topographic Mapping (GTM), which can cope with missing data. The block structure supports direct modelling of strongly correlated variables. We show that including prior structural information it is possible to improve both the data visualisation and the model fit. These benefits are demonstrated on artificial data as well as a real geochemical dataset used for oil exploration, where the proposed modifications improved the missing data imputation results by 3 to 13%.
Resumo:
In this paper we present a novel method for emulating a stochastic, or random output, computer model and show its application to a complex rabies model. The method is evaluated both in terms of accuracy and computational efficiency on synthetic data and the rabies model. We address the issue of experimental design and provide empirical evidence on the effectiveness of utilizing replicate model evaluations compared to a space-filling design. We employ the Mahalanobis error measure to validate the heteroscedastic Gaussian process based emulator predictions for both the mean and (co)variance. The emulator allows efficient screening to identify important model inputs and better understanding of the complex behaviour of the rabies model.
Resumo:
This paper examines the problems in the definition of the General Non-Parametric Corporate Performance (GNCP) and introduces a multiplicative linear programming as an alternative model for corporate performance. We verified and tested a statistically significant difference between the two models based on the application of 27 UK industries using six performance ratios. Our new model is found to be a more robust performance model than the previous standard Data Envelopment Analysis (DEA) model.
Resumo:
Social streams have proven to be the mostup-to-date and inclusive information on cur-rent events. In this paper we propose a novelprobabilistic modelling framework, called violence detection model (VDM), which enables the identification of text containing violent content and extraction of violence-related topics over social media data. The proposed VDM model does not require any labeled corpora for training, instead, it only needs the in-corporation of word prior knowledge which captures whether a word indicates violence or not. We propose a novel approach of deriving word prior knowledge using the relative entropy measurement of words based on the in-tuition that low entropy words are indicative of semantically coherent topics and therefore more informative, while high entropy words indicates words whose usage is more topical diverse and therefore less informative. Our proposed VDM model has been evaluated on the TREC Microblog 2011 dataset to identify topics related to violence. Experimental results show that deriving word priors using our proposed relative entropy method is more effective than the widely-used information gain method. Moreover, VDM gives higher violence classification results and produces more coherent violence-related topics compared toa few competitive baselines.
Resumo:
Resource Space Model is a kind of data model which can effectively and flexibly manage the digital resources in cyber-physical system from multidimensional and hierarchical perspectives. This paper focuses on constructing resource space automatically. We propose a framework that organizes a set of digital resources according to different semantic dimensions combining human background knowledge in WordNet and Wikipedia. The construction process includes four steps: extracting candidate keywords, building semantic graphs, detecting semantic communities and generating resource space. An unsupervised statistical language topic model (i.e., Latent Dirichlet Allocation) is applied to extract candidate keywords of the facets. To better interpret meanings of the facets found by LDA, we map the keywords to Wikipedia concepts, calculate word relatedness using WordNet's noun synsets and construct corresponding semantic graphs. Moreover, semantic communities are identified by GN algorithm. After extracting candidate axes based on Wikipedia concept hierarchy, the final axes of resource space are sorted and picked out through three different ranking strategies. The experimental results demonstrate that the proposed framework can organize resources automatically and effectively.©2013 Published by Elsevier Ltd. All rights reserved.
Resumo:
Personalized recommender systems aim to assist users in retrieving and accessing interesting items by automatically acquiring user preferences from the historical data and matching items with the preferences. In the last decade, recommendation services have gained great attention due to the problem of information overload. However, despite recent advances of personalization techniques, several critical issues in modern recommender systems have not been well studied. These issues include: (1) understanding the accessing patterns of users (i.e., how to effectively model users' accessing behaviors); (2) understanding the relations between users and other objects (i.e., how to comprehensively assess the complex correlations between users and entities in recommender systems); and (3) understanding the interest change of users (i.e., how to adaptively capture users' preference drift over time). To meet the needs of users in modern recommender systems, it is imperative to provide solutions to address the aforementioned issues and apply the solutions to real-world applications. ^ The major goal of this dissertation is to provide integrated recommendation approaches to tackle the challenges of the current generation of recommender systems. In particular, three user-oriented aspects of recommendation techniques were studied, including understanding accessing patterns, understanding complex relations and understanding temporal dynamics. To this end, we made three research contributions. First, we presented various personalized user profiling algorithms to capture click behaviors of users from both coarse- and fine-grained granularities; second, we proposed graph-based recommendation models to describe the complex correlations in a recommender system; third, we studied temporal recommendation approaches in order to capture the preference changes of users, by considering both long-term and short-term user profiles. In addition, a versatile recommendation framework was proposed, in which the proposed recommendation techniques were seamlessly integrated. Different evaluation criteria were implemented in this framework for evaluating recommendation techniques in real-world recommendation applications. ^ In summary, the frequent changes of user interests and item repository lead to a series of user-centric challenges that are not well addressed in the current generation of recommender systems. My work proposed reasonable solutions to these challenges and provided insights on how to address these challenges using a simple yet effective recommendation framework.^
Resumo:
We are in an era of unprecedented data volumes generated from observations and model simulations. This is particularly true from satellite Earth Observations (EO) and global scale oceanographic models. This presents us with an opportunity to evaluate large scale oceanographic model outputs using EO data. Previous work on model skill evaluation has led to a plethora of metrics. The paper defines two new model skill evaluation metrics. The metrics are based on the theory of universal multifractals and their purpose is to measure the structural similarity between the model predictions and the EO data. The two metrics have the following advantages over the standard techniques: a) they are scale-free, b) they carry important part of information about how model represents different oceanographic drivers. Those two metrics are then used in the paper to evaluate the performance of the FVCOM model in the shelf seas around the south-west coast of the UK.
Resumo:
We are in an era of unprecedented data volumes generated from observations and model simulations. This is particularly true from satellite Earth Observations (EO) and global scale oceanographic models. This presents us with an opportunity to evaluate large scale oceanographic model outputs using EO data. Previous work on model skill evaluation has led to a plethora of metrics. The paper defines two new model skill evaluation metrics. The metrics are based on the theory of universal multifractals and their purpose is to measure the structural similarity between the model predictions and the EO data. The two metrics have the following advantages over the standard techniques: a) they are scale-free, b) they carry important part of information about how model represents different oceanographic drivers. Those two metrics are then used in the paper to evaluate the performance of the FVCOM model in the shelf seas around the south-west coast of the UK.
Resumo:
Developing a theoretical framework for pervasive information environments is an enormous goal. This paper aims to provide a small step towards such a goal. The following pages report on our initial investigations to devise a framework that will continue to support locative, experiential and evaluative data from ‘user feedback’ in an increasingly pervasive information environment. We loosely attempt to outline this framework by developing a methodology capable of moving from rapid-deployment of software and hardware technologies, towards a goal of realistic immersive experience of pervasive information. We propose various technical solutions and address a range of problems such as; information capture through a novel model of sensing, processing, visualization and cognition.
Resumo:
Data leakage is a serious issue and can result in the loss of sensitive data, compromising user accounts and details, potentially affecting millions of internet users. This paper contributes to research in online security and reducing personal footprint by evaluating the levels of privacy provided by the Firefox browser. The aim of identifying conditions that would minimize data leakage and maximize data privacy is addressed by assessing and comparing data leakage in the four possible browsing modes: normal and private modes using a browser installed on the host PC or using a portable browser from a connected USB device respectively. To provide a firm foundation for analysis, a series of carefully designed, pre-planned browsing sessions were repeated in each of the various modes of Firefox. This included low RAM environments to determine any effects low RAM may have on browser data leakage. The results show that considerable data leakage may occur within Firefox. In normal mode, all of the browsing information is stored within the Mozilla profile folder in Firefox-specific SQLite databases and sessionstore.js. While passwords were not stored as plain text, other confidential information such as credit card numbers could be recovered from the Form history under certain conditions. There is no difference when using a portable browser in normal mode, except that the Mozilla profile folder is located on the USB device rather than the host's hard disk. By comparison, private browsing reduces data leakage. Our findings confirm that no information is written to the Firefox-related locations on the hard disk or USB device during private browsing, implying that no deletion would be necessary and no remnants of data would be forensically recoverable from unallocated space. However, two aspects of data leakage occurred equally in all four browsing modes. Firstly, all of the browsing history was stored in the live RAM and was therefore accessible while the browser remained open. Secondly, in low RAM situations, the operating system caches out RAM to pagefile.sys on the host's hard disk. Irrespective of the browsing mode used, this may include Firefox history elements which can then remain forensically recoverable for considerable time.