18 resultados para categorical and mix datasets

em Aston University Research Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Very large spatially-referenced datasets, for example, those derived from satellite-based sensors which sample across the globe or large monitoring networks of individual sensors, are becoming increasingly common and more widely available for use in environmental decision making. In large or dense sensor networks, huge quantities of data can be collected over small time periods. In many applications the generation of maps, or predictions at specific locations, from the data in (near) real-time is crucial. Geostatistical operations such as interpolation are vital in this map-generation process and in emergency situations, the resulting predictions need to be available almost instantly, so that decision makers can make informed decisions and define risk and evacuation zones. It is also helpful when analysing data in less time critical applications, for example when interacting directly with the data for exploratory analysis, that the algorithms are responsive within a reasonable time frame. Performing geostatistical analysis on such large spatial datasets can present a number of problems, particularly in the case where maximum likelihood. Although the storage requirements only scale linearly with the number of observations in the dataset, the computational complexity in terms of memory and speed, scale quadratically and cubically respectively. Most modern commodity hardware has at least 2 processor cores if not more. Other mechanisms for allowing parallel computation such as Grid based systems are also becoming increasingly commonly available. However, currently there seems to be little interest in exploiting this extra processing power within the context of geostatistics. In this paper we review the existing parallel approaches for geostatistics. By recognising that diffeerent natural parallelisms exist and can be exploited depending on whether the dataset is sparsely or densely sampled with respect to the range of variation, we introduce two contrasting novel implementations of parallel algorithms based on approximating the data likelihood extending the methods of Vecchia [1988] and Tresp [2000]. Using parallel maximum likelihood variogram estimation and parallel prediction algorithms we show that computational time can be significantly reduced. We demonstrate this with both sparsely sampled data and densely sampled data on a variety of architectures ranging from the common dual core processor, found in many modern desktop computers, to large multi-node super computers. To highlight the strengths and weaknesses of the diffeerent methods we employ synthetic data sets and go on to show how the methods allow maximum likelihood based inference on the exhaustive Walker Lake data set.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Magnetoencephalography (MEG) is the measurement of the magnetic fields generated outside the head by the brain’s electrical activity. The technique offers the promise of high temporal and spatial resolution. There is however an ambiguity in the inversion process of estimating what goes on inside the head from what is measured outside. Other techniques, such as functional Magnetic Resonance Imaging (fMRI) have no such inversion problems yet suffer from poorer temporal resolution. In this study we examined metrics of mutual information and linear correlation between volumetric images from the two modalities. Measures of mutual information reveal a significant, non-linear, relationship between MEG and fMRI datasets across a number of frequency bands.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a statistical comparison of regional phonetic and lexical variation in American English. Both the phonetic and lexical datasets were first subjected to separate multivariate spatial analyses in order to identify the most common dimensions of spatial clustering in these two datasets. The dimensions of phonetic and lexical variation extracted by these two analyses were then correlated with each other, after being interpolated over a shared set of reference locations, in order to measure the similarity of regional phonetic and lexical variation in American English. This analysis shows that regional phonetic and lexical variation are remarkably similar in Modern American English.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A recent novel approach to the visualisation and analysis of datasets, and one which is particularly applicable to those of a high dimension, is discussed in the context of real applications. A feed-forward neural network is utilised to effect a topographic, structure-preserving, dimension-reducing transformation of the data, with an additional facility to incorporate different degrees of associated subjective information. The properties of this transformation are illustrated on synthetic and real datasets, including the 1992 UK Research Assessment Exercise for funding in higher education. The method is compared and contrasted to established techniques for feature extraction, and related to topographic mappings, the Sammon projection and the statistical field of multidimensional scaling.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The UK government aims at achieving 80% CO2 emission reduction by 2050 which requires collective efforts across all the UK industry sectors. In particular, the housing sector has a large potential to contribute to achieving the aim because the housing sector alone accounts for 27% of the total UK CO2 emission, and furthermore, 87% of the housing which is responsible for current 27% CO2 emission will still stand in 2050. Therefore, it is essential to improve energy efficiency of existing housing stock built with low energy efficiency standard. In order for this, a whole‐house needs to be refurbished in a sustainable way by considering the life time financial and environmental impacts of a refurbished house. However, the current refurbishment process seems to be challenging to generate a financially and environmentally affordable refurbishment solution due to the highly fragmented nature of refurbishment practice and a lack of knowledge and skills about whole‐house refurbishment in the construction industry. In order to generate an affordable refurbishment solution, diverse information regarding costs and environmental impacts of refurbishment measures and materials should be collected and integrated in right sequences throughout the refurbishment project life cycle among key project stakeholders. Consequently, various researchers increasingly study a way of utilizing Building Information Modelling (BIM) to tackle current problems in the construction industry because BIM can support construction professionals to manage construction projects in a collaborative manner by integrating diverse information, and to determine the best refurbishment solution among various alternatives by calculating the life cycle costs and lifetime CO2 performance of a refurbishment solution. Despite the capability of BIM, the BIM adoption rate is low with 25% in the housing sector and it has been rarely studied about a way of using BIM for housing refurbishment projects. Therefore, this research aims to develop a BIM framework to formulate a financially and environmentally affordable whole‐house refurbishment solution based on the Life Cycle Costing (LCC) and Life Cycle Assessment (LCA) methods simultaneously. In order to achieve the aim, a BIM feasibility study was conducted as a pilot study to examine whether BIM is suitable for housing refurbishment, and a BIM framework was developed based on the grounded theory because there was no precedent research. After the development of a BIM framework, this framework was examined by a hypothetical case study using BIM input data collected from questionnaire survey regarding homeowners’ preferences for housing refurbishment. Finally, validation of the BIM framework was conducted among academics and professionals by providing the BIM framework and a formulated refurbishment solution based on the LCC and LCA studies through the framework. As a result, BIM was identified as suitable for housing refurbishment as a management tool, and it is timely for developing the BIM framework. The BIM framework with seven project stages was developed to formulate an affordable refurbishment solution. Through the case study, the Building Regulation is identified as the most affordable energy efficiency standard which renders the best LCC and LCA results when it is applied for whole‐house refurbishment solution. In addition, the Fabric Energy Efficiency Standard (FEES) is recommended when customers are willing to adopt high energy standard, and the maximum 60% of CO2 emissions can be reduced through whole‐house fabric refurbishment with the FEES. Furthermore, limitations and challenges to fully utilize BIM framework for housing refurbishment were revealed such as a lack of BIM objects with proper cost and environmental information, limited interoperability between different BIM software and limited information of LCC and LCA datasets in BIM system. Finally, the BIM framework was validated as suitable for housing refurbishment projects, and reviewers commented that the framework can be more practical if a specific BIM library for housing refurbishment with proper LCC and LCA datasets is developed. This research is expected to provide a systematic way of formulating a refurbishment solution using BIM, and to become a basis for further research on BIM for the housing sector to resolve the current limitations and challenges. Future research should enhance the BIM framework by developing more detailed process map and develop BIM objects with proper LCC and LCA Information.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we present, LEAPS, a Semantic Web and Linked data framework for searching and visualising datasets from the domain of Algal biomass. LEAPS provides tailored interfaces to explore algal biomass datasets via REST services and a SPARQL endpoint for stakeholders in the domain of algal biomass. The rich suite of datasets include data about potential algal biomass cultivation sites, sources of CO2, the pipelines connecting the cultivation sites to the CO2 sources and a subset of the biological taxonomy of algae derived from the world's largest online information source on algae.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Heterogeneous datasets arise naturally in most applications due to the use of a variety of sensors and measuring platforms. Such datasets can be heterogeneous in terms of the error characteristics and sensor models. Treating such data is most naturally accomplished using a Bayesian or model-based geostatistical approach; however, such methods generally scale rather badly with the size of dataset, and require computationally expensive Monte Carlo based inference. Recently within the machine learning and spatial statistics communities many papers have explored the potential of reduced rank representations of the covariance matrix, often referred to as projected or fixed rank approaches. In such methods the covariance function of the posterior process is represented by a reduced rank approximation which is chosen such that there is minimal information loss. In this paper a sequential Bayesian framework for inference in such projected processes is presented. The observations are considered one at a time which avoids the need for high dimensional integrals typically required in a Bayesian approach. A C++ library, gptk, which is part of the INTAMAP web service, is introduced which implements projected, sequential estimation and adds several novel features. In particular the library includes the ability to use a generic observation operator, or sensor model, to permit data fusion. It is also possible to cope with a range of observation error characteristics, including non-Gaussian observation errors. Inference for the covariance parameters is explored, including the impact of the projected process approximation on likelihood profiles. We illustrate the projected sequential method in application to synthetic and real datasets. Limitations and extensions are discussed. © 2010 Elsevier Ltd.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Presents a simulation study of the costing of police custody operations at a UK police force. The custody operation incorporates the arrest, booking-in, interview, detention and court appearance activities. The Activity Based Costing (ABC) approach is used as a framework to show how costs are generated by the three “drivers” of cost, activity and resource. These relate to the design efficiency of the process, the timing and mix of demand on the process and the cost of resources used to undertake the process respectively. The use of discrete-event simulation allows the incorporation of dynamic (time-dependent) and stochastic (variability) elements in the cost analysis. This enables both the amount and timing of the use of capacity and the generation of cost to be established. The concept of committed and flexible resources directs management decisions to the redeployment of unused capacity or alternatively the identification of additional capacity requirements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Practitioners assess performance of entities in increasingly large and complicated datasets. If non-parametric models, such as Data Envelopment Analysis, were ever considered as simple push-button technologies, this is impossible when many variables are available or when data have to be compiled from several sources. This paper introduces by the 'COOPER-framework' a comprehensive model for carrying out non-parametric projects. The framework consists of six interrelated phases: Concepts and objectives, On structuring data, Operational models, Performance comparison model, Evaluation, and Result and deployment. Each of the phases describes some necessary steps a researcher should examine for a well defined and repeatable analysis. The COOPER-framework provides for the novice analyst guidance, structure and advice for a sound non-parametric analysis. The more experienced analyst benefits from a check list such that important issues are not forgotten. In addition, by the use of a standardized framework non-parametric assessments will be more reliable, more repeatable, more manageable, faster and less costly. © 2010 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The probabilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete continuous data, can be extended to model heterogeneous (i.e. containing both continuous and discrete values) and missing data. This paper describes and assesses the resulting model on both synthetic and real-world heterogeneous data with missing values.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most machine-learning algorithms are designed for datasets with features of a single type whereas very little attention has been given to datasets with mixed-type features. We recently proposed a model to handle mixed types with a probabilistic latent variable formalism. This proposed model describes the data by type-specific distributions that are conditionally independent given the latent space and is called generalised generative topographic mapping (GGTM). It has often been observed that visualisations of high-dimensional datasets can be poor in the presence of noisy features. In this paper we therefore propose to extend the GGTM to estimate feature saliency values (GGTMFS) as an integrated part of the parameter learning process with an expectation-maximisation (EM) algorithm. The efficacy of the proposed GGTMFS model is demonstrated both for synthetic and real datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The focus of this thesis is the extension of topographic visualisation mappings to allow for the incorporation of uncertainty. Few visualisation algorithms in the literature are capable of mapping uncertain data with fewer able to represent observation uncertainties in visualisations. As such, modifications are made to NeuroScale, Locally Linear Embedding, Isomap and Laplacian Eigenmaps to incorporate uncertainty in the observation and visualisation spaces. The proposed mappings are then called Normally-distributed NeuroScale (N-NS), T-distributed NeuroScale (T-NS), Probabilistic LLE (PLLE), Probabilistic Isomap (PIso) and Probabilistic Weighted Neighbourhood Mapping (PWNM). These algorithms generate a probabilistic visualisation space with each latent visualised point transformed to a multivariate Gaussian or T-distribution, using a feed-forward RBF network. Two types of uncertainty are then characterised dependent on the data and mapping procedure. Data dependent uncertainty is the inherent observation uncertainty. Whereas, mapping uncertainty is defined by the Fisher Information of a visualised distribution. This indicates how well the data has been interpolated, offering a level of ‘surprise’ for each observation. These new probabilistic mappings are tested on three datasets of vectorial observations and three datasets of real world time series observations for anomaly detection. In order to visualise the time series data, a method for analysing observed signals and noise distributions, Residual Modelling, is introduced. The performance of the new algorithms on the tested datasets is compared qualitatively with the latent space generated by the Gaussian Process Latent Variable Model (GPLVM). A quantitative comparison using existing evaluation measures from the literature allows performance of each mapping function to be compared. Finally, the mapping uncertainty measure is combined with NeuroScale to build a deep learning classifier, the Cascading RBF. This new structure is tested on the MNist dataset achieving world record performance whilst avoiding the flaws seen in other Deep Learning Machines.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Tensor analysis plays an important role in modern image and vision computing problems. Most of the existing tensor analysis approaches are based on the Frobenius norm, which makes them sensitive to outliers. In this paper, we propose L1-norm-based tensor analysis (TPCA-L1), which is robust to outliers. Experimental results upon face and other datasets demonstrate the advantages of the proposed approach. © 2006 IEEE.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Data Envelopment Analysis (DEA) is one of the most widely used methods in the measurement of the efficiency and productivity of Decision Making Units (DMUs). DEA for a large dataset with many inputs/outputs would require huge computer resources in terms of memory and CPU time. This paper proposes a neural network back-propagation Data Envelopment Analysis to address this problem for the very large scale datasets now emerging in practice. Neural network requirements for computer memory and CPU time are far less than that needed by conventional DEA methods and can therefore be a useful tool in measuring the efficiency of large datasets. Finally, the back-propagation DEA algorithm is applied to five large datasets and compared with the results obtained by conventional DEA.