17 resultados para Missing-data
em Aston University Research Archive
Resumo:
Exploratory analysis of data in all sciences seeks to find common patterns to gain insights into the structure and distribution of the data. Typically visualisation methods like principal components analysis are used but these methods are not easily able to deal with missing data nor can they capture non-linear structure in the data. One approach to discovering complex, non-linear structure in the data is through the use of linked plots, or brushing, while ignoring the missing data. In this technical report we discuss a complementary approach based on a non-linear probabilistic model. The generative topographic mapping enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate far more structure than a two dimensional principal components plot could, and deal at the same time with missing data. We show that using the generative topographic mapping provides us with an optimal method to explore the data while being able to replace missing values in a dataset, particularly where a large proportion of the data is missing.
Resumo:
Exploratory analysis of data seeks to find common patterns to gain insights into the structure and distribution of the data. In geochemistry it is a valuable means to gain insights into the complicated processes making up a petroleum system. Typically linear visualisation methods like principal components analysis, linked plots, or brushing are used. These methods can not directly be employed when dealing with missing data and they struggle to capture global non-linear structures in the data, however they can do so locally. This thesis discusses a complementary approach based on a non-linear probabilistic model. The generative topographic mapping (GTM) enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate more structure than a two dimensional principal components plot. The model can deal with uncertainty, missing data and allows for the exploration of the non-linear structure in the data. In this thesis a novel approach to initialise the GTM with arbitrary projections is developed. This makes it possible to combine GTM with algorithms like Isomap and fit complex non-linear structure like the Swiss-roll. Another novel extension is the incorporation of prior knowledge about the structure of the covariance matrix. This extension greatly enhances the modelling capabilities of the algorithm resulting in better fit to the data and better imputation capabilities for missing data. Additionally an extensive benchmark study of the missing data imputation capabilities of GTM is performed. Further a novel approach, based on missing data, will be introduced to benchmark the fit of probabilistic visualisation algorithms on unlabelled data. Finally the work is complemented by evaluating the algorithms on real-life datasets from geochemical projects.
Resumo:
Exploratory analysis of petroleum geochemical data seeks to find common patterns to help distinguish between different source rocks, oils and gases, and to explain their source, maturity and any intra-reservoir alteration. However, at the outset, one is typically faced with (a) a large matrix of samples, each with a range of molecular and isotopic properties, (b) a spatially and temporally unrepresentative sampling pattern, (c) noisy data and (d) often, a large number of missing values. This inhibits analysis using conventional statistical methods. Typically, visualisation methods like principal components analysis are used, but these methods are not easily able to deal with missing data nor can they capture non-linear structure in the data. One approach to discovering complex, non-linear structure in the data is through the use of linked plots, or brushing, while ignoring the missing data. In this paper we introduce a complementary approach based on a non-linear probabilistic model. Generative topographic mapping enables the visualisation of the effects of very many variables on a single plot, while also dealing with missing data. We show how using generative topographic mapping also provides an optimal method with which to replace missing values in two geochemical datasets, particularly where a large proportion of the data is missing.
Resumo:
One of the main challenges of classifying clinical data is determining how to handle missing features. Most research favours imputing of missing values or neglecting records that include missing data, both of which can degrade accuracy when missing values exceed a certain level. In this research we propose a methodology to handle data sets with a large percentage of missing values and with high variability in which particular data are missing. Feature selection is effected by picking variables sequentially in order of maximum correlation with the dependent variable and minimum correlation with variables already selected. Classification models are generated individually for each test case based on its particular feature set and the matching data values available in the training population. The method was applied to real patients' anonymous mental-health data where the task was to predict the suicide risk judgement clinicians would give for each patient's data, with eleven possible outcome classes: zero to ten, representing no risk to maximum risk. The results compare favourably with alternative methods and have the advantage of ensuring explanations of risk are based only on the data given, not imputed data. This is important for clinical decision support systems using human expertise for modelling and explaining predictions.
Resumo:
Visualising data for exploratory analysis is a major challenge in many applications. Visualisation allows scientists to gain insight into the structure and distribution of the data, for example finding common patterns and relationships between samples as well as variables. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are employed. These methods are favoured because of their simplicity, but they cannot cope with missing data and it is difficult to incorporate prior knowledge about properties of the variable space into the analysis; this is particularly important in the high-dimensional, sparse datasets typical in geochemistry. In this paper we show how to utilise a block-structured correlation matrix using a modification of a well known non-linear probabilistic visualisation model, the Generative Topographic Mapping (GTM), which can cope with missing data. The block structure supports direct modelling of strongly correlated variables. We show that including prior structural information it is possible to improve both the data visualisation and the model fit. These benefits are demonstrated on artificial data as well as a real geochemical dataset used for oil exploration, where the proposed modifications improved the missing data imputation results by 3 to 13%.
Resumo:
Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The probabilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete continuous data, can be extended to model heterogeneous (i.e. containing both continuous and discrete values) and missing data. This paper describes and assesses the resulting model on both synthetic and real-world heterogeneous data with missing values.
Resumo:
This research develops a low cost remote sensing system for use in agricultural applications. The important features of the system are that it monitors the near infrared and it incorporates position and attitude measuring equipment allowing for geo-rectified images to be produced without the use of ground control points. The equipment is designed to be hand held and hence requires no structural modification to the aircraft. The portable remote sensing system consists of an inertia measurement unit (IMU), which is accelerometer based, a low-cost GPS device and a small format false colour composite digital camera. The total cost of producing such a system is below GBP 3000, which is far cheaper than equivalent existing systems. The design of the portable remote sensing device has eliminated bore sight misalignment errors from the direct geo-referencing process. A new processing technique has been introduced for the data obtained from these low-cost devices, and it is found that using this technique the image can be matched (overlaid) onto Ordnance Survey Master Maps at an accuracy compatible with precision agriculture requirements. The direct geo-referencing has also been improved by introducing an algorithm capable of correcting oblique images directly. This algorithm alters the pixels value, hence it is advised that image analysis is performed before image georectification. The drawback of this research is that the low-cost GPS device experienced bad checksum errors, which resulted in missing data. The Wide Area Augmented System (WAAS) correction could not be employed because the satellites could not be locked onto whilst flying. The best GPS data were obtained from the Garmin eTrex (15 m kinematic and 2 m static) instruments which have a highsensitivity receiver with good lock on capability. The limitation of this GPS device is the inability to effectively receive the P-Code wavelength, which is needed to gain the best accuracy when undertaking differential GPS processing. Pairing the carrier phase L1 with the pseudorange C/A-Code received, in order to determine the image coordinates by the differential technique, is still under investigation. To improve the position accuracy, it is recommended that a GPS base station should be established near the survey area, instead of using a permanent GPS base station established by the Ordnance Survey.
Resumo:
In this chapter we present the relevant mathematical background to address two well defined signal and image processing problems. Namely, the problem of structured noise filtering and the problem of interpolation of missing data. The former is addressed by recourse to oblique projection based techniques whilst the latter, which can be considered equivalent to impulsive noise filtering, is tackled by appropriate interpolation methods.
Resumo:
Background - Specialist Lifestyle Management (SLiM) is a structured patient education and self-management group weight management programme. Each session is run monthly over a 6-month period providing a less intensive long-term approach. The groups are patient-centred incorporating educational, motivational, behavioural and cognitive elements. The theoretical background, programme structure and preliminary results of SLiM are presented. Subjects/methods - The study was a pragmatic service evaluation of obese patients with a body mass index (BMI) ≥35 kg/m2 with comorbidity or ≥40 kg/m2 without comorbidity referred to a specialist weight management service in the West Midlands, UK. 828 patients were enrolled within SLiM over a 48-month period. Trained facilitators delivered the programme. Preliminary anonymised data were analysed using the intention-to-treat principle. The primary outcome measure was weight loss at 3 and 6 months with comparisons between completers and non-completers performed. The last observation carried forward was used for missing data. Results - Of the 828 enrolled within SLiM, 464 completed the programme (56%). The mean baseline weight was 135 kg (BMI=49.1 kg/m2) with 87.2% of patients having a BMI≥40 kg/m2 and 12.4% with BMI≥60 kg/m2. The mean weight change of all patients enrolled was −4.1 kg (95% CI −3.6 to −4.6 kg, p=0.0001) at the end of SLiM, with completers (n=464) achieving −5.5 kg (95% CI −4.2 to −6.2 kg, p=0.0001) and non-completers achieving −2.3 kg (p=0.0001). The majority (78.6%) who attended the 6-month programme achieved weight loss with 32.3% achieving a ≥5% weight loss. Conclusions - The SLiM programme is an effective group intervention for the management of severe and complex obesity.
Resumo:
Concurrent coding is an encoding scheme with 'holographic' type properties that are shown here to be robust against a significant amount of noise and signal loss. This single encoding scheme is able to correct for random errors and burst errors simultaneously, but does not rely on cyclic codes. A simple and practical scheme has been tested that displays perfect decoding when the signal to noise ratio is of order -18dB. The same scheme also displays perfect reconstruction when a contiguous block of 40% of the transmission is missing. In addition this scheme is 50% more efficient in terms of transmitted power requirements than equivalent cyclic codes. A simple model is presented that describes the process of decoding and can determine the computational load that would be expected, as well as describing the critical levels of noise and missing data at which false messages begin to be generated.
Resumo:
We propose a novel template matching approach for the discrimination of handwritten and machine-printed text. We first pre-process the scanned document images by performing denoising, circles/lines exclusion and word-block level segmentation. We then align and match characters in a flexible sized gallery with the segmented regions, using parallelised normalised cross-correlation. The experimental results over the Pattern Recognition & Image Analysis Research Lab-Natural History Museum (PRImA-NHM) dataset show remarkably high robustness of the algorithm in classifying cluttered, occluded and noisy samples, in addition to those with significant high missing data. The algorithm, which gives 84.0% classification rate with false positive rate 0.16 over the dataset, does not require training samples and generates compelling results as opposed to the training-based approaches, which have used the same benchmark.
Resumo:
The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.
Resumo:
We analyse how the Generative Topographic Mapping (GTM) can be modified to cope with missing values in the training data. Our approach is based on an Expectation -Maximisation (EM) method which estimates the parameters of the mixture components and at the same time deals with the missing values. We incorporate this algorithm into a hierarchical GTM. We verify the method on a toy data set (using a single GTM) and a realistic data set (using a hierarchical GTM). The results show our algorithm can help to construct informative visualisation plots, even when some of the training points are corrupted with missing values.
Resumo:
This paper contrasts the effects of trade, inward FDI and technological development upon the demand for skilled and unskilled workers in the UK. By focussing on industry level data panel data on smaller firms, the paper also contrasts these effects with those generated by large scale domestic investment. The analysis is placed within the broader context of shifts in British industrial policy, which has seen significant shifts from sectoral to horizontal measures and towards stressing the importance of SMEs, clusters and new technology, all delivered at the regional scale. This, however, is contrasted with continued elements of British and EU regional policy which have emphasised the attraction of inward investment in order to alleviate regional unemployment. The results suggest that such policies are not naturally compatible; that while both trade and FDI benefit skilled workers, they have adverse effects on the demand for unskilled labour in the UK. At the very least this suggests the need for a range of policies to tackle various targets (including in this case unemployment and social inclusion) and the need to integrate these into a coherent industrial strategy at various levels of governance, whether regional and/or national. This has important implications for the form of any 'new' industrial policy.