39 resultados para High-Dimensional Space Geometrical Informatics (HDSGI)
Resumo:
Failure to detect patients at risk of attempting suicide can result in tragic consequences. Identifying risks earlier and more accurately helps prevent serious incidents occurring and is the objective of the GRiST clinical decision support system (CDSS). One of the problems it faces is high variability in the type and quantity of data submitted for patients, who are assessed in multiple contexts along the care pathway. Although GRiST identifies up to 138 patient cues to collect, only about half of them are relevant for any one patient and their roles may not be for risk evaluation but more for risk management. This paper explores the data collection behaviour of clinicians using GRiST to see whether it can elucidate which variables are important for risk evaluations and when. The GRiST CDSS is based on a cognitive model of human expertise manifested by a sophisticated hierarchical knowledge structure or tree. This structure is used by the GRiST interface to provide top-down controlled access to the patient data. Our research explores relationships between the answers given to these higher-level 'branch' questions to see whether they can help direct assessors to the most important data, depending on the patient profile and assessment context. The outcome is a model for dynamic data collection driven by the knowledge hierarchy. It has potential for improving other clinical decision support systems operating in domains with high dimensional data that are only partially collected and in a variety of combinations.
Resumo:
Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset.
Resumo:
This paper considers the problem of low-dimensional visualisation of very high dimensional information sources for the purpose of situation awareness in the maritime environment. In response to the requirement for human decision support aids to reduce information overload (and specifically, data amenable to inter-point relative similarity measures) appropriate to the below-water maritime domain, we are investigating a preliminary prototype topographic visualisation model. The focus of the current paper is on the mathematical problem of exploiting a relative dissimilarity representation of signals in a visual informatics mapping model, driven by real-world sonar systems. A realistic noise model is explored and incorporated into non-linear and topographic visualisation algorithms building on the approach of [9]. Concepts are illustrated using a real world dataset of 32 hydrophones monitoring a shallow-water environment in which targets are present and dynamic.
Resumo:
Most machine-learning algorithms are designed for datasets with features of a single type whereas very little attention has been given to datasets with mixed-type features. We recently proposed a model to handle mixed types with a probabilistic latent variable formalism. This proposed model describes the data by type-specific distributions that are conditionally independent given the latent space and is called generalised generative topographic mapping (GGTM). It has often been observed that visualisations of high-dimensional datasets can be poor in the presence of noisy features. In this paper we therefore propose to extend the GGTM to estimate feature saliency values (GGTMFS) as an integrated part of the parameter learning process with an expectation-maximisation (EM) algorithm. The efficacy of the proposed GGTMFS model is demonstrated both for synthetic and real datasets.
Resumo:
The generative topographic mapping (GTM) model was introduced by Bishop et al. (1998, Neural Comput. 10(1), 215-234) as a probabilistic re- formulation of the self-organizing map (SOM). It offers a number of advantages compared with the standard SOM, and has already been used in a variety of applications. In this paper we report on several extensions of the GTM, including an incremental version of the EM algorithm for estimating the model parameters, the use of local subspace models, extensions to mixed discrete and continuous data, semi-linear models which permit the use of high-dimensional manifolds whilst avoiding computational intractability, Bayesian inference applied to hyper-parameters, and an alternative framework for the GTM based on Gaussian processes. All of these developments directly exploit the probabilistic structure of the GTM, thereby allowing the underlying modelling assumptions to be made explicit. They also highlight the advantages of adopting a consistent probabilistic framework for the formulation of pattern recognition algorithms.
Resumo:
This thesis describes the Generative Topographic Mapping (GTM) --- a non-linear latent variable model, intended for modelling continuous, intrinsically low-dimensional probability distributions, embedded in high-dimensional spaces. It can be seen as a non-linear form of principal component analysis or factor analysis. It also provides a principled alternative to the self-organizing map --- a widely established neural network model for unsupervised learning --- resolving many of its associated theoretical problems. An important, potential application of the GTM is visualization of high-dimensional data. Since the GTM is non-linear, the relationship between data and its visual representation may be far from trivial, but a better understanding of this relationship can be gained by computing the so-called magnification factor. In essence, the magnification factor relates the distances between data points, as they appear when visualized, to the actual distances between those data points. There are two principal limitations of the basic GTM model. The computational effort required will grow exponentially with the intrinsic dimensionality of the density model. However, if the intended application is visualization, this will typically not be a problem. The other limitation is the inherent structure of the GTM, which makes it most suitable for modelling moderately curved probability distributions of approximately rectangular shape. When the target distribution is very different to that, theaim of maintaining an `interpretable' structure, suitable for visualizing data, may come in conflict with the aim of providing a good density model. The fact that the GTM is a probabilistic model means that results from probability theory and statistics can be used to address problems such as model complexity. Furthermore, this framework provides solid ground for extending the GTM to wider contexts than that of this thesis.
Resumo:
Using techniques from Statistical Physics, the annealed VC entropy for hyperplanes in high dimensional spaces is calculated as a function of the margin for a spherical Gaussian distribution of inputs.
Resumo:
The modem digital communication systems are made transmission reliable by employing error correction technique for the redundancies. Codes in the low-density parity-check work along the principles of Hamming code, and the parity-check matrix is very sparse, and multiple errors can be corrected. The sparseness of the matrix allows for the decoding process to be carried out by probability propagation methods similar to those employed in Turbo codes. The relation between spin systems in statistical physics and digital error correcting codes is based on the existence of a simple isomorphism between the additive Boolean group and the multiplicative binary group. Shannon proved general results on the natural limits of compression and error-correction by setting up the framework known as information theory. Error-correction codes are based on mapping the original space of words onto a higher dimensional space in such a way that the typical distance between encoded words increases.
Resumo:
We consider the problem of illusory or artefactual structure from the visualisation of high-dimensional structureless data. In particular we examine the role of the distance metric in the use of topographic mappings based on the statistical field of multidimensional scaling. We show that the use of a squared Euclidean metric (i.e. the SSTRESs measure) gives rise to an annular structure when the input data is drawn from a high-dimensional isotropic distribution, and we provide a theoretical justification for this observation.
Resumo:
PURPOSE. It is well documented that myopia is associated with an increase in axial length or, more specifically, in vitreous chamber depth. Whether the transverse dimensions of the eye also increase in myopia is relevant to further understanding of its development. METHODS. The posterior retinal surface was localized in two-dimensional space in both eyes of young adult white and Taiwanese-Chinese iso- and anisomyopes (N = 56), from measured keratometry, A-scan ultrasonography, and central and peripheral refraction (±35°) data, with the aid of a computer modeling program designed for this purpose. Anisomyopes had 2 D or more interocular difference in their refractive errors, with mean values in their more myopic eyes of -5.57 D and in their less myopic eyes of -3.25 D, similar to the means of the two isomyopic groups. The derived retinal contours for the more and less myopic eyes were compared by way of investigating ocular shape changes that accompany myopia, in the posterior region of the vitreous chamber. The presence and size of optic disc crescents were also investigated as an index of retinal stretching in myopia. RESULTS. Relative to the less myopic eyes of anisometropic subjects, the more myopic eyes were more elongated and also distorted into a more prolate shape in both the white and Chinese groups. However, the Chinese eyes showed a greater and more uniform relative expansion of the posterior retinal surface in their more myopic eyes, and this was associated with larger optic disc crescents. The changes in the eyes of whites displayed a nasal-temporal axial asymmetry, reflecting greater enlargement of the nasal retinal sector. CONCLUSIONS. Myopia is associated with increased axial length and a prolate shape. This prolate shape is consistent with the proposed idea that axial and transverse dimensions of the eye are regulated differently. The observations that ocular shape changes are larger but more symmetrical in Chinese eyes than in eyes of whites warrant further investigation.
Resumo:
Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman’s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework.
Resumo:
This thesis describes the development of a complete data visualisation system for large tabular databases, such as those commonly found in a business environment. A state-of-the-art 'cyberspace cell' data visualisation technique was investigated and a powerful visualisation system using it was implemented. Although allowing databases to be explored and conclusions drawn, it had several drawbacks, the majority of which were due to the three-dimensional nature of the visualisation. A novel two-dimensional generic visualisation system, known as MADEN, was then developed and implemented, based upon a 2-D matrix of 'density plots'. MADEN allows an entire high-dimensional database to be visualised in one window, while permitting close analysis in 'enlargement' windows. Selections of records can be made and examined, and dependencies between fields can be investigated in detail. MADEN was used as a tool for investigating and assessing many data processing algorithms, firstly data-reducing (clustering) methods, then dimensionality-reducing techniques. These included a new 'directed' form of principal components analysis, several novel applications of artificial neural networks, and discriminant analysis techniques which illustrated how groups within a database can be separated. To illustrate the power of the system, MADEN was used to explore customer databases from two financial institutions, resulting in a number of discoveries which would be of interest to a marketing manager. Finally, the database of results from the 1992 UK Research Assessment Exercise was analysed. Using MADEN allowed both universities and disciplines to be graphically compared, and supplied some startling revelations, including empirical evidence of the 'Oxbridge factor'.
Resumo:
This thesis describes a novel connectionist machine utilizing induction by a Hilbert hypercube representation. This representation offers a number of distinct advantages which are described. We construct a theoretical and practical learning machine which lies in an area of overlap between three disciplines - neural nets, machine learning and knowledge acquisition - hence it is refered to as a "coalesced" machine. To this unifying aspect is added the various advantages of its orthogonal lattice structure as against less structured nets. We discuss the case for such a fundamental and low level empirical learning tool and the assumptions behind the machine are clearly outlined. Our theory of an orthogonal lattice structure the Hilbert hypercube of an n-dimensional space using a complemented distributed lattice as a basis for supervised learning is derived from first principles on clearly laid out scientific principles. The resulting "subhypercube theory" was implemented in a development machine which was then used to test the theoretical predictions again under strict scientific guidelines. The scope, advantages and limitations of this machine were tested in a series of experiments. Novel and seminal properties of the machine include: the "metrical", deterministic and global nature of its search; complete convergence invariably producing minimum polynomial solutions for both disjuncts and conjuncts even with moderate levels of noise present; a learning engine which is mathematically analysable in depth based upon the "complexity range" of the function concerned; a strong bias towards the simplest possible globally (rather than locally) derived "balanced" explanation of the data; the ability to cope with variables in the network; and new ways of reducing the exponential explosion. Performance issues were addressed and comparative studies with other learning machines indicates that our novel approach has definite value and should be further researched.
Resumo:
This thesis is a study of low-dimensional visualisation methods for data visualisation under certainty of the input data. It focuses on the two main feed-forward neural network algorithms which are NeuroScale and Generative Topographic Mapping (GTM) by trying to make both algorithms able to accommodate the uncertainty. The two models are shown not to work well under high levels of noise within the data and need to be modified. The modification of both models, NeuroScale and GTM, are verified by using synthetic data to show their ability to accommodate the noise. The thesis is interested in the controversy surrounding the non-uniqueness of predictive gene lists (PGL) of predicting prognosis outcome of breast cancer patients as available in DNA microarray experiments. Many of these studies have ignored the uncertainty issue resulting in random correlations of sparse model selection in high dimensional spaces. The visualisation techniques are used to confirm that the patients involved in such medical studies are intrinsically unclassifiable on the basis of provided PGL evidence. This additional category of ‘unclassifiable’ should be accommodated within medical decision support systems if serious errors and unnecessary adjuvant therapy are to be avoided.
Resumo:
Background: The controversy surrounding the non-uniqueness of predictive gene lists (PGL) of small selected subsets of genes from very large potential candidates as available in DNA microarray experiments is now widely acknowledged 1. Many of these studies have focused on constructing discriminative semi-parametric models and as such are also subject to the issue of random correlations of sparse model selection in high dimensional spaces. In this work we outline a different approach based around an unsupervised patient-specific nonlinear topographic projection in predictive gene lists. Methods: We construct nonlinear topographic projection maps based on inter-patient gene-list relative dissimilarities. The Neuroscale, the Stochastic Neighbor Embedding(SNE) and the Locally Linear Embedding(LLE) techniques have been used to construct two-dimensional projective visualisation plots of 70 dimensional PGLs per patient, classifiers are also constructed to identify the prognosis indicator of each patient using the resulting projections from those visualisation techniques and investigate whether a-posteriori two prognosis groups are separable on the evidence of the gene lists. A literature-proposed predictive gene list for breast cancer is benchmarked against a separate gene list using the above methods. Generalisation ability is investigated by using the mapping capability of Neuroscale to visualise the follow-up study, but based on the projections derived from the original dataset. Results: The results indicate that small subsets of patient-specific PGLs have insufficient prognostic dissimilarity to permit a distinction between two prognosis patients. Uncertainty and diversity across multiple gene expressions prevents unambiguous or even confident patient grouping. Comparative projections across different PGLs provide similar results. Conclusion: The random correlation effect to an arbitrary outcome induced by small subset selection from very high dimensional interrelated gene expression profiles leads to an outcome with associated uncertainty. This continuum and uncertainty precludes any attempts at constructing discriminative classifiers. However a patient's gene expression profile could possibly be used in treatment planning, based on knowledge of other patients' responses. We conclude that many of the patients involved in such medical studies are intrinsically unclassifiable on the basis of provided PGL evidence. This additional category of 'unclassifiable' should be accommodated within medical decision support systems if serious errors and unnecessary adjuvant therapy are to be avoided.