22 resultados para Data exploration

em Aston University Research Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes how the statistical technique of cluster analysis and the machine learning technique of rule induction can be combined to explore a database. The ways in which such an approach alleviates the problems associated with other techniques for data analysis are discussed. We report the results of experiments carried out on a database from the medical diagnosis domain. Finally we describe the future developments which we plan to carry out to build on our current work.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis introduces a flexible visual data exploration framework which combines advanced projection algorithms from the machine learning domain with visual representation techniques developed in the information visualisation domain to help a user to explore and understand effectively large multi-dimensional datasets. The advantage of such a framework to other techniques currently available to the domain experts is that the user is directly involved in the data mining process and advanced machine learning algorithms are employed for better projection. A hierarchical visualisation model guided by a domain expert allows them to obtain an informed segmentation of the input space. Two other components of this thesis exploit properties of these principled probabilistic projection algorithms to develop a guided mixture of local experts algorithm which provides robust prediction and a model to estimate feature saliency simultaneously with the training of a projection algorithm.Local models are useful since a single global model cannot capture the full variability of a heterogeneous data space such as the chemical space. Probabilistic hierarchical visualisation techniques provide an effective soft segmentation of an input space by a visualisation hierarchy whose leaf nodes represent different regions of the input space. We use this soft segmentation to develop a guided mixture of local experts (GME) algorithm which is appropriate for the heterogeneous datasets found in chemoinformatics problems. Moreover, in this approach the domain experts are more involved in the model development process which is suitable for an intuition and domain knowledge driven task such as drug discovery. We also derive a generative topographic mapping (GTM) based data visualisation approach which estimates feature saliency simultaneously with the training of a visualisation model.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Clustering techniques such as k-means and hierarchical clustering are commonly used to analyze DNA microarray derived gene expression data. However, the interactions between processes underlying the cell activity suggest that the complexity of the microarray data structure may not be fully represented with discrete clustering methods.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Visualising data for exploratory analysis is a major challenge in many applications. Visualisation allows scientists to gain insight into the structure and distribution of the data, for example finding common patterns and relationships between samples as well as variables. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are employed. These methods are favoured because of their simplicity, but they cannot cope with missing data and it is difficult to incorporate prior knowledge about properties of the variable space into the analysis; this is particularly important in the high-dimensional, sparse datasets typical in geochemistry. In this paper we show how to utilise a block-structured correlation matrix using a modification of a well known non-linear probabilistic visualisation model, the Generative Topographic Mapping (GTM), which can cope with missing data. The block structure supports direct modelling of strongly correlated variables. We show that including prior structural information it is possible to improve both the data visualisation and the model fit. These benefits are demonstrated on artificial data as well as a real geochemical dataset used for oil exploration, where the proposed modifications improved the missing data imputation results by 3 to 13%.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

N-tuple recognition systems (RAMnets) are normally modeled using a small number of input lines to each RAM, because the address space grows exponentially with the number of inputs. It is impossible to implement an arbitrarily-large address space as physical memory. But given modest amounts of training data, correspondingly modest numbers of bits will be set in that memory. Hash arrays can therefore be used instead of a direct implementation of the required address space. This paper describes some exploratory experiments using the hash array technique to investigate the performance of RAMnets with very large numbers of input lines. An argument is presented which concludes that performance should peak at a relatively small n-tuple size, but the experiments carried out so far contradict this. Further experiments are needed to confirm this unexpected result.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Multidimensional compound optimization is a new paradigm in the drug discovery process, yielding efficiencies during early stages and reducing attrition in the later stages of drug development. The success of this strategy relies heavily on understanding this multidimensional data and extracting useful information from it. This paper demonstrates how principled visualization algorithms can be used to understand and explore a large data set created in the early stages of drug discovery. The experiments presented are performed on a real-world data set comprising biological activity data and some whole-molecular physicochemical properties. Data visualization is a popular way of presenting complex data in a simpler form. We have applied powerful principled visualization methods, such as generative topographic mapping (GTM) and hierarchical GTM (HGTM), to help the domain experts (screening scientists, chemists, biologists, etc.) understand and draw meaningful decisions. We also benchmark these principled methods against relatively better known visualization approaches, principal component analysis (PCA), Sammon's mapping, and self-organizing maps (SOMs), to demonstrate their enhanced power to help the user visualize the large multidimensional data sets one has to deal with during the early stages of the drug discovery process. The results reported clearly show that the GTM and HGTM algorithms allow the user to cluster active compounds for different targets and understand them better than the benchmarks. An interactive software tool supporting these visualization algorithms was provided to the domain experts. The tool facilitates the domain experts by exploration of the projection obtained from the visualization algorithms providing facilities such as parallel coordinate plots, magnification factors, directional curvatures, and integration with industry standard software. © 2006 American Chemical Society.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. miniDVMS v1.8 provides a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualisation domain. The advantage of this interface is that the user is directly involved in the data mining process. Principled projection methods, such as generative topographic mapping (GTM) and hierarchical GTM (HGTM), are integrated with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, and user interaction facilities, to provide this integrated visual data mining framework. The software also supports conventional visualisation techniques such as principal component analysis (PCA), Neuroscale, and PhiVis. This user manual gives an overview of the purpose of the software tool, highlights some of the issues to be taken care while creating a new model, and provides information about how to install and use the tool. The user manual does not require the readers to have familiarity with the algorithms it implements. Basic computing skills are enough to operate the software.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This field work study furthers understanding about expatriate management, in particular, the nature of cross-cultural management in Hong Kong involving Anglo-American expatriate and Chinese host national managers, the important features of adjustment for expatriates living and working there, and the type of training which will assist them to adjust and to work successfully in this Asian environment. Qualitative and quantitative data on each issue was gathered during in-depth interviews in Hong Kong, using structured interview schedules, with 39 expatriate and 31 host national managers drawn from a cross-section of functional areas and organizations. Despite the adoption of Western technology and the influence of Western business practices, micro-level management in Hong Kong retains a cultural specificity which is consistent with the norms and values of Chinese culture. There are differences in how expatriates and host nationals define their social roles, and Hong Kong's recent colonial history appears to influence cross-cultural interpersonal interactions. The inability of the spouse and/or family to adapt to Hong Kong is identified as a major reason for expatriate assignments to fail, though the causes have less to do with living away from family and friends, than with Hong Kong's highly urbanized environment and the heavy demands of work. Culture shock is not identified as a major problem, but in Hong Kong micro-level social factors require greater adjustment than macro-level societal factors. The adjustment of expatriate managers is facilitated by a strong orientation towards career development and hard work, possession of technical/professional expertise, and a willingness to engage in a process of continuous 'active learning' with respect to the host national society and culture. A four-part model of manager training suitable for Hong Kong is derived from the study data. It consists of a pre-departure briefing, post-arrival cross-cultural training, language training in basic Cantonese and in how to communicate more effectively in English with non-native speakers, and the assignment of a mentor to newly arrived expatriate managers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Computer simulated trajectories of bulk water molecules form complex spatiotemporal structures at the picosecond time scale. This intrinsic complexity, which underlies the formation of molecular structures at longer time scales, has been quantified using a measure of statistical complexity. The method estimates the information contained in the molecular trajectory by detecting and quantifying temporal patterns present in the simulated data (velocity time series). Two types of temporal patterns are found. The first, defined by the short-time correlations corresponding to the velocity autocorrelation decay times (â‰0.1â€ps), remains asymptotically stable for time intervals longer than several tens of nanoseconds. The second is caused by previously unknown longer-time correlations (found at longer than the nanoseconds time scales) leading to a value of statistical complexity that slowly increases with time. A direct measure based on the notion of statistical complexity that describes how the trajectory explores the phase space and independent from the particular molecular signal used as the observed time series is introduced. © 2008 The American Physical Society.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman’s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Most parametric software cost estimation models used today evolved in the late 70's and early 80's. At that time, the dominant software development techniques being used were the early 'structured methods'. Since then, several new systems development paradigms and methods have emerged, one being Jackson Systems Development (JSD). As current cost estimating methods do not take account of these developments, their non-universality means they cannot provide adequate estimates of effort and hence cost. In order to address these shortcomings two new estimation methods have been developed for JSD projects. One of these methods JSD-FPA, is a top-down estimating method, based on the existing MKII function point method. The other method, JSD-COCOMO, is a sizing technique which sizes a project, in terms of lines of code, from the process structure diagrams and thus provides an input to the traditional COCOMO method.The JSD-FPA method allows JSD projects in both the real-time and scientific application areas to be costed, as well as the commercial information systems applications to which FPA is usually applied. The method is based upon a three-dimensional view of a system specification as opposed to the largely data-oriented view traditionally used by FPA. The method uses counts of various attributes of a JSD specification to develop a metric which provides an indication of the size of the system to be developed. This size metric is then transformed into an estimate of effort by calculating past project productivity and utilising this figure to predict the effort and hence cost of a future project. The effort estimates produced were validated by comparing them against the effort figures for six actual projects.The JSD-COCOMO method uses counts of the levels in a process structure chart as the input to an empirically derived model which transforms them into an estimate of delivered source code instructions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Some researchers argue that the top team, rather than the CEO, is a better predictor of an organisation’s fate (Finkelstein & Hambrick, 1996; Knight et al., 1999). However, others suggest that the importance of the top management team (TMT) composition literature is exaggerated (West & Schwenk, 1996). This has stimulated a need for further research on TMTs. While the importance of TMT is well documented in the innovation literature, the organisational environment also plays a key role in determining organisational outcomes. Therefore, the inclusion of both TMT characteristics and organisational variables (climate and organisational learning) in this study provides a more holistic picture of innovation. The research methodologies employed includes (i) interviews with TMT members in 35 Irish software companies (ii) a survey completed by managerial respondents and core workers in these companies (iii) in-depth interviews with TMT members from five companies. Data were gathered in two phases, time 1 (1998-2000) and time 2 (2003). The TMT played an important part in fostering innovation. However, it was a group process, rather than team demography, that was most strongly associated with innovation. Task reflexivity was an important predictor of innovation time 1, time 2). Only one measure of TMT diversity was associated with innovation - tenure diversity -in time 2 only. Organisational context played an important role in determining innovation. This was positively associated with innovation - but with one dimension of organisational learning only. The ability to share information (access to information) was not associated with innovation but the motivation to share information was (perceiving the sharing of information to be valuable). Innovative climate was also associated with innovation. This study suggests that this will lead to innovative outcomes if employees perceive the organisation to support risk, experimentation and other innovative behaviours.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The thesis began as a study of new firm formation. Preliminary research suggested that infant death rate was considered to be a closely related problem and the search was for a theory of new firm formation which would explain both. The thesis finds theories of exit and entry inadequate in this respect and focusses instead on theories of entrepreneurship, particularly those which concentrate on entrepreneurship as an agent of change. The role of information is found to be fundamental to economic change and an understanding of information generation and dissemination and the nature and direction of information flows is postulated to lead coterminously to an understanding of entrepreneurhsip and economic change. The economics of information is applied to theories of entrepreneurhsip and some testable hypotheses are derived. The testing relies on etablishing and measuring the information bases of the founders of new firms and then testing for certain hypothesised differences between the information bases of survivors and non-survivors. No theory of entrepreneurship is likely to be straightforwardly testable and many postulates have to be established to bring the theory to a testable stage. A questionnaire is used to gather information from a sample of firms taken from a new micro-data set established as part of the work of the thesis. Discriminant Analysis establishes the variables which best distinguish between survivors and non-survivors. The variables which emerge as important discriminators are consistent with the theory which the analysis is testing. While there are alternative interpretations of the important variables, collective consistency with the theory under test is established. The thesis concludes with an examination of the implications of the theory for policy towards stimulating new firm formation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Exploratory analysis of data seeks to find common patterns to gain insights into the structure and distribution of the data. In geochemistry it is a valuable means to gain insights into the complicated processes making up a petroleum system. Typically linear visualisation methods like principal components analysis, linked plots, or brushing are used. These methods can not directly be employed when dealing with missing data and they struggle to capture global non-linear structures in the data, however they can do so locally. This thesis discusses a complementary approach based on a non-linear probabilistic model. The generative topographic mapping (GTM) enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate more structure than a two dimensional principal components plot. The model can deal with uncertainty, missing data and allows for the exploration of the non-linear structure in the data. In this thesis a novel approach to initialise the GTM with arbitrary projections is developed. This makes it possible to combine GTM with algorithms like Isomap and fit complex non-linear structure like the Swiss-roll. Another novel extension is the incorporation of prior knowledge about the structure of the covariance matrix. This extension greatly enhances the modelling capabilities of the algorithm resulting in better fit to the data and better imputation capabilities for missing data. Additionally an extensive benchmark study of the missing data imputation capabilities of GTM is performed. Further a novel approach, based on missing data, will be introduced to benchmark the fit of probabilistic visualisation algorithms on unlabelled data. Finally the work is complemented by evaluating the algorithms on real-life datasets from geochemical projects.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Despite the growth of spoken academic corpora in recent years, relatively little is known about the language of seminar discussions in higher education. This thesis compares seminar discussions across three disciplinary areas. The aim of this thesis is to uncover the functions and patterns of talk used in different disciplinary discussions and to highlight language on a macro and micro level that would be useful for materials design and teaching purposes. A framework for identifying and analysing genres in spoken language based on Hallidayan Systemic Functional Linguistics (SFL) is used. Stretches of talk sharing a similar purpose and predictable functional staging, termed Discussion Macro Genres (DMGs) are identified. Language is compared across DMGs and across disciplines through use of corpus techniques in conjunction with SFL genre theory. Data for the study comprises just over 180,000 tokens and is drawn from the British Academic Spoken English corpus (BASE), recorded at two universities in the UK. The discipline areas investigated are Arts and Humanities, Social Sciences and Physical Sciences. Findings from this study make theoretical, empirical and methodological contributions to the field of spoken EAP. The empirical findings are firstly, that the majority of the seminar discussion can be assigned to one of the three main DMG in the corpus: Responding, Debating and Problem Solving. Secondly, it characterises each discipline area according to two DMGs. Thirdly, the majority of the discussion is non-oppositional in nature, suggesting that ‘debate’ is not the only form of discussion that students need to be prepared for. Finally, while some characteristics of the discussion are tied to the DMG and common across disciplines, others are discipline specific. On a theoretical level, this study shows that an SFL genre model for investigating spoken discourse can be successfully extended to investigate longer stretches of discourse than have previously been identified. The methodological contribution is to demonstrate how corpus techniques can be combined with SFL genre theory to investigate extended stretches of spoken discussion. The thesis will be of value to those working in the field of teaching spoken EAP/ ESAP as well as to materials developers.