965 resultados para Latent variable models
Resumo:
Esta tese é composta por três artigos. Dois deles investigam assuntos afeitos a tributação e o terceiro é um artigo sobre o tema “poupança”'. Embora os objetos de análise sejam distintos, os três possuem como característica comum a aplicação de técnicas de econometria de dados em painel a bases de dados inéditas. Em dois dos artigos, utiliza-se estimação por GMM em modelos dinâmicos. Por sua vez, o artigo remanescente é uma aplicação de modelos de variável dependente latente. Abaixo, apresenta-se um breve resumo de cada artigo, começando pelos dois artigos de tributação, que dividem uma seção comum sobre o ICMS (o imposto estadual sobre valor adicionado) e terminando com o artigo sobre poupança. O primeiro artigo analisa a importância da fiscalização como instrumento para deter a evasão de tributos e aumentar a receita tributária, no caso de um imposto sobre valor adicionado, no contexto de um país em desenvolvimento. O estudo é realizado com dados do estado de São Paulo. Para tratar questões relativas a endogeneidade e inércia na série de receita tributária, empregam-se técnicas de painel dinâmico. Utiliza-se como variáveis de controle o nível do PIB regional e duas proxies para esforço fiscal: a quantidade e o valor das multas tributárias. Os resultados apontam impacto significativo do esforço fiscal nas receitas tributárias. O artigo evidencia, indiretamente, a forma como a evasão fiscal é afetada pela penalidade aplicada aos casos de sonegação. Suas conclusões também são relevantes no contexto das discussões sobre o federalismo fiscal brasileiro, especialmente no caso de uma reforma tributária potencial. O segundo artigo examina uma das principais tarefas das administrações tributárias: a escolha periódica de contribuintes para auditoria. A melhora na eficiência dos mecanismos de seleção de empresas tem o potencial de impactar positivamente a probabilidade de detecção de fraudes fiscais, provendo melhor alocação dos escassos recursos fiscais. Neste artigo, tentamos desenvolver este mecanismo calculando a probabilidade de sonegação associada a cada contribuinte. Isto é feito, no universo restrito de empresas auditadas, por meio da combinação “ótima” de diversos indicadores fiscais existentes e de informações dos resultados dos procedimentos de auditoria, em modelos de variável dependente latente. Após calculados os coeficientes, a probabilidade de sonegação é calculada para todo o universo de contribuintes. O método foi empregado em um painel com micro-dados de empresas sujeitas ao recolhimento de ICMS no âmbito da Delegacia Tributária de Guarulhos, no estado de São Paulo. O terceiro artigo analisa as baixas taxas de poupança dos países latino-americanos nas últimas décadas. Utilizando técnicas de dados em painel, identificam-se os determinantes da taxa de poupança. Em seguida, faz-se uma análise contrafactual usando a China, que tem apresentado altas taxas de poupança no mesmo período, como parâmetro. Atenção especial é dispensada ao Brasil, que tem ficado muito atrás dos seus pares no grupo dos BRICs neste quesito. O artigo contribui para a literatura existente em vários sentidos: emprega duas amplas bases de dados para analisar a influência de uma grande variedade de determinantes da taxa de poupança, incluindo variáveis demográficas e de previdência social; confirma resultados previamente encontrados na literatura, com a robustez conferida por bases de dados mais ricas; para alguns países latino-americanos, revela que as suas taxas de poupança tenderiam a aumentar se eles tivessem um comportamento mais semelhante ao da China em outras áreas, mas o incremento não seria tão dramático.
Resumo:
In this article we use factor models to describe a certain class of covariance structure for financiaI time series models. More specifical1y, we concentrate on situations where the factor variances are modeled by a multivariate stochastic volatility structure. We build on previous work by allowing the factor loadings, in the factor mo deI structure, to have a time-varying structure and to capture changes in asset weights over time motivated by applications with multi pIe time series of daily exchange rates. We explore and discuss potential extensions to the models exposed here in the prediction area. This discussion leads to open issues on real time implementation and natural model comparisons.
Resumo:
The past decade has wítenessed a series of (well accepted and defined) financial crises periods in the world economy. Most of these events aI,"e country specific and eventually spreaded out across neighbor countries, with the concept of vicinity extrapolating the geographic maps and entering the contagion maps. Unfortunately, what contagion represents and how to measure it are still unanswered questions. In this article we measure the transmission of shocks by cross-market correlation\ coefficients following Forbes and Rigobon's (2000) notion of shift-contagion,. Our main contribution relies upon the use of traditional factor model techniques combined with stochastic volatility mo deIs to study the dependence among Latin American stock price indexes and the North American indexo More specifically, we concentrate on situations where the factor variances are modeled by a multivariate stochastic volatility structure. From a theoretical perspective, we improve currently available methodology by allowing the factor loadings, in the factor model structure, to have a time-varying structure and to capture changes in the series' weights over time. By doing this, we believe that changes and interventions experienced by those five countries are well accommodated by our models which learns and adapts reasonably fast to those economic and idiosyncratic shocks. We empirically show that the time varying covariance structure can be modeled by one or two common factors and that some sort of contagion is present in most of the series' covariances during periods of economical instability, or crisis. Open issues on real time implementation and natural model comparisons are thoroughly discussed.
Resumo:
Questionnaire data may contain missing values because certain questions do not apply to all respondents. For instance, questions addressing particular attributes of a symptom, such as frequency, triggers or seasonality, are only applicable to those who have experienced the symptom, while for those who have not, responses to these items will be missing. This missing information does not fall into the category 'missing by design', rather the features of interest do not exist and cannot be measured regardless of survey design. Analysis of responses to such conditional items is therefore typically restricted to the subpopulation in which they apply. This article is concerned with joint multivariate modelling of responses to both unconditional and conditional items without restricting the analysis to this subpopulation. Such an approach is of interest when the distributions of both types of responses are thought to be determined by common parameters affecting the whole population. By integrating the conditional item structure into the model, inference can be based both on unconditional data from the entire population and on conditional data from subjects for whom they exist. This approach opens new possibilities for multivariate analysis of such data. We apply this approach to latent class modelling and provide an example using data on respiratory symptoms (wheeze and cough) in children. Conditional data structures such as that considered here are common in medical research settings and, although our focus is on latent class models, the approach can be applied to other multivariate models.
Resumo:
Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.
Resumo:
The present study investigated the relationship between psychometric intelligence and temporal resolution power (TRP) as simultaneously assessed by auditory and visual psychophysical timing tasks. In addition, three different theoretical models of the functional relationship between TRP and psychometric intelligence as assessed by means of the Adaptive Matrices Test (AMT) were developed. To test the validity of these models, structural equation modeling was applied. Empirical data supported a hierarchical model that assumed auditory and visual modality-specific temporal processing at a first level and amodal temporal processing at a second level. This second-order latent variable was substantially correlated with psychometric intelligence. Therefore, the relationship between psychometric intelligence and psychophysical timing performance can be explained best by a hierarchical model of temporal information processing.
Resumo:
Numerous studies reported a strong link between working memory capacity (WMC) and fluid intelligence (Gf), although views differ in respect to how close these two constructs are related to each other. In the present study, we used a WMC task with five levels of task demands to assess the relationship between WMC and Gf by means of a new methodological approach referred to as fixed-links modeling. Fixed-links models belong to the family of confirmatory factor analysis (CFA) and are of particular interest for experimental, repeated-measures designs. With this technique, processes systematically varying across task conditions can be disentangled from processes unaffected by the experimental manipulation. Proceeding from the assumption that experimental manipulation in a WMC task leads to increasing demands on WMC, the processes systematically varying across task conditions can be assumed to be WMC-specific. Processes not varying across task conditions, on the other hand, are probably independent of WMC. Fixed-links models allow for representing these two kinds of processes by two independent latent variables. In contrast to traditional CFA where a common latent variable is derived from the different task conditions, fixed-links models facilitate a more precise or purified representation of the WMC-related processes of interest. By using fixed-links modeling to analyze data of 200 participants, we identified a non-experimental latent variable, representing processes that remained constant irrespective of the WMC task conditions, and an experimental latent variable which reflected processes that varied as a function of experimental manipulation. This latter variable represents the increasing demands on WMC and, hence, was considered a purified measure of WMC controlled for the constant processes. Fixed-links modeling showed that both the purified measure of WMC (β = .48) as well as the constant processes involved in the task (β = .45) were related to Gf. Taken together, these two latent variables explained the same portion of variance of Gf as a single latent variable obtained by traditional CFA (β = .65) indicating that traditional CFA causes an overestimation of the effective relationship between WMC and Gf. Thus, fixed-links modeling provides a feasible method for a more valid investigation of the functional relationship between specific constructs.
Resumo:
Research on lifestyle physical activity interventions suggests that they help individuals meet the new recommendations for physical activity made by the Centers for Disease Control and Prevention (CDC) and the American College of Sports Medicine (ACSM). The purpose of this research was to describe the rates of adherence to two lifestyle physical activity intervention arms and to examine the association between adherence and outcome variables, using data from Project PRIME, a lifestyle physical activity intervention based on the transtheoretical model and conducted by the Cooper Institute of Aerobics Research, Dallas, Texas. Participants were 250 sedentary healthy adults, aged 35 to 70 years, primarily non-Hispanic White, and in the contemplation and preparation stages of readiness to change. They were randomized to a group (PRIME G) or a mail- and telephone-delivered condition (PRIME C). Adherence measures included attending class (PRIME G), completing a monthly telephone call with a health educator (PRIME C), and completing homework assignments and self-monitoring minutes of moderate- to vigorous physical activity (both groups). In the first results paper, adherence over time and between conditions was examined: Attendance in group, completing the monthly telephone call, and homework completion decreased over time, and participants in PRIME G were more likely to complete homework than those in PRIME C. Paper 2 aimed to determine whether the adherence measures predicted achievement of the CDC/ACSM physical activity guideline. In separate models for the two conditions, a latent variable measuring adherence was found to predict achievement of the guideline. Paper 3 examined the association between adherence measures and the transtheoretical model's processes of change within each condition. For both, participants who completed at least two thirds of the homework assignments improved their use of the processes of change more than those who completed less than that amount. These results suggest that encouraging adherence to a lifestyle physical activity intervention, at least among already motivated volunteers, may increase the likelihood of beneficial changes in the outcomes. ^
Resumo:
It is widely acknowledged in theoretical and empirical literature that social relationships, comprising of structural measures (social networks) and functional measures (perceived social support) have an undeniable effect on health outcomes. However, the actual mechanism of this effect has yet to be clearly understood or explicated. In addition, comorbidity is found to adversely affect social relationships and health related quality of life (a valued outcome measure in cancer patients and survivors). ^ This cross sectional study uses selected baseline data (N=3088) from the Women's Healthy Eating and Living (WHEL) study. Lisrel 8.72 was used for the latent variable structural equation modeling. Due to the ordinal nature of the data, Weighted Least Squares (WLS) method of estimation using Asymptotic Distribution Free covariance matrices was chosen for this analysis. The primary exogenous predictor variables are Social Networks and Comorbidity; Perceived Social Support is the endogenous predictor variable. Three dimensions of HRQoL, physical, mental and satisfaction with current quality of life were the outcome variables. ^ This study hypothesizes and tests the mechanism and pathways between comorbidity, social relationships and HRQoL using latent variable structural equation modeling. After testing the measurement models of social networks and perceived social support, a structural model hypothesizing associations between the latent exogenous and endogenous variables was tested. The results of the study after listwise deletion (N=2131) mostly confirmed the hypothesized relationships (TLI, CFI >0.95, RMSEA = 0.05, p=0.15). Comorbidity was adversely associated with all three HRQoL outcomes. Strong ties were negatively associated with perceived social support; social network had a strong positive association with perceived social support, which served as a mediator between social networks and HRQoL. Mental health quality of life was the most adversely affected by the predictor variables. ^ This study is a preliminary look at the integration of structural and functional measures of social relationships, comorbidity and three HRQoL indicators using LVSEM. Developing stronger social networks and forming supportive relationships is beneficial for health outcomes such as HRQoL of cancer survivors. Thus, the medical community treating cancer survivors as well as the survivor's social networks need to be informed and cognizant of these possible relationships. ^
Resumo:
Las organizaciones son sistemas o unidades sociales, compuestas por personas que interactúan entre sí, para lograr objetivos comunes. Uno de sus objetivos es la productividad. La productividad es un constructo multidimensional en la que influyen aspectos tecnológicos, económicos, organizacionales y humanos. Diversos estudios apoyan la influencia de la motivación de las personas, de las habilidades y destrezas de los individuos, de su talento para desempeñar el trabajo, así como también del ambiente de trabajo presente en la organización, en la productividad. Por esta razón, el objetivo general de la investigación, es analizar la influencia entre los factores humanos y la productividad. Se hará énfasis en la persona como factor productivo clave, para responder a las interrogantes de la investigación, referidas a cuáles son las variables humanas que inciden en la productividad, a la posibilidad de plantear un modelo de productividad que considere el impacto del factor humano y la posibilidad de encontrar un método para la medición de la productividad que contemple la percepción del factor humano. Para resolver estas interrogantes, en esta investigación se busca establecer las relaciones entre las variables humanas y la productividad, vistas desde la perspectiva de tres unidades de análisis diferentes: individuo, grupo y organización, para la formulación de un modelo de productividad humana y el diseño de un instrumento para su medida. Una de las principales fuente de investigación para la elección de las variables humanas, la formulación del modelo, y el método de medición de la productividad, fue la revisión de la literatura disponible sobre la productividad y el factor humano en las organizaciones, lo que facilitó el trazado del marco teórico y conceptual. Otra de las fuentes para la selección fue la opinión de expertos y de especialistas directamente involucrados en el sector eléctrico venezolano, lo cual facilitó la obtención de un modelo, cuyas variables reflejasen la realidad del ámbito en estudio. Para aportar una interpretación explicativa del fenómeno, se planteó el modelo de los Factores Humanos vs Productividad (MFHP), el cual se analizó desde la perspectiva del análisis causal y fue conformado por tres variables latentes exógenas denominadas: factores individuales, factores grupales y factores organizacionales, que estaban relacionadas con una variable latente endógena denominada productividad. El MFHP se formuló mediante la metodología de los modelos de ecuaciones estructurales (SEM). Las relaciones inicialmente propuestas entre las variables latentes fueron corroboradas por los ajustes globales del modelo, se constataron las relaciones entre las variables latentes planteadas y sus indicadores asociados, lo que facilitó el enunciado de 26 hipótesis, de las cuales se comprobaron 24. El modelo fue validado mediante la estrategia de modelos rivales, utilizada para comparar varios modelos SEM, y seleccionar el de mejor ajuste, con sustento teórico. La aceptación del modelo se realizó mediante la evaluación conjunta de los índices de bondad de ajuste globales. Asimismo, para la elaboración del instrumento de medida de la productividad (IMPH), se realizó un análisis factorial exploratorio previo a la aplicación del análisis factorial confirmatorio, aplicando SEM. La revisión de los conceptos de productividad, la incidencia del factor humano, y sus métodos de medición, condujeron al planteamiento de métodos subjetivos que incorporaron la percepción de los principales actores del proceso productivo, tanto para la selección de las variables, como para la formulación de un modelo de productividad y el diseño de un instrumento de medición de la productividad. La contribución metodológica de este trabajo de investigación, ha sido el empleo de los SEM para relacionar variables que tienen que ver con el comportamiento humano en la organización y la productividad, lo cual abre nuevas posibilidades a la investigación en este ámbito. Organizations are social systems or units composed of people who interact with each other to achieve common goals. One objective is productivity, which is a multidimensional construct influenced by technological, economic, organizational and human aspects. Several studies support the influence on productivity of personal motivation, of the skills and abilities of individuals, of their talent for the job, as well as of the work environment present in the organization. Therefore, the overall objective of this research is to analyze the influence between human factors and productivity. The emphasis is on the individual as a productive factor which is key in order to answer the research questions concerning the human variables that affect productivity and to address the ability to propose a productivity model that considers the impact of the human factor and the possibility of finding a method for the measurement of productivity that includes the perception of the human factor. To consider these questions, this research seeks to establish the relationships between human and productivity variables, as seen from the perspective of three different units of analysis: the individual, the group and the organization, in order to formulate a model of human productivity and to design an instrument for its measurement. A major source of research for choosing the human variables, model formulation, and method of measuring productivity, was the review of the available literature on productivity and the human factor in organizations which facilitated the design of the theoretical and conceptual framework. Another source for the selection was the opinion of experts and specialists directly involved in the Venezuelan electricity sector which facilitated obtaining a model whose variables reflect the reality of the area under study. To provide an interpretation explaining the phenomenon, the model of the Human Factors vs. Productivity Model (HFPM) was proposed. This model has been analyzed from the perspective of causal analysis and was composed of three latent exogenous variables denominated: individual, group and organizational factors which are related to a latent variable denominated endogenous productivity. The HFPM was formulated using the methodology of Structural Equation Modeling (SEM). The initially proposed relationships between latent variables were confirmed by the global fits of the model, the relationships between the latent variables and their associated indicators enable the statement of 26 hypotheses, of which 24 were confirmed. The model was validated using the strategy of rival models, used for comparing various SEM models and to select the one that provides the best fit, with theoretical support. The acceptance of the model was performed through the joint evaluation of the adequacy of global fit indices. Additionally, for the development of an instrument to measure productivity, an exploratory factor analysis was performed prior to the application of a confirmatory factor analysis, using SEM. The review of the concepts of productivity, the impact of the human factor, and the measurement methods led to a subjective methods approach that incorporated the perception of the main actors of the production process, both for the selection of variables and for the formulation of a productivity model and the design of an instrument to measure productivity. The methodological contribution of this research has been the use of SEM to relate variables that have to do with human behavior in the organization and with productivity, opening new possibilities for research in this area.
Resumo:
During the last years cities around the world have invested important quantities of money in measures for reducing congestion and car-trips. Investments which are nothing but potential solutions for the well-known urban sprawl phenomenon, also called the “development trap” that leads to further congestion and a higher proportion of our time spent in slow moving cars. Over the path of this searching for solutions, the complex relationship between urban environment and travel behaviour has been studied in a number of cases. The main question on discussion is, how to encourage multi-stop tours? Thus, the objective of this paper is to verify whether unobserved factors influence tour complexity. For this purpose, we use a data-base from a survey conducted in 2006-2007 in Madrid, a suitable case study for analyzing urban sprawl due to new urban developments and substantial changes in mobility patterns in the last years. A total of 943 individuals were interviewed from 3 selected neighbourhoods (CBD, urban and suburban). We study the effect of unobserved factors on trip frequency. This paper present the estimation of an hybrid model where the latent variable is called propensity to travel and the discrete choice model is composed by 5 alternatives of tour type. The results show that characteristics of the neighbourhoods in Madrid are important to explain trip frequency. The influence of land use variables on trip generation is clear and in particular the presence of commercial retails. Through estimation of elasticities and forecasting we determine to what extent land-use policy measures modify travel demand. Comparing aggregate elasticities with percentage variations, it can be seen that percentage variations could lead to inconsistent results. The result shows that hybrid models better explain travel behavior than traditional discrete choice models.
Resumo:
Principal component analysis (PCA) is one of the most popular techniques for processing, compressing and visualising data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Previous attempts to formulate mixture models for PCA have therefore to some extent been ad hoc. In this paper, PCA is formulated within a maximum-likelihood framework, based on a specific form of Gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analysers, whose parameters can be determined using an EM algorithm. We discuss the advantages of this model in the context of clustering, density modelling and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.
Resumo:
This paper introduces a new technique in the investigation of limited-dependent variable models. This paper illustrates that variable precision rough set theory (VPRS), allied with the use of a modern method of classification, or discretisation of data, can out-perform the more standard approaches that are employed in economics, such as a probit model. These approaches and certain inductive decision tree methods are compared (through a Monte Carlo simulation approach) in the analysis of the decisions reached by the UK Monopolies and Mergers Committee. We show that, particularly in small samples, the VPRS model can improve on more traditional models, both in-sample, and particularly in out-of-sample prediction. A similar improvement in out-of-sample prediction over the decision tree methods is also shown.
Resumo:
The principled statistical application of Gaussian random field models used in geostatistics has historically been limited to data sets of a small size. This limitation is imposed by the requirement to store and invert the covariance matrix of all the samples to obtain a predictive distribution at unsampled locations, or to use likelihood-based covariance estimation. Various ad hoc approaches to solve this problem have been adopted, such as selecting a neighborhood region and/or a small number of observations to use in the kriging process, but these have no sound theoretical basis and it is unclear what information is being lost. In this article, we present a Bayesian method for estimating the posterior mean and covariance structures of a Gaussian random field using a sequential estimation algorithm. By imposing sparsity in a well-defined framework, the algorithm retains a subset of “basis vectors” that best represent the “true” posterior Gaussian random field model in the relative entropy sense. This allows a principled treatment of Gaussian random field models on very large data sets. The method is particularly appropriate when the Gaussian random field model is regarded as a latent variable model, which may be nonlinearly related to the observations. We show the application of the sequential, sparse Bayesian estimation in Gaussian random field models and discuss its merits and drawbacks.
Resumo:
This thesis applies a hierarchical latent trait model system to a large quantity of data. The motivation for it was lack of viable approaches to analyse High Throughput Screening datasets which maybe include thousands of data points with high dimensions. High Throughput Screening (HTS) is an important tool in the pharmaceutical industry for discovering leads which can be optimised and further developed into candidate drugs. Since the development of new robotic technologies, the ability to test the activities of compounds has considerably increased in recent years. Traditional methods, looking at tables and graphical plots for analysing relationships between measured activities and the structure of compounds, have not been feasible when facing a large HTS dataset. Instead, data visualisation provides a method for analysing such large datasets, especially with high dimensions. So far, a few visualisation techniques for drug design have been developed, but most of them just cope with several properties of compounds at one time. We believe that a latent variable model (LTM) with a non-linear mapping from the latent space to the data space is a preferred choice for visualising a complex high-dimensional data set. As a type of latent variable model, the latent trait model can deal with either continuous data or discrete data, which makes it particularly useful in this domain. In addition, with the aid of differential geometry, we can imagine the distribution of data from magnification factor and curvature plots. Rather than obtaining the useful information just from a single plot, a hierarchical LTM arranges a set of LTMs and their corresponding plots in a tree structure. We model the whole data set with a LTM at the top level, which is broken down into clusters at deeper levels of t.he hierarchy. In this manner, the refined visualisation plots can be displayed in deeper levels and sub-clusters may be found. Hierarchy of LTMs is trained using expectation-maximisation (EM) algorithm to maximise its likelihood with respect to the data sample. Training proceeds interactively in a recursive fashion (top-down). The user subjectively identifies interesting regions on the visualisation plot that they would like to model in a greater detail. At each stage of hierarchical LTM construction, the EM algorithm alternates between the E- and M-step. Another problem that can occur when visualising a large data set is that there may be significant overlaps of data clusters. It is very difficult for the user to judge where centres of regions of interest should be put. We address this problem by employing the minimum message length technique, which can help the user to decide the optimal structure of the model. In this thesis we also demonstrate the applicability of the hierarchy of latent trait models in the field of document data mining.