836 resultados para Probabilistic latent semantic analysis (PLSA)
Resumo:
Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions.
Resumo:
The retrieval of wind vectors from satellite scatterometer observations is a non-linear inverse problem. A common approach to solving inverse problems is to adopt a Bayesian framework and to infer the posterior distribution of the parameters of interest given the observations by using a likelihood model relating the observations to the parameters, and a prior distribution over the parameters. We show how Gaussian process priors can be used efficiently with a variety of likelihood models, using local forward (observation) models and direct inverse models for the scatterometer. We present an enhanced Markov chain Monte Carlo method to sample from the resulting multimodal posterior distribution. We go on to show how the computational complexity of the inference can be controlled by using a sparse, sequential Bayes algorithm for estimation with Gaussian processes. This helps to overcome the most serious barrier to the use of probabilistic, Gaussian process methods in remote sensing inverse problems, which is the prohibitively large size of the data sets. We contrast the sampling results with the approximations that are found by using the sparse, sequential Bayes algorithm.
Resumo:
This work explores the relevance of semantic and linguistic description to translation, theory and practice. It is aimed towards a practical model of approach to texts to translate. As literary texts [poetry mainly] are the focus of attention, so are stylistic matters. Note, however, that 'style', and, to some extent, the conclusions of the work, are not limited to so-called literary texts. The study of semantic description reveals that most translation problems do not stem from the cognitive (langue-related), but rather from the contextual (parole-related) aspects of meaning. Thus, any linguistic model that fails to account for the latter is bound to fall short. T.G.G. does, whereas Systemics, concerned with both the 'Iangue' and 'parole' (stylistic and sociolinguistic mainly) aspects of meaning, provides a useful framework of approach to texts to translate. Two essential semantic principles for translation are: that meaning is the property of a language (Firth); and the 'relativity of meaning assignments' (Tymoczko). Both imply that meaning can only be assessed, correctly, in the relevant socio-cultural background. Translation is seen as a restricted creation, and the translator's encroach as a three-dimensional critical one. To encompass the most technical to the most literary text, and account for variations in emphasis in any text, translation theory must be based on typology of function Halliday's ideational, interpersonal and textual, or, Buhler's symbol, signal, symptom, Functions3. Function Coverall and specific] will dictate aims and method, and also provide the critic with criteria to assess translation Faithfulness. Translation can never be reduced to purely objective methods, however. Intuitive procedures intervene, in textual interpretation and analysis, in the choice of equivalents, and in the reception of a translation. Ultimately, translation, theory and practice, may perhaps constitute the touchstone as regards the validity of linguistic and semantic theories.
Resumo:
Simplification of texts has traditionally been carried out by replacing words and structures with appropriate semantic equivalents in the learner's interlanguage, omitting whichever items prove intractable, and thereby bringing the language of the original within the scope of the learner's transitional linguistic competence. This kind of simplification focuses mainly on the formal features of language. The simplifier can, on the other hand, concentrate on making explicit the propositional content and its presentation in the original in order to bring what is communicated in the original within the scope of the learner's transitional communicative competence. In this case, simplification focuses on the communicative function of the language. Up to now, however, approaches to the problem of simplification have been mainly concerned with the first kind, using the simplifier’s intuition as to what constitutes difficulty for the learner. There appear to be few objective principles underlying this process. The main aim of this study is to investigate the effect of simplification on the communicative aspects of narrative texts, which includes the manner in which narrative units at higher levels of organisation are structured and presented and also the temporal and logical relationships between lower level structures such as sentences/clauses, with the intention of establishing an objective approach to the problem of simplification based on a set of principled procedures which could be used as a guideline in the simplification of material for foreign students at an advanced level.
Resumo:
In a series of experiments, we tested category-specific activation in normal parti¬cipants using magnetoencephalography (MEG). Our experiments explored the temporal processing of objects, as MEG characterises neural activity on the order of milliseconds. Our experiments explored object-processing, including assessing the time-course of ob¬ject naming, early differences in processing living compared with nonliving objects and processing objects at the basic compared with the domain level, and late differences in processing living compared with nonliving objects and processing objects at the basic compared with the domain level. In addition to studies using normal participants, we also utilised MEG to explore category-specific processing in a patient with a deficit for living objects. Our findings support the cascade model of object naming (Humphreys et al., 1988). In addition, our findings using normal participants demonstrate early, category-specific perceptual differences. These findings are corroborated by our patient study. In our assessment of the time-course of category-specific effects as well as a separate analysis designed to measure semantic differences between living and nonliving objects, we found support for the sensory/motor model of object naming (Martin, 1998), in addition to support for the cascade model of object naming. Thus, object processing in normal participants appears to be served by a distributed network in the brain, and there are both perceptual and semantic differences between living and nonliving objects. A separate study assessing the influence of the level at which you are asked to identify an object on processing in the brain found evidence supporting the convergence zone hypothesis (Damasio, 1989). Taken together, these findings indicate the utility of MEG in exploring the time-course of object processing, isolating early perceptual and later semantic effects within the brain.
Resumo:
The topic of this thesis is the development of knowledge based statistical software. The shortcomings of conventional statistical packages are discussed to illustrate the need to develop software which is able to exhibit a greater degree of statistical expertise, thereby reducing the misuse of statistical methods by those not well versed in the art of statistical analysis. Some of the issues involved in the development of knowledge based software are presented and a review is given of some of the systems that have been developed so far. The majority of these have moved away from conventional architectures by adopting what can be termed an expert systems approach. The thesis then proposes an approach which is based upon the concept of semantic modelling. By representing some of the semantic meaning of data, it is conceived that a system could examine a request to apply a statistical technique and check if the use of the chosen technique was semantically sound, i.e. will the results obtained be meaningful. Current systems, in contrast, can only perform what can be considered as syntactic checks. The prototype system that has been implemented to explore the feasibility of such an approach is presented, the system has been designed as an enhanced variant of a conventional style statistical package. This involved developing a semantic data model to represent some of the statistically relevant knowledge about data and identifying sets of requirements that should be met for the application of the statistical techniques to be valid. Those areas of statistics covered in the prototype are measures of association and tests of location.
Resumo:
This thesis presents the results from an investigation into the merits of analysing Magnetoencephalographic (MEG) data in the context of dynamical systems theory. MEG is the study of both the methods for the measurement of minute magnetic flux variations at the scalp, resulting from neuro-electric activity in the neocortex, as well as the techniques required to process and extract useful information from these measurements. As a result of its unique mode of action - by directly measuring neuronal activity via the resulting magnetic field fluctuations - MEG possesses a number of useful qualities which could potentially make it a powerful addition to any brain researcher's arsenal. Unfortunately, MEG research has so far failed to fulfil its early promise, being hindered in its progress by a variety of factors. Conventionally, the analysis of MEG has been dominated by the search for activity in certain spectral bands - the so-called alpha, delta, beta, etc that are commonly referred to in both academic and lay publications. Other efforts have centred upon generating optimal fits of "equivalent current dipoles" that best explain the observed field distribution. Many of these approaches carry the implicit assumption that the dynamics which result in the observed time series are linear. This is despite a variety of reasons which suggest that nonlinearity might be present in MEG recordings. By using methods that allow for nonlinear dynamics, the research described in this thesis avoids these restrictive linearity assumptions. A crucial concept underpinning this project is the belief that MEG recordings are mere observations of the evolution of the true underlying state, which is unobservable and is assumed to reflect some abstract brain cognitive state. Further, we maintain that it is unreasonable to expect these processes to be adequately described in the traditional way: as a linear sum of a large number of frequency generators. One of the main objectives of this thesis will be to prove that much more effective and powerful analysis of MEG can be achieved if one were to assume the presence of both linear and nonlinear characteristics from the outset. Our position is that the combined action of a relatively small number of these generators, coupled with external and dynamic noise sources, is more than sufficient to account for the complexity observed in the MEG recordings. Another problem that has plagued MEG researchers is the extremely low signal to noise ratios that are obtained. As the magnetic flux variations resulting from actual cortical processes can be extremely minute, the measuring devices used in MEG are, necessarily, extremely sensitive. The unfortunate side-effect of this is that even commonplace phenomena such as the earth's geomagnetic field can easily swamp signals of interest. This problem is commonly addressed by averaging over a large number of recordings. However, this has a number of notable drawbacks. In particular, it is difficult to synchronise high frequency activity which might be of interest, and often these signals will be cancelled out by the averaging process. Other problems that have been encountered are high costs and low portability of state-of-the- art multichannel machines. The result of this is that the use of MEG has, hitherto, been restricted to large institutions which are able to afford the high costs associated with the procurement and maintenance of these machines. In this project, we seek to address these issues by working almost exclusively with single channel, unaveraged MEG data. We demonstrate the applicability of a variety of methods originating from the fields of signal processing, dynamical systems, information theory and neural networks, to the analysis of MEG data. It is noteworthy that while modern signal processing tools such as independent component analysis, topographic maps and latent variable modelling have enjoyed extensive success in a variety of research areas from financial time series modelling to the analysis of sun spot activity, their use in MEG analysis has thus far been extremely limited. It is hoped that this work will help to remedy this oversight.
Resumo:
Exploratory analysis of data seeks to find common patterns to gain insights into the structure and distribution of the data. In geochemistry it is a valuable means to gain insights into the complicated processes making up a petroleum system. Typically linear visualisation methods like principal components analysis, linked plots, or brushing are used. These methods can not directly be employed when dealing with missing data and they struggle to capture global non-linear structures in the data, however they can do so locally. This thesis discusses a complementary approach based on a non-linear probabilistic model. The generative topographic mapping (GTM) enables the visualisation of the effects of very many variables on a single plot, which is able to incorporate more structure than a two dimensional principal components plot. The model can deal with uncertainty, missing data and allows for the exploration of the non-linear structure in the data. In this thesis a novel approach to initialise the GTM with arbitrary projections is developed. This makes it possible to combine GTM with algorithms like Isomap and fit complex non-linear structure like the Swiss-roll. Another novel extension is the incorporation of prior knowledge about the structure of the covariance matrix. This extension greatly enhances the modelling capabilities of the algorithm resulting in better fit to the data and better imputation capabilities for missing data. Additionally an extensive benchmark study of the missing data imputation capabilities of GTM is performed. Further a novel approach, based on missing data, will be introduced to benchmark the fit of probabilistic visualisation algorithms on unlabelled data. Finally the work is complemented by evaluating the algorithms on real-life datasets from geochemical projects.
Resumo:
Component-based development (CBD) has become an important emerging topic in the software engineering field. It promises long-sought-after benefits such as increased software reuse, reduced development time to market and, hence, reduced software production cost. Despite the huge potential, the lack of reasoning support and development environment of component modeling and verification may hinder its development. Methods and tools that can support component model analysis are highly appreciated by industry. Such a tool support should be fully automated as well as efficient. At the same time, the reasoning tool should scale up well as it may need to handle hundreds or even thousands of components that a modern software system may have. Furthermore, a distributed environment that can effectively manage and compose components is also desirable. In this paper, we present an approach to the modeling and verification of a newly proposed component model using Semantic Web languages and their reasoning tools. We use the Web Ontology Language and the Semantic Web Rule Language to precisely capture the inter-relationships and constraints among the entities in a component model. Semantic Web reasoning tools are deployed to perform automated analysis support of the component models. Moreover, we also proposed a service-oriented architecture (SOA)-based semantic web environment for CBD. The adoption of Semantic Web services and SOA make our component environment more reusable, scalable, dynamic and adaptive.
Resumo:
Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework called joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. A reparameterized version of the JST model called Reverse-JST, obtained by reversing the sequence of sentiment and topic generation in the modeling process, is also studied. Although JST is equivalent to Reverse-JST without a hierarchical prior, extensive experiments show that when sentiment priors are added, JST performs consistently better than Reverse-JST. Besides, unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when shifting to other domains, the weakly supervised nature of JST makes it highly portable to other domains. This is verified by the experimental results on data sets from five different domains where the JST model even outperforms existing semi-supervised approaches in some of the data sets despite using no labeled documents. Moreover, the topics and topic sentiment detected by JST are indeed coherent and informative. We hypothesize that the JST model can readily meet the demand of large-scale sentiment analysis from the web in an open-ended fashion.
Resumo:
Structural analysis in handwritten mathematical expressions focuses on interpreting the recognized symbols using geometrical information such as relative sizes and positions of the symbols. Most existing approaches rely on hand-crafted grammar rules to identify semantic relationships among the recognized mathematical symbols. They could easily fail when writing errors occurred. Moreover, they assume the availability of the whole mathematical expression before being able to analyze the semantic information of the expression. To tackle these problems, we propose a progressive structural analysis (PSA) approach for dynamic recognition of handwritten mathematical expressions. The proposed PSA approach is able to provide analysis result immediately after each written input symbol. This has an advantage that users are able to detect any recognition errors immediately and correct only the mis-recognized symbols rather than the whole expression. Experiments conducted on 57 most commonly used mathematical expressions have shown that the PSA approach is able to achieve very good performance results.
Resumo:
Web APIs have gained increasing popularity in recent Web service technology development owing to its simplicity of technology stack and the proliferation of mashups. However, efficiently discovering Web APIs and the relevant documentations on the Web is still a challenging task even with the best resources available on the Web. In this paper we cast the problem of detecting the Web API documentations as a text classification problem of classifying a given Web page as Web API associated or not. We propose a supervised generative topic model called feature latent Dirichlet allocation (feaLDA) which offers a generic probabilistic framework for automatic detection of Web APIs. feaLDA not only captures the correspondence between data and the associated class labels, but also provides a mechanism for incorporating side information such as labelled features automatically learned from data that can effectively help improving classification performance. Extensive experiments on our Web APIs documentation dataset shows that the feaLDA model outperforms three strong supervised baselines including naive Bayes, support vector machines, and the maximum entropy model, by over 3% in classification accuracy. In addition, feaLDA also gives superior performance when compared against other existing supervised topic models.
Resumo:
Previous research into formulaic language has focussed on specialised groups of people (e.g. L1 acquisition by infants and adult L2 acquisition) with ordinary adult native speakers of English receiving less attention. Additionally, whilst some features of formulaic language have been used as evidence of authorship (e.g. the Unabomber’s use of you can’t eat your cake and have it too) there has been no systematic investigation into this as a potential marker of authorship. This thesis reports the first full-scale study into the use of formulaic sequences by individual authors. The theory of formulaic language hypothesises that formulaic sequences contained in the mental lexicon are shaped by experience combined with what each individual has found to be communicatively effective. Each author’s repertoire of formulaic sequences should therefore differ. To test this assertion, three automated approaches to the identification of formulaic sequences are tested on a specially constructed corpus containing 100 short narratives. The first approach explores a limited subset of formulaic sequences using recurrence across a series of texts as the criterion for identification. The second approach focuses on a word which frequently occurs as part of formulaic sequences and also investigates alternative non-formulaic realisations of the same semantic content. Finally, a reference list approach is used. Whilst claiming authority for any reference list can be difficult, the proposed method utilises internet examples derived from lists prepared by others, a procedure which, it is argued, is akin to asking large groups of judges to reach consensus about what is formulaic. The empirical evidence supports the notion that formulaic sequences have potential as a marker of authorship since in some cases a Questioned Document was correctly attributed. Although this marker of authorship is not universally applicable, it does promise to become a viable new tool in the forensic linguist’s tool-kit.
Resumo:
While much of a company's knowledge can be found in text repositories, current content management systems have limited capabilities for structuring and interpreting documents. In the emerging Semantic Web, search, interpretation and aggregation can be addressed by ontology-based semantic mark-up. In this paper, we examine semantic annotation, identify a number of requirements, and review the current generation of semantic annotation systems. This analysis shows that, while there is still some way to go before semantic annotation tools will be able to address fully all the knowledge management needs, research in the area is active and making good progress.
Resumo:
In this paper, we explore the idea of social role theory (SRT) and propose a novel regularized topic model which incorporates SRT into the generative process of social media content. We assume that a user can play multiple social roles, and each social role serves to fulfil different duties and is associated with a role-driven distribution over latent topics. In particular, we focus on social roles corresponding to the most common social activities on social networks. Our model is instantiated on microblogs, i.e., Twitter and community question-answering (cQA), i.e., Yahoo! Answers, where social roles on Twitter include "originators" and "propagators", and roles on cQA are "askers" and "answerers". Both explicit and implicit interactions between users are taken into account and modeled as regularization factors. To evaluate the performance of our proposed method, we have conducted extensive experiments on two Twitter datasets and two cQA datasets. Furthermore, we also consider multi-role modeling for scientific papers where an author's research expertise area is considered as a social role. A novel application of detecting users' research interests through topical keyword labeling based on the results of our multi-role model has been presented. The evaluation results have shown the feasibility and effectiveness of our model.