911 resultados para Statistical Language Model


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Perez-Losada et al. [1] analyzed 72 complete genomes corresponding to nine mammalian (67 strains) and 2 avian (5 strains) polyomavirus species using maximum likelihood and Bayesian methods of phylogenetic inference. Because some data of 2 genomes in their work are now not available in GenBank, in this work, we analyze the phylogenetic relationship of the remaining 70 complete genomes corresponding to nine mammalian (65 strains) and two avian (5 strains) polyomavirus species using a dynamical language model approach developed by our group (Yu et al., [26]). This distance method does not require sequence alignment for deriving species phylogeny based on overall similarities of the complete genomes. Our best tree separates the bird polyomaviruses (avian polyomaviruses and goose hemorrhagic polymaviruses) from the mammalian polyomaviruses, which supports the idea of splitting the genus into two subgenera. Such a split is consistent with the different viral life strategies of each group. In the mammalian polyomavirus subgenera, mouse polyomaviruses (MPV), simian viruses 40 (SV40), BK viruses (BKV) and JC viruses (JCV) are grouped as different branches as expected. The topology of our best tree is quite similar to that of the tree constructed by Perez-Losada et al.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In recent times, the improved levels of accuracy obtained by Automatic Speech Recognition (ASR) technology has made it viable for use in a number of commercial products. Unfortunately, these types of applications are limited to only a few of the world’s languages, primarily because ASR development is reliant on the availability of large amounts of language specific resources. This motivates the need for techniques which reduce this language-specific, resource dependency. Ideally, these approaches should generalise across languages, thereby providing scope for rapid creation of ASR capabilities for resource poor languages. Cross Lingual ASR emerges as a means for addressing this need. Underpinning this approach is the observation that sound production is largely influenced by the physiological construction of the vocal tract, and accordingly, is human, and not language specific. As a result, a common inventory of sounds exists across languages; a property which is exploitable, as sounds from a resource poor, target language can be recognised using models trained on resource rich, source languages. One of the initial impediments to the commercial uptake of ASR technology was its fragility in more challenging environments, such as conversational telephone speech. Subsequent improvements in these environments has gained consumer confidence. Pragmatically, if cross lingual techniques are to considered a viable alternative when resources are limited, they need to perform under the same types of conditions. Accordingly, this thesis evaluates cross lingual techniques using two speech environments; clean read speech and conversational telephone speech. Languages used in evaluations are German, Mandarin, Japanese and Spanish. Results highlight that previously proposed approaches provide respectable results for simpler environments such as read speech, but degrade significantly when in the more taxing conversational environment. Two separate approaches for addressing this degradation are proposed. The first is based on deriving better target language lexical representation, in terms of the source language model set. The second, and ultimately more successful approach, focuses on improving the classification accuracy of context-dependent (CD) models, by catering for the adverse influence of languages specific phonotactic properties. Whilst the primary research goal in this thesis is directed towards improving cross lingual techniques, the catalyst for investigating its use was based on expressed interest from several organisations for an Indonesian ASR capability. In Indonesia alone, there are over 200 million speakers of some Malay variant, provides further impetus and commercial justification for speech related research on this language. Unfortunately, at the beginning of the candidature, limited research had been conducted on the Indonesian language in the field of speech science, and virtually no resources existed. This thesis details the investigative and development work dedicated towards obtaining an ASR system with a 10000 word recognition vocabulary for the Indonesian language.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

For the first time in human history, large volumes of spoken audio are being broadcast, made available on the internet, archived, and monitored for surveillance every day. New technologies are urgently required to unlock these vast and powerful stores of information. Spoken Term Detection (STD) systems provide access to speech collections by detecting individual occurrences of specified search terms. The aim of this work is to develop improved STD solutions based on phonetic indexing. In particular, this work aims to develop phonetic STD systems for applications that require open-vocabulary search, fast indexing and search speeds, and accurate term detection. Within this scope, novel contributions are made within two research themes, that is, accommodating phone recognition errors and, secondly, modelling uncertainty with probabilistic scores. A state-of-the-art Dynamic Match Lattice Spotting (DMLS) system is used to address the problem of accommodating phone recognition errors with approximate phone sequence matching. Extensive experimentation on the use of DMLS is carried out and a number of novel enhancements are developed that provide for faster indexing, faster search, and improved accuracy. Firstly, a novel comparison of methods for deriving a phone error cost model is presented to improve STD accuracy, resulting in up to a 33% improvement in the Figure of Merit. A method is also presented for drastically increasing the speed of DMLS search by at least an order of magnitude with no loss in search accuracy. An investigation is then presented of the effects of increasing indexing speed for DMLS, by using simpler modelling during phone decoding, with results highlighting the trade-off between indexing speed, search speed and search accuracy. The Figure of Merit is further improved by up to 25% using a novel proposal to utilise word-level language modelling during DMLS indexing. Analysis shows that this use of language modelling can, however, be unhelpful or even disadvantageous for terms with a very low language model probability. The DMLS approach to STD involves generating an index of phone sequences using phone recognition. An alternative approach to phonetic STD is also investigated that instead indexes probabilistic acoustic scores in the form of a posterior-feature matrix. A state-of-the-art system is described and its use for STD is explored through several experiments on spontaneous conversational telephone speech. A novel technique and framework is proposed for discriminatively training such a system to directly maximise the Figure of Merit. This results in a 13% improvement in the Figure of Merit on held-out data. The framework is also found to be particularly useful for index compression in conjunction with the proposed optimisation technique, providing for a substantial index compression factor in addition to an overall gain in the Figure of Merit. These contributions significantly advance the state-of-the-art in phonetic STD, by improving the utility of such systems in a wide range of applications.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper details the participation of the Australian e- Health Research Centre (AEHRC) in the ShARe/CLEF 2013 eHealth Evaluation Lab { Task 3. This task aims to evaluate the use of information retrieval (IR) systems to aid consumers (e.g. patients and their relatives) in seeking health advice on the Web. Our submissions to the ShARe/CLEF challenge are based on language models generated from the web corpus provided by the organisers. Our baseline system is a standard Dirichlet smoothed language model. We enhance the baseline by identifying and correcting spelling mistakes in queries, as well as expanding acronyms using AEHRC's Medtex medical text analysis platform. We then consider the readability and the authoritativeness of web pages to further enhance the quality of the document ranking. Measures of readability are integrated in the language models used for retrieval via prior probabilities. Prior probabilities are also used to encode authoritativeness information derived from a list of top-100 consumer health websites. Empirical results show that correcting spelling mistakes and expanding acronyms found in queries signi cantly improves the e ectiveness of the language model baseline. Readability priors seem to increase retrieval e ectiveness for graded relevance at early ranks (nDCG@5, but not precision), but no improvements are found at later ranks and when considering binary relevance. The authoritativeness prior does not appear to provide retrieval gains over the baseline: this is likely to be because of the small overlap between websites in the corpus and those in the top-100 consumer-health websites we acquired.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper presents our system to address the CogALex-IV 2014 shared task of identifying a single word most semantically related to a group of 5 words (queries). Our system uses an implementation of a neural language model and identifies the answer word by finding the most semantically similar word representation to the sum of the query representations. It is a fully unsupervised system which learns on around 20% of the UkWaC corpus. It correctly identifies 85 exact correct targets out of 2,000 queries, 285 approximate targets in lists of 5 suggestions.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Due to significant increase in vehicular accident and traffic congestions, vehicle to vehicle (V2V) communication based on the intelligent transport system (ITS) was introduced. However, to carry out efficient design and implementation of a reliable vehicular communication systems,a deep knowledge of the propagation channel characteristics in different environments is crucial, in particular the Doppler and pathloss parameters. Therefore, this paper presents an empirical V2V channel characterization and measurement performed under realistic urban, suburban and highway driving conditions in Brisbane, Australia. Based on Lin Cheng statistical Doppler Model (LCDM), values for the RMS Doppler spread and coherence time due to time selective nature of V2V channels were presented. Also, based on Log-distance power law model, values for the mean pathloss exponent and the standard deviation of shadowing were reported for urban, suburban and highway environments. The V2V channel parameters can be useful to system designers for the purpose of evaluating, simulating and developing new protocols and systems.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Recent advances in neural language models have contributed new methods for learning distributed vector representations of words (also called word embeddings). Two such methods are the continuous bag-of-words model and the skipgram model. These methods have been shown to produce embeddings that capture higher order relationships between words that are highly effective in natural language processing tasks involving the use of word similarity and word analogy. Despite these promising results, there has been little analysis of the use of these word embeddings for retrieval. Motivated by these observations, in this paper, we set out to determine how these word embeddings can be used within a retrieval model and what the benefit might be. To this aim, we use neural word embeddings within the well known translation language model for information retrieval. This language model captures implicit semantic relations between the words in queries and those in relevant documents, thus producing more accurate estimations of document relevance. The word embeddings used to estimate neural language models produce translations that differ from previous translation language model approaches; differences that deliver improvements in retrieval effectiveness. The models are robust to choices made in building word embeddings and, even more so, our results show that embeddings do not even need to be produced from the same corpus being used for retrieval.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Planar curves arise naturally as interfaces between two regions of the plane. An important part of statistical physics is the study of lattice models. This thesis is about the interfaces of 2D lattice models. The scaling limit is an infinite system limit which is taken by letting the lattice mesh decrease to zero. At criticality, the scaling limit of an interface is one of the SLE curves (Schramm-Loewner evolution), introduced by Oded Schramm. This family of random curves is parametrized by a real variable, which determines the universality class of the model. The first and the second paper of this thesis study properties of SLEs. They contain two different methods to study the whole SLE curve, which is, in fact, the most interesting object from the statistical physics point of view. These methods are applied to study two symmetries of SLE: reversibility and duality. The first paper uses an algebraic method and a representation of the Virasoro algebra to find common martingales to different processes, and that way, to confirm the symmetries for polynomial expected values of natural SLE data. In the second paper, a recursion is obtained for the same kind of expected values. The recursion is based on stationarity of the law of the whole SLE curve under a SLE induced flow. The third paper deals with one of the most central questions of the field and provides a framework of estimates for describing 2D scaling limits by SLE curves. In particular, it is shown that a weak estimate on the probability of an annulus crossing implies that a random curve arising from a statistical physics model will have scaling limits and those will be well-described by Loewner evolutions with random driving forces.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Recommender systems assist users in finding what they want. The challenging issue is how to efficiently acquire user preferences or user information needs for building personalized recommender systems. This research explores the acquisition of user preferences using data taxonomy information to enhance personalized recommendations for alleviating cold-start problem. A concept hierarchy model is proposed, which provides a two-dimensional hierarchy for acquiring user preferences. The language model is also extended for the proposed hierarchy in order to generate an effective recommender algorithm. Both Amazon.com book and music datasets are used to evaluate the proposed approach, and the experimental results show that the proposed approach is promising.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In a statistical downscaling model, it is important to remove the bias of General Circulations Model (GCM) outputs resulting from various assumptions about the geophysical processes. One conventional method for correcting such bias is standardisation, which is used prior to statistical downscaling to reduce systematic bias in the mean and variances of GCM predictors relative to the observations or National Centre for Environmental Prediction/ National Centre for Atmospheric Research (NCEP/NCAR) reanalysis data. A major drawback of standardisation is that it may reduce the bias in the mean and variance of the predictor variable but it is much harder to accommodate the bias in large-scale patterns of atmospheric circulation in GCMs (e.g. shifts in the dominant storm track relative to observed data) or unrealistic inter-variable relationships. While predicting hydrologic scenarios, such uncorrected bias should be taken care of; otherwise it will propagate in the computations for subsequent years. A statistical method based on equi-probability transformation is applied in this study after downscaling, to remove the bias from the predicted hydrologic variable relative to the observed hydrologic variable for a baseline period. The model is applied in prediction of monsoon stream flow of Mahanadi River in India, from GCM generated large scale climatological data.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We present a statistical methodology for leakage power estimation, due to subthreshold and gate tunneling leakage, in the presence of process variations, for 65 nm CMOS. The circuit leakage power variations is analyzed by Monte Carlo (MC) simulations, by characterizing NAND gate library. A statistical “hybrid model” is proposed, to extend this methodology to a generic library. We demonstrate that hybrid model based statistical design results in up to 95% improvement in the prediction of worst to best corner leakage spread, with an error of less than 0.5%, with respect to worst case design.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Climate change would significantly affect many hydrologic systems, which in turn would affect the water availability, runoff, and the flow in rivers. This study evaluates the impacts of possible future climate change scenarios on the hydrology of the catchment area of the TungaBhadra River, upstream of the Tungabhadra dam. The Hydrologic Engineering Center's Hydrologic Modeling System version 3.4 (HEC-HMS 3.4) is used for the hydrological modelling of the study area. Linear-regression-based Statistical DownScaling Model version 4.2 (SDSM 4.2) is used to downscale the daily maximum and minimum temperature, and daily precipitation in the four sub-basins of the study area. The large-scale climate variables for the A2 and B2 scenarios obtained from the Hadley Centre Coupled Model version 3 are used. After model calibration and testing of the downscaling procedure, the hydrological model is run for the three future periods: 20112040, 20412070, and 20712099. The impacts of climate change on the basin hydrology are assessed by comparing the present and future streamflow and the evapotranspiration estimates. Results of the water balance study suggest increasing precipitation and runoff and decreasing actual evapotranspiration losses over the sub-basins in the study area.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper presents an approach to model the expected impacts of climate change on irrigation water demand in a reservoir command area. A statistical downscaling model and an evapotranspiration model are used with a general circulation model (GCM) output to predict the anticipated change in the monthly irrigation water requirement of a crop. Specifically, we quantify the likely changes in irrigation water demands at a location in the command area, as a response to the projected changes in precipitation and evapotranspiration at that location. Statistical downscaling with a canonical correlation analysis is carried out to develop the future scenarios of meteorological variables (rainfall, relative humidity (RH), wind speed (U-2), radiation, maximum (Tmax) and minimum (Tmin) temperatures) starting with simulations provided by a GCM for a specified emission scenario. The medium resolution Model for Interdisciplinary Research on Climate GCM is used with the A1B scenario, to assess the likely changes in irrigation demands for paddy, sugarcane, permanent garden and semidry crops over the command area of Bhadra reservoir, India. Results from the downscaling model suggest that the monthly rainfall is likely to increase in the reservoir command area. RH, Tmax and Tmin are also projected to increase with small changes in U-2. Consequently, the reference evapotranspiration, modeled by the Penman-Monteith equation, is predicted to increase. The irrigation requirements are assessed on monthly scale at nine selected locations encompassing the Bhadra reservoir command area. The irrigation requirements are projected to increase, in most cases, suggesting that the effect of projected increase in rainfall on the irrigation demands is offset by the effect due to projected increase/change in other meteorological variables (viz., Tmax and Tmin, solar radiation, RH and U-2). The irrigation demand assessment study carried out at a river basin will be useful for future irrigation management systems. Copyright (c) 2012 John Wiley & Sons, Ltd.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We describe a method for text entry based on inverse arithmetic coding that relies on gaze direction and which is faster and more accurate than using an on-screen keyboard. These benefits are derived from two innovations: the writing task is matched to the capabilities of the eye, and a language model is used to make predictable words and phrases easier to write.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper discusses the development of the CU-HTK Mandarin Broadcast News (BN) transcription system. The Mandarin BN task includes a significant amount of English data. Hence techniques have been investigated to allow the same system to handle both Mandarin and English by augmenting the Mandarin training sets with English acoustic and language model training data. A range of acoustic models were built including models based on Gaussianised features, speaker adaptive training and feature-space MPE. A multi-branch system architecture is described in which multiple acoustic model types, alternate phone sets and segmentations can be used in a system combination framework to generate the final output. The final system shows state-of-the-art performance over a range of test sets. ©2006 British Crown Copyright.