30 resultados para Old Norse language.
Resumo:
We present a new approach to spoken language modeling for language identification (LID) using the Lempel-Ziv-Welch (LZW) algorithm. The LZW technique is applicable to any kind of tokenization of the speech signal. Because of the efficiency of LZW algorithm to obtain variable length symbol strings in the training data, the LZW codebook captures the essentials of a language effectively. We develop two new deterministic measures for LID based on the LZW algorithm namely: (i) Compression ratio score (LZW-CR) and (ii) weighted discriminant score (LZW-WDS). To assess these measures, we consider error-free tokenization of speech as well as artificially induced noise in the tokenization. It is shown that for a 6 language LID task of OGI-TS database with clean tokenization, the new model (LZW-WDS) performs slightly better than the conventional bigram model. For noisy tokenization, which is the more realistic case, LZW-WDS significantly outperforms the bigram technique
Resumo:
By means of N-body simulations we investigate the impact of minor mergers on the angular momentum and dynamical properties of the merger remnant. Our simulations cover a range of initial orbital characteristics and gas-to-stellar mass fractions (from 0 to 20%), and include star formation and supernova feedback. We confirm and extend previous results by showing that the specific angular momentum of the stellar component always decreases independently of the orbital parameters or morphology of the satellite, and that the decrease in the rotation velocity of the primary galaxy is accompanied by a change in the anisotropy of the orbits. However, the decrease affects only the old stellar population, and not the new population formed from gas during the merging process. This means that the merging process induces an increasing difference in the rotational support of the old and young stellar components, with the old one lagging with respect to the new. Even if our models are not intended specifically to reproduce the Milky Way and its accretion history, we find that, under certain conditions, the modeled rotational lag found is compatible with that observed in the Milky Way disk, thus indicating that minor mergers can be a viable way to produce it. The lag can increase with the vertical distance from the disk midplane, but only if the satellite is accreted along a direct orbit, and in all cases the main contribution to the lag comes from stars originally in the primary disk rather than from stars in the satellite galaxy. We also discuss the possibility of creating counter-rotating stars in the remnant disk, their fraction as a function of the vertical distance from the galaxy midplane, and the cumulative effect of multiple mergers on their creation.
Resumo:
Current scientific research is characterized by increasing specialization, accumulating knowledge at a high speed due to parallel advances in a multitude of sub-disciplines. Recent estimates suggest that human knowledge doubles every two to three years – and with the advances in information and communication technologies, this wide body of scientific knowledge is available to anyone, anywhere, anytime. This may also be referred to as ambient intelligence – an environment characterized by plentiful and available knowledge. The bottleneck in utilizing this knowledge for specific applications is not accessing but assimilating the information and transforming it to suit the needs for a specific application. The increasingly specialized areas of scientific research often have the common goal of converting data into insight allowing the identification of solutions to scientific problems. Due to this common goal, there are strong parallels between different areas of applications that can be exploited and used to cross-fertilize different disciplines. For example, the same fundamental statistical methods are used extensively in speech and language processing, in materials science applications, in visual processing and in biomedicine. Each sub-discipline has found its own specialized methodologies making these statistical methods successful to the given application. The unification of specialized areas is possible because many different problems can share strong analogies, making the theories developed for one problem applicable to other areas of research. It is the goal of this paper to demonstrate the utility of merging two disparate areas of applications to advance scientific research. The merging process requires cross-disciplinary collaboration to allow maximal exploitation of advances in one sub-discipline for that of another. We will demonstrate this general concept with the specific example of merging language technologies and computational biology.
Resumo:
Parallel sub-word recognition (PSWR) is a new model that has been proposed for language identification (LID) which does not need elaborate phonetic labeling of the speech data in a foreign language. The new approach performs a front-end tokenization in terms of sub-word units which are designed by automatic segmentation, segment clustering and segment HMM modeling. We develop PSWR based LID in a framework similar to the parallel phone recognition (PPR) approach in the literature. This includes a front-end tokenizer and a back-end language model, for each language to be identified. Considering various combinations of the statistical evaluation scores, it is found that PSWR can perform as well as PPR, even with broad acoustic sub-word tokenization, thus making it an efficient alternative to the PPR system.
Resumo:
In this paper we approach the problem of computing the characteristic polynomial of a matrix from the combinatorial viewpoint. We present several combinatorial characterizations of the coefficients of the characteristic polynomial, in terms of walks and closed walks of different kinds in the underlying graph. We develop algorithms based on these characterizations, and show that they tally with well-known algorithms arrived at independently from considerations in linear algebra.
Resumo:
This paper presents the first stable isotope (delta O-18 and delta C-13) data of a similar to 400 years (1590-2006 AD) long annual to decadal-resolution speleothem record collected from the Indian Lesser Himalaya. The data show a variation from -2.7 to -5.9 parts per thousand in delta O-18 and -5.3 to -8.8 parts per thousand in delta C-13. The isotopic analyses indicate that the climate during this period can be divided into two stages: a wet phase during the Little Ice Age (LIA) (1590-1850 AD) and comparatively dry phase during the post-LIA after 1850 AD. However, the record also documents the minor dry events during the LIA and a wet episode after the LIA. Within the age uncertainty, the dry spells during the LIA are linked with the historical drought events in the Indian subcontinent and similar latitudes. The isotopic record is consistent with a number of previous studies in the areas influenced by the Westerlies but appears to be conflicting to the regions, dominated by the Indian Summer Monsoon (ISM). This may be due to the possible changes in the strength of Westerlies in the study area and added by negative anomaly of North Atlantic Oscillation (NAO) during the LIA. (C) 2012 Elsevier Ltd and INQUA. All rights reserved.
Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences
Resumo:
Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.
Resumo:
N-gram language models and lexicon-based word-recognition are popular methods in the literature to improve recognition accuracies of online and offline handwritten data. However, there are very few works that deal with application of these techniques on online Tamil handwritten data. In this paper, we explore methods of developing symbol-level language models and a lexicon from a large Tamil text corpus and their application to improving symbol and word recognition accuracies. On a test database of around 2000 words, we find that bigram language models improve symbol (3%) and word recognition (8%) accuracies and while lexicon methods offer much greater improvements (30%) in terms of word recognition, there is a large dependency on choosing the right lexicon. For comparison to lexicon and language model based methods, we have also explored re-evaluation techniques which involve the use of expert classifiers to improve symbol and word recognition accuracies.
Resumo:
Polyhedral techniques for program transformation are now used in several proprietary and open source compilers. However, most of the research on polyhedral compilation has focused on imperative languages such as C, where the computation is specified in terms of statements with zero or more nested loops and other control structures around them. Graphical dataflow languages, where there is no notion of statements or a schedule specifying their relative execution order, have so far not been studied using a powerful transformation or optimization approach. The execution semantics and referential transparency of dataflow languages impose a different set of challenges. In this paper, we attempt to bridge this gap by presenting techniques that can be used to extract polyhedral representation from dataflow programs and to synthesize them from their equivalent polyhedral representation. We then describe PolyGLoT, a framework for automatic transformation of dataflow programs which we built using our techniques and other popular research tools such as Clan and Pluto. For the purpose of experimental evaluation, we used our tools to compile LabVIEW, one of the most widely used dataflow programming languages. Results show that dataflow programs transformed using our framework are able to outperform those compiled otherwise by up to a factor of seventeen, with a mean speed-up of 2.30x while running on an 8-core Intel system.
Resumo:
Carbon isotope compositions of carbonate rocks from similar to 2.7-Ga-old Neoarchean Vanivilas Formation of the Dharwar Supergroup presented earlier by us are re-evaluated in this study, besides oxygen isotope compositions of a few silica dolomite pairs. The purpose of such a revisit assumes significance in view of recent field evidences that suggest a glaciomarine origin for the matrix-supported conglomerate member, the Talya conglomerate, which underlies the carbonate rocks of the Vanivilas Formation. An in-depth analysis of carbon isotope data reveals preservation of their pristine character despite the rocks having been subjected to metamorphism to different degrees (from lower greenschist to lower amphibolite facies). The dolomitic member of Vanivilas Formation of Marikanive area is characterized by highly depleted delta C-13 value (up to -5 parts per thousand VPDB) and merits as the Indian example of ca. 2.7-Ga-old cap carbonate. This inference is further supported by estimated low temperature of equilibration documented by a few silica dolomite pairs from the Vanivilas Formation collected near Kalche area. These pairs show evidence for oxygen isotopic equilibrium at low temperatures (similar to 0-20 degrees C) with depleted water (delta O-18 = -21 parts per thousand to -15 parts per thousand VSMOW) of glacial origin. We propose that the mineral pairs were deposited during the deglaciation period when the ocean temperature was in its gradual restoration phase. The dolomite of Marikanive area is the first record of cap carbonates from the Indian subcontinent with Neoarchean antiquity.
Resumo:
Many bacterial transcription factors do not behave as per the textbook operon model. We draw on whole genome work, as well as reported diversity across different bacteria, to argue that transcription factors may have evolved from nucleoid-associated proteins. This view would explain a large amount of recent data gleaned from high-throughput sequencing and bioinformatic analyses.
Resumo:
Identifying translations from comparable corpora is a well-known problem with several applications, e.g. dictionary creation in resource-scarce languages. Scarcity of high quality corpora, especially in Indian languages, makes this problem hard, e.g. state-of-the-art techniques achieve a mean reciprocal rank (MRR) of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. There exist comparable corpora in many Indian languages with other ``auxiliary'' languages. We observe that translations have many topically related words in common in the auxiliary language. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for translation induction. Extensive experiments on 35 comparable corpora using English and French as auxiliary languages show that this approach can yield dramatic improvements in performance (e.g. MRR improves by 124% to 0.419 for Telugu-Kannada). A user study on WikiTSu, a system for cross-lingual Wikipedia title suggestion that uses our approach, shows a 20% improvement in the quality of titles suggested.
Resumo:
Graph algorithms have been shown to possess enough parallelism to keep several computing resources busy-even hundreds of cores on a GPU. Unfortunately, tuning their implementation for efficient execution on a particular hardware configuration of heterogeneous systems consisting of multicore CPUs and GPUs is challenging, time consuming, and error prone. To address these issues, we propose a domain-specific language (DSL), Falcon, for implementing graph algorithms that (i) abstracts the hardware, (ii) provides constructs to write explicitly parallel programs at a higher level, and (iii) can work with general algorithms that may change the graph structure (morph algorithms). We illustrate the usage of our DSL to implement local computation algorithms (that do not change the graph structure) and morph algorithms such as Delaunay mesh refinement, survey propagation, and dynamic SSSP on GPU and multicore CPUs. Using a set of benchmark graphs, we illustrate that the generated code performs close to the state-of-the-art hand-tuned implementations.