9 resultados para Comparable corpora


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Experiments show that for a large corpus, Zipf’s law does not hold for all rank of words: the frequencies fall below those predicted by Zipf’s law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf’s law with a slope close to -1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf’s law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory
alone can predict this behavior in randomly created n-grams of binary bits.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

To create smiling virtual characters, the different morphological and dynamic characteristics of the virtual characters smiles and the impact of the virtual characters smiling behavior on the users need to be identified. For this purpose, we have collected two corpora: one directly created by users and the other resulting from the interaction between virtual characters and users. We present in details these two corpora in the article.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, a coupling of fluorophore-DNA barcode and bead-based immunoassay for detecting avian influenza virus (AIV) with PCR-like sensitivity is reported. The assay is based on the use of sandwich immunoassay and fluorophore-tagged oligonucleotides as representative barcodes. The detection involves the sandwiching of the target AIV between magnetic immunoprobes and barcode-carrying immunoprobes. Because each barcode-carrying immunoprobe is functionalized with a multitude of fluorophore-DNA barcode strands, many DNA barcodes are released for each positive binding event resulting in amplification of the signal. Using an inactivated H16N3 AIV as a model, a linear response over five orders of magnitude was obtained, and the sensitivity of the detection was comparable to conventional RT-PCR. Moreover, the entire detection required less than 2 hr. The results indicate that the method has great potential as an alternative for surveillance of epidemic outbreaks caused by AIV, other viruses and microorganisms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We address the problem of mining interesting phrases from subsets of a text corpus where the subset is specified using a set of features such as keywords that form a query. Previous algorithms for the problem have proposed solutions that involve sifting through a phrase dictionary based index or a document-based index where the solution is linear in either the phrase dictionary size or the size of the document subset. We propose the usage of an independence assumption between query keywords given the top correlated phrases, wherein the pre-processing could be reduced to discovering phrases from among the top phrases per each feature in the query. We then outline an indexing mechanism where per-keyword phrase lists are stored either in disk or memory, so that popular aggregation algorithms such as No Random Access and Sort-merge Join may be adapted to do the scoring at real-time to identify the top interesting phrases. Though such an approach is expected to be approximate, we empirically illustrate that very high accuracies (of over 90%) are achieved against the results of exact algorithms. Due to the simplified list-aggregation, we are also able to provide response times that are orders of magnitude better than state-of-the-art algorithms. Interestingly, our disk-based approach outperforms the in-memory baselines by up to hundred times and sometimes more, confirming the superiority of the proposed method.