888 resultados para Corpus Theognideum
Resumo:
N-gram language models and lexicon-based word-recognition are popular methods in the literature to improve recognition accuracies of online and offline handwritten data. However, there are very few works that deal with application of these techniques on online Tamil handwritten data. In this paper, we explore methods of developing symbol-level language models and a lexicon from a large Tamil text corpus and their application to improving symbol and word recognition accuracies. On a test database of around 2000 words, we find that bigram language models improve symbol (3%) and word recognition (8%) accuracies and while lexicon methods offer much greater improvements (30%) in terms of word recognition, there is a large dependency on choosing the right lexicon. For comparison to lexicon and language model based methods, we have also explored re-evaluation techniques which involve the use of expert classifiers to improve symbol and word recognition accuracies.
Resumo:
In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated given the summary topics. This ensures that our summaries always highlight the crux of the document without paying any attention to the grammar and the structure of the documents. Finally, we evaluate our summaries on the DUC 2002 Single document summarization data corpus using ROUGE measures. Our summaries had higher ROUGE values and better semantic similarity with the documents than the DUC summaries.
Resumo:
When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.
Resumo:
There are many popular models available for classification of documents like Naïve Bayes Classifier, k-Nearest Neighbors and Support Vector Machine. In all these cases, the representation is based on the “Bag of words” model. This model doesn't capture the actual semantic meaning of a word in a particular document. Semantics are better captured by proximity of words and their occurrence in the document. We propose a new “Bag of Phrases” model to capture this discriminative power of phrases for text classification. We present a novel algorithm to extract phrases from the corpus using the well known topic model, Latent Dirichlet Allocation(LDA), and to integrate them in vector space model for classification. Experiments show a better performance of classifiers with the new Bag of Phrases model against related representation models.
Resumo:
Latent variable methods, such as PLCA (Probabilistic Latent Component Analysis) have been successfully used for analysis of non-negative signal representations. In this paper, we formulate PLCS (Probabilistic Latent Component Segmentation), which models each time frame of a spectrogram as a spectral distribution. Given the signal spectrogram, the segmentation boundaries are estimated using a maximum-likelihood approach. For an efficient solution, the algorithm imposes a hard constraint that each segment is modelled by a single latent component. The hard constraint facilitates the solution of ML boundary estimation using dynamic programming. The PLCS framework does not impose a parametric assumption unlike earlier ML segmentation techniques. PLCS can be naturally extended to model coarticulation between successive phones. Experiments on the TIMIT corpus show that the proposed technique is promising compared to most state of the art speech segmentation algorithms.
Resumo:
Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.
Resumo:
Automatic and accurate detection of the closure-burst transition events of stops and affricates serves many applications in speech processing. A temporal measure named the plosion index is proposed to detect such events, which are characterized by an abrupt increase in energy. Using the maxima of the pitch-synchronous normalized cross correlation as an additional temporal feature, a rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates. The performance of the algorithm, characterized by receiver operating characteristic curves and temporal accuracy, is evaluated using the labeled closure-burst transitions of stops and affricates of the entire TIMIT test and training databases. The robustness of the algorithm is studied with respect to global white and babble noise as well as local noise using the TIMIT test set and on telephone quality speech using the NTIMIT test set. For these experiments, the proposed algorithm, which does not require explicit statistical training and is based on two one-dimensional temporal measures, gives a performance comparable to or better than the state-of-the-art methods. In addition, to test the scalability, the algorithm is applied on the Buckeye conversational speech corpus and databases of two Indian languages. (C) 2014 Acoustical Society of America.
Resumo:
This paper describes a spatio-temporal registration approach for speech articulation data obtained from electromagnetic articulography (EMA) and real-time Magnetic Resonance Imaging (rtMRI). This is motivated by the potential for combining the complementary advantages of both types of data. The registration method is validated on EMA and rtMRI datasets obtained at different times, but using the same stimuli. The aligned corpus offers the advantages of high temporal resolution (from EMA) and a complete mid-sagittal view (from rtMRI). The co-registration also yields optimum placement of EMA sensors as articulatory landmarks on the magnetic resonance images, thus providing richer spatio-temporal information about articulatory dynamics. (C) 2014 Acoustical Society of America
Resumo:
In several species including the buffalo cow, prostaglandin (PG) F-2 alpha is the key molecule responsible for regression of corpus luteum (CL). Experiments were carried out to characterize gene expression changes in the CL tissue at various time points after administration of luteolytic dose of PGF(2 alpha) in buffalo cows. Circulating progesterone levels decreased within 1 h of PGF(2 alpha) treatment and evidence of apoptosis was demonstrable at 18 h post treatment. Microarray analysis indicated expression changes in several of immediate early genes and transcription factors within 3 h of treatment. Also, changes in expression of genes associated with cell to cell signaling, cytokine signaling, steroidogenesis, PG synthesis and apoptosis were observed. Analysis of various components of LH/CGR signaling in CL tissues indicated decreased LH/CGR protein expression, pCREB levels and PKA activity post PGF(2 alpha) treatment. The novel finding of this study is the down regulation of CYP19A1 gene expression accompanied by decrease in expression of E-2 receptors and circulating and intra luteal E-2 post PGF(2 alpha) treatment. Mining of microarray data revealed several differentially expressed E-2 responsive genes. Since CYP19A1 gene expression is low in the bovine CL, mining of microarray data of PGF(2 alpha)-treated macaques, the species with high luteal CYP19A1 expression, showed good correlation between differentially expressed E-2 responsive genes between both the species. Taken together, the results of this study suggest that PGF(2 alpha) interferes with luteotrophic signaling, impairs intraluteal E-2 levels and regulates various signaling pathways before the effects on structural luteolysis are manifest.
Resumo:
USC-TIMIT is an extensive database of multimodal speech production data, developed to complement existing resources available to the speech research community and with the intention of being continuously refined and augmented. The database currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English. Electromagnetic articulography data have also been presently collected from four of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus used previously in the MOCHA-TIMIT database. In both cases the audio signal was recorded and synchronized with the articulatory data. The database and companion software are freely available to the research community. (C) 2014 Acoustical Society of America.
Resumo:
Electromagnetic Articulography (EMA) technique is used to record the kinematics of different articulators while one speaks. EMA data often contains missing segments due to sensor failure. In this work, we propose a maximum a-posteriori (MAP) estimation with continuity constraint to recover the missing samples in the articulatory trajectories recorded using EMA. In this approach, we combine the benefits of statistical MAP estimation as well as the temporal continuity of the articulatory trajectories. Experiments on articulatory corpus using different missing segment durations show that the proposed continuity constraint results in a 30% reduction in average root mean squared error in estimation over statistical estimation of missing segments without any continuity constraint.
Resumo:
In subject-independent acoustic-to-articulatory inversion, the articulatory kinematics of a test subject are estimated assuming that the training corpus does not include data from the test subject. The training corpus in subject-independent inversion (SII) is formed with acoustic and articulatory kinematics data and the acoustic mismatch between training and test subjects is then estimated by an acoustic normalization using acoustic data drawn from a large pool of speakers called generic acoustic space (GAS). In this work, we focus on improving the SII performance through better acoustic normalization and adaptation. We propose unsupervised and several supervised ways of clustering GAS for acoustic normalization. We perform an adaptation of acoustic models of GAS using the acoustic data of the training and test subjects in SII. It is found that SII performance significantly improves (similar to 25% relative on average) over the subject-dependent inversion when the acoustic clusters in GAS correspond to phonetic units (or states of 3-state phonetic HMMs) and when the acoustic model built on GAS is adapted to training and test subjects while optimizing the inversion criterion. (C) 2014 Elsevier B.V. All rights reserved.
Resumo:
Speech enhancement in stationary noise is addressed using the ideal channel selection framework. In order to estimate the binary mask, we propose to classify each time-frequency (T-F) bin of the noisy signal as speech or noise using Discriminative Random Fields (DRF). The DRF function contains two terms - an enhancement function and a smoothing term. On each T-F bin, we propose to use an enhancement function based on likelihood ratio test for speech presence, while Ising model is used as smoothing function for spectro-temporal continuity in the estimated binary mask. The effect of the smoothing function over successive iterations is found to reduce musical noise as opposed to using only enhancement function. The binary mask is inferred from the noisy signal using Iterated Conditional Modes (ICM) algorithm. Sentences from NOIZEUS corpus are evaluated from 0 dB to 15 dB Signal to Noise Ratio (SNR) in 4 kinds of additive noise settings: additive white Gaussian noise, car noise, street noise and pink noise. The reconstructed speech using the proposed technique is evaluated in terms of average segmental SNR, Perceptual Evaluation of Speech Quality (PESQ) and Mean opinion Score (MOS).
Resumo:
Con el propósito de evaluar el comportamiento reproductivo, la dinámica de producción y calidad de la leche de genotipos lecheros en la Finca El Corpus del Meneo, manejado intensivamente, se utilizó información del periodo 1997 - 2004. Se analizaron 181 datos de edad de incorporación (EDADINC), 157 datos de edad a primer parto (EPP), 543 datos de número de servicios por concepción (NSC) y 341 datos de intervalo entre partos (IEP), 233 datos de dos pesajes de leche (diciembre 2004 y enero 2005) y sus respectivos análisis de calidad en porcentajes de grasa (%FAT), proteína (%PROT), lactosa (%LACT) y materia seca (%DRYM). Los modelos lineales aditivos incluyeron efectos de grupo racial (GRUPO), año de nacimiento (ANACV), año de incorporación (AINCV), época de incorporación (El), año de parto (APART), número de parto (NUMPA), época de parto (EP), periodo de lactancia (PERL) y sexo de la cría (SEXC) e interacciones importantes. Para EDINCV, se encontró diferencias relevantes entre GRUPOS (P<0.068), ANACV (P<0.0001), AINCV (P<0.0001) y EN (P<0.0001). Para EPP, diferencias importantes entre AINCV (P<0.0001), APART (P<0.0028) y la interacción APARTxEP (P<0.0361). Para NSC, diferencias estadísticas entre APART (P<0.0138) y NUMPA (P<0.0074). Para IEP, las diferencias importantes entre APART (P<0.0076), NUMPA (P<0.0004) y la interacción GRUPOxSEXC (P<0.0882). Se obtuvieron medias de mínimos cuadrados para EDADINC, EPP, NSC IEP de 25.75±0.72 meses, 35.71±0.88 meses, 1.34±0.11 unidades y !2.82±0.35 meses, respectivamente. Se encontró diferencias importantes entre GRUPOS (1'<0.0001 a P<0.026), NUMPA (P<0.0003 a P<0.0024), PERL (P<0.0000) y la interacción NUMPA*PERL (P<0.0000 a 0.0082), en las variables de producción y calidad, no así entre NUMPA para %DRYM. Se estimaron valores de 8.79±0.29 Kg., 4.14±0.09, 3.48±0.04, 4.31±0.02 y 12.56±0.11 para PLD, %FAT, %PROT, %LACT y %DRYM, respectivamente. GRUPOS no resulto significativo, pero se observo una tendencia marcada del GRUPO 5 (Pardo suizo) hacia una menor EDADINC, el GRUPO l (Holstein y cruces) mostró menores EPP, y los menores valores de NSC (mayor eficiencia técnica) e IEP para el GRUPO 3 (Jersey y cruces). El GRUPO con Holstein mostró mayores producciones de leche pero con menor calidad, mientras que los GRUPOS 2 y 4 (Jersey y cruces, Pardo suizo y cruces) mostraron menores producciones pero con mayor calidad general. La producción de leche por vaca promedio semanal (PLVD) a través de los años se comportó de acuerdo con algunos eventos climatológicos que determinan la disponibilidad y calidad del alimento. Se determinaron tres picos de producción (9.5, 10.0 y 9.7 kg.) y tres puntos críticos similares (9.0- 9.1 kg.). Los genotipos lecheros estudiados muestran que bajo condiciones de trópico seco y manejo intensivo es posible lograr parámetros de reproducción, producción y calidad de leche aceptable y mayores que los parámetros nacionales.
Resumo:
Resumen: Cuando los medios se refieren a los candidatos políticos en una situación de campaña electoral proyectan una cierta imagen de los mismos a partir de las verbalizaciones que preponderan en los mensajes informativos. El presente trabajo analiza estas verbalizaciones de los medios en el marco de la teoría de la Agenda Setting, más específicamente en el segundo nivel de esta teoría, la cual hace referencia a los atributos o aspectos que caracterizan a los protagonistas de las noticias, para este estudio en particular, los políticos. Esta teoría fue puesta a prueba repetidamente desde su aplicación en las elecciones estadounidenses de 1968 de la mano de sus autores Maxwell McCombs y Donald Shaw. La misma se ha extendido con paso firme desde su país de origen hacia otras latitudes. El objetivo general es describir la imagen de los candidatos presidenciales a partir de las expresiones que preponderan en los medios masivos de comunicación, durante la campaña presidencial en Argentina ocurrida en octubre de 2011. El procedimiento consiste en el relevamiento realizado durante los meses anteriores a las elecciones presidenciales. Para ello fue necesario armar un corpus compuesto por la selección de los medios masivos de comunicación a analizar. Seguidamente se realiza el análisis de contenido del corpus y se procede al análisis de los datos, de ello deriva una base de datos, donde la unidad de análisis fue la mención de los diversos aspectos o características de los candidatos políticos. Los aspectos o características fueron tomados de investigaciones anteriores que aplicaron la misma metodología y que fueron realizadas en nuestro país también en situaciones de contextos electorales. Por ello es factible de efectuar comparaciones en el tiempo, ya que una de las enormes riquezas de la teoría de la Agenda Setting es la de ser susceptible de comparación por tratarse de una metodología de análisis sistemático. La revisión de la teoría de la Agenda Setting que enmarca esta investigación pretende introducirnos primeramente en las investigaciones de medios masivos en general para luego focalizar en la teoría propiamente dicha y más aún en el segundo nivel de la teoría que trata de los atributos de los personajes públicos Se realiza posteriormente una breve sinopsis de investigaciones en Latinoamérica y Argentina, sobre temas relacionados con campañas electorales y no electorales para finalmente de manera concatenada dar cuenta de trabajos realizados en nuestro país en situaciones de campañas políticas