995 resultados para identification schemes


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The span of writer identification extends to broad domes like digital rights administration, forensic expert decisionmaking systems, and document analysis systems and so on. As the success rate of a writer identification scheme is highly dependent on the features extracted from the documents, the phase of feature extraction and therefore selection is highly significant for writer identification schemes. In this paper, the writer identification in Malayalam language is sought for by utilizing feature extraction technique such as Scale Invariant Features Transform (SIFT).The schemes are tested on a test bed of 280 writers and performance evaluated

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The modelling of a nonlinear stochastic dynamical processes from data involves solving the problems of data gathering, preprocessing, model architecture selection, learning or adaptation, parametric evaluation and model validation. For a given model architecture such as associative memory networks, a common problem in non-linear modelling is the problem of "the curse of dimensionality". A series of complementary data based constructive identification schemes, mainly based on but not limited to an operating point dependent fuzzy models, are introduced in this paper with the aim to overcome the curse of dimensionality. These include (i) a mixture of experts algorithm based on a forward constrained regression algorithm; (ii) an inherent parsimonious delaunay input space partition based piecewise local lineal modelling concept; (iii) a neurofuzzy model constructive approach based on forward orthogonal least squares and optimal experimental design and finally (iv) the neurofuzzy model construction algorithm based on basis functions that are Bézier Bernstein polynomial functions and the additive decomposition. Illustrative examples demonstrate their applicability, showing that the final major hurdle in data based modelling has almost been removed.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

It is very common in mathematics to construct surfaces by identifying the sides of a polygon together in pairs: For example, identifying opposite sides of a square yields a torus. In this article the construction is considered in the case where infinitely many pairs of segments around the boundary of the polygon are identified. The topological, metric, and complex structures of the resulting surfaces are discussed: In particular, a condition is given under which the surface has a global complex structure (i.e., is a Riemann surface). In this case, a modulus of continuity for a uniformizing map is given. The motivation for considering this construction comes from dynamical systems theory: If the modulus of continuity is uniform across a family of such constructions, each with an iteration defined on it, then it is possible to take limits in the family and hence to complete it. Such an application is briefly discussed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A dynamic modelling methodology, which combines on-line variable estimation and parameter identification with physical laws to form an adaptive model for rotary sugar drying processes, is developed in this paper. In contrast to the conventional rate-based models using empirical transfer coefficients, the heat and mass transfer rates are estimated by using on-line measurements in the new model. Furthermore, a set of improved sectional solid transport equations with localized parameters is developed in this work to reidentified on-line using measurement data, the model is able to closely track the dynamic behaviour of rotary drying processes within a broad range of operational conditions. This adaptive model is validated against experimental data obtained from a pilot-scale rotary sugar dryer. The proposed modelling methodology can be easily incorporated into nonlinear model based control schemes to form a unified modelling and control framework.place the global correlation for the computation of solid retention time. Since a number of key model variables and parameters are identified on-line using measurement data, the model is able to closely track the dynamic behaviour of rotary drying processes within a broad range of operational conditions. This adaptive model is validated against experimental data obtained from a pilot-scale rotary sugar dryer. The proposed modelling methodology can be easily incorporated into nonlinear model based control schemes to form a unified modelling and control framework.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The use of molecular tools for genotyping Mycobacterium tuberculosis isolates in epidemiological surveys in order to identify clustered and orphan strains requires faster response times than those offered by the reference method, IS6110 restriction fragment length polymorphism (RFLP) genotyping. A method based on PCR, the mycobacterial interspersed repetitive-unit-variable-number tandem-repeat (MIRU-VNTR) genotyping technique, is an option for fast fingerprinting of M. tuberculosis, although precise evaluations of correlation between MIRU-VNTR and RFLP findings in population-based studies in different contexts are required before the methods are switched. In this study, we evaluated MIRU-VNTR genotyping (with a set of 15 loci [MIRU-15]) in parallel to RFLP genotyping in a 39-month universal population-based study in a challenging setting with a high proportion of immigrants. For 81.9% (281/343) of the M. tuberculosis isolates, both RFLP and MIRU-VNTR types were obtained. The percentages of clustered cases were 39.9% (112/281) and 43.1% (121/281) for RFLP and MIRU-15 analyses, and the numbers of clusters identified were 42 and 45, respectively. For 85.4% of the cases, the RFLP and MIRU-15 results were concordant, identifying the same cases as clustered and orphan (kappa, 0.7). However, for the remaining 14.6% of the cases, discrepancies were observed: 16 of the cases clustered by RFLP analysis were identified as orphan by MIRU-15 analysis, and 25 cases identified as orphan by RFLP analysis were clustered by MIRU-15 analysis. When discrepant cases showing subtle genotypic differences were tolerated, the discrepancies fell from 14.6% to 8.6%. Epidemiological links were found for 83.8% of the cases clustered by both RFLP and MIRU-15 analyses, whereas for the cases clustered by RFLP or MIRU-VNTR analysis alone, links were identified for only 30.8% or 38.9% of the cases, respectively. The latter group of cases mainly comprised isolates that could also have been clustered, if subtle genotypic differences had been tolerated. MIRU-15 genotyping seems to be a good alternative to RFLP genotyping for real-time interventional schemes. The correlation between MIRU-15 and IS6110 RFLP findings was reasonable, although some uncertainties as to the assignation of clusters by MIRU-15 analysis were identified.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Résumé Le cancer du sein est le cancer le plus commun chez les femmes et est responsable de presque 30% de tous les nouveaux cas de cancer en Europe. On estime le nombre de décès liés au cancer du sein en Europe est à plus de 130.000 par an. Ces chiffres expliquent l'impact social considérable de cette maladie. Les objectifs de cette thèse étaient: (1) d'identifier les prédispositions et les mécanismes biologiques responsables de l'établissement des sous-types spécifiques de cancer du sein; (2) les valider dans un modèle ín vivo "humain-dans-souris"; et (3) de développer des traitements spécifiques à chaque sous-type de cancer du sein identifiés. Le premier objectif a été atteint par l'intermédiaire de l'analyse des données d'expression de gènes des tumeurs, produite dans notre laboratoire. Les données obtenues par puces à ADN ont été produites à partir de 49 biopsies des tumeurs du sein provenant des patientes participant dans l'essai clinique EORTC 10994/BIG00-01. Les données étaient très riches en information et m'ont permis de valider des données précédentes des autres études d'expression des gènes dans des tumeurs du sein. De plus, cette analyse m'a permis d'identifier un nouveau sous-type biologique de cancer du sein. Dans la première partie de la thèse, je décris I identification des tumeurs apocrines du sein par l'analyse des puces à ADN et les implications potentielles de cette découverte pour les applications cliniques. Le deuxième objectif a été atteint par l'établissement d'un modèle de cancer du sein humain, basé sur des cellules épithéliales mammaires humaines primaires (HMECs) dérivées de réductions mammaires. J'ai choisi d'adapter un système de culture des cellules en suspension basé sur des mammosphères précédemment décrit et pat décidé d'exprimer des gènes en utilisant des lentivirus. Dans la deuxième partie de ma thèse je décris l'établissement d'un système de culture cellulaire qui permet la transformation quantitative des HMECs. Par la suite, j'ai établi un modèle de xénogreffe dans les souris immunodéficientes NOD/SCID, qui permet de modéliser la maladie humaine chez la souris. Dans la troisième partie de ma thèse je décris et je discute les résultats que j'ai obtenus en établissant un modèle estrogène-dépendant de cancer du sein par transformation quantitative des HMECs avec des gènes définis, identifiés par analyse de données d'expression des gènes dans le cancer du sein. Les cellules transformées dans notre modèle étaient estrogène-dépendantes pour la croissance, diploïdes et génétiquement normales même après la culture cellulaire in vitro prolongée. Les cellules formaient des tumeurs dans notre modèle de xénogreffe et constituaient des métastases péritonéales disséminées et du foie. Afin d'atteindre le troisième objectif de ma thèse, j'ai défini et examiné des stratégies de traitement qui permettent réduire les tumeurs et les métastases. J'ai produit un modèle de cancer du sein génétiquement défini et positif pour le récepteur de l'estrogène qui permet de modéliser le cancer du sein estrogène-dépendant humain chez la souris. Ce modèle permet l'étude des mécanismes impliqués dans la formation des tumeurs et des métastases. Abstract Breast cancer is the most common cancer in women and accounts for nearly 30% of all new cancer cases in Europe. The number of deaths from breast cancer in Europe is estimated to be over 130,000 each year, implying the social impact of the disease. The goals of this thesis were first, to identify biological features and mechanisms --responsible for the establishment of specific breast cancer subtypes, second to validate them in a human-in-mouse in vivo model and third to develop specific treatments for identified breast cancer subtypes. The first objective was achieved via the analysis of tumour gene expression data produced in our lab. The microarray data were generated from 49 breast tumour biopsies that were collected from patients enrolled in the clinical trial EORTC 10994/BIG00-01. The data set was very rich in information and allowed me to validate data of previous breast cancer gene expression studies and to identify biological features of a novel breast cancer subtype. In the first part of the thesis I focus on the identification of molecular apacrine breast tumours by microarray analysis and the potential imptìcation of this finding for the clinics. The second objective was attained by the production of a human breast cancer model system based on primary human mammary epithelial cells {HMECs) derived from reduction mammoplasties. I have chosen to adopt a previously described suspension culture system based on mammospheres and expressed selected target genes using lentiviral expression constructs. In the second part of my thesis I mainly focus on the establishment of a cell culture system allowing for quantitative transformation of HMECs. I then established a xenograft model in immunodeficient NOD/SCID mice, allowing to model human disease in a mouse. In the third part of my thesis I describe and discuss the results that I obtained while establishing an oestrogen-dependent model of breast cancer by quantitative transformation of HMECs with defined genes identified after breast cancer gene expression data analysis. The transformed cells in our model are oestrogen-dependent for growth; remain diploid and genetically normal even after prolonged cell culture in vitro. The cells farm tumours and form disseminated peritoneal and liver metastases in our xenograft model. Along the lines of the third objective of my thesis I defined and tested treatment schemes allowing reducing tumours and metastases. I have generated a genetically defined model of oestrogen receptor alpha positive human breast cancer that allows to model human oestrogen-dependent breast cancer in a mouse and enables the study of mechanisms involved in tumorigenesis and metastasis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In November 2008, Colombian authorities dismantled a network of Ponzi schemes, making hundreds of thousands of investors lose tens of millions of dollars throughout the country. Using original data on the geographical incidence of the Ponzi schemes, this paper estimates the impact of their break down on crime. We find that the crash of Ponzi schemes differentially exacerbated crime in affected districts. Confirming the intuition of the standard economic model of crime, this effect is only present in places with relatively weak judicial and law enforcement institutions, and with little access to consumption smoothing mechanisms such as microcredit. In addition, we show that, with the exception of economically-motivated felonies such as robbery, violent crime is not affected by the negative shock.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The development of protocols for the identification of metal phosphates in phosphate-treated, metal-contaminated soils is a necessary yet problematical step in the validation of remediation schemes involving immobilization of metals as phosphate phases. The potential for Raman spectroscopy to be applied to the identification of these phosphates in soils has yet to be fully explored. With this in mind, a range of synthetic mixed-metal hydroxylapatites has been characterized and added to soils at known concentrations for analysis using both bulk X-ray powder diffraction (XRD) and Raman spectroscopy. Mixed-metal hydroxylapatites in the binary series Ca-Cd, Ca-Pb, Ca-Sr and Cd-Pb synthesized in the presence of acetate and carbonate ions, were characterized using a range of analytical techniques including XRD, analytical scanning electron microscopy (SEM), infrared spectroscopy (IR), inductively coupled plasma-atomic emission spectrometry (ICP-AES) and Raman spectroscopy. Only the Ca-Cd series displays complete solid solution, although under the synthesis conditions of this study the Cd-5(PO4)(3)OH end member could not be synthesized as a pure phase. Within the Ca-Cd series the cell parameters, IR active modes and Raman active bands vary linearly as a function of Cd content. X-ray diffraction and extended X-ray absorption fine structure spectroscopy (EXAFS) suggest that the Cd is distributed across both the Ca(1) and Ca(2) sites, even at low Cd concentrations. In order to explore the likely detection limits for mixed-metal phosphates in soils for XRD and Raman spectroscopy, soils doped with mixed-metal hydroxylapatites at concentrations of 5, 1 and 0.5 wt.% were then studied. X-ray diffraction could not confirm unambiguously the presence or identity of mixed-metal phosphates in soils at concentrations below 5 wt.%. Raman spectroscopy proved a far more sensitive method for the identification of mixed-metal hydroxylapatites in soils, which could positively identify the presence of such phases in soils at all the dopant concentrations used in this study. Moreover, Raman spectroscopy could also provide an accurate assessment of the degree of chemical substitution in the hydroxylapatites even when present in soils at concentrations as low as 0.1%.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Asynchronous Optical Sampling (ASOPS) [1,2] and frequency comb spectrometry [3] based on dual Ti:saphire resonators operated in a master/slave mode have the potential to improve signal to noise ratio in THz transient and IR sperctrometry. The multimode Brownian oscillator time-domain response function described by state-space models is a mathematically robust framework that can be used to describe the dispersive phenomena governed by Lorentzian, Debye and Drude responses. In addition, the optical properties of an arbitrary medium can be expressed as a linear combination of simple multimode Brownian oscillator functions. The suitability of a range of signal processing schemes adopted from the Systems Identification and Control Theory community for further processing the recorded THz transients in the time or frequency domain will be outlined [4,5]. Since a femtosecond duration pulse is capable of persistent excitation of the medium within which it propagates, such approach is perfectly justifiable. Several de-noising routines based on system identification will be shown. Furthermore, specifically developed apodization structures will be discussed. These are necessary because due to dispersion issues, the time-domain background and sample interferograms are non-symmetrical [6-8]. These procedures can lead to a more precise estimation of the complex insertion loss function. The algorithms are applicable to femtosecond spectroscopies across the EM spectrum. Finally, a methodology for femtosecond pulse shaping using genetic algorithms aiming to map and control molecular relaxation processes will be mentioned.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Asynchronous Optical Sampling has the potential to improve signal to noise ratio in THz transient sperctrometry. The design of an inexpensive control scheme for synchronising two femtosecond pulse frequency comb generators at an offset frequency of 20 kHz is discussed. The suitability of a range of signal processing schemes adopted from the Systems Identification and Control Theory community for further processing recorded THz transients in the time and frequency domain are outlined. Finally, possibilities for femtosecond pulse shaping using genetic algorithms are mentioned.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With the rapid development of proteomics, a number of different methods appeared for the basic task of protein identification. We made a simple comparison between a common liquid chromatography-tandem mass spectrometry (LC-MS/MS) workflow using an ion trap mass spectrometer and a combined LC-MS and LC-MS/MS method using Fourier transform ion cyclotron resonance (FTICR) mass spectrometry and accurate peptide masses. To compare the two methods for protein identification, we grew and extracted proteins from E. coli using established protocols. Cystines were reduced and alkylated, and proteins digested by trypsin. The resulting peptide mixtures were separated by reversed-phase liquid chromatography using a 4 h gradient from 0 to 50% acetonitrile over a C18 reversed-phase column. The LC separation was coupled on-line to either a Bruker Esquire HCT ion trap or a Bruker 7 tesla APEX-Qe Qh-FTICR hybrid mass spectrometer. Data-dependent Qh-FTICR-MS/MS spectra were acquired using the quadrupole mass filter and collisionally induced dissociation into the external hexapole trap. Proteins were in both schemes identified by Mascot MS/MS ion searches and the peptides identified from these proteins in the FTICR MS/MS data were used for automatic internal calibration of the FTICR-MS data, together with ambient polydimethylcyclosiloxane ions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Agri-Environment Footprint Index (AFI) has been developed as a generic methodology to assess changes in the overall environmental impacts from agriculture at the farm level and to assist in the evaluation of European agri-environmental schemes (AES). The methodology is based on multicriteria analysis (MCA) and involves stakeholder participation to provide a locally customised evaluation based on weighted environmental indicators. The methodology was subjected to a feasibility assessment in a series of case studies across the EU. The AFI approach was able to measure significant differences in environmental status between farms that participated in an AES and nonparticipants. Wider environmental concerns, beyond the scheme objectives, were also considered in some case studies and the benefits for identification of unintentional (and often beneficial) impacts of AESs are presented. The participatory approach to AES valuation proved efficient in different environments and administrative contexts. The approach proved to be appropriate for environmental evaluation of complex agri-environment systems and can complement any evaluation conducted under the Common Monitoring and Evaluation Framework. The applicability of the AFI in routine monitoring of AES impacts and in providing feedback to improve policy design is discussed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this work was to identify markers associated with production traits in the pig genome using different approaches. We focused the attention on Italian Large White pig breed using Genome Wide Association Studies (GWAS) and applying a selective genotyping approach to increase the power of the analyses. Furthermore, we searched the pig genome using Next Generation Sequencing (NSG) Ion Torrent Technology to combine selective genotyping approach and deep sequencing for SNP discovery. Other two studies were carried on with a different approach. Allele frequency changes for SNPs affecting candidate genes and at Genome Wide level were analysed to identify selection signatures driven by selection program during the last 20 years. This approach confirmed that a great number of markers may affect production traits and that they are captured by the classical selection programs. GWAS revealed 123 significant or suggestively significant SNP associated with Back Fat Thickenss and 229 associated with Average Daily Gain. 16 Copy Number Variant Regions resulted more frequent in lean or fat pigs and showed that different copies of those region could have a limited impact on fat. These often appear to be involved in food intake and behavior, beside affecting genes involved in metabolic pathways and their expression. By combining NGS sequencing with selective genotyping approach, new variants where discovered and at least 54 are worth to be analysed in association studies. The study of groups of pigs undergone to stringent selection showed that allele frequency of some loci can drastically change if they are close to traits that are interesting for selection schemes. These approaches could be, in future, integrated in genomic selection plans.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Wind energy has been one of the most growing sectors of the nation’s renewable energy portfolio for the past decade, and the same tendency is being projected for the upcoming years given the aggressive governmental policies for the reduction of fossil fuel dependency. Great technological expectation and outstanding commercial penetration has shown the so called Horizontal Axis Wind Turbines (HAWT) technologies. Given its great acceptance, size evolution of wind turbines over time has increased exponentially. However, safety and economical concerns have emerged as a result of the newly design tendencies for massive scale wind turbine structures presenting high slenderness ratios and complex shapes, typically located in remote areas (e.g. offshore wind farms). In this regard, safety operation requires not only having first-hand information regarding actual structural dynamic conditions under aerodynamic action, but also a deep understanding of the environmental factors in which these multibody rotating structures operate. Given the cyclo-stochastic patterns of the wind loading exerting pressure on a HAWT, a probabilistic framework is appropriate to characterize the risk of failure in terms of resistance and serviceability conditions, at any given time. Furthermore, sources of uncertainty such as material imperfections, buffeting and flutter, aeroelastic damping, gyroscopic effects, turbulence, among others, have pleaded for the use of a more sophisticated mathematical framework that could properly handle all these sources of indetermination. The attainable modeling complexity that arises as a result of these characterizations demands a data-driven experimental validation methodology to calibrate and corroborate the model. For this aim, System Identification (SI) techniques offer a spectrum of well-established numerical methods appropriated for stationary, deterministic, and data-driven numerical schemes, capable of predicting actual dynamic states (eigenrealizations) of traditional time-invariant dynamic systems. As a consequence, it is proposed a modified data-driven SI metric based on the so called Subspace Realization Theory, now adapted for stochastic non-stationary and timevarying systems, as is the case of HAWT’s complex aerodynamics. Simultaneously, this investigation explores the characterization of the turbine loading and response envelopes for critical failure modes of the structural components the wind turbine is made of. In the long run, both aerodynamic framework (theoretical model) and system identification (experimental model) will be merged in a numerical engine formulated as a search algorithm for model updating, also known as Adaptive Simulated Annealing (ASA) process. This iterative engine is based on a set of function minimizations computed by a metric called Modal Assurance Criterion (MAC). In summary, the Thesis is composed of four major parts: (1) development of an analytical aerodynamic framework that predicts interacted wind-structure stochastic loads on wind turbine components; (2) development of a novel tapered-swept-corved Spinning Finite Element (SFE) that includes dampedgyroscopic effects and axial-flexural-torsional coupling; (3) a novel data-driven structural health monitoring (SHM) algorithm via stochastic subspace identification methods; and (4) a numerical search (optimization) engine based on ASA and MAC capable of updating the SFE aerodynamic model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

La última década ha sido testigo de importantes avances en el campo de la tecnología de reconocimiento de voz. Los sistemas comerciales existentes actualmente poseen la capacidad de reconocer habla continua de múltiples locutores, consiguiendo valores aceptables de error, y sin la necesidad de realizar procedimientos explícitos de adaptación. A pesar del buen momento que vive esta tecnología, el reconocimiento de voz dista de ser un problema resuelto. La mayoría de estos sistemas de reconocimiento se ajustan a dominios particulares y su eficacia depende de manera significativa, entre otros muchos aspectos, de la similitud que exista entre el modelo de lenguaje utilizado y la tarea específica para la cual se está empleando. Esta dependencia cobra aún más importancia en aquellos escenarios en los cuales las propiedades estadísticas del lenguaje varían a lo largo del tiempo, como por ejemplo, en dominios de aplicación que involucren habla espontánea y múltiples temáticas. En los últimos años se ha evidenciado un constante esfuerzo por mejorar los sistemas de reconocimiento para tales dominios. Esto se ha hecho, entre otros muchos enfoques, a través de técnicas automáticas de adaptación. Estas técnicas son aplicadas a sistemas ya existentes, dado que exportar el sistema a una nueva tarea o dominio puede requerir tiempo a la vez que resultar costoso. Las técnicas de adaptación requieren fuentes adicionales de información, y en este sentido, el lenguaje hablado puede aportar algunas de ellas. El habla no sólo transmite un mensaje, también transmite información acerca del contexto en el cual se desarrolla la comunicación hablada (e.g. acerca del tema sobre el cual se está hablando). Por tanto, cuando nos comunicamos a través del habla, es posible identificar los elementos del lenguaje que caracterizan el contexto, y al mismo tiempo, rastrear los cambios que ocurren en estos elementos a lo largo del tiempo. Esta información podría ser capturada y aprovechada por medio de técnicas de recuperación de información (information retrieval) y de aprendizaje de máquina (machine learning). Esto podría permitirnos, dentro del desarrollo de mejores sistemas automáticos de reconocimiento de voz, mejorar la adaptación de modelos del lenguaje a las condiciones del contexto, y por tanto, robustecer al sistema de reconocimiento en dominios con condiciones variables (tales como variaciones potenciales en el vocabulario, el estilo y la temática). En este sentido, la principal contribución de esta Tesis es la propuesta y evaluación de un marco de contextualización motivado por el análisis temático y basado en la adaptación dinámica y no supervisada de modelos de lenguaje para el robustecimiento de un sistema automático de reconocimiento de voz. Esta adaptación toma como base distintos enfoque de los sistemas mencionados (de recuperación de información y aprendizaje de máquina) mediante los cuales buscamos identificar las temáticas sobre las cuales se está hablando en una grabación de audio. Dicha identificación, por lo tanto, permite realizar una adaptación del modelo de lenguaje de acuerdo a las condiciones del contexto. El marco de contextualización propuesto se puede dividir en dos sistemas principales: un sistema de identificación de temática y un sistema de adaptación dinámica de modelos de lenguaje. Esta Tesis puede describirse en detalle desde la perspectiva de las contribuciones particulares realizadas en cada uno de los campos que componen el marco propuesto: _ En lo referente al sistema de identificación de temática, nos hemos enfocado en aportar mejoras a las técnicas de pre-procesamiento de documentos, asimismo en contribuir a la definición de criterios más robustos para la selección de index-terms. – La eficiencia de los sistemas basados tanto en técnicas de recuperación de información como en técnicas de aprendizaje de máquina, y específicamente de aquellos sistemas que particularizan en la tarea de identificación de temática, depende, en gran medida, de los mecanismos de preprocesamiento que se aplican a los documentos. Entre las múltiples operaciones que hacen parte de un esquema de preprocesamiento, la selección adecuada de los términos de indexado (index-terms) es crucial para establecer relaciones semánticas y conceptuales entre los términos y los documentos. Este proceso también puede verse afectado, o bien por una mala elección de stopwords, o bien por la falta de precisión en la definición de reglas de lematización. En este sentido, en este trabajo comparamos y evaluamos diferentes criterios para el preprocesamiento de los documentos, así como también distintas estrategias para la selección de los index-terms. Esto nos permite no sólo reducir el tamaño de la estructura de indexación, sino también mejorar el proceso de identificación de temática. – Uno de los aspectos más importantes en cuanto al rendimiento de los sistemas de identificación de temática es la asignación de diferentes pesos a los términos de acuerdo a su contribución al contenido del documento. En este trabajo evaluamos y proponemos enfoques alternativos a los esquemas tradicionales de ponderado de términos (tales como tf-idf ) que nos permitan mejorar la especificidad de los términos, así como también discriminar mejor las temáticas de los documentos. _ Respecto a la adaptación dinámica de modelos de lenguaje, hemos dividimos el proceso de contextualización en varios pasos. – Para la generación de modelos de lenguaje basados en temática, proponemos dos tipos de enfoques: un enfoque supervisado y un enfoque no supervisado. En el primero de ellos nos basamos en las etiquetas de temática que originalmente acompañan a los documentos del corpus que empleamos. A partir de estas, agrupamos los documentos que forman parte de la misma temática y generamos modelos de lenguaje a partir de dichos grupos. Sin embargo, uno de los objetivos que se persigue en esta Tesis es evaluar si el uso de estas etiquetas para la generación de modelos es óptimo en términos del rendimiento del reconocedor. Por esta razón, nosotros proponemos un segundo enfoque, un enfoque no supervisado, en el cual el objetivo es agrupar, automáticamente, los documentos en clusters temáticos, basándonos en la similaridad semántica existente entre los documentos. Por medio de enfoques de agrupamiento conseguimos mejorar la cohesión conceptual y semántica en cada uno de los clusters, lo que a su vez nos permitió refinar los modelos de lenguaje basados en temática y mejorar el rendimiento del sistema de reconocimiento. – Desarrollamos diversas estrategias para generar un modelo de lenguaje dependiente del contexto. Nuestro objetivo es que este modelo refleje el contexto semántico del habla, i.e. las temáticas más relevantes que se están discutiendo. Este modelo es generado por medio de la interpolación lineal entre aquellos modelos de lenguaje basados en temática que estén relacionados con las temáticas más relevantes. La estimación de los pesos de interpolación está basada principalmente en el resultado del proceso de identificación de temática. – Finalmente, proponemos una metodología para la adaptación dinámica de un modelo de lenguaje general. El proceso de adaptación tiene en cuenta no sólo al modelo dependiente del contexto sino también a la información entregada por el proceso de identificación de temática. El esquema usado para la adaptación es una interpolación lineal entre el modelo general y el modelo dependiente de contexto. Estudiamos también diferentes enfoques para determinar los pesos de interpolación entre ambos modelos. Una vez definida la base teórica de nuestro marco de contextualización, proponemos su aplicación dentro de un sistema automático de reconocimiento de voz. Para esto, nos enfocamos en dos aspectos: la contextualización de los modelos de lenguaje empleados por el sistema y la incorporación de información semántica en el proceso de adaptación basado en temática. En esta Tesis proponemos un marco experimental basado en una arquitectura de reconocimiento en ‘dos etapas’. En la primera etapa, empleamos sistemas basados en técnicas de recuperación de información y aprendizaje de máquina para identificar las temáticas sobre las cuales se habla en una transcripción de un segmento de audio. Esta transcripción es generada por el sistema de reconocimiento empleando un modelo de lenguaje general. De acuerdo con la relevancia de las temáticas que han sido identificadas, se lleva a cabo la adaptación dinámica del modelo de lenguaje. En la segunda etapa de la arquitectura de reconocimiento, usamos este modelo adaptado para realizar de nuevo el reconocimiento del segmento de audio. Para determinar los beneficios del marco de trabajo propuesto, llevamos a cabo la evaluación de cada uno de los sistemas principales previamente mencionados. Esta evaluación es realizada sobre discursos en el dominio de la política usando la base de datos EPPS (European Parliamentary Plenary Sessions - Sesiones Plenarias del Parlamento Europeo) del proyecto europeo TC-STAR. Analizamos distintas métricas acerca del rendimiento de los sistemas y evaluamos las mejoras propuestas con respecto a los sistemas de referencia. ABSTRACT The last decade has witnessed major advances in speech recognition technology. Today’s commercial systems are able to recognize continuous speech from numerous speakers, with acceptable levels of error and without the need for an explicit adaptation procedure. Despite this progress, speech recognition is far from being a solved problem. Most of these systems are adjusted to a particular domain and their efficacy depends significantly, among many other aspects, on the similarity between the language model used and the task that is being addressed. This dependence is even more important in scenarios where the statistical properties of the language fluctuates throughout the time, for example, in application domains involving spontaneous and multitopic speech. Over the last years there has been an increasing effort in enhancing the speech recognition systems for such domains. This has been done, among other approaches, by means of techniques of automatic adaptation. These techniques are applied to the existing systems, specially since exporting the system to a new task or domain may be both time-consuming and expensive. Adaptation techniques require additional sources of information, and the spoken language could provide some of them. It must be considered that speech not only conveys a message, it also provides information on the context in which the spoken communication takes place (e.g. on the subject on which it is being talked about). Therefore, when we communicate through speech, it could be feasible to identify the elements of the language that characterize the context, and at the same time, to track the changes that occur in those elements over time. This information can be extracted and exploited through techniques of information retrieval and machine learning. This allows us, within the development of more robust speech recognition systems, to enhance the adaptation of language models to the conditions of the context, thus strengthening the recognition system for domains under changing conditions (such as potential variations in vocabulary, style and topic). In this sense, the main contribution of this Thesis is the proposal and evaluation of a framework of topic-motivated contextualization based on the dynamic and non-supervised adaptation of language models for the enhancement of an automatic speech recognition system. This adaptation is based on an combined approach (from the perspective of both information retrieval and machine learning fields) whereby we identify the topics that are being discussed in an audio recording. The topic identification, therefore, enables the system to perform an adaptation of the language model according to the contextual conditions. The proposed framework can be divided in two major systems: a topic identification system and a dynamic language model adaptation system. This Thesis can be outlined from the perspective of the particular contributions made in each of the fields that composes the proposed framework: _ Regarding the topic identification system, we have focused on the enhancement of the document preprocessing techniques in addition to contributing in the definition of more robust criteria for the selection of index-terms. – Within both information retrieval and machine learning based approaches, the efficiency of topic identification systems, depends, to a large extent, on the mechanisms of preprocessing applied to the documents. Among the many operations that encloses the preprocessing procedures, an adequate selection of index-terms is critical to establish conceptual and semantic relationships between terms and documents. This process might also be weakened by a poor choice of stopwords or lack of precision in defining stemming rules. In this regard we compare and evaluate different criteria for preprocessing the documents, as well as for improving the selection of the index-terms. This allows us to not only reduce the size of the indexing structure but also to strengthen the topic identification process. – One of the most crucial aspects, in relation to the performance of topic identification systems, is to assign different weights to different terms depending on their contribution to the content of the document. In this sense we evaluate and propose alternative approaches to traditional weighting schemes (such as tf-idf ) that allow us to improve the specificity of terms, and to better identify the topics that are related to documents. _ Regarding the dynamic language model adaptation, we divide the contextualization process into different steps. – We propose supervised and unsupervised approaches for the generation of topic-based language models. The first of them is intended to generate topic-based language models by grouping the documents, in the training set, according to the original topic labels of the corpus. Nevertheless, a goal of this Thesis is to evaluate whether or not the use of these labels to generate language models is optimal in terms of recognition accuracy. For this reason, we propose a second approach, an unsupervised one, in which the objective is to group the data in the training set into automatic topic clusters based on the semantic similarity between the documents. By means of clustering approaches we expect to obtain a more cohesive association of the documents that are related by similar concepts, thus improving the coverage of the topic-based language models and enhancing the performance of the recognition system. – We develop various strategies in order to create a context-dependent language model. Our aim is that this model reflects the semantic context of the current utterance, i.e. the most relevant topics that are being discussed. This model is generated by means of a linear interpolation between the topic-based language models related to the most relevant topics. The estimation of the interpolation weights is based mainly on the outcome of the topic identification process. – Finally, we propose a methodology for the dynamic adaptation of a background language model. The adaptation process takes into account the context-dependent model as well as the information provided by the topic identification process. The scheme used for the adaptation is a linear interpolation between the background model and the context-dependent one. We also study different approaches to determine the interpolation weights used in this adaptation scheme. Once we defined the basis of our topic-motivated contextualization framework, we propose its application into an automatic speech recognition system. We focus on two aspects: the contextualization of the language models used by the system, and the incorporation of semantic-related information into a topic-based adaptation process. To achieve this, we propose an experimental framework based in ‘a two stages’ recognition architecture. In the first stage of the architecture, Information Retrieval and Machine Learning techniques are used to identify the topics in a transcription of an audio segment. This transcription is generated by the recognition system using a background language model. According to the confidence on the topics that have been identified, the dynamic language model adaptation is carried out. In the second stage of the recognition architecture, an adapted language model is used to re-decode the utterance. To test the benefits of the proposed framework, we carry out the evaluation of each of the major systems aforementioned. The evaluation is conducted on speeches of political domain using the EPPS (European Parliamentary Plenary Sessions) database from the European TC-STAR project. We analyse several performance metrics that allow us to compare the improvements of the proposed systems against the baseline ones.