71 resultados para speech recognition
Resumo:
Speaker diarization is the process of sorting speeches according to the speaker. Diarization helps to search and retrieve what a certain speaker uttered in a meeting. Applications of diarization systemsextend to other domains than meetings, for example, lectures, telephone, television, and radio. Besides, diarization enhances the performance of several speech technologies such as speaker recognition, automatic transcription, and speaker tracking. Methodologies previously used in developing diarization systems are discussed. Prior results and techniques are studied and compared. Methods such as Hidden Markov Models and Gaussian Mixture Models that are used in speaker recognition and other speech technologies are also used in speaker diarization. The objective of this thesis is to develop a speaker diarization system in meeting domain. Experimental part of this work indicates that zero-crossing rate can be used effectively in breaking down the audio stream into segments, and adaptive Gaussian Models fit adequately short audio segments. Results show that 35 Gaussian Models and one second as average length of each segment are optimum values to build a diarization system for the tested data. Uniting the segments which are uttered by same speaker is done in a bottom-up clustering by a newapproach of categorizing the mixture weights.
Resumo:
The flow of information within modern information society has increased rapidly over the last decade. The major part of this information flow relies on the individual’s abilities to handle text or speech input. For the majority of us it presents no problems, but there are some individuals who would benefit from other means of conveying information, e.g. signed information flow. During the last decades the new results from various disciplines have all suggested towards the common background and processing for sign and speech and this was one of the key issues that I wanted to investigate further in this thesis. The basis of this thesis is firmly within speech research and that is why I wanted to design analogous test batteries for widely used speech perception tests for signers – to find out whether the results for signers would be the same as in speakers’ perception tests. One of the key findings within biology – and more precisely its effects on speech and communication research – is the mirror neuron system. That finding has enabled us to form new theories about evolution of communication, and it all seems to converge on the hypothesis that all communication has a common core within humans. In this thesis speech and sign are discussed as equal and analogical counterparts of communication and all research methods used in speech are modified for sign. Both speech and sign are thus investigated using similar test batteries. Furthermore, both production and perception of speech and sign are studied separately. An additional framework for studying production is given by gesture research using cry sounds. Results of cry sound research are then compared to results from children acquiring sign language. These results show that individuality manifests itself from very early on in human development. Articulation in adults, both in speech and sign, is studied from two perspectives: normal production and re-learning production when the apparatus has been changed. Normal production is studied both in speech and sign and the effects of changed articulation are studied with regards to speech. Both these studies are done by using carrier sentences. Furthermore, sign production is studied giving the informants possibility for spontaneous speech. The production data from the signing informants is also used as the basis for input in the sign synthesis stimuli used in sign perception test battery. Speech and sign perception were studied using the informants’ answers to questions using forced choice in identification and discrimination tasks. These answers were then compared across language modalities. Three different informant groups participated in the sign perception tests: native signers, sign language interpreters and Finnish adults with no knowledge of any signed language. This gave a chance to investigate which of the characteristics found in the results were due to the language per se and which were due to the changes in modality itself. As the analogous test batteries yielded similar results over different informant groups, some common threads of results could be observed. Starting from very early on in acquiring speech and sign the results were highly individual. However, the results were the same within one individual when the same test was repeated. This individuality of results represented along same patterns across different language modalities and - in some occasions - across language groups. As both modalities yield similar answers to analogous study questions, this has lead us to providing methods for basic input for sign language applications, i.e. signing avatars. This has also given us answers to questions on precision of the animation and intelligibility for the users – what are the parameters that govern intelligibility of synthesised speech or sign and how precise must the animation or synthetic speech be in order for it to be intelligible. The results also give additional support to the well-known fact that intelligibility in fact is not the same as naturalness. In some cases, as shown within the sign perception test battery design, naturalness decreases intelligibility. This also has to be taken into consideration when designing applications. All in all, results from each of the test batteries, be they for signers or speakers, yield strikingly similar patterns, which would indicate yet further support for the common core for all human communication. Thus, we can modify and deepen the phonetic framework models for human communication based on the knowledge obtained from the results of the test batteries within this thesis.
Resumo:
During a possible loss of coolant accident in BWRs, a large amount of steam will be released from the reactor pressure vessel to the suppression pool. Steam will be condensed into the suppression pool causing dynamic and structural loads to the pool. The formation and break up of bubbles can be measured by visual observation using a suitable pattern recognition algorithm. The aim of this study was to improve the preliminary pattern recognition algorithm, developed by Vesa Tanskanen in his doctoral dissertation, by using MATLAB. Video material from the PPOOLEX test facility, recorded during thermal stratification and mixing experiments, was used as a reference in the development of the algorithm. The developed algorithm consists of two parts: the pattern recognition of the bubbles and the analysis of recognized bubble images. The bubble recognition works well, but some errors will appear due to the complex structure of the pool. The results of the image analysis were reasonable. The volume and the surface area of the bubbles were not evaluated. Chugging frequencies calculated by using FFT fitted well into the results of oscillation frequencies measured in the experiments. The pattern recognition algorithm works in the conditions it is designed for. If the measurement configuration will be changed, some modifications have to be done. Numerous improvements are proposed for the future 3D equipment.
Resumo:
Human activity recognition in everyday environments is a critical, but challenging task in Ambient Intelligence applications to achieve proper Ambient Assisted Living, and key challenges still remain to be dealt with to realize robust methods. One of the major limitations of the Ambient Intelligence systems today is the lack of semantic models of those activities on the environment, so that the system can recognize the speci c activity being performed by the user(s) and act accordingly. In this context, this thesis addresses the general problem of knowledge representation in Smart Spaces. The main objective is to develop knowledge-based models, equipped with semantics to learn, infer and monitor human behaviours in Smart Spaces. Moreover, it is easy to recognize that some aspects of this problem have a high degree of uncertainty, and therefore, the developed models must be equipped with mechanisms to manage this type of information. A fuzzy ontology and a semantic hybrid system are presented to allow modelling and recognition of a set of complex real-life scenarios where vagueness and uncertainty are inherent to the human nature of the users that perform it. The handling of uncertain, incomplete and vague data (i.e., missing sensor readings and activity execution variations, since human behaviour is non-deterministic) is approached for the rst time through a fuzzy ontology validated on real-time settings within a hybrid data-driven and knowledgebased architecture. The semantics of activities, sub-activities and real-time object interaction are taken into consideration. The proposed framework consists of two main modules: the low-level sub-activity recognizer and the high-level activity recognizer. The rst module detects sub-activities (i.e., actions or basic activities) that take input data directly from a depth sensor (Kinect). The main contribution of this thesis tackles the second component of the hybrid system, which lays on top of the previous one, in a superior level of abstraction, and acquires the input data from the rst module's output, and executes ontological inference to provide users, activities and their in uence in the environment, with semantics. This component is thus knowledge-based, and a fuzzy ontology was designed to model the high-level activities. Since activity recognition requires context-awareness and the ability to discriminate among activities in di erent environments, the semantic framework allows for modelling common-sense knowledge in the form of a rule-based system that supports expressions close to natural language in the form of fuzzy linguistic labels. The framework advantages have been evaluated with a challenging and new public dataset, CAD-120, achieving an accuracy of 90.1% and 91.1% respectively for low and high-level activities. This entails an improvement over both, entirely data-driven approaches, and merely ontology-based approaches. As an added value, for the system to be su ciently simple and exible to be managed by non-expert users, and thus, facilitate the transfer of research to industry, a development framework composed by a programming toolbox, a hybrid crisp and fuzzy architecture, and graphical models to represent and con gure human behaviour in Smart Spaces, were developed in order to provide the framework with more usability in the nal application. As a result, human behaviour recognition can help assisting people with special needs such as in healthcare, independent elderly living, in remote rehabilitation monitoring, industrial process guideline control, and many other cases. This thesis shows use cases in these areas.
Resumo:
The problem of automatic recognition of the fish from the video sequences is discussed in this Master’s Thesis. This is a very urgent issue for many organizations engaged in fish farming in Finland and Russia because the process of automation control and counting of individual species is turning point in the industry. The difficulties and the specific features of the problem have been identified in order to find a solution and propose some recommendations for the components of the automated fish recognition system. Methods such as background subtraction, Kalman filtering and Viola-Jones method were implemented during this work for detection, tracking and estimation of fish parameters. Both the results of the experiments and the choice of the appropriate methods strongly depend on the quality and the type of a video which is used as an input data. Practical experiments have demonstrated that not all methods can produce good results for real data, whereas on synthetic data they operate satisfactorily.
Resumo:
Tässä sivuaineen tutkielmassa tarkasteltiin englannin kielen sanaston kehitystä lukion vieraan kielen syventävän suullisen kurssin aikana. Tutkimuksessa selvitettiin, miten oppilaiden sanastollinen rikkaus muuttuu puhutussa kielessä. Sanastollista rikkautta analysoitiin sanastollisen variaation ja sanastollisen tiheyden mittareilla. Työssä hyödynnettiin pitkittäistutkimusasetelmaa eli verrattiin yhden oppilasryhmän puhetta sekä ennen lukion englannin kielen suullista kurssia että sen jälkeen. Osanottajia oli yhteensä yhdeksän, jotka kaikki olivat lukion toisella vuosikurssilla. Osallistujien tekemät suulliset testit olivat osa Turun yliopiston keräämää tutkimuskäyttöön tarkoitettua materiaalia. Äänitteistä tehdyt transkriptiot muokattiin tätä tutkimusta varten sopiviksi, jonka jälkeen niistä mitattiin sanastollista rikkautta erilaisilla mittareilla. Aineistoa tutkittiin määrällisin menetelmin. Tulokset osoittavat, että keskimääräisesti sekä puheen sanastollinen variaatio että sanastollinen tiheys kehittyivät kurssin aikana hiukan. Toisin sanoen oppilaat käyttivät kurssin jälkeen tehdyssä testissä aavistuksen verran monipuolisempaa sanastoa, ja sisältösanojen osuus kieliopillisiin sanoihin nähden oli hieman suurempi kuin ennen kurssia. Kurssin aikana oppilaiden aktiivisessa sanavarastossa tapahtunut kehitys ei kuitenkaan ollut tilastollisesti merkitsevää. Lisäksi tutkimus osoitti, että osallistujien väliset erot olivat suuria, mutta erot tasoittuivat jonkin verran kurssin jälkeen. Tutkimustulosten perusteella voidaan olettaa englannin kielen suullisen kurssin sekä lisänneen oppilaiden sanastollista rikkautta että tasoittaneen yksilöllisiä eroja, yhdessä monien muiden mahdollisten tekijöiden kanssa. Tutkimusotoksen pienuuden vuoksi tuloksia ei kuitenkaan voida yleistää. Jatkossa olisi mielenkiintoista laajentaa tutkimusnäkökulmaa koskemaan muitakin sanastollisen rikkauden osa-alueita kuten sanastollista sofistikaatiota. Olisi myös mielenkiintoista sisällyttää tutkimukseen oppilaiden passsiivisen sanavaraston mittaaminen ja mahdollisesti tutkia englannin kielen suullisen kurssin vaikutuksia oppilaiden suullisen kielitaidon kehittymiseen laajemminkin kuin vain sanavaraston osalta.
Resumo:
Metal-ion-mediated base-pairing of nucleic acids has attracted considerable attention during the past decade, since it offers means to expand the genetic code by artificial base-pairs, to create predesigned molecular architecture by metal-ion-mediated inter- or intra-strand cross-links, or to convert double stranded DNA to a nano-scale wire. Such applications largely depend on the presence of a modified nucleobase in both strands engaged in the duplex formation. Hybridization of metal-ion-binding oligonucleotide analogs with natural nucleic acid sequences has received much less attention in spite of obvious applications. While the natural oligonucleotides hybridize with high selectivity, their affinity for complementary sequences is inadequate for a number of applications. In the case of DNA, for example, more than 10 consecutive Watson-Crick base pairs are required for a stable duplex at room temperature, making targeting of sequences shorter than this challenging. For example, many types of cancer exhibit distinctive profiles of oncogenic miRNA, the diagnostics of which is, however, difficult owing to the presence of only short single stranded loop structures. Metallo-oligonucleotides, with their superior affinity towards their natural complements, would offer a way to overcome the low stability of short duplexes. In this study a number of metal-ion-binding surrogate nucleosides were prepared and their interaction with nucleoside 5´-monophosphates (NMPs) has been investigated by 1H NMR spectroscopy. To find metal ion complexes that could discriminate between natural nucleobases upon double helix formation, glycol nucleic acid (GNA) sequences carrying a PdII ion with vacant coordination sites at a predetermined position were synthesized and their affinity to complementary as well as mismatched counterparts quantified by UV-melting measurements.
Resumo:
Convolutional Neural Networks (CNN) have become the state-of-the-art methods on many large scale visual recognition tasks. For a lot of practical applications, CNN architectures have a restrictive requirement: A huge amount of labeled data are needed for training. The idea of generative pretraining is to obtain initial weights of the network by training the network in a completely unsupervised way and then fine-tune the weights for the task at hand using supervised learning. In this thesis, a general introduction to Deep Neural Networks and algorithms are given and these methods are applied to classification tasks of handwritten digits and natural images for developing unsupervised feature learning. The goal of this thesis is to find out if the effect of pretraining is damped by recent practical advances in optimization and regularization of CNN. The experimental results show that pretraining is still a substantial regularizer, however, not a necessary step in training Convolutional Neural Networks with rectified activations. On handwritten digits, the proposed pretraining model achieved a classification accuracy comparable to the state-of-the-art methods.
Resumo:
Vapaakappalekartuntaan perustuva tilasto Suomessa julkaistuista puheäänitteistä vuodesta 1995 lähtien