186 resultados para Utterance
Resumo:
Automatic spoken Language Identi¯cation (LID) is the process of identifying the language spoken within an utterance. The challenge that this task presents is that no prior information is available indicating the content of the utterance or the identity of the speaker. The trend of globalization and the pervasive popularity of the Internet will amplify the need for the capabilities spoken language identi¯ca- tion systems provide. A prominent application arises in call centers dealing with speakers speaking di®erent languages. Another important application is to index or search huge speech data archives and corpora that contain multiple languages. The aim of this research is to develop techniques targeted at producing a fast and more accurate automatic spoken LID system compared to the previous National Institute of Standards and Technology (NIST) Language Recognition Evaluation. Acoustic and phonetic speech information are targeted as the most suitable fea- tures for representing the characteristics of a language. To model the acoustic speech features a Gaussian Mixture Model based approach is employed. Pho- netic speech information is extracted using existing speech recognition technol- ogy. Various techniques to improve LID accuracy are also studied. One approach examined is the employment of Vocal Tract Length Normalization to reduce the speech variation caused by di®erent speakers. A linear data fusion technique is adopted to combine the various aspects of information extracted from speech. As a result of this research, a LID system was implemented and presented for evaluation in the 2003 Language Recognition Evaluation conducted by the NIST.
Resumo:
The cascading appearance-based (CAB) feature extraction technique has established itself as the state-of-the-art in extracting dynamic visual speech features for speech recognition. In this paper, we will focus on investigating the effectiveness of this technique for the related speaker verification application. By investigating the speaker verification ability of each stage of the cascade we will demonstrate that the same steps taken to reduce static speaker and environmental information for the visual speech recognition application also provide similar improvements for visual speaker recognition. A further study is conducted comparing synchronous HMM (SHMM) based fusion of CAB visual features and traditional perceptual linear predictive (PLP) acoustic features to show that higher complexity inherit in the SHMM approach does not appear to provide any improvement in the final audio-visual speaker verification system over simpler utterance level score fusion.
Resumo:
Much of the research on the delivery of advice by professionals such as physicians, health workers and counsellors, both on the telephone and in face to face interaction more generally, has focused on the theme of client resistance and the consequent need for professionals to adopt particular formats to assist in the uptake of the advice. In this paper we consider one setting, Kid’s Helpline, the national Australian counselling service for children and young people, where there is an institutional mandate not to give explicit advice in accordance with the values of self-direction and empowerment. The paper examines one practice, the use of script proposals by counsellors, which appears to offer a way of providing support which is consistent with these values. Script proposals entail the counsellors packaging their advice as something that the caller might say – at some future time – to a third party such as a friend, teacher, parent, or partner, and involve the counsellor adopting the speaking position of the caller in what appears as a rehearsal of a forthcoming strip of interaction. Although the core feature of a script proposal is the counsellor’s use of direct reported speech they appear to be delivered, not so much as exact words to be followed, but as the type of conversation that the client needs to have with the 3rd party. Script proposals, in short, provide models of what to say as well as alluding to how these could be emulated by the client. In their design script proposals invariably incorporate one or more of the most common rhetorical formats for maximising the persuasive force of an utterance such as a three part list or a contrastive pair. Script proposals, moreover, stand in a complex relation to the prior talk and one of their functions appears to be to summarise, respecify or expand upon the client’s own ideas or suggestions for problem solving that have emerged in these preceding sequences.
Resumo:
Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.
Resumo:
This paper investigates the effects of limited speech data in the context of speaker verification using a probabilistic linear discriminant analysis (PLDA) approach. Being able to reduce the length of required speech data is important to the development of automatic speaker verification system in real world applications. When sufficient speech is available, previous research has shown that heavy-tailed PLDA (HTPLDA) modeling of speakers in the i-vector space provides state-of-the-art performance, however, the robustness of HTPLDA to the limited speech resources in development, enrolment and verification is an important issue that has not yet been investigated. In this paper, we analyze the speaker verification performance with regards to the duration of utterances used for both speaker evaluation (enrolment and verification) and score normalization and PLDA modeling during development. Two different approaches to total-variability representation are analyzed within the PLDA approach to show improved performance in short-utterance mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development. The results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset suggest that the HTPLDA system can continue to achieve better performance than Gaussian PLDA (GPLDA) as evaluation utterance lengths are decreased. We also highlight the importance of matching durations for score normalization and PLDA modeling to the expected evaluation conditions. Finally, we found that a pooled total-variability approach to PLDA modeling can achieve better performance than the traditional concatenated total-variability approach for short utterances in mismatched evaluation conditions and conditions for which insufficient speech resources are available for adequate system development.
Resumo:
Reliability of the performance of biometric identity verification systems remains a significant challenge. Individual biometric samples of the same person (identity class) are not identical at each presentation and performance degradation arises from intra-class variability and inter-class similarity. These limitations lead to false accepts and false rejects that are dependent. It is therefore difficult to reduce the rate of one type of error without increasing the other. The focus of this dissertation is to investigate a method based on classifier fusion techniques to better control the trade-off between the verification errors using text-dependent speaker verification as the test platform. A sequential classifier fusion architecture that integrates multi-instance and multisample fusion schemes is proposed. This fusion method enables a controlled trade-off between false alarms and false rejects. For statistically independent classifier decisions, analytical expressions for each type of verification error are derived using base classifier performances. As this assumption may not be always valid, these expressions are modified to incorporate the correlation between statistically dependent decisions from clients and impostors. The architecture is empirically evaluated by applying the proposed architecture for text dependent speaker verification using the Hidden Markov Model based digit dependent speaker models in each stage with multiple attempts for each digit utterance. The trade-off between the verification errors is controlled using the parameters, number of decision stages (instances) and the number of attempts at each decision stage (samples), fine-tuned on evaluation/tune set. The statistical validation of the derived expressions for error estimates is evaluated on test data. The performance of the sequential method is further demonstrated to depend on the order of the combination of digits (instances) and the nature of repetitive attempts (samples). The false rejection and false acceptance rates for proposed fusion are estimated using the base classifier performances, the variance in correlation between classifier decisions and the sequence of classifiers with favourable dependence selected using the 'Sequential Error Ratio' criteria. The error rates are better estimated by incorporating user-dependent (such as speaker-dependent thresholds and speaker-specific digit combinations) and class-dependent (such as clientimpostor dependent favourable combinations and class-error based threshold estimation) information. The proposed architecture is desirable in most of the speaker verification applications such as remote authentication, telephone and internet shopping applications. The tuning of parameters - the number of instances and samples - serve both the security and user convenience requirements of speaker-specific verification. The architecture investigated here is applicable to verification using other biometric modalities such as handwriting, fingerprints and key strokes.
Resumo:
Chinese modal particles feature prominently in Chinese people’s daily use of the language, but their pragmatic and semantic functions are elusive as commonly recognised by Chinese linguists and teachers of Chinese as a foreign language. This book originates from an extensive and intensive empirical study of the Chinese modal particle a (啊), one of the most frequently used modal particles in Mandarin Chinese. In order to capture all the uses and the underlying meanings of the particle, the author transcribed the first 20 episodes, about 20 hours in length, of the popular Chinese TV drama series Kewang ‘Expectations’, which yielded a corpus data of more than 142’000 Chinese characters with a total of 1829 instances of the particle all used in meaningful communicative situations. Within its context of use, every single occurrence of the particle was analysed in terms of its pragmatic and semantic contributions to the hosting utterance. Upon this basis the core meanings were identified which were seen as constituting the modal nature of the particle.
Resumo:
This PhD research has provided novel solutions to three major challenges which have prevented the wide spread deployment of speaker recognition technology: (1) combating enrolment/ verification mismatch, (2) reducing the large amount of development and training data that is required and (3) reducing the duration of speech required to verify a speaker. A range of applications of speaker recognition technology from forensics in criminal investigations to secure access in banking will benefit from the research outcomes.
Resumo:
In How to Do Things with Words, Austin (1975) described marriages, sentencings and ship launchings as prototypes of performative utterance. What’s the appropriate speech act for launching an academic journal? First editions of journals tend to take a field as formed a priori, as having “come of age”, and state good intents to capture its best or most innovative work.
Resumo:
The interpretation of irony in this study is seen as being crucially dependent on the notion of coherence. Coherence depends on a complex interplay of contextual features, which is why all interpretations must be seen as socio-cultural processes. An utterance is perceived as coherent if it makes sense and if it hangs together. Incoherent utterances can result in an ironic interpretation; however, the incoherence must also be perceived as being intentional, and intentionality in turn is a sign of the ironist's rejecting stance. The study does not encompass the notion of irony of fate nor situational irony that is unintentional. Irony is defined in this study as a combination of five components. It is seen as (1) a negative attitude that reflects (2) the intention of the ironist, and (3) has a target and most often (4) a victim too. Essential to irony is its fifth component, the fact that one or more of these four components must be inferred from co- or context. The componential definition of irony is crucial in deciding whether an interpretation is ironic or not, and the definition makes it possible to discern the differences as well as the similarities between different kinds of irony. The method of the study is experimental: 12 Finnish newspaper texts that could be considered to be ironic were interpreted by 107 informants. The interpretation of one of the texts was based on unelicited feedback given by readers of a weekly magazine. The responses were analyzed to determine (a) whether the texts were perceived as being coherent or incoherent and (b) whether the informants appealed to any of the five components of irony. The results of the analyses of the informants' responses indicate that differences between the ironic and non-ironic interpretations of the texts can be explained in terms of whether or not the informant regarded the text as being coherent. The thesis also discusses the shortcomings of other accounts of irony: the Gricean theory of conversational implicature, speech act theory, irony as rhetoric, irony as pretense, irony as echoic mention, and irony as framing. In contrast to these other accounts, the study focuses on irony as a textual phenomenon and underlines the importance of socio-cultural context in the interpretation of irony. Key words: irony, coherence, incoherence, the componential definition of irony, interpretation of linguistic utterances.
Resumo:
This thesis is an empirical study of how two words in Icelandic, "nú" and "núna", are used in contemporary Icelandic conversation. My aims in this study are, first, to explain the differences between the temporal functions of "nú" and "núna", and, second, to describe the non-temporal functions of "nú". In the analysis, a focus is placed on comparing the sequential placement of the two words, on their syntactical distribution, and on their prosodic realization. The empirical data comprise 14 hours and 11 minutes of naturally occurring conversation recorded between 1996 and 2003. The selected conversations represent a wide range of interactional contexts including informal dinner parties, institutional and non-institutional telephone conversations, radio programs for teenagers, phone-in programs, and, finally, a political debate on television. The theoretical and methodological framework is interactional linguistics, which can be described as linguistically oriented conversation analysis (CA). A comparison of "nú" and "núna" shows that the two words have different syntactic distributions. "Nú" has a clear tendency to occur in the front field, before the finite verb, while "núna" typically occurs in the end field, after the object. It is argued that this syntactic difference reflects a functional difference between "nú" and "núna". A sequential analysis of "núna" shows that the word refers to an unspecified period of time which includes the utterance time as well as some time in the past and in the future. This temporal relation is referred to as reference time. "Nú", by contrast, is mainly used in three different environments: a) in temporal comparisons, 2) in transitions, and 3) when the speaker is taking an affective stance. The non-temporal functions of "nú" are divided into three categories: a) "nú" as a tone particle, 2) "nú" as an utterance particle, and 3) "nú" as a dialogue particle. "Nú" as a tone particle is syntactically integrated and can occur in two syntactic positions: pre-verbally and post-verbally. I argue that these instances are employed in utterances in which a speaker is foregrounding information or marking it as particularly important. The study shows that, although these instances are typically prosodically non-prominent and unstressed, they are in some cases delivered with stress and with a higher pitch than the surrounding talk. "Nú" as an utterance particle occurs turn-initially and is syntactically non-integrated. By using "nú", speakers show continuity between turns and link new turns to prior ones. These instances initiate either continuations by the same speaker or new turns after speaker shifts. "Nú" as a dialogue particle occurs as a turn of its own. The study shows that these instances register informings in prior turns as unexpected or as a departure from the normal state of affairs. "Nú" as a dialogue particle is often delivered with a prolonged vowel and a recognizable intonation contour. A comparative sequential and prosodic analysis shows that in these cases there is a correlation between the function of "nú" and the intonation contour by which it is delivered. Finally, I argue that despite the many functions of "nú", all the instances can be said to have a common denominator, which is to display attention towards the present moment and the utterances which are produced prior or after the production of "nú". Instead of anchoring the utterances in external time or reference time, these instances position the utterance in discourse internal time, or discourse time.
Resumo:
This 'project' investigates Janet Cardiff's Whispering Room. It examines how Cardiff deconstructs the privileging of the visual over all other corporeal senses in her work, the Whispering Room. Using sound as a fulcrum, Cardiff explores the links between subjects, collective narratives, memories, experiences and performances. Janet Cardiff destabilizes time and space and fractures the continuum through the use of sound. My 'project' celebrates sound as a transgressive medium — sound not as a gendered medium but as a vehicle in which to speak (to) gender. It explores how sound can destabilize notions of perception and reception and question art and museal practices. In the process this 'project' reveals the complexity of interpreting and representing art as an object. My aim is to reflect the very intertextual and expressionist collage that Cardiff has created in Whispering Room in my own text. Cardiff solicits the viewer's intimacy and participation. Whispering Room is a physical yet metonymic space in which Cardiff creates a place for performatvity, experience, memory, desire and speech, thus she opens up a space for the utterance and performance of the viewer. Viewers construct and create meaning/s for themselves within this mnemonic space by digging up their own memories, desires and reveries. The strength of Cardiff's work is that it relies on a viewer to perform, a body to trigger the pseudo-spectacle and a voice to interrupt the whispers. One might ask of Whispering Room where the illusionistic space begins and where the physical space ends. This 'project' investigates how in Whispering Room there is no one experience but many experiences.
Resumo:
This research deals with direct speech quotations in magazine articles through two questions: As my major research question, I study the functions of speech quotations based on a data consisting of six literary-journalistic magazine articles. My minor research question builds on the fact that there is no absolute relation between the sound waves of the spoken language and the graphemes of the written one. Hence, I study the general thoughts on how utterances should be arranged in the written form based on a large review of literature and textbooks on journalistic writing as well as interviews I have made with magazine writers and editors, and the Council of Mass Media in Finland. To support my main research questions, I also examine the reference system of the Finnish language, define the aspects of the literary-journalistic article and study vernacular cues in written speech quotations. FUNCTIONS OF QUOTATIONS. I demonstrate the results of my analysis with a six-pointed apparatus. It is a continuum which extends from the structural level of text, all the way through the explicit functions, to the implicit functions of the quotation. The explicit functions deal with the question of what is the content, whereas the implicit ones base mainly on the question how the content is presented. 1. The speech quotation is an distinctive element in the structure of the magazine article. Thereby it creates a rhythm for the text, such as episodes, paragraphs and clauses. 2. All stories are told through a plot, and in magazine articles, the speech quotations are one of the narrative elements that propel the plot forward. 3. The speech quotations create and intensify the location written in the story. This location can be a physical one but also a social one, in which case it describes the atmosphere and mood in the physical environment and of the story characters. 4. The quotations enhance the plausibility of the facts and assumptions presented in the article, and moreover, when a text is placed between quotation marks, the reader can be assured that the text has been reproduced in the authentic verbatim way. 5. Speech quotations tell about the speaker's unique way of using language and the first-hand experiences of the person quoted. 6. The sixth function of speech quotations is probably the most essential one: the quotations characterize the quoted speaker. In other words, in addition to the propositional content of the utterance, the way in which it has been said transmits a lot of the speaker's character (e.g. nature, generation, behaviour, education, attitudes etc.). It is important to notice, that these six functions of my speech quotation apparatus do not exlude one another. It means that every speech quotation basically includes all of the functions discussed above. However, in practice one or more of them have a principal role, while the others play a subsidiary role. HOW TO MAKE QUOTATIONS? It is not suprising that the field of journalism (textbooks, literature and interviews) holds heterogeneous and unestablished thoughts on how the spoken language should be arranged in written quotations, which is my minor research question. However, the most frequent and distinctive aspects can be depicted in a couple of words: serve the reader and respect the target person. Very common advice on how to arrange the quotations is − firstly, to delete such vernacular cues (e.g. repetitions and ”expletives”) that are common in spoken communication, but purposeless in the written language. − secondly, to complete the phonetic word forms of the spoken language into a more reader-friendly form (esim. punanen → punainen, 'red'), and − thirdly, to enhance the independence of clauses from the (authentic) context and to toughen reciprocal links between them. According to the knowledge of the journalistic field, utterances recorded in different points in time of an interview or a data-collecting session can be transferred as consecutive quotations or even merged together. However, if there is any temporal-spatial location written in the story, the dialogue of the story characters should also be situated in an authentic context – chronologically in the right place in the continuum of the events. To summarize, the way in which the utterances should be arranged into written speech quotations is always situationally-specific − and it is strongly based on the author's discretion.
Resumo:
This thesis presents an experimental study of the speech prosody of identical and non-identical twins. Speech fluency, pauses, speech rate, utterance length and speech frequency were examined phonetically, auditorily, semantically and statistically. The methods included both reading tasks (reading the alphabet, numerical lists, sentences with foreign loan words, holiday theme questions as well as 1.5 pages of text with long sentences and complex words) and spontaneous speech tasks (picture description and answering holiday theme questions). The subjects were Finnish-speaking 22-28-year-old female twins: 8 identical (monozygotic) and 10 non-identical (dizygotic) pairs. One pair was male-female. Comparisons were made between twin groups and between sisters. The data was regathered from four twin pairs, to make it possible to investigate some subjects intra-individually. In addition phoneticians, phonetic students and people without knowledge of phonetic science were tested in two listening experiments. The results showed that the dizygotic twins differed more from each other than monozygotic twins and that monozygotic twin sisters shared more similarities than dizygotic twin sisters. For example, between monozygotic twin sisters smaller differences were found between word count, utterance length and speech rate in spontaneous speech tasks. Dizygotic twin sisters made more different kinds of reading mistakes with the same target words than monozygotic twin sisters, while monozygotic twin sisters made more of the same reading mistakes with the same target words than dizygotic twin sisters. The listening experiments showed that only professional phoneticians were able to recognize the twin sisters. Even though the twins had the possibility to freely choose their speech rate, pausing and speech frequency, they used their own speech patterns; these included the same average speech frequency, average speech rate, type of pausing routine or filled pauses, and other speech mannerisms throughout their speech.
Resumo:
Goals This study aims to map the effect of interrogative function on the intonation of spontaneous and read Finnish. Earlier research shows that the most prominent feature in Finnish question intonation is an appeal to the listener. Question word questions typically start with a high peak which is followed by falling intonation. In yes/no questions, F0 remains on a high level until the word carrying sentence stress and then falls. Final rises are mainly found in intonation clichés such as "Ai mitä?" ("What?") These earlier results are based on read speech and enacted dialogues. In this study, questions and statements found in spontaneous dialogues were compared. These utterances were also compared with read versions of the same utterances. Fundamental frequency values were compared using a mixed model. Contours were also grouped using auditory and visual inspection. Thus it was possible to compare frequencies of contour types according to utterance type and speech style. The position of questions in the F0 distribution of the whole material was also investigated in this study. Method The material consisted of four spontaneous dialogues and their read versions. The speakers were young adults from the Helsinki metropolitan area, four females and four males. The whole material was first divided into broad dialogue function categories arising from the material and F0 curves were calculated for each category. After this, 277 questions and 244 statements were selected for closer inspection. Values reflecting F0 distribution and contour shape were measured from the F0 contours of these utterances. A mixed model was used to analyse the differences. Utterance type, question type, speech style and speaker gender were used as fixed effects. The frequencies of F0 contour types were compared using a Chi square test. Additional material in this study came from eight young female speakers in central Finland. Results and conclusions In the mixed model analysis, significant differences were found both between questions and statements and between spontaneous and read speech. Generally, utterance type affected the variables reflecting contour type while speech style affected the variables reflecting F0 distribution. The effect of question type was not clearly visible. In read speech the contours resembled earlier results more closely. Speakers had different strategies in differentiating between questions and statements. In the whole material, F0 was slightly higher in questions than in statements. The effect of dialectal background could be seen in the contour types. The results show that interrogative function affects intonation in both spontaneous and read Finnish.