886 resultados para Text-to-speech
Resumo:
Speech is often a multimodal process, presented audiovisually through a talking face. One area of speech perception influenced by visual speech is speech segmentation, or the process of breaking a stream of speech into individual words. Mitchel and Weiss (2013) demonstrated that a talking face contains specific cues to word boundaries and that subjects can correctly segment a speech stream when given a silent video of a speaker. The current study expanded upon these results, using an eye tracker to identify highly attended facial features of the audiovisual display used in Mitchel and Weiss (2013). In Experiment 1, subjects were found to spend the most time watching the eyes and mouth, with a trend suggesting that the mouth was viewed more than the eyes. Although subjects displayed significant learning of word boundaries, performance was not correlated with gaze duration on any individual feature, nor was performance correlated with a behavioral measure of autistic-like traits. However, trends suggested that as autistic-like traits increased, gaze duration of the mouth increased and gaze duration of the eyes decreased, similar to significant trends seen in autistic populations (Boratston & Blakemore, 2007). In Experiment 2, the same video was modified so that a black bar covered the eyes or mouth. Both videos elicited learning of word boundaries that was equivalent to that seen in the first experiment. Again, no correlations were found between segmentation performance and SRS scores in either condition. These results, taken with those in Experiment, suggest that neither the eyes nor mouth are critical to speech segmentation and that perhaps more global head movements indicate word boundaries (see Graf, Cosatto, Strom, & Huang, 2002). Future work will elucidate the contribution of individual features relative to global head movements, as well as extend these results to additional types of speech tasks.
Resumo:
Comprehending speech is one of the most important human behaviors, but we are only beginning to understand how the brain accomplishes this difficult task. One key to speech perception seems to be that the brain integrates the independent sources of information available in the auditory and visual modalities in a process known as multisensory integration. This allows speech perception to be accurate, even in environments in which one modality or the other is ambiguous in the context of noise. Previous electrophysiological and functional magnetic resonance imaging (fMRI) experiments have implicated the posterior superior temporal sulcus (STS) in auditory-visual integration of both speech and non-speech stimuli. While evidence from prior imaging studies have found increases in STS activity for audiovisual speech compared with unisensory auditory or visual speech, these studies do not provide a clear mechanism as to how the STS communicates with early sensory areas to integrate the two streams of information into a coherent audiovisual percept. Furthermore, it is currently unknown if the activity within the STS is directly correlated with strength of audiovisual perception. In order to better understand the cortical mechanisms that underlie audiovisual speech perception, we first studied the STS activity and connectivity during the perception of speech with auditory and visual components of varying intelligibility. By studying fMRI activity during these noisy audiovisual speech stimuli, we found that STS connectivity with auditory and visual cortical areas mirrored perception; when the information from one modality is unreliable and noisy, the STS interacts less with the cortex processing that modality and more with the cortex processing the reliable information. We next characterized the role of STS activity during a striking audiovisual speech illusion, the McGurk effect, to determine if activity within the STS predicts how strongly a person integrates auditory and visual speech information. Subjects with greater susceptibility to the McGurk effect exhibited stronger fMRI activation of the STS during perception of McGurk syllables, implying a direct correlation between strength of audiovisual integration of speech and activity within an the multisensory STS.
Resumo:
"An Introduction to Public Health is about the discipline of public health and the nature and scope of public health activity set within the challenges of the twenty first century. It is an introductory text to the principles and practice of public health written in a way that is easy to understand. Of what relevance is public health to the many allied health disciplines who contribute to it? How might an understanding of public health contribute to a range of health professionals who use the principles and practices of public health in their professional activities? These are the questions that this book addresses. An Introduction to Public Health leads the reader on a journey of discovery that concludes with not only an understanding of the nature and scope of public health but the challenges that face the field into the future." Provided by publisher.
Resumo:
Purpose: The classic study of Sumby and Pollack (1954, JASA, 26(2), 212-215) demonstrated that visual information aided speech intelligibility under noisy auditory conditions. Their work showed that visual information is especially useful under low signal-to-noise conditions where the auditory signal leaves greater margins for improvement. We investigated whether simulated cataracts interfered with the ability of participants to use visual cues to help disambiguate the auditory signal in the presence of auditory noise. Methods: Participants in the study were screened to ensure normal visual acuity (mean of 20/20) and normal hearing (auditory threshold ≤ 20 dB HL). Speech intelligibility was tested under an auditory only condition and two visual conditions: normal vision and simulated cataracts. The light scattering effects of cataracts were imitated using cataract-simulating filters. Participants wore blacked-out glasses in the auditory only condition and lens-free frames in the normal auditory-visual condition. Individual sentences were spoken by a live speaker in the presence of prerecorded four-person background babble set to a speech-to-noise ratio (SNR) of -16 dB. The SNR was determined in a preliminary experiment to support 50% correct identification of sentence under the auditory only conditions. The speaker was trained to match the rate, intensity and inflections of a prerecorded audio track of everyday speech sentences. The speaker was blind to the visual conditions of the participant to control for bias.Participants’ speech intelligibility was measured by comparing the accuracy of their written account of what they believed the speaker to have said to the actual spoken sentence. Results: Relative to the normal vision condition, speech intelligibility was significantly poorer when participants wore simulated catarcts. Conclusions: The results suggest that cataracts may interfere with the acquisition of visual cues to speech perception.
Resumo:
Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.
Resumo:
While close talking microphones give the best signal quality and produce the highest accuracy from current Automatic Speech Recognition (ASR) systems, the speech signal enhanced by microphone array has been shown to be an effective alternative in a noisy environment. The use of microphone arrays in contrast to close talking microphones alleviates the feeling of discomfort and distraction to the user. For this reason, microphone arrays are popular and have been used in a wide range of applications such as teleconferencing, hearing aids, speaker tracking, and as the front-end to speech recognition systems. With advances in sensor and sensor network technology, there is considerable potential for applications that employ ad-hoc networks of microphone-equipped devices collaboratively as a virtual microphone array. By allowing such devices to be distributed throughout the users’ environment, the microphone positions are no longer constrained to traditional fixed geometrical arrangements. This flexibility in the means of data acquisition allows different audio scenes to be captured to give a complete picture of the working environment. In such ad-hoc deployment of microphone sensors, however, the lack of information about the location of devices and active speakers poses technical challenges for array signal processing algorithms which must be addressed to allow deployment in real-world applications. While not an ad-hoc sensor network, conditions approaching this have in effect been imposed in recent National Institute of Standards and Technology (NIST) ASR evaluations on distant microphone recordings of meetings. The NIST evaluation data comes from multiple sites, each with different and often loosely specified distant microphone configurations. This research investigates how microphone array methods can be applied for ad-hoc microphone arrays. A particular focus is on devising methods that are robust to unknown microphone placements in order to improve the overall speech quality and recognition performance provided by the beamforming algorithms. In ad-hoc situations, microphone positions and likely source locations are not known and beamforming must be achieved blindly. There are two general approaches that can be employed to blindly estimate the steering vector for beamforming. The first is direct estimation without regard to the microphone and source locations. An alternative approach is instead to first determine the unknown microphone positions through array calibration methods and then to use the traditional geometrical formulation for the steering vector. Following these two major approaches investigated in this thesis, a novel clustered approach which includes clustering the microphones and selecting the clusters based on their proximity to the speaker is proposed. Novel experiments are conducted to demonstrate that the proposed method to automatically select clusters of microphones (ie, a subarray), closely located both to each other and to the desired speech source, may in fact provide a more robust speech enhancement and recognition than the full array could.
Resumo:
This thesis investigates aspects of encoding the speech spectrum at low bit rates, with extensions to the effect of such coding on automatic speaker identification. Vector quantization (VQ) is a technique for jointly quantizing a block of samples at once, in order to reduce the bit rate of a coding system. The major drawback in using VQ is the complexity of the encoder. Recent research has indicated the potential applicability of the VQ method to speech when product code vector quantization (PCVQ) techniques are utilized. The focus of this research is the efficient representation, calculation and utilization of the speech model as stored in the PCVQ codebook. In this thesis, several VQ approaches are evaluated, and the efficacy of two training algorithms is compared experimentally. It is then shown that these productcode vector quantization algorithms may be augmented with lossless compression algorithms, thus yielding an improved overall compression rate. An approach using a statistical model for the vector codebook indices for subsequent lossless compression is introduced. This coupling of lossy compression and lossless compression enables further compression gain. It is demonstrated that this approach is able to reduce the bit rate requirement from the current 24 bits per 20 millisecond frame to below 20, using a standard spectral distortion metric for comparison. Several fast-search VQ methods for use in speech spectrum coding have been evaluated. The usefulness of fast-search algorithms is highly dependent upon the source characteristics and, although previous research has been undertaken for coding of images using VQ codebooks trained with the source samples directly, the product-code structured codebooks for speech spectrum quantization place new constraints on the search methodology. The second major focus of the research is an investigation of the effect of lowrate spectral compression methods on the task of automatic speaker identification. The motivation for this aspect of the research arose from a need to simultaneously preserve the speech quality and intelligibility and to provide for machine-based automatic speaker recognition using the compressed speech. This is important because there are several emerging applications of speaker identification where compressed speech is involved. Examples include mobile communications where the speech has been highly compressed, or where a database of speech material has been assembled and stored in compressed form. Although these two application areas have the same objective - that of maximizing the identification rate - the starting points are quite different. On the one hand, the speech material used for training the identification algorithm may or may not be available in compressed form. On the other hand, the new test material on which identification is to be based may only be available in compressed form. Using the spectral parameters which have been stored in compressed form, two main classes of speaker identification algorithm are examined. Some studies have been conducted in the past on bandwidth-limited speaker identification, but the use of short-term spectral compression deserves separate investigation. Combining the major aspects of the research, some important design guidelines for the construction of an identification model when based on the use of compressed speech are put forward.
Resumo:
The availability and use of online counseling approaches has increased rapidly over the last decade. While research has suggested a range of potential affordances and limitations of online counseling modalities, very few studies have offered detailed examinations of how counselors and clients manage asynchronous email counseling exchanges. In this paper we examine email exchanges involving clients and counselors through Kids Helpline, a national Australian counseling service that offers free online, email and telephone counseling for young people up to the age of 25. We employ tools from the traditions of ethnomethodology and conversation analysis to analyze the ways in which counselors from Kids Helpline request that their clients call them, and hence change the modality of their counseling relationship, from email to telephone counseling. This paper shows the counselors’ three multi-layered approaches in these emails as they negotiate the potentially delicate task of requesting and persuading a client to change the trajectory of their counseling relationship from text to talk without placing that relationship in jeopardy.
Resumo:
The need for special education (SE) is increasing. The majority of those whose problems are due to neurodevelopmental disorders have no specific aetiology. The aim of this study was to evaluate the contribution of prenatal and perinatal factors and factors associated with growth and development to later need for full-time SE and to assess joint structural and volumetric brain alterations among subjects with unexplained, familial need for SE. A random sample of 900 subjects in full-time SE allocated into three levels of neurodevelopmental problems and 301 controls in mainstream education (ME) provided data on socioeconomic factors, pregnancy, delivery, growth, and development. Of those, 119 subjects belonging to a sibling-pair in full-time SE with unexplained aetiology and 43 controls in ME underwent brain magnetic resonance imaging (MRI). Analyses of structural brain alterations and midsagittal area and diameter measurements were made. Voxel-based morphometry (VBM) analysis provided detailed information on regional grey matter, white matter, and cerebrospinal fluid (CSF) volume differences. Father’s age ≥ 40 years, low birth weight, male sex, and lower socio-economic status all increased the probability of SE placement. At age 1 year, one standard deviation score decrease in height raised the probability of SE placement by 40% and in head circumference by 28%. At infancy, the gross motor milestones differentiated the children. From age 18 months, the fine motor milestones and those related to speech and social skills became more important. Brain MRI revealed no specific aetiology for subjects in SE. However, they had more often ≥ 3 abnormal findings in MRIs (thin corpus callosum and enlarged cerebral and cerebellar CSF spaces). In VBM, subjects in full-time SE had smaller global white matter, CSF, and total brain volumes than controls. Compared with controls, subjects with intellectual disabilities had regional volume alterations (greater grey matter volumes in the anterior cingulate cortex bilaterally, smaller grey matter volume in left thalamus and left cerebellar hemisphere, greater white matter volume in the left fronto-parietal region, and smaller white matter volumes bilaterally in the posterior limbs of the internal capsules). In conclusion, the epidemiological studies emphasized several factors that increased the probability of SE placement, useful as a framework for interventional studies. The global and regional brain MRI findings provide an interesting basis for future investigations of learning-related brain structures in young subjects with cognitive impairments or intellectual disabilities of unexplained, familial aetiology.
Resumo:
Why is public health important? An Introduction to Public Health is about the discipline of public health, the nature and scope of public health activity, and the challenges that face public health in the twenty-first century. The book is designed as an introductory text to the principles and practice of public health. This is a complex and multifaceted area. What we have tried to do in this book is make public health easy to understand without making it simplistic. As many authors have stated, public health is essentially about the organised efforts of society to promote, protect and restore the public’s health (Brownson 2011, Last 2001, Schneider 2011, Turnock 2012, Winslow 1920). It is multidisciplinary in nature, and it is influenced by genetic, physical, social, cultural, economic and political determinants of health. How do we define public health, and what are the disciplines that contribute to public health? How has the area changed over time? Are there health issues in the twenty-first century that change the focus and activity of public health? Yes, there are! There are many challenges facing public health now and in the future, just as there have been over the course of the history of organised public health efforts, dating from around 1850 in the Western world. Of what relevance is public health to the many health disciplines that contribute to it? How might an understanding of public health contribute to a range of health professionals who use the principles and practices of public health in their professional activities? These are the questions that this book addresses. Introduction to Public Health leads the reader on a journey of discovery that concludes with an understanding of the nature and scope of public health and the challenges facing the field into the future. In this edition we have included one new chapter, ‘Public health and social policy’, in order to broaden our understanding of the policy influences on public health. The book is designed for a range of students undertaking health courses where there is a focus on advancing the health of the population. While it is imperative that people wanting to be public health professionals understand the theory and practice of public health, many other health workers contribute to effective public health practice. The book would also be relevant to a range of undergraduate students who want an introductory understanding of public health and its practice.
Resumo:
The Internet has made possible the cost-effective dissemination of scientific journals in the form of electronic versions, usually in parallel with the printed versions. At the same time the electronic medium also makes possible totally new open access (OA) distribution models, funded by author charges, sponsorship, advertising, voluntary work, etc., where the end product is free in full text to the readers. Although more than 2,000 new OA journals have been founded in the last 15 years, the uptake of open access has been rather slow, with currently around 5% of all peer-reviewed articles published in OA journals. The slow growth can to a large extent be explained by the fact that open access has predominantly emerged via newly founded journals and startup publishers. Established journals and publishers have not had strong enough incentives to change their business models, and the commercial risks in doing so have been high. In this paper we outline and discuss two different scenarios for how scholarly publishers could change their operating model to open access. The first is based on an instantaneous change and the second on a gradual change. We propose a way to manage the gradual change by bundling traditional “big deal” licenses and author charges for opening access to individual articles.
Resumo:
The Internet has made possible the cost-effective dissemination of scientific journals in the form of electronic versions, usually in parallel with the printed versions. At the same time the electronic medium also makes possible totally new open access (OA) distribution models, funded by author charges, sponsorship, advertising, voluntary work, etc., where the end product is free in full text to the readers. Although more than 2,000 new OA journals have been founded in the last 15 years, the uptake of open access has been rather slow, with currently around 5% of all peer-reviewed articles published in OA journals. The slow growth can to a large extent be explained by the fact that open access has predominantly emerged via newly founded journals and startup publishers. Established journals and publishers have not had strong enough incentives to change their business models, and the commercial risks in doing so have been high. In this paper we outline and discuss two different scenarios for how scholarly publishers could change their operating model to open access. The first is based on an instantaneous change and the second on a gradual change. We propose a way to manage the gradual change by bundling traditional “big deal” licenses and author charges for opening access to individual articles.
Resumo:
Self-similarity, a concept taken from mathematics, is gradually becoming a keyword in musicology. Although a polysemic term, self-similarity often refers to the multi-scalar feature repetition in a set of relationships, and it is commonly valued as an indication for musical coherence and consistency . This investigation provides a theory of musical meaning formation in the context of intersemiosis, that is, the translation of meaning from one cognitive domain to another cognitive domain (e.g. from mathematics to music, or to speech or graphic forms). From this perspective, the degree of coherence of a musical system relies on a synecdochic intersemiosis: a system of related signs within other comparable and correlated systems. This research analyzes the modalities of such correlations, exploring their general and particular traits, and their operational bounds. Looking forward in this direction, the notion of analogy is used as a rich concept through its two definitions quoted by the Classical literature: proportion and paradigm, enormously valuable in establishing measurement, likeness and affinity criteria. Using quantitative qualitative methods, evidence is presented to justify a parallel study of different modalities of musical self-similarity. For this purpose, original arguments by Benoît B. Mandelbrot are revised, alongside a systematic critique of the literature on the subject. Furthermore, connecting Charles S. Peirce s synechism with Mandelbrot s fractality is one of the main developments of the present study. This study provides elements for explaining Bolognesi s (1983) conjecture, that states that the most primitive, intuitive and basic musical device is self-reference, extending its functions and operations to self-similar surfaces. In this sense, this research suggests that, with various modalities of self-similarity, synecdochic intersemiosis acts as system of systems in coordination with greater or lesser development of structural consistency, and with a greater or lesser contextual dependence.
Resumo:
In the field of second language (L2) acquisition, the term `foreign accent´ is often used to refer to speech characteristics that differ from the pronunciation of native speakers. Foreign accent may affect the intelligibility and perceived comprehensibility of speech and it is also sometimes associated with negative attitudes. The degree of L2 learners foreign accent and the speech characteristics that account for it have previously been studied through speech perception experiments and acoustic measurements. Perception experiments have shown that native listeners are easily able to identify foreign accent in speech. However to date, no studies have been done on the assessment of foreign accent in the speech of non-native speakers of Finnish. The aim of this study is to examine how native speakers of Finnish rate the degree of foreign accentedness in the speech of Russian L2 learners of Finnish. Furthermore, phonetic analysis is used to study the characteristics of speech that affect the perceived strength of foreign accent. Altogether 96 native speakers of Finnish listened to excerpts of read-aloud and spontaneous Finnish speech from ten Russian and six Finnish female speakers. The Russian speakers were intermediate and advanced learners of Finnish and had all immigrated to Finland as adults. Among the listeners, was a group of teachers of Finnish as an L2, and it was presumed that these teachers had been exposed to foreign accent in Finnish and were used to hearing it. The temporal aspects and segmental properties of speech were phonetically analysed in the speech of the Russian speakers in order to measure their effect on the perceived degree of accent. Although wide differences were observed in the use of the rating scale among the listeners, they were still quite unanimous on which speakers had the strongest foreign accent and which had the mildest. The listeners background factors had little effect on their ratings, and the ratings of the teachers of Finnish as an L2 did not differ from those of the other listeners. However, a clear difference was noted in the ratings of the two types of stimuli used in the perception experiment: the read-aloud speech was rated as more strongly accented than the spontaneous speech. It is important to note that the assessment of foreign accent is affected by many factors and their complex interactions in the experimental setting. Futher the study found that, both the temporal aspects of speech, often associated with fluency, and the number of single deviant phonetic segments contributed to the perceived degree of accentedness in the speech of the native Russian speakers.
Resumo:
We formulate a two-stage Iterative Wiener filtering (IWF) approach to speech enhancement, bettering the performance of constrained IWF, reported in literature. The codebook constrained IWF (CCIWF) has been shown to be effective in achieving convergence of IWF in the presence of both stationary and non-stationary noise. To this, we include a second stage of unconstrained IWF and show that the speech enhancement performance can be improved in terms of average segmental SNR (SSNR), Itakura-Saito (IS) distance and Linear Prediction Coefficients (LPC) parameter coincidence. We also explore the tradeoff between the number of CCIWF iterations and the second stage IWF iterations.