57 resultados para Parties of speech
Resumo:
Acoustically, car cabins are extremely noisy and as a consequence, existing audio-only speech recognition systems, for voice-based control of vehicle functions such as the GPS based navigator, perform poorly. Audio-only speech recognition systems fail to make use of the visual modality of speech (eg: lip movements). As the visual modality is immune to acoustic noise, utilising this visual information in conjunction with an audio only speech recognition system has the potential to improve the accuracy of the system. The field of recognising speech using both auditory and visual inputs is known as Audio Visual Speech Recognition (AVSR). Continuous research in AVASR field has been ongoing for the past twenty-five years with notable progress being made. However, the practical deployment of AVASR systems for use in a variety of real-world applications has not yet emerged. The main reason is due to most research to date neglecting to address variabilities in the visual domain such as illumination and viewpoint in the design of the visual front-end of the AVSR system. In this paper we present an AVASR system in a real-world car environment using the AVICAR database [1], which is publicly available in-car database and we show that the use of visual speech conjunction with the audio modality is a better approach to improve the robustness and effectiveness of voice-only recognition systems in car cabin environments.
Resumo:
In recent times, the improved levels of accuracy obtained by Automatic Speech Recognition (ASR) technology has made it viable for use in a number of commercial products. Unfortunately, these types of applications are limited to only a few of the world’s languages, primarily because ASR development is reliant on the availability of large amounts of language specific resources. This motivates the need for techniques which reduce this language-specific, resource dependency. Ideally, these approaches should generalise across languages, thereby providing scope for rapid creation of ASR capabilities for resource poor languages. Cross Lingual ASR emerges as a means for addressing this need. Underpinning this approach is the observation that sound production is largely influenced by the physiological construction of the vocal tract, and accordingly, is human, and not language specific. As a result, a common inventory of sounds exists across languages; a property which is exploitable, as sounds from a resource poor, target language can be recognised using models trained on resource rich, source languages. One of the initial impediments to the commercial uptake of ASR technology was its fragility in more challenging environments, such as conversational telephone speech. Subsequent improvements in these environments has gained consumer confidence. Pragmatically, if cross lingual techniques are to considered a viable alternative when resources are limited, they need to perform under the same types of conditions. Accordingly, this thesis evaluates cross lingual techniques using two speech environments; clean read speech and conversational telephone speech. Languages used in evaluations are German, Mandarin, Japanese and Spanish. Results highlight that previously proposed approaches provide respectable results for simpler environments such as read speech, but degrade significantly when in the more taxing conversational environment. Two separate approaches for addressing this degradation are proposed. The first is based on deriving better target language lexical representation, in terms of the source language model set. The second, and ultimately more successful approach, focuses on improving the classification accuracy of context-dependent (CD) models, by catering for the adverse influence of languages specific phonotactic properties. Whilst the primary research goal in this thesis is directed towards improving cross lingual techniques, the catalyst for investigating its use was based on expressed interest from several organisations for an Indonesian ASR capability. In Indonesia alone, there are over 200 million speakers of some Malay variant, provides further impetus and commercial justification for speech related research on this language. Unfortunately, at the beginning of the candidature, limited research had been conducted on the Indonesian language in the field of speech science, and virtually no resources existed. This thesis details the investigative and development work dedicated towards obtaining an ASR system with a 10000 word recognition vocabulary for the Indonesian language.
Resumo:
This thesis investigates aspects of encoding the speech spectrum at low bit rates, with extensions to the effect of such coding on automatic speaker identification. Vector quantization (VQ) is a technique for jointly quantizing a block of samples at once, in order to reduce the bit rate of a coding system. The major drawback in using VQ is the complexity of the encoder. Recent research has indicated the potential applicability of the VQ method to speech when product code vector quantization (PCVQ) techniques are utilized. The focus of this research is the efficient representation, calculation and utilization of the speech model as stored in the PCVQ codebook. In this thesis, several VQ approaches are evaluated, and the efficacy of two training algorithms is compared experimentally. It is then shown that these productcode vector quantization algorithms may be augmented with lossless compression algorithms, thus yielding an improved overall compression rate. An approach using a statistical model for the vector codebook indices for subsequent lossless compression is introduced. This coupling of lossy compression and lossless compression enables further compression gain. It is demonstrated that this approach is able to reduce the bit rate requirement from the current 24 bits per 20 millisecond frame to below 20, using a standard spectral distortion metric for comparison. Several fast-search VQ methods for use in speech spectrum coding have been evaluated. The usefulness of fast-search algorithms is highly dependent upon the source characteristics and, although previous research has been undertaken for coding of images using VQ codebooks trained with the source samples directly, the product-code structured codebooks for speech spectrum quantization place new constraints on the search methodology. The second major focus of the research is an investigation of the effect of lowrate spectral compression methods on the task of automatic speaker identification. The motivation for this aspect of the research arose from a need to simultaneously preserve the speech quality and intelligibility and to provide for machine-based automatic speaker recognition using the compressed speech. This is important because there are several emerging applications of speaker identification where compressed speech is involved. Examples include mobile communications where the speech has been highly compressed, or where a database of speech material has been assembled and stored in compressed form. Although these two application areas have the same objective - that of maximizing the identification rate - the starting points are quite different. On the one hand, the speech material used for training the identification algorithm may or may not be available in compressed form. On the other hand, the new test material on which identification is to be based may only be available in compressed form. Using the spectral parameters which have been stored in compressed form, two main classes of speaker identification algorithm are examined. Some studies have been conducted in the past on bandwidth-limited speaker identification, but the use of short-term spectral compression deserves separate investigation. Combining the major aspects of the research, some important design guidelines for the construction of an identification model when based on the use of compressed speech are put forward.
Resumo:
This thesis presents an original approach to parametric speech coding at rates below 1 kbitsjsec, primarily for speech storage applications. Essential processes considered in this research encompass efficient characterization of evolutionary configuration of vocal tract to follow phonemic features with high fidelity, representation of speech excitation using minimal parameters with minor degradation in naturalness of synthesized speech, and finally, quantization of resulting parameters at the nominated rates. For encoding speech spectral features, a new method relying on Temporal Decomposition (TD) is developed which efficiently compresses spectral information through interpolation between most steady points over time trajectories of spectral parameters using a new basis function. The compression ratio provided by the method is independent of the updating rate of the feature vectors, hence allows high resolution in tracking significant temporal variations of speech formants with no effect on the spectral data rate. Accordingly, regardless of the quantization technique employed, the method yields a high compression ratio without sacrificing speech intelligibility. Several new techniques for improving performance of the interpolation of spectral parameters through phonetically-based analysis are proposed and implemented in this research, comprising event approximated TD, near-optimal shaping event approximating functions, efficient speech parametrization for TD on the basis of an extensive investigation originally reported in this thesis, and a hierarchical error minimization algorithm for decomposition of feature parameters which significantly reduces the complexity of the interpolation process. Speech excitation in this work is characterized based on a novel Multi-Band Excitation paradigm which accurately determines the harmonic structure in the LPC (linear predictive coding) residual spectra, within individual bands, using the concept 11 of Instantaneous Frequency (IF) estimation in frequency domain. The model yields aneffective two-band approximation to excitation and computes pitch and voicing with high accuracy as well. New methods for interpolative coding of pitch and gain contours are also developed in this thesis. For pitch, relying on the correlation between phonetic evolution and pitch variations during voiced speech segments, TD is employed to interpolate the pitch contour between critical points introduced by event centroids. This compresses pitch contour in the ratio of about 1/10 with negligible error. To approximate gain contour, a set of uniformly-distributed Gaussian event-like functions is used which reduces the amount of gain information to about 1/6 with acceptable accuracy. The thesis also addresses a new quantization method applied to spectral features on the basis of statistical properties and spectral sensitivity of spectral parameters extracted from TD-based analysis. The experimental results show that good quality speech, comparable to that of conventional coders at rates over 2 kbits/sec, can be achieved at rates 650-990 bits/sec.
Resumo:
Keyword Spotting is the task of detecting keywords of interest within continu- ous speech. The applications of this technology range from call centre dialogue systems to covert speech surveillance devices. Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unre- stricted vocabulary audio document indexing. However, to date, many keyword spotting approaches have su®ered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability. This work investigates the application of keyword spotting to data mining tasks. The thesis makes a number of major contributions to the ¯eld of keyword spotting. The ¯rst major contribution is the development of a novel keyword veri¯cation method named Cohort Word Veri¯cation. This method combines high level lin- guistic information with cohort-based veri¯cation techniques to obtain dramatic improvements in veri¯cation performance, in particular for the problematic short duration target word class. The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting. This technique aug- ments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations. The resulting algorithm obtains signi¯cant improvement in detection rate over lattice-based audio document indexing while still maintaining extremely fast search speeds. The third major contribution is the study of multiple veri¯er fusion for the task of keyword veri¯cation. The reported experiments demonstrate that substantial improvements in veri¯cation performance can be obtained through the fusion of multiple keyword veri¯ers. The research focuses on combinations of speech background model based veri¯ers and cohort word veri¯ers. The ¯nal major contribution is a comprehensive study of the e®ects of limited training data for keyword spotting. This study is performed with consideration as to how these e®ects impact the immediate development and deployment of speech technologies for non-English languages.
Resumo:
The internet by its very nature challenges an individual’s notions of propriety, moral acuity and social correctness. A tension will always exist between the censorship of obscene and sensitive information and the freedom to publish and/or access such information. Freedom of expression and communication on the internet is not a static concept: ‘Its continual regeneration is the product of particular combinations of political, legal, cultural and philosophical conditions’.
Resumo:
The use of visual features in the form of lip movements to improve the performance of acoustic speech recognition has been shown to work well, particularly in noisy acoustic conditions. However, whether this technique can outperform speech recognition incorporating well-known acoustic enhancement techniques, such as spectral subtraction, or multi-channel beamforming is not known. This is an important question to be answered especially in an automotive environment, for the design of an efficient human-vehicle computer interface. We perform a variety of speech recognition experiments on a challenging automotive speech dataset and results show that synchronous HMM-based audio-visual fusion can outperform traditional single as well as multi-channel acoustic speech enhancement techniques. We also show that further improvement in recognition performance can be obtained by fusing speech-enhanced audio with the visual modality, demonstrating the complementary nature of the two robust speech recognition approaches.
Resumo:
According to the diagnosis of schizophrenia in the DSM-IV-TR (American Psychiatric Association, 2000), negative symptoms are those personal characteristics that are thought to be reduced from normal functioning, while positive symptoms are aspects of functioning that exist as an excess or distortion of normal functioning. Negative symptoms are generally considered to be a core feature of people diagnosed with schizophrenia. However, negative symptoms are not always present in those diagnosed, and a diagnosis can be made with only negative or only positive symptoms, or with a combination of both. Negative symptoms include an observed loss of emotional expression (affective flattening), loss of motivation or self directedness (avolition), loss of speech (alogia), and also a loss of interests and pleasures (anhedonia). Positive symptoms include the perception of things that others do not perceive (hallucinations), and extraordinary explanations for ordinary events (delusions) (American Psychiatric Association, 2000). Both negative and positive symptoms are derived from watching the patient and thus do not consider the patient’s subjective experience. However, aspects of negative symptoms, such as observed affective flattening are highly contended. Within conventional psychiatry, the absence of emotional expression is assumed to coincide with an absence of emotional experience. Contrasting research findings suggests that patients who were observed to score low on displayed emotional expression, scored high on self ratings of emotional experience. Patients were also observed to be significantly lower on emotional expression when compared with others (Aghevli, Blanchard, & Horan, 2003; Selton, van der Bosch, & Sijben, 1998). It appears that there is little correlation between emotional experience and emotional expression in patients, and that observer ratings cannot help us to understand the subjective experience of the negative symptoms. This chapter will focus on research into the subjective experiences of negative symptoms. A framework for these experiences will be used from the qualitative research findings of the primary author (Le Lievre, 2010). In this study, the primary author found that subjective experiences of the negative symptoms belonged to one of the two phases of the illness experience; “transitioning into emotional shutdown” or “recovering from emotional shutdown”. This chapter will use the six themes from the phase of “transitioning into emotional shutdown”. This phase described the experience of turning the focus of attention away from the world and onto the self and the past, thus losing contact with the world and others (emotional shutdown). Transitioning into emotional shutdown involved; “not being acknowledged”, “relational confusion”, “not being expressive”, “reliving the past”, “detachment”, and “no sense of direction” (Le Lievre, 2010). Detail will be added to this framework of experience from other qualitative research in this area. We will now review the six themes that constitute a “transition into emotional shutdown” and corresponding previous research findings.
Resumo:
This paper demonstrates, following Vygotsky, that language and tool use has a critical role in the collaborative problem-solving behaviour of school-age children. It reports original ethnographic classroom research examining the convergence of speech and practical activity in children’s collaborative problem solving with robotics programming tasks. The researchers analysed children’s interactions during a series of problem solving experiments in which Lego Mindstorms toolsets were used by teachers to create robotics design challenges among 24 students in a Year 4 Australian classroom (students aged 8.5–9.5 years). The design challenges were incrementally difficult, beginning with basic programming of straight line movement, and progressing to more complex challenges involving programming of the robots to raise Lego figures from conduit pipes using robots as pulleys with string and recycled materials. Data collection involved micro-genetic analysis of students’ speech interactions with tools, peers, and other experts, teacher interviews, and student focus group data. Coding the repeated patterns in the transcripts, the authors outline the structure of the children’s social speech in joint problem solving, demonstrating the patterns of speech and interaction that play an important role in the socialisation of the school-age child’s practical intellect.
Resumo:
Background Recent initiatives within an Australia public healthcare service have seen a focus on increasing the research capacity of their workforce. One of the key initiatives involves encouraging clinicians to be research generators rather than solely research consumers. As a result, baseline data of current research capacity are essential to determine whether initiatives encouraging clinicians to undertake research have been effective. Speech pathologists have previously been shown to be interested in conducting research within their clinical role; therefore they are well positioned to benefit from such initiatives. The present study examined the current research interest, confidence and experience of speech language pathologists (SLPs) in a public healthcare workforce, as well as factors that predicted clinician research engagement. Methods Data were collected via an online survey emailed to an estimated 330 SLPs working within Queensland, Australia. The survey consisted of 30 questions relating to current levels of interest, confidence and experience performing specific research tasks, as well as how frequently SLPs had performed these tasks in the last 5 years. Results Although 158 SLPs responded to the survey, complete data were available for only 137. Respondents were more confident and experienced with basic research tasks (e.g., finding literature) and less confident and experienced with complex research tasks (e.g., analysing and interpreting results, publishing results). For most tasks, SLPs displayed higher levels of interest in the task than confidence and experience. Research engagement was predicted by highest qualification obtained, current job classification level and overall interest in research. Conclusions Respondents generally reported levels of interest in research higher than their confidence and experience, with many respondents reporting limited experience in most research tasks. Therefore SLPs have potential to benefit from research capacity building activities to increase their research skills in order to meet organisational research engagement objectives. However, these findings must be interpreted with the caveats that a relatively low response rate occurred and participants were recruited from a single state-wide health service, and therefore may not be representative of the wider SLP workforce.
Resumo:
Oscillatory entrainment to the speech signal is important for language processing, but has not yet been studied in developmental disorders of language. Developmental dyslexia, a difficulty in acquiring efficient reading skills linked to difficulties with phonology (the sound structure of language), has been associated with behavioural entrainment deficits. It has been proposed that the phonological ‘deficit’ that characterises dyslexia across languages is related to impaired auditory entrainment to speech at lower frequencies via neuroelectric oscillations (<10 Hz, ‘temporal sampling theory’). Impaired entrainment to temporal modulations at lower frequencies would affect the recovery of the prosodic and syllabic structure of speech. Here we investigated event-related oscillatory EEG activity and contingent negative variation (CNV) to auditory rhythmic tone streams delivered at frequencies within the delta band (2 Hz, 1.5 Hz), relevant to sampling stressed syllables in speech. Given prior behavioural entrainment findings at these rates, we predicted functionally atypical entrainment of delta oscillations in dyslexia. Participants performed a rhythmic expectancy task, detecting occasional white noise targets interspersed with tones occurring regularly at rates of 2 Hz or 1.5 Hz. Both groups showed significant entrainment of delta oscillations to the rhythmic stimulus stream, however the strength of inter-trial delta phase coherence (ITC, ‘phase locking’) and the CNV were both significantly weaker in dyslexics, suggestive of weaker entrainment and less preparatory brain activity. Both ITC strength and CNV amplitude were significantly related to individual differences in language processing and reading. Additionally, the instantaneous phase of prestimulus delta oscillation predicted behavioural responding (response time) for control participants only.
Resumo:
The support for typically out-of-vocabulary query terms such as names, acronyms, and foreign words is an important requirement of many speech indexing applications. However, to date many unrestricted vocabulary indexing systems have struggled to provide a balance between good detection rate and fast query speeds. This paper presents a fast and accurate unrestricted vocabulary speech indexing technique named Dynamic Match Lattice Spotting (DMLS). The proposed method augments the conventional lattice spotting technique with dynamic sequence matching, together with a number of other novel algorithmic enhancements, to obtain a system that is capable of searching hours of speech in seconds while maintaining excellent detection performance
Resumo:
Design process phases of development, evaluation and implementation were used to create a garment to simultaneously collect reliable data of speech production and intensity of movement of toddlers (18-36 months). A series of prototypes were developed and evaluated that housed accelerometer-based motion sensors and a digital transmitter with microphone. The approved test garment was a top constructed from loop-faced fabric with interior pockets to house devices. Extended side panels allowed for sizing. In total, 56 toddlers (28 male; 28 female; 16-36 months of age) participated in the study providing pilot and baseline data. The test garment was effective in collecting data as evaluated for accuracy and reliability using ANOVA for accelerometer data, transcription of video for type of movement, and number and length of utterances for speech production. The data collection garment has been implemented in various studies across disciplines.
Resumo:
Spoken word production is assumed to involve stages of processing in which activation spreads through layers of units comprising lexical-conceptual knowledge and their corresponding phonological word forms. Using high-field (4T) functional magnetic resonance imagine (fMRI), we assessed whether the relationship between these stages is strictly serial or involves cascaded-interactive processing, and whether central (decision/control) processing mechanisms are involved in lexical selection. Participants performed the competitor priming paradigm in which distractor words, named from a definition and semantically related to a subsequently presented target picture, slow picture-naming latency compared to that with unrelated words. The paradigm intersperses two trials between the definition and the picture to be named, temporally separating activation in the word perception and production networks. Priming semantic competitors of target picture names significantly increased activation in the left posterior temporal cortex, and to a lesser extent the left middle temporal cortex, consistent with the predictions of cascaded-interactive models of lexical access. In addition, extensive activation was detected in the anterior cingulate and pars orbitalis of the inferior frontal gyrus. The findings indicate that lexical selection during competitor priming is biased by top-down mechanisms to reverse associations between primed distractor words and target pictures to select words that meet the current goal of speech.
Resumo:
This chapter considers the legal ramifications of Wikipedia, and other online media, such as the Encyclopedia of Life. Nathaniel Tkacz (2007) has observed: 'Wikipedia is an ideal entry-point from which to approach the shifting character of knowledge in contemporary society.' He observes: 'Scholarship on Wikipedia from computer science, history, philosophy, pedagogy and media studies has moved beyond speculation regarding its considerable potential, to the task of interpreting - and potentially intervening in - the significance of Wikipedia's impact' (Tkacz 2007). After an introduction, Part II considers the evolution and development of Wikipedia, and the legal troubles that have attended it. It also considers the establishment of rival online encyclopedia - such as Citizendium set up by Larry Sanger, the co-founder of Wikipedia; and Knol, the mysterious new project of Google. Part III explores the use of mass, collaborative authorship in the field of science. In particular, it looks at the development of the Encyclopedia of Life, which seeks to document the world's biodiversity. This chapter expresses concern that Wiki-based software had to develop in a largely hostile and inimical legal environment. It contends that copyright law and related fields of intellectual property need to be reformed in order better to accommodate users of copyright material (Rimmer 2007). This chapter makes a number of recommendations. First, there is a need to acknowledge and recognize forms of mass, collaborative production and consumption - not just individual authorship. Second, the view of a copyright 'work' and other subject matter as a complete and closed piece of cultural production also should be reconceptualised. Third, the defense of fair use should be expanded to accommodate a wide range of amateur, peer-to-peer production activities - not only in the United States, but in other jurisdictions as well. Fourth, the safe harbor protections accorded to Internet intermediaries, such as Wikipedia, should be strengthened. Fifth, there should be a defense in respect of the use of 'orphan works' - especially in cases of large-scale digitization. Sixth, the innovations of open source licensing should be expressly incorporated and entrenched within the formal framework of copyright laws. Finally, courts should craft judicial remedies to take into account concerns about political censorship and freedom of speech.