5 resultados para speech signals
em Massachusetts Institute of Technology
Resumo:
This work addresses two related questions. The first question is what joint time-frequency energy representations are most appropriate for auditory signals, in particular, for speech signals in sonorant regions. The quadratic transforms of the signal are examined, a large class that includes, for example, the spectrograms and the Wigner distribution. Quasi-stationarity is not assumed, since this would neglect dynamic regions. A set of desired properties is proposed for the representation: (1) shift-invariance, (2) positivity, (3) superposition, (4) locality, and (5) smoothness. Several relations among these properties are proved: shift-invariance and positivity imply the transform is a superposition of spectrograms; positivity and superposition are equivalent conditions when the transform is real; positivity limits the simultaneous time and frequency resolution (locality) possible for the transform, defining an uncertainty relation for joint time-frequency energy representations; and locality and smoothness tradeoff by the 2-D generalization of the classical uncertainty relation. The transform that best meets these criteria is derived, which consists of two-dimensionally smoothed Wigner distributions with (possibly oriented) 2-D guassian kernels. These transforms are then related to time-frequency filtering, a method for estimating the time-varying 'transfer function' of the vocal tract, which is somewhat analogous to ceptstral filtering generalized to the time-varying case. Natural speech examples are provided. The second question addressed is how to obtain a rich, symbolic description of the phonetically relevant features in these time-frequency energy surfaces, the so-called schematic spectrogram. Time-frequency ridges, the 2-D analog of spectral peaks, are one feature that is proposed. If non-oriented kernels are used for the energy representation, then the ridge tops can be identified, with zero-crossings in the inner product of the gradient vector and the direction of greatest downward curvature. If oriented kernels are used, the method can be generalized to give better orientation selectivity (e.g., at intersecting ridges) at the cost of poorer time-frequency locality. Many speech examples are given showing the performance for some traditionally difficult cases: semi-vowels and glides, nasalized vowels, consonant-vowel transitions, female speech, and imperfect transmission channels.
Resumo:
This research is concerned with the development of tactual displays to supplement the information available through lipreading. Because voicing carries a high informational load in speech and is not well transmitted through lipreading, the efforts are focused on providing tactual displays of voicing to supplement the information available on the lips of the talker. This research includes exploration of 1) signal-processing schemes to extract information about voicing from the acoustic speech signal, 2) methods of displaying this information through a multi-finger tactual display, and 3) perceptual evaluations of voicing reception through the tactual display alone (T), lipreading alone (L), and the combined condition (L+T). Signal processing for the extraction of voicing information used amplitude-envelope signals derived from filtered bands of speech (i.e., envelopes derived from a lowpass-filtered band at 350 Hz and from a highpass-filtered band at 3000 Hz). Acoustic measurements made on the envelope signals of a set of 16 initial consonants represented through multiple tokens of C1VC2 syllables indicate that the onset-timing difference between the low- and high-frequency envelopes (EOA: envelope-onset asynchrony) provides a reliable and robust cue for distinguishing voiced from voiceless consonants. This acoustic cue was presented through a two-finger tactual display such that the envelope of the high-frequency band was used to modulate a 250-Hz carrier signal delivered to the index finger (250-I) and the envelope of the low-frequency band was used to modulate a 50-Hz carrier delivered to the thumb (50T). The temporal-onset order threshold for these two signals, measured with roving signal amplitude and duration, averaged 34 msec, sufficiently small for use of the EOA cue. Perceptual evaluations of the tactual display of EOA with speech signal indicated: 1) that the cue was highly effective for discrimination of pairs of voicing contrasts; 2) that the identification of 16 consonants was improved by roughly 15 percentage points with the addition of the tactual cue over L alone; and 3) that no improvements in L+T over L were observed for reception of words in sentences, indicating the need for further training on this task
Resumo:
Does knowledge of language consist of symbolic rules? How do children learn and use their linguistic knowledge? To elucidate these questions, we present a computational model that acquires phonological knowledge from a corpus of common English nouns and verbs. In our model the phonological knowledge is encapsulated as boolean constraints operating on classical linguistic representations of speech sounds in term of distinctive features. The learning algorithm compiles a corpus of words into increasingly sophisticated constraints. The algorithm is incremental, greedy, and fast. It yields one-shot learning of phonological constraints from a few examples. Our system exhibits behavior similar to that of young children learning phonological knowledge. As a bonus the constraints can be interpreted as classical linguistic rules. The computational model can be implemented by a surprisingly simple hardware mechanism. Our mechanism also sheds light on a fundamental AI question: How are signals related to symbols?
Resumo:
With the rapid increase in low-cost and sophisticated digital technology the need for techniques to authenticate digital material will become more urgent. In this paper we address the problem of authenticating digital signals assuming no explicit prior knowledge of the original. The basic approach that we take is to assume that in the frequency domain a "natural" signal has weak higher-order statistical correlations. We then show that "un-natural" correlations are introduced if this signal is passed through a non-linearity (which would almost surely occur in the creation of a forgery). Techniques from polyspectral analysis are then used to detect the presence of these correlations. We review the basics of polyspectral analysis, show how and why these tools can be used in detecting forgeries and show their effectiveness in analyzing human speech.
Resumo:
We report a 75dB, 2.8mW, 100Hz-10kHz envelope detector in a 1.5mm 2.8V CMOS technology. The envelope detector performs input-dc-insensitive voltage-to-currentconverting rectification followed by novel nanopower current-mode peak detection. The use of a subthreshold wide- linear-range transconductor (WLR OTA) allows greater than 1.7Vpp input voltage swings. We show theoretically that this optimal performance is technology-independent for the given topology and may be improved only by spending more power. A novel circuit topology is used to perform 140nW peak detection with controllable attack and release time constants. The lower limits of envelope detection are determined by the more dominant of two effects: The first effect is caused by the inability of amplified high-frequency signals to exceed the deadzone created by exponential nonlinearities in the rectifier. The second effect is due to an output current caused by thermal noise rectification. We demonstrate good agreement of experimentally measured results with theory. The envelope detector is useful in low power bionic implants for the deaf, hearing aids, and speech-recognition front ends. Extension of the envelope detector to higher- frequency applications is straightforward if power consumption is inc