5 resultados para Audio-Visual Automatic Speech Recognition
em Duke University
Resumo:
While cochlear implants (CIs) usually provide high levels of speech recognition in quiet, speech recognition in noise remains challenging. To overcome these difficulties, it is important to understand how implanted listeners separate a target signal from interferers. Stream segregation has been studied extensively in both normal and electric hearing, as a function of place of stimulation. However, the effects of pulse rate, independent of place, on the perceptual grouping of sequential sounds in electric hearing have not yet been investigated. A rhythm detection task was used to measure stream segregation. The results of this study suggest that while CI listeners can segregate streams based on differences in pulse rate alone, the amount of stream segregation observed decreases as the base pulse rate increases. Further investigation of the perceptual dimensions encoded by the pulse rate and the effect of sequential presentation of different stimulation rates on perception could be beneficial for the future development of speech processing strategies for CIs.
Resumo:
Our ability to track an object as the same persisting entity over time and motion may primarily rely on spatiotemporal representations which encode some, but not all, of an object's features. Previous researchers using the 'object reviewing' paradigm have demonstrated that such representations can store featural information of well-learned stimuli such as letters and words at a highly abstract level. However, it is unknown whether these representations can also store purely episodic information (i.e. information obtained from a single, novel encounter) that does not correspond to pre-existing type-representations in long-term memory. Here, in an object-reviewing experiment with novel face images as stimuli, observers still produced reliable object-specific preview benefits in dynamic displays: a preview of a novel face on a specific object speeded the recognition of that particular face at a later point when it appeared again on the same object compared to when it reappeared on a different object (beyond display-wide priming), even when all objects moved to new positions in the intervening delay. This case study demonstrates that the mid-level visual representations which keep track of persisting identity over time--e.g. 'object files', in one popular framework can store not only abstract types from long-term memory, but also specific tokens from online visual experience.
Resumo:
This dissertation focuses on two vital challenges in relation to whale acoustic signals: detection and classification.
In detection, we evaluated the influence of the uncertain ocean environment on the spectrogram-based detector, and derived the likelihood ratio of the proposed Short Time Fourier Transform detector. Experimental results showed that the proposed detector outperforms detectors based on the spectrogram. The proposed detector is more sensitive to environmental changes because it includes phase information.
In classification, our focus is on finding a robust and sparse representation of whale vocalizations. Because whale vocalizations can be modeled as polynomial phase signals, we can represent the whale calls by their polynomial phase coefficients. In this dissertation, we used the Weyl transform to capture chirp rate information, and used a two dimensional feature set to represent whale vocalizations globally. Experimental results showed that our Weyl feature set outperforms chirplet coefficients and MFCC (Mel Frequency Cepstral Coefficients) when applied to our collected data.
Since whale vocalizations can be represented by polynomial phase coefficients, it is plausible that the signals lie on a manifold parameterized by these coefficients. We also studied the intrinsic structure of high dimensional whale data by exploiting its geometry. Experimental results showed that nonlinear mappings such as Laplacian Eigenmap and ISOMAP outperform linear mappings such as PCA and MDS, suggesting that the whale acoustic data is nonlinear.
We also explored deep learning algorithms on whale acoustic data. We built each layer as convolutions with either a PCA filter bank (PCANet) or a DCT filter bank (DCTNet). With the DCT filter bank, each layer has different a time-frequency scale representation, and from this, one can extract different physical information. Experimental results showed that our PCANet and DCTNet achieve high classification rate on the whale vocalization data set. The word error rate of the DCTNet feature is similar to the MFSC in speech recognition tasks, suggesting that the convolutional network is able to reveal acoustic content of speech signals.
Resumo:
Sexual risk behavior among young adults is a serious public health concern; 50% will contract a sexually transmitted infection (STI) before the age of 25. The current study collected self-report personality and sexual history data, as well as neuroimaging, experimental behavioral (e.g., real-time hypothetical sexual decision making data), and self-report sexual arousal data from 120 heterosexual young adults ages 18-26. In addition, longitudinal changes in self-reported sexual behavior were collected from a subset (n = 70) of the participants. The primary aims of the study were (1) to predict differences in self-report sexual behavior and hypothetical sexual decision-making (in response to sexually explicit audio-visual cues) as a function of ventral striatum (VS) and amygdala activity, (2) test whether the association between sexual behavior/decision-making and brain function is moderated by gender, self-reported sexual arousal, and/or trait-level personality factors (i.e., self-control, impulsivity, and sensation seeking) and (3) to examine how the main effects of neural function and interaction effects predict sexual risk behavior over time. Our hypotheses were mostly supported across the sexual behavior and decision-making outcome variables, such that neural risk phenotypes (heightened reward-related ventral striatum activity coupled with decreased threat-related amygdala activity) were associated with greater lifetime sexual partners at baseline measured and over time (longitudinal analyses). Impulsivity moderated the relationship between neural function and self-reported number of sexual partners at baseline and follow up measures, as well as experimental condom use decision-making. Sexual arousal and sensation seeking moderated the relationship between neural function and baseline and follow up self-reports of number of sexual partners. Finally, unique gender differences were observed in the relationship between threat and reward-related neural reactivity and self-reported sexual risk behavior. The results of this study provide initial evidence for the potential role for neurobiological approaches to understanding sexual decision-making and risk behavior. With continued research, establishing biomarkers for sexual risk behavior could help inform the development of novel and more effective individually tailored sexual health prevention and intervention efforts.
Resumo:
Although people do not normally try to remember associations between faces and physical contexts, these associations are established automatically, as indicated by the difficulty of recognizing familiar faces in different contexts ("butcher-on-the-bus" phenomenon). The present fMRI study investigated the automatic binding of faces and scenes. In the face-face (F-F) condition, faces were presented alone during both encoding and retrieval, whereas in the face/scene-face (FS-F) condition, they were presented overlaid on scenes during encoding but alone during retrieval (context change). Although participants were instructed to focus only on the faces during both encoding and retrieval, recognition performance was worse in the FS-F than in the F-F condition ("context shift decrement" [CSD]), confirming automatic face-scene binding during encoding. This binding was mediated by the hippocampus as indicated by greater subsequent memory effects (remembered > forgotten) in this region for the FS-F than the F-F condition. Scene memory was mediated by right parahippocampal cortex, which was reactivated during successful retrieval when the faces were associated with a scene during encoding (FS-F condition). Analyses using the CSD as a regressor yielded a clear hemispheric asymmetry in medial temporal lobe activity during encoding: Left hippocampal and parahippocampal activity was associated with a smaller CSD, indicating more flexible memory representations immune to context changes, whereas right hippocampal/rhinal activity was associated with a larger CSD, indicating less flexible representations sensitive to context change. Taken together, the results clarify the neural mechanisms of context effects on face recognition.