14 resultados para Vocal quartets.
em Indian Institute of Science - Bangalore - Índia
Resumo:
This paper presents comparative data on the vocal communication of two Asian leaf monkeys, the Nilgiri langur (Presbytis johnii) and South Indian common langur (Presbytis entellus), based on sound recordings and behavioural observations of free-ranging groups. Spectrographical analyses revealed a repertoire of 18 basic patterns for Nilgiri langurs, and 21 basic patterns for common langurs. The repertoires of the two langur species consist of both discretely structured vocal patterns, in which alterations of the physical parameters are restricted to intra-class variation, and those in which structural variations cause intergradation between different sections of the repertoire. Qualitative assessments of group scans indicate that in both species vocal behaviour is characterized by pronounced sex-differences in the use of the different elements of the vocal repertoire. Comparison of data available from different populations of P. entellus suggests population-specific modifications on both structural and behavioural levels. Moreover, characteristic elements of the vocal systems of the two Asian species demonstrate striking similarities to those described for the African black-and-white colobus.
Resumo:
Field observations and spectrographic analyses of sound recordings of South Indian bonnet macaques revealed a vocal repertoire of at least 25 basic patterns. The repertoire consists of well separated sound classes and acoustic categories connected by structural intergradation. Besides structural variations within and between different elements of the repertoire, the vocal system ofMacaca radiata is characterized by regular combinations of particular basic patterns. These combinations occurred not only between calls of similar structure and function but also between calls usually emitted in entirely different social contexts. According to the qualitative analysis, sex-specific asymmetries of the vocal behaviour were less pronounced than age-dependent characteristics. The comparison of clear call vocalizations ofMacaca radiata andM. fuscata revealed significant species-specific differences on the structural and the behavioural level. Evaluations of the structural features of alarm calls of various macaque species imply marked differences between members of thefascicularis group andsinica group on one hand and thesilenus group andarctoides
Resumo:
Sound recordings and behavioural data were collected from four primate species of two genera (Macaca, Presbytis). Comparative analyses of structural and behavioural aspects of vocal communication revealed a high degree of intrageneric similarity but striking intergeneric differences. In the two macaque species (Macaca silenus, Macaca radiata), males and females shared the major part of the repertoire. In contrast, in the two langurs (Presbytis johnii, Presbytis entellus), many calls were exclusive to adult males. Striking differences between both species groups occurred with respect to age-specific patterns of vocal behaviour. The diversity of vocal behaviour was assessed from the number of different calls used and the proportion of each call in relation to total vocal output for a given age/sex class. In Macaca, diversity decreases with the age of the vocalizer, whereas in Presbytis the age of the vocalizer and the diversity of vocal behaviour are positively correlated. A comparison of the data of the two genera does not suggest any causal relationship between group composition (e.g. multi-male vs. one-male group) and communication system. Within each genus, interspecific differences in vocal behaviour can be explained by differences in social behaviour (e.g. group cohesion, intergroup relation, mating behaviour) and functional disparities. Possible factors responsible for the pronounced intergeneric differences in vocal behaviour between Macaca and Presbytis are discussed.
Resumo:
Objective identification and description of mimicked calls is a primary component of any study on avian vocal mimicry but few studies have adopted a quantitative approach. We used spectral feature representations commonly used in human speech analysis in combination with various distance metrics to distinguish between mimicked and non-mimicked calls of the greater racket-tailed drongo, Dicrurus paradiseus and cross-validated the results with human assessment of spectral similarity. We found that the automated method and human subjects performed similarly in terms of the overall number of correct matches of mimicked calls to putative model calls. However, the two methods also misclassified different subsets of calls and we achieved a maximum accuracy of ninety five per cent only when we combined the results of both the methods. This study is the first to use Mel-frequency Cepstral Coefficients and Relative Spectral Amplitude - filtered Linear Predictive Coding coefficients to quantify vocal mimicry. Our findings also suggest that in spite of several advances in automated methods of song analysis, corresponding cross-validation by humans remains essential.
Resumo:
This paper presents speaker normalization approaches for audio search task. Conventional state-of-the-art feature set, viz., Mel Frequency Cepstral Coefficients (MFCC) is known to contain speaker-specific and linguistic information implicitly. This might create problem for speaker-independent audio search task. In this paper, universal warping-based approach is used for vocal tract length normalization in audio search. In particular, features such as scale transform and warped linear prediction are used to compensate speaker variability in audio matching. The advantage of these features over conventional feature set is that they apply universal frequency warping for both the templates to be matched during audio search. The performance of Scale Transform Cepstral Coefficients (STCC) and Warped Linear Prediction Cepstral Coefficients (WLPCC) are about 3% higher than the state-of-the-art MFCC feature sets on TIMIT database.
Resumo:
Elephants use vocalizations for both long and short distance communication. Whereas the acoustic repertoire of the African elephant (Loxodonta africana) has been extensively studied in its savannah habitat, very little is known about the structure and social context of the vocalizations of the Asian elephant (Elephas maximus), which is mostly found in forests. In this study, the vocal repertoire of wild Asian elephants in southern India was examined. The calls could be classified into four mutually exclusive categories, namely, trumpets, chirps, roars, and rumbles, based on quantitative analyses of their spectral and temporal features. One of the call types, the rumble, exhibited high structural diversity, particularly in the direction and extent of frequency modulation of calls. Juveniles produced three of the four call types, including trumpets, roars, and rumbles, in the context of play and distress. Adults produced trumpets and roars in the context of disturbance, aggression, and play. Chirps were typically produced in situations of confusion and alarm. Rumbles were used for contact calling within and among herds, by matriarchs to assemble the herd, in close-range social interactions, and during disturbance and aggression. Spectral and temporal features of the four call types were similar between Asian and African elephants.
Resumo:
Adult male Nilgiri langurs (Presbytis johnii) utter loud call bouts consisting of one or more phrases. Phrases are made up of several units showing similar or different structural features. The units involved differ with respect to not only their physical structure but also their overall utilization: three vocal patterns are uttered exclusively by mature males living in bisexual groups or all-male bands and, in addition to being part of loud call bouts, are given during encounters with terrestrial predators; two vocal patterns are uttered by males and females, again not just as constituents of loud calls; and one vocal pattern is given exclusively by mature males living in bisexual groups. Within a given bout, phrases differ not only with respect to their composition but also in their temporal organization. In addition to the acoustic components, loud calls are regularly accompanied by stereotyped motoric displays. The motoric and acoustic components of loud call displays appear independently of each other and at different times during ontogeny. The development of the display is characterized by combination of units with different structural features and synchronization of vocal and motoric components. Although more evidence is needed, our observations suggest that the development of loud call displays coincides with the aquisitation of social maturation and competence and requires not only social experience but also a certain amount of motoric training. In spite of the high degree of ritualization, loud call displays are not completely fixed in form, but instead are open to individual- and population-specific variation.
Resumo:
In voiced speech analysis epochal information is useful in accurate estimation of pitch periods and the frequency response of the vocal tract system. Ideally, linear prediction (LP) residual should give impulses at epochs. However, there are often ambiguities in the direct use of LP residual since samples of either polarity occur around epochs. Further, since the digital inverse filter does not compensate the phase response of the vocal tract system exactly, there is an uncertainty in the estimated epoch position. In this paper we present an interpretation of LP residual by considering the effect of the following factors: 1) the shape of glottal pulses, 2) inaccurate estimation of formants and bandwidths, 3) phase angles of formants at the instants of excitation, and 4) zeros in the vocal tract system. A method for the unambiguous identification of epochs from LP residual is then presented. The accuracy of the method is tested by comparing the results with the epochs obtained from the estimated glottal pulse shapes for several vowel segments. The method is used to identify the closed glottis interval for the estimation of the true frequency response of the vocal tract system.
Resumo:
We analyze the AlApana of a Carnatic music piece without the prior knowledge of the singer or the rAga. AlApana is ameans to communicate to the audience, the flavor or the bhAva of the rAga through the permitted notes and its phrases. The input to our analysis is a recording of the vocal AlApana along with the accompanying instrument. The AdhAra shadja(base note) of the singer for that AlApana is estimated through a stochastic model of note frequencies. Based on the shadja, we identify the notes (swaras) used in the AlApana using a semi-continuous GMM. Using the probabilities of each note interval, we recognize swaras of the AlApana. For sampurNa rAgas, we can identify the possible rAga, based on the swaras. We have been able to achieve correct shadja identification, which is crucial to all further steps, in 88.8% of 55 AlApanas. Among them (48 AlApanas of 7 rAgas), we get 91.5% correct swara identification and 62.13% correct R (rAga) accuracy.
Resumo:
The instants at which significant excitation of vocal tract take place during voicing are referred to as epochs. Epochs and strengths of excitation pulses at epochs are useful in characterizing voice source. Epoch filtering technique proposed by the authors determine epochs from speech waveform. In this paper we propose zero-phase inverse filtering to obtain strengths of excitation pulses at epochs. Zero-phase inverse filter compensates the gross spectral envelope of short-time spectrum of speech without affecting phase characteristics. Linear prediction analysis is used to realize the zero-phase inverse filter. Source characteristics that can be derived from speech using this technique are illustrated with examples.
Resumo:
Real-Time services are traditionally supported on circuit switched network. However, there is a need to port these services on packet switched network. Architecture for audio conferencing application over the Internet in the light of ITU-T H.323 recommendations is considered. In a conference, considering packets only from a set of selected clients can reduce speech quality degradation because mixing packets from all clients can lead to lack of speech clarity. A distributed algorithm and architecture for selecting clients for mixing is suggested here based on a new quantifier of the voice activity called “Loudness Number” (LN). The proposed system distributes the computation load and reduces the load on client terminals. The highlights of this architecture are scalability, bandwidth saving and speech quality enhancement. Client selection for playing out tries to mimic a physical conference where the most vocal participants attract more attention. The contributions of the paper are expected to aid H.323 recommendations implementations for Multipoint Processors (MP). A working prototype based on the proposed architecture is already functional.
Resumo:
We consider the speech production mechanism and the asso- ciated linear source-filter model. For voiced speech sounds in particular, the source/glottal excitation is modeled as a stream of impulses and the filter as a cascade of second-order resonators. We show that the process of sampling speech signals can be modeled as filtering a stream of Dirac impulses (a model for the excitation) with a kernel function (the vocal tract response),and then sampling uniformly. We show that the problem of esti- mating the excitation is equivalent to the problem of recovering a stream of Dirac impulses from samples of a filtered version. We present associated algorithms based on the annihilating filter and also make a comparison with the classical linear prediction technique, which is well known in speech analysis. Results on synthesized as well as natural speech data are presented.
Resumo:
We propose a two-dimensional (2-D) multicomponent amplitude-modulation, frequency-modulation (AM-FM) model for a spectrogram patch corresponding to voiced speech, and develop a new demodulation algorithm to effectively separate the AM, which is related to the vocal tract response, and the carrier, which is related to the excitation. The demodulation algorithm is based on the Riesz transform and is developed along the lines of Hilbert-transform-based demodulation for 1-D AM-FM signals. We compare the performance of the Riesz transform technique with that of the sinusoidal demodulation technique on real speech data. Experimental results show that the Riesz-transform-based demodulation technique represents spectrogram patches accurately. The spectrograms reconstructed from the demodulated AM and carrier are inverted and the corresponding speech signal is synthesized. The signal-to-noise ratio (SNR) of the reconstructed speech signal, with respect to clean speech, was found to be 2 to 4 dB higher in case of the Riesz transform technique than the sinusoidal demodulation technique.
Resumo:
We address the problem of separating a speech signal into its excitation and vocal-tract filter components, which falls within the framework of blind deconvolution. Typically, the excitation in case of voiced speech is assumed to be sparse and the vocal-tract filter stable. We develop an alternating l(p) - l(2) projections algorithm (ALPA) to perform deconvolution taking into account these constraints. The algorithm is iterative, and alternates between two solution spaces. The initialization is based on the standard linear prediction decomposition of a speech signal into an autoregressive filter and prediction residue. In every iteration, a sparse excitation is estimated by optimizing an l(p)-norm-based cost and the vocal-tract filter is derived as a solution to a standard least-squares minimization problem. We validate the algorithm on voiced segments of natural speech signals and show applications to epoch estimation. We also present comparisons with state-of-the-art techniques and show that ALPA gives a sparser impulse-like excitation, where the impulses directly denote the epochs or instants of significant excitation.