11 resultados para audio visual speech recognition
em Dalarna University College Electronic Archive
Resumo:
I denna uppsats har filmljudet i krigsfilmerna Apocalypse Now och Saving Private Ryan undersökts. Detta har gjorts för att försöka bidra med ökad förståelse för filmljudets användningsområde och funktioner, främst för filmerna i fråga, men även för krigsfilm rent generellt. Filmljud i denna kontext omfattar allt det ljud som finns i film, men utesluter dock all ickediegetisk musik. Båda filmerna har undersökts genom en audio-visuell analys. En sådan analys görs genom att detaljgranska båda filmernas ljud- och bildinnehåll var för sig, för att slutligen undersöka samma filmsekvens som helhet då ljudet och bilden satts ihop igen. Den audio-visuella analysmetod som nyttjats i uppsatsen är Michel Chions metod, Masking. De 30 minuter film som analyserades placerades sedan i olika filmljudzoner, där respektive filmljudzons ljudinnehåll bland annat visade vilka främsta huvudfunktioner somfilmljudet hade i dessa filmer. Dessa funktioner är till för att bibehålla åskådarens fokus och intresse, att skapa närhet till rollkaraktärerna, samt att tillföra en hög känsla av realism och närvaro. Intentionerna med filmljudet verkade vara att flytta åskådaren in i filmens verklighet, att låta åskådaren bli ett med filmen. Att återspegla denna känsla av realism, närvaro, fokus samt intresse, visade sig också vara de intentioner som funnits redan i de båda filmernas förproduktionsstadier. Detta bevisar att de lyckats åstadkomma det de eftersträvat. Men om filmljudet använts på samma sätt eller innehar samma funktioner i krigsfilm rent genrellt går inte att säga.I have for this bachelor’s thesis examined the movie sound of the classic warfare movies Apocalypse Now and Saving Private Ryan. This is an attempt to contribute to a more profound comprehension of the appliance and importance of movie sound. In this context movie sound implies all kinds of sounds within the movies, accept from non-diegetic music. These two movies have been examined by an audio-visual analysis. It's done by auditing the sound and picture content separately, and then combined to audit the same sequence as a whole. Michel Chion, which is the founder of this analysis, calls this method Masking. The sound in this 30 minute sequence was then divided into different zones, where every zone represented a certain main function. These functions are provided to create a stronger connection to the characters, sustain the viewers interest and bring a sense of realism and presence. It seems though the intention with the movies sound is to bring the viewers to the scene in hand, and let it become their reality. To mirror this sense of realism, presence, focus and interest, proves to be the intention from an early stage of the production. This bachelor’s thesis demonstrates a success in their endeavours. Although it can’t confirm whether the movie sound have been utilized in the same manner or if they posess the same functions to warefare movies in general.
Resumo:
Background: Voice processing in real-time is challenging. A drawback of previous work for Hypokinetic Dysarthria (HKD) recognition is the requirement of controlled settings in a laboratory environment. A personal digital assistant (PDA) has been developed for home assessment of PD patients. The PDA offers sound processing capabilities, which allow for developing a module for recognition and quantification HKD. Objective: To compose an algorithm for assessment of PD speech severity in the home environment based on a review synthesis. Methods: A two-tier review methodology is utilized. The first tier focuses on real-time problems in speech detection. In the second tier, acoustics features that are robust to medication changes in Levodopa-responsive patients are investigated for HKD recognition. Keywords such as Hypokinetic Dysarthria , and Speech recognition in real time were used in the search engines. IEEE explorer produced the most useful search hits as compared to Google Scholar, ELIN, EBRARY, PubMed and LIBRIS. Results: Vowel and consonant formants are the most relevant acoustic parameters to reflect PD medication changes. Since relevant speech segments (consonants and vowels) contains minority of speech energy, intelligibility can be improved by amplifying the voice signal using amplitude compression. Pause detection and peak to average power rate calculations for voice segmentation produce rich voice features in real time. Enhancements in voice segmentation can be done by inducing Zero-Crossing rate (ZCR). Consonants have high ZCR whereas vowels have low ZCR. Wavelet transform is found promising for voice analysis since it quantizes non-stationary voice signals over time-series using scale and translation parameters. In this way voice intelligibility in the waveforms can be analyzed in each time frame. Conclusions: This review evaluated HKD recognition algorithms to develop a tool for PD speech home-assessment using modern mobile technology. An algorithm that tackles realtime constraints in HKD recognition based on the review synthesis is proposed. We suggest that speech features may be further processed using wavelet transforms and used with a neural network for detection and quantification of speech anomalies related to PD. Based on this model, patients' speech can be automatically categorized according to UPDRS speech ratings.
Resumo:
This paper examines a popular music song (Heartbeats by Jose Gonzalez) as a sign system in television advertising. The study was conducted through qualitative questionnaires in connection to an audio-visual method of analysis called Masking. The method facilitates the analysis of isolated parts in the audio-visual spectrum by masking/hiding parts of the audio-visual totality.The survey had seven respondents where a hermeneutic epistemological approach was used. For the analysis Cooper's theory of brand identity (Practical and Symbolic Attitudes to Buying Brands) was used together with an interaction model for music in audio-visual advertising called "Modes of music-image interaction”. The results showed that the music was associated with values as genuine, honest, responsibility, purity, independence and innovation. The music's symbolic values helped to position the brand in a lifestyle context. The music also helped to express the target group’s identity and attitudes by being innovative and independent. It also enhanced the perception of the visual colour rendition in the film. In general the television advertisement perceived more positive and entertaining when the music was present. In other words the music's social and cultural position contributed to raise the film's credibility. A deeper social and cultural value was created in the movie through resonance between symbolic values of the music and symbolic values of the film.
Resumo:
Literacy is an invaluable asset to have, and has allowed for communication, documentation and the spreading of ideas since the beginning of the written language. With technological advancements, and new possibilities to communicate, it is important to question the degree to which people’s abilities to utilise these new methods have developed in relation to these emerging technologies. The purpose of this bachelor’s thesis is to analyse the state of students’ at Dalarna University mulitimodal literacy, as well as their experience of multimodality in their education. This has led to the two main research questions: What is the state of the students at Dalarna University multimodal literacy? And: How have the students at Dalarna University experienced multimodality in education? The paper is based on a mixed-method study that incorporates both a quantitative and qualitative aspect to it. The main thrust of the research paper is, however, based on a quantitative study that was conducted online and emailed to students via their program coordinators. The scope of the research is in audio-visual modes, i.e. audio, video and images, while textual literacy is presumed and serves as an inspiration to the study. The purpose of the study is to analyse the state of the students’ multimodal literacy and their experience of multimodality in education. The study revealed that the students at Dalarna University have most skill in image editing, while not being very literate in audio or video editing. The students seem to have had mediocre experience creating meaning through multimodality both in private use and in their respective educational institutions. The study also reveals that students prefer learning by means of video (rather than text or audio), yet are not able to create meaning (communicate) through it.
Resumo:
Allt eftersom utvecklingen går framåt inom applikationer och system så förändras också sättet på vilket vi interagerar med systemet på. Hittills har navigering och användning av applikationer och system mestadels skett med händerna och då genom mus och tangentbord. På senare tid så har navigering via touch-skärmar och rösten blivit allt mer vanligt. Då man ska styra en applikation med hjälp av rösten är det viktigt att vem som helst kan styra applikationen, oavsett vilken dialekt man har. För att kunna se hur korrekt ett röstigenkännings-API (Application Programming Interface) uppfattar svenska dialekter så initierades denna studie med dokumentstudier om dialekters kännetecken och ljudkombinationer. Dessa kännetecken och ljudkombinationer låg till grund för de ord vi valt ut till att testa API:et med. Varje dialekt fick alltså ett ord uppbyggt för att vara extra svårt för API:et att uppfatta när det uttalades av just den aktuella dialekten. Därefter utvecklades en prototyp, närmare bestämt en android-applikation som fungerade som ett verktyg i datainsamlingen. Då arbetet innehåller en prototyp och en undersökning så valdes Design and Creation Research som forskningsstrategi med datainsamlingsmetoderna dokumentstudier och observationer för att få önskat resultat. Data samlades in via observationer med prototypen som hjälpmedel och med hjälp av dokumentstudier. Det empiriska data som registrerats via observationerna och med hjälp av applikationen påvisade att vissa dialekter var lättare för API:et att uppfatta korrekt. I vissa fall var resultaten väntade då vissa ord uppbyggda av ljudkombinationer i enlighet med teorin skulle uttalas väldigt speciellt av en viss dialekt. Ibland blev det väldigt låga resultat på just dessa ord men i andra fall förvånansvärt höga. Slutsatsen vi drog av detta var att de ord vi valt ut med en baktanke om att de skulle få låga resultat för den speciella dialekten endast visade sig stämma vid två tillfällen. Det var istället det ord innehållande sje- och tje-ljud som enligt teorin var gemensamma kännetecken för alla dialekter som fick lägst resultat överlag.
Resumo:
Loop-teknik i solistiska sammanhang är en idag väletablerad musicerande form dock är möjligheterna att använda tekniken i ensembleform ett mer eller mindre oprövat fält. I projektet “Audiovisuella loopar” som genomförts vid Högskolan Dalarna presenteras ett system för kollektiv livelooping där upp till fyra personer loop-musicerar tillsammans och där loopandet samtidigt kan spelas in och spelas upp som videoklipp. Tekniken har visat sig ha en stark kreativ potential. Med enbart en kort instruktion så startar en kollektiv process där den omedelbara feedbacken ger ett “fl ow” som lockar fram skaparglädjen hos musikanterna. Dessa erfarenheter pekar på spännande möjligheter att använda tekniken i musikundervisning och musikterapi.
Resumo:
For those who are not new to the world of Japanese animation, known mainly as anime, the debate of "dub vs. sub" is by no means anything out of the ordinary, but rather a very heated argument amongst fans. The study will focus on the differences in the US English version between the two approaches of translating audio-visual media, namely subtitling (official subtitles and fanmade subtitles) and dubbing, in a qualitative context. More precisely, which of the two approaches can store the most information from the same audiovisual segment, in order to satisfy the needs of the anime audience. In order to draw substantial conclusions, the analysis will be conducted on a corpus of 1 episode from the first season of the popular mid-nineties TV animated series, Sailor Moon. The main objective of this research is to analyze the three versions and compare the findings to what anime fans expect each of them to provide, in terms of how culture specific terms are handled, how accurate the translation is, localization, censorship, and omission. As for the fans’ opinions, the study will include a survey regarding the personal preference of fans when it comes to choosing between the official subtitled version, the fanmade subtitles and the dubbed version.
Resumo:
Johansson, Fredrik (2012). Filmljudets funktioner i dramafilm – En audio-visuell analys av filmen The King´s Speech. Examensuppsats inom Ljudproduktion, Högskolan Dalarna, Akademin för språk och medier, Falun. I denna uppsats undersöktes filmljudet i dramafilmen The King´s Speech. Detta för att ta reda på vilka funktioner filmljudet fyller i de valda sekvenserna ur nämnda film, samt hur ljudet är placerat i filmens flerkanalsmix. Filmen granskades med hjälp av en audio-visuell analys. Denna metod går ut på att ljudet och bilden undersöks separat, för att sedan åter kombineras och analyseras som helhet. Den audio-visuella analysmetod som använts kommer från ljudteoretikern Michel Chion, och kallas Masking. Resultatet av den audio-visuella analysen pekade mot att ljudets huvudsakliga funktioner var att skapa en realistisk skildring av karaktärer och omgivningar, skapa en känsla av närvaro, samt att skapa och bibehålla olika perspektiv i den narrativa världen. Den stora majoriteten av ljud visade sig vara placerade i centerkanalen, medan främst ickediegetisk musik och ambiensljud var placerade i front- och surroundkanalerna. Detta kanalanvändande tycktes gynna de funna funktionerna, främst genom att bidra till känslan av närvaro och realism, genom att omsluta filmpubliken med ambienta ljud.
Resumo:
Parkinson's disease (PD) is a degenerative illness whose cardinal symptoms include rigidity, tremor, and slowness of movement. In addition to its widely recognized effects PD can have a profound effect on speech and voice.The speech symptoms most commonly demonstrated by patients with PD are reduced vocal loudness, monopitch, disruptions of voice quality, and abnormally fast rate of speech. This cluster of speech symptoms is often termed Hypokinetic Dysarthria.The disease can be difficult to diagnose accurately, especially in its early stages, due to this reason, automatic techniques based on Artificial Intelligence should increase the diagnosing accuracy and to help the doctors make better decisions. The aim of the thesis work is to predict the PD based on the audio files collected from various patients.Audio files are preprocessed in order to attain the features.The preprocessed data contains 23 attributes and 195 instances. On an average there are six voice recordings per person, By using data compression technique such as Discrete Cosine Transform (DCT) number of instances can be minimized, after data compression, attribute selection is done using several WEKA build in methods such as ChiSquared, GainRatio, Infogain after identifying the important attributes, we evaluate attributes one by one by using stepwise regression.Based on the selected attributes we process in WEKA by using cost sensitive classifier with various algorithms like MultiPass LVQ, Logistic Model Tree(LMT), K-Star.The classified results shows on an average 80%.By using this features 95% approximate classification of PD is acheived.This shows that using the audio dataset, PD could be predicted with a higher level of accuracy.
Resumo:
The aim of this thesis is to investigate computerized voice assessment methods to classify between the normal and Dysarthric speech signals. In this proposed system, computerized assessment methods equipped with signal processing and artificial intelligence techniques have been introduced. The sentences used for the measurement of inter-stress intervals (ISI) were read by each subject. These sentences were computed for comparisons between normal and impaired voice. Band pass filter has been used for the preprocessing of speech samples. Speech segmentation is performed using signal energy and spectral centroid to separate voiced and unvoiced areas in speech signal. Acoustic features are extracted from the LPC model and speech segments from each audio signal to find the anomalies. The speech features which have been assessed for classification are Energy Entropy, Zero crossing rate (ZCR), Spectral-Centroid, Mean Fundamental-Frequency (Meanf0), Jitter (RAP), Jitter (PPQ), and Shimmer (APQ). Naïve Bayes (NB) has been used for speech classification. For speech test-1 and test-2, 72% and 80% accuracies of classification between healthy and impaired speech samples have been achieved respectively using the NB. For speech test-3, 64% correct classification is achieved using the NB. The results direct the possibility of speech impairment classification in PD patients based on the clinical rating scale.