41 resultados para Chitra Mudgal


In this paper, we propose a novel solution for segmenting an instructional video into hierarchical topical sections. Incorporating the knowledge of education-oriented film theory with our previous study of expressive functions namely the content density and the thematic functions, we develop an algorithm to effectively structuralize an instructional video into a two-tiered hierarchy of topical sections at the main and sub-topic levels. Our experimental results on a set of ten industrial instructional videos demonstrate the validity of the detection scheme.


Using film grammar as the underpinning, we study the extraction of structures in video based on color using a wide configuration of clustering methods combined with existing and new similarity measures. We study the visualisation of these structures, which we call Scene-Cluster Temporal Charts and show how it can bring out the interweaving of different themes and settings in a film. We also extract color events that filmmakers use to draw/force a viewer's attention to a shot/scene. This is done by first extracting a set of colors used rarely in film, and then building a probabilistic model for color event detection. We demonstrate with experimental results from ten movies that our algorithms are effective in the extraction of both scene-cluster temporal charts and color events.


Automatically partitioning instructional videos into topic sections is a challenging problem in e-learning environments for efficient content management and cataloging. This paper addresses this problem by proposing a novel density function to delineate sections underscored by changes in topics in instructional and training videos. The content density function draws guidance from the observation that topic boundaries coincide with the ebb and flow of the 'density' of content shown in these videos. Based on this function, we propose two methods for high-level segmentation by determining topic boundaries. We study the performance of the two methods on eight training videos, and our experimental results demonstrate the effectiveness and robustness of the two proposed high-level segmentation algorithms for learning media.


This paper deals with the problem ofstructuralizing education and training videos for high-level semantics extraction and nonlinear media presentation in e-learning applications. Drawing guidance from production knowledge in instructional media, we propose six main narrative structures employed in education and training videos for both motivation and demonstration during learning and practical training. We devise a powerful audiovisual feature set, accompanied by a hierarchical decision tree-based classification system to determine and discriminate between these structures. Based on a two-liered hierarchical model, we demonstrate that we can achieve an accuracy of 84.7% on a comprehensive set of education and training video data.


In order to enable high-level semantics-based video annotation and interpretation, we tackle the problem of automatic decomposition of motion pictures into meaningful story units, namely scenes. Since a scene is a complicated and subjective concept, we first propose guidelines from film production to determine when a scene change occurs in film. We examine different rules and conventions followed as part of Film Grammar to guide and shape our algorithmic solution for determining a scene boundary. Two different techniques are proposed as new solutions in this paper. Our experimental results on 10 full-length movies show that our technique based on shot sequence coherence performs well and reasonably better than the color edges-based approach.


We examine localised sound energy patterns, or events, that we associate with high level affect experienced with films. The study of sound energy events in conjunction with their intended affect enable the analysis of film at a higher conceptual level, such as genre. The various affect/emotional responses we investigate in this paper are brought about by well established patterns of sound energy dynamics employed in audio tracks of horror films. This allows the examination of the thematic content of the films in relation to horror elements. We analyse the frequency of sound energy and affect events at a film level as well as at a scene level, and propose measures indicative of the film genre and scene content. Using 4 horror, and 2 non-horror movies as experimental data we establish a correlation between the sound energy event types and horrific thematic content within film, thus enabling an automated mechanism for genre typing and scene content labeling in film.


In this paper, we focus on the ‘reverse editing’ problem in movie analysis, i.e., the extraction of film takes, original camera shots that a film editor extracts and arranges to produce a finished scene. The ability to disassemble final scenes and shots into takes is essential for nonlinear browsing, content annotation and the extraction of higher order cinematic constructs from film. In this work, we investigate agglomerative hierachical clustering methods along with different similarity metrics and group distances for this task, and demonstrate our findings with 10 movies.


This paper addresses the area of video annotation, indexing and retrieval, and shows how a set of tools can be employed, along with domain knowledge, to detect narrative structure in broadcast news. The initial structure is detected using low-level audio visual processing in conjunction with domain knowledge. Higher level processing may then utilize the initial structure detected to direct processing to improve and extend the initial classification.

The structure detected breaks a news broadcast into segments, each of which contains a single topic of discussion. Further the segments are labeled as a) anchor person or reporter, b) footage with a voice over or c) sound bite. This labeling may be used to provide a summary, for example by presenting a thumbnail for each reporter present in a section of the video. The inclusion of domain knowledge in computation allows more directed application of high level processing, giving much greater efficiency of effort expended. This allows valid deductions to be made about structure and semantics of the contents of a news video stream, as demonstrated by our experiments on CNN news broadcasts.


This work constitutes the first attempt to extract an important narrative structure, the 3-Act story telling paradigm, in film. This narrative structure is prevalent in the domain of film as it forms the foundation and framework in which the film can be made to function as an effective tool for story telling, and its extraction is a vital step in automatic content management for film data. A novel act boundary likelihood function for Act 1 is derived using a Bayesian formulation under guidance from film grammar, tested under many configurations and the results are reported for experiments involving 25 full length movies. The formulation is shown to be a useful tool in both the automatic and semi-interactive setting for semantic analysis of film.


The identification of useful structures in home video is difficult because this class of video is distinguished from other video sources by its unrestricted, non edited content and the absence of regulated storyline. In addition, home videos contain a lot of motion and erratic camera movements, with shots of the same character being captured from various angles and viewpoints. In this paper, we present a solution to the challenging problem of clustering shots and faces in home videos, based on the use of SIFT features. SIFT features have been known to be robust for object recognition; however, in dealing with the complexities of home video setting, the matching process needs to be augmented and adapted. This paper describes various techniques that can improve the number of matches returned as well as the correctness of matches. For example, existing methods for verification of matches are inadequate for cases when a small number of matches are returned, a common situation in home videos. We address this by constructing a robust classifier that works on matching sets instead of individual matches, allowing the exploitation of the geometric constraints between matches. Finally, we propose techniques for robustly extracting target clusters from individual feature matches.


In this paper, we propose novel computational models for the extraction of high level expressive constructs related to, namely thematic and dramatic functions of the content shown in educational and training videos. Drawing on the existing knowledge of film theory, and media production rules and conventions used by the filmmakers. we hypothesize key aesthetic elements contributing to convey these functions of the content. Computational models to extract them are then formulated and their performance evaluated on a set of ten educational and training videos is presented.


Motivated by existing cinematic conventions known as film grammar, we proposed a computational approach to determine tempo as a high-level movie content descriptor as well as means for deriving dramatic story sections and events occurring in movies. Movie tempo is extracted from two easily computed aspects in our approach: shot length and motion. Story sections and events are generally associated with changes in tempo, and are thus identified by edges located in the tempo function. In this paper, we analyze our initial founding of the tempo function on the basis that the distribution of both shot length and motion in movies is normal. Given that the distribution of shot length is approximately Weibull as confirmed in our experiments, we examine the impact of modelling and modifying the contributions of shot length to tempo. We derive an appropriate normalization function that faithfully encapsulates the role of shot length in tempo perception, and analyze the changes to the story sections identified in films.


The aim of this work is to devise an effective method for static summarization of home video sequences. Based on the premise that the user watching a summary is interested in people related (how many, who, emotional state) or activity related aspects, we formulate a novel approach to video summarization that works to specifically expose relevant video frames that make the content spotting tasks possible. Unlike existing approaches, which work on low-level features which often produce the summary not appealing to the viewer due to the semantic gap between low-level features and high-level concepts, our approach is driven by various utility functions (identity count, identity recognition, emotion recognition, activity recognition, sense of space) that use the results of face detection, face clustering, shot clustering and within cluster frame alignment. The summarization problem is then treated as the problem of extracting the set of key frames that have the maximum combined utility.


In this paper, we investigate the use of a wavelet transform-based analysis of audio tracks accompanying videos for the problem of automatic program genre detection. We compare the classification performance based on wavelet-based audio features to that using conventional features derived from Fourier and time analysis for the task of discriminating TV programs such as news, commercials, music shows, concerts, motor racing games, and animated cartoons. Three different classifiers namely the Decision Trees, SVMs, and k-Nearest Neighbours are studied to analyse the reliability of the performance of our wavelet features based approach. Further, we investigate the issue of an appropriate duration of an audio clip to be analyzed for this automatic genre determination. Our experimental results show that features derived from the wavelet transform of the audio signal can very well separate the six video genres studied. It is also found that there is no significant difference in performance with varying audio clip durations across the classifiers.


In this paper, we investigate the problem of classifying a subset of environmental sounds in movie audio tracks that indicate specific indexical semiotic use. These environmental sounds are used to signify and enhance events occurring in film scenes. We propose a classification system for detecting the presence of violence and car chase scenes in film by classifying ten various environmental sounds that form the constituent audio events of these scenes using a number of old and new audio features. Experiments with our classification system on pure test sounds resulted in a correct event classification rate of 88.9%. We also present the results of the classifier on the mixed audio tracks of several scenes taken from The Mummy and Lethal Weapon 2. The classification of sound events is the first step towards determining the presence of the complex sound scenes within film audio and describing the thematic content of the scenes.