941 resultados para Visual Speaker Recognition, Visual Speech Recognition, Cascading Appearance-Based Features


Relevância:

50.00% 50.00%

Publicador:

Resumo:

Multiple system atrophy (MSA) is a rare movement disorder and a member of the 'parkinsonian syndromes', which also include Parkinson's disease (PD), progressive supranuclear palsy (PSP), dementia with Lewy bodies (DLB) and corticobasal degeneration (CBD). Multiple system atrophy is a complex syndrome, in which patients exhibit a variety of signs and symptoms, including parkinsonism, ataxia and autonomic dysfunction. It can be difficult to separate MSA from the other parkinsonian syndromes but if ocular signs and symptoms are present, they may aid differential diagnosis. Typical ocular features of MSA include blepharospasm, excessive square-wave jerks, mild to moderate hypometria of saccades, impaired vestibular-ocular reflex (VOR), nystagmus and impaired event-related evoked potentials. Less typical features include slowing of saccadic eye movements, the presence of vertical gaze palsy, visual hallucinations and an impaired electroretinogram (ERG). Aspects of primary vision such as visual acuity, colour vision or visual fields are usually unaffected. Management of the disease to deal with problems of walking, movement, daily tasks and speech problems is important in MSA. Optometrists can work in collaboration with the patient and health-care providers to identify and manage the patient's visual deficits. A more specific role for the optometrist is to correct vision to prevent falls and to monitor the anterior eye to prevent dry eye and control blepharospasm.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Most existing color-based tracking algorithms utilize the statistical color information of the object as the tracking clues, without maintaining the spatial structure within a single chromatic image. Recently, the researches on the multilinear algebra provide the possibility to hold the spatial structural relationship in a representation of the image ensembles. In this paper, a third-order color tensor is constructed to represent the object to be tracked. Considering the influence of the environment changing on the tracking, the biased discriminant analysis (BDA) is extended to the tensor biased discriminant analysis (TBDA) for distinguishing the object from the background. At the same time, an incremental scheme for the TBDA is developed for the tensor biased discriminant subspace online learning, which can be used to adapt to the appearance variant of both the object and background. The experimental results show that the proposed method can track objects precisely undergoing large pose, scale and lighting changes, as well as partial occlusion. © 2009 Elsevier B.V.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Many Object recognition techniques perform some flavour of point pattern matching between a model and a scene. Such points are usually selected through a feature detection algorithm that is robust to a class of image transformations and a suitable descriptor is computed over them in order to get a reliable matching. Moreover, some approaches take an additional step by casting the correspondence problem into a matching between graphs defined over feature points. The motivation is that the relational model would add more discriminative power, however the overall effectiveness strongly depends on the ability to build a graph that is stable with respect to both changes in the object appearance and spatial distribution of interest points. In fact, widely used graph-based representations, have shown to suffer some limitations, especially with respect to changes in the Euclidean organization of the feature points. In this paper we introduce a technique to build relational structures over corner points that does not depend on the spatial distribution of the features. © 2012 ICPR Org Committee.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

More information is now readily available to computer users than at any time in human history; however, much of this information is often inaccessible to people with blindness or low-vision, for whom information must be presented non-visually. Currently, screen readers are able to verbalize on-screen text using text-to-speech (TTS) synthesis; however, much of this vocalization is inadequate for browsing the Internet. An auditory interface that incorporates auditory-spatial orientation was created and tested. For information that can be structured as a two-dimensional table, links can be semantically grouped as cells in a row within an auditory table, which provides a consistent structure for auditory navigation. An auditory display prototype was tested.^ Sixteen legally blind subjects participated in this research study. Results demonstrated that stereo panning was an effective technique for audio-spatially orienting non-visual navigation in a five-row, six-column HTML table as compared to a centered, stationary synthesized voice. These results were based on measuring the time- to-target (TTT), or the amount of time elapsed from the first prompting to the selection of each tabular link. Preliminary analysis of the TTT values recorded during the experiment showed that the populations did not conform to the ANOVA requirements of normality and equality of variances. Therefore, the data were transformed using the natural logarithm. The repeated-measures two-factor ANOVA results show that the logarithmically-transformed TTTs were significantly affected by the tonal variation method, F(1,15) = 6.194, p= 0.025. Similarly, the results show that the logarithmically transformed TTTs were marginally affected by the stereo spatialization method, F(1,15) = 4.240, p=0.057. The results show that the logarithmically transformed TTTs were not significantly affected by the interaction of both methods, F(1,15) = 1.381, p=0.258. These results suggest that some confusion may be caused in the subject when employing both of these methods simultaneously. The significant effect of tonal variation indicates that the effect is actually increasing the average TTT. In other words, the presence of preceding tones increases task completion time on average. The marginally-significant effect of stereo spatialization decreases the average log(TTT) from 2.405 to 2.264.^

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Corticobasal degeneration is a rare, progressive neurodegenerative disease and a member of the 'parkinsonian' group of disorders, which also includes Parkinson's disease, progressive supranuclear palsy, dementia with Lewy bodies and multiple system atrophy. The most common initial symptom is limb clumsiness, usually affecting one side of the body, with or without accompanying rigidity or tremor. Subsequently, the disease affects gait and there is a slow progression to influence ipsilateral arms and legs. Apraxia and dementia are the most common cortical signs. Corticobasal degeneration can be difficult to distinguish from other parkinsonian syndromes but if ocular signs and symptoms are present, they may aid clinical diagnosis. Typical ocular features include increased latency of saccadic eye movements ipsilateral to the side exhibiting apraxia, impaired smooth pursuit movements and visuo-spatial dysfunction, especially involving spatial rather than object-based tasks. Less typical features include reduction in saccadic velocity, vertical gaze palsy, visual hallucinations, sleep disturbance and an impaired electroretinogram. Aspects of primary vision such as visual acuity and colour vision are usually unaffected. Management of the condition to deal with problems of walking, movement, daily tasks and speech problems is an important aspect of the disease.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

It is widely accepted that infants begin learning their native language not by learning words, but by discovering features of the speech signal: consonants, vowels, and combinations of these sounds. Learning to understand words, as opposed to just perceiving their sounds, is said to come later, between 9 and 15 mo of age, when infants develop a capacity for interpreting others' goals and intentions. Here, we demonstrate that this consensus about the developmental sequence of human language learning is flawed: in fact, infants already know the meanings of several common words from the age of 6 mo onward. We presented 6- to 9-mo-old infants with sets of pictures to view while their parent named a picture in each set. Over this entire age range, infants directed their gaze to the named pictures, indicating their understanding of spoken words. Because the words were not trained in the laboratory, the results show that even young infants learn ordinary words through daily experience with language. This surprising accomplishment indicates that, contrary to prevailing beliefs, either infants can already grasp the referential intentions of adults at 6 mo or infants can learn words before this ability emerges. The precocious discovery of word meanings suggests a perspective in which learning vocabulary and learning the sound structure of spoken language go hand in hand as language acquisition begins.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

La gárgola ha poblado desde antiguo las cornisas de los edificios religiosos y civiles, sirviendo de evacuación para el agua y para diferentes misiones según las distintas interpretaciones. Su gran componente simbólico hace de ella un reclamo muy interesante para el mundo audiovisual actual. De entre las manifestaciones encontradas, estudiaremos la visión de la gárgola en estos medios y valoraremos su adecuación o no a la visión medieval. Mediante la aparición de la gárgola medieval en el mundo actual en forma de personajes y ornamentos, y gracias a la aplicación de imágenes actuales a las nuevas gárgolas, podemos realizar un recorrido visual y simbólico alrededor de las distintas visiones que la cultura actual realiza de ellas. Pero no podemos quedarnos ahí, es necesario relacionar estas nuevas visiones y significaciones con las originales, para establecer si se ha hecho un uso correcto de la imagen, o se ha roto la estructura del símbolo primigenio.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This paper presents the novel theory for performing multi-agent activity recognition without requiring large training corpora. The reduced need for data means that robust probabilistic recognition can be performed within domains where annotated datasets are traditionally unavailable. Complex human activities are composed from sequences of underlying primitive activities. We do not assume that the exact temporal ordering of primitives is necessary, so can represent complex activity using an unordered bag. Our three-tier architecture comprises low-level video tracking, event analysis and high-level inference. High-level inference is performed using a new, cascading extension of the Rao–Blackwellised Particle Filter. Simulated annealing is used to identify pairs of agents involved in multi-agent activity. We validate our framework using the benchmarked PETS 2006 video surveillance dataset and our own sequences, and achieve a mean recognition F-Score of 0.82. Our approach achieves a mean improvement of 17% over a Hidden Markov Model baseline.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We propose a novel bolt-on module capable of boosting the robustness of various single compact 2D gait representations. Gait recognition is negatively influenced by covariate factors including clothing and time which alter the natural gait appearance and motion. Contrary to traditional gait recognition, our bolt-on module remedies this by a dedicated covariate factor detection and removal procedure which we quantitatively and qualitatively evaluate. The fundamental concept of the bolt-on module is founded on exploiting the pixel-wise composition of covariate factors. Results demonstrate how our bolt-on module is a powerful component leading to significant improvements across gait representations and datasets yielding state-of-the-art results.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

With the world of professional sports shifting towards employing better sport analytics, the demand for vision-based performance analysis is growing increasingly in recent years. In addition, the nature of many sports does not allow the use of any kind of sensors or other wearable markers attached to players for monitoring their performances during competitions. This provides a potential application of systematic observations such as tracking information of the players to help coaches to develop their visual skills and perceptual awareness needed to make decisions about team strategy or training plans. My PhD project is part of a bigger ongoing project between sport scientists and computer scientists involving also industry partners and sports organisations. The overall idea is to investigate the contribution technology can make to the analysis of sports performance on the example of team sports such as rugby, football or hockey. A particular focus is on vision-based tracking, so that information about the location and dynamics of the players can be gained without any additional sensors on the players. To start with, prior approaches on visual tracking are extensively reviewed and analysed. In this thesis, methods to deal with the difficulties in visual tracking to handle the target appearance changes caused by intrinsic (e.g. pose variation) and extrinsic factors, such as occlusion, are proposed. This analysis highlights the importance of the proposed visual tracking algorithms, which reflect these challenges and suggest robust and accurate frameworks to estimate the target state in a complex tracking scenario such as a sports scene, thereby facilitating the tracking process. Next, a framework for continuously tracking multiple targets is proposed. Compared to single target tracking, multi-target tracking such as tracking the players on a sports field, poses additional difficulties, namely data association, which needs to be addressed. Here, the aim is to locate all targets of interest, inferring their trajectories and deciding which observation corresponds to which target trajectory is. In this thesis, an efficient framework is proposed to handle this particular problem, especially in sport scenes, where the players of the same team tend to look similar and exhibit complex interactions and unpredictable movements resulting in matching ambiguity between the players. The presented approach is also evaluated on different sports datasets and shows promising results. Finally, information from the proposed tracking system is utilised as the basic input for further higher level performance analysis such as tactics and team formations, which can help coaches to design a better training plan. Due to the continuous nature of many team sports (e.g. soccer, hockey), it is not straightforward to infer the high-level team behaviours, such as players’ interaction. The proposed framework relies on two distinct levels of performance analysis: low-level performance analysis, such as identifying players positions on the play field, as well as a high-level analysis, where the aim is to estimate the density of player locations or detecting their possible interaction group. The related experiments show the proposed approach can effectively explore this high-level information, which has many potential applications.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The production and perception of music is a multimodal activity involving auditory, visual and conceptual processing, integrating these with prior knowledge and environmental experience. Musicians utilise expressive physical nuances to highlight salient features of the score. The question arises within the literature as to whether performers’ non-technical, non-sound-producing movements may be communicatively meaningful and convey important structural information to audience members and co-performers. In the light of previous performance research (Vines et al., 2006, Wanderley, 2002, Davidson, 1993), and considering findings within co-speech gestural research and auditory and audio-visual neuroscience, this thesis examines the nature of those movements not directly necessary for the production of sound, and their particular influence on audience perception. Within the current research 3D performance analysis is conducted using the Vicon 12- camera system and Nexus data-processing software. Performance gestures are identified as repeated patterns of motion relating to music structure, which not only express phrasing and structural hierarchy but are consistently and accurately interpreted as such by a perceiving audience. Gestural characteristics are analysed across performers and performance style using two Chopin preludes selected for their diverse yet comparable structures (Opus 28:7 and 6). Effects on perceptual judgements of presentation modes (visual-only, auditory-only, audiovisual, full- and point-light) and viewing conditions are explored. This thesis argues that while performance style is highly idiosyncratic, piano performers reliably generate structural gestures through repeated patterns of upper-body movement. The shapes and locations of phrasing motions are identified particular to the sample of performers investigated. Findings demonstrate that despite the personalised nature of the gestures, performers use increased velocity of movements to emphasise musical structure and that observers accurately and consistently locate phrasing junctures where these patterns and variation in motion magnitude, shape and velocity occur. By viewing performance motions in polar (spherical) rather than cartesian coordinate space it is possible to get mathematically closer to the movement generated by each of the nine performers, revealing distinct patterns of motion relating to phrasing structures, regardless of intended performance style. These patterns are highly individualised both to each performer and performed piece. Instantaneous velocity analysis indicates a right-directed bias of performance motion variation at salient structural features within individual performances. Perceptual analyses demonstrate that audience members are able to accurately and effectively detect phrasing structure from performance motion alone. This ability persists even for degraded point-light performances, where all extraneous environmental information has been removed. The relative contributions of audio, visual and audiovisual judgements demonstrate that the visual component of a performance does positively impact on the over- all accuracy of phrasing judgements, indicating that receivers are most effective in their recognition of structural segmentations when they can both see and hear a performance. Observers appear to make use of a rapid online judgement heuristics, adjusting response processes quickly to adapt and perform accurately across multiple modes of presentation and performance style. In line with existent theories within the literature, it is proposed that this processing ability may be related to cognitive and perceptual interpretation of syntax within gestural communication during social interaction and speech. Findings of this research may have future impact on performance pedagogy, computational analysis and performance research, as well as potentially influencing future investigations of the cognitive aspects of musical and gestural understanding.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Nowadays, new computers generation provides a high performance that enables to build computationally expensive computer vision applications applied to mobile robotics. Building a map of the environment is a common task of a robot and is an essential part to allow the robots to move through these environments. Traditionally, mobile robots used a combination of several sensors from different technologies. Lasers, sonars and contact sensors have been typically used in any mobile robotic architecture, however color cameras are an important sensor due to we want the robots to use the same information that humans to sense and move through the different environments. Color cameras are cheap and flexible but a lot of work need to be done to give robots enough visual understanding of the scenes. Computer vision algorithms are computational complex problems but nowadays robots have access to different and powerful architectures that can be used for mobile robotics purposes. The advent of low-cost RGB-D sensors like Microsoft Kinect which provide 3D colored point clouds at high frame rates made the computer vision even more relevant in the mobile robotics field. The combination of visual and 3D data allows the systems to use both computer vision and 3D processing and therefore to be aware of more details of the surrounding environment. The research described in this thesis was motivated by the need of scene mapping. Being aware of the surrounding environment is a key feature in many mobile robotics applications from simple robotic navigation to complex surveillance applications. In addition, the acquisition of a 3D model of the scenes is useful in many areas as video games scene modeling where well-known places are reconstructed and added to game systems or advertising where once you get the 3D model of one room the system can add furniture pieces using augmented reality techniques. In this thesis we perform an experimental study of the state-of-the-art registration methods to find which one fits better to our scene mapping purposes. Different methods are tested and analyzed on different scene distributions of visual and geometry appearance. In addition, this thesis proposes two methods for 3d data compression and representation of 3D maps. Our 3D representation proposal is based on the use of Growing Neural Gas (GNG) method. This Self-Organizing Maps (SOMs) has been successfully used for clustering, pattern recognition and topology representation of various kind of data. Until now, Self-Organizing Maps have been primarily computed offline and their application in 3D data has mainly focused on free noise models without considering time constraints. Self-organising neural models have the ability to provide a good representation of the input space. In particular, the Growing Neural Gas (GNG) is a suitable model because of its flexibility, rapid adaptation and excellent quality of representation. However, this type of learning is time consuming, specially for high-dimensional input data. Since real applications often work under time constraints, it is necessary to adapt the learning process in order to complete it in a predefined time. This thesis proposes a hardware implementation leveraging the computing power of modern GPUs which takes advantage of a new paradigm coined as General-Purpose Computing on Graphics Processing Units (GPGPU). Our proposed geometrical 3D compression method seeks to reduce the 3D information using plane detection as basic structure to compress the data. This is due to our target environments are man-made and therefore there are a lot of points that belong to a plane surface. Our proposed method is able to get good compression results in those man-made scenarios. The detected and compressed planes can be also used in other applications as surface reconstruction or plane-based registration algorithms. Finally, we have also demonstrated the goodness of the GPU technologies getting a high performance implementation of a CAD/CAM common technique called Virtual Digitizing.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

A evolução do varejo no Brasil, também atribuída ao crescimento econômico das últimas décadas, foi marcada pelo surgimento de novos formatos e estratégias comerciais e por um surpreendente processo de transformação social. Observa-se que a referida transformação social foi seguida pelo aumento do poder aquisitivo das diferentes classes sociais que compõem o cenário econômico. O crescimento econômico, proporcionado pela instituição de políticas econômicas e de inclusão social, fez despontar um nicho de mercado, que até o momento apresentava-se com acesso restrito ao consumo, formado pela classe social de baixa renda. O surgimento de um novo cenário mercadológico constituído por indivíduos pertencentes à classe social de baixa renda e a concorrência saturada dos mercados consolidados despertaram o interesse de muitas empresas em incrementar suas atividades investindo no emergente mercado de consumo. Para consolidar suas atuações em um mercado pouco conhecido, algumas empresas perceberam a necessidade de compatibilizar sua estratégias comerciais e de marketing ao novo perfil e comportamento dos respectivos consumidores. Entretanto, ainda é possível observar que muitas das referidas estratégias são desenvolvidas a partir de mitos acerca das motivações e comportamentos de consumo da classe social de baixa renda e, desta forma, não correspondente ao estilo de vida e formação cultural do referido público alvo. O presente estudo pretende contribuir para o aperfeiçoamento do varejo de baixa renda e, desta forma, apresentar a importância de planejar ações comerciais e de convencimento de consumo coordenadas com estratégias de marketing, notadamente do visual merchandising, adaptadas ao referido público alvo. Para alcançar a proposta deste estudo fez-se necessária a compreensão do estilo de vida do respectivo público alvo através do desenvolvimento de métodos de pesquisa investigativos, para uma análise detalhada e assertiva das respectivas motivações de consumo.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

This thesis examines the state of audiovisual translation (AVT) in the aftermath of the COVID-19 emergency, highlighting new trends with regards to the implementation of AI technologies as well as their strengths, constraints, and ethical implications. It starts with an overview of the current AVT landscape, focusing on future projections about its evolution and its critical aspects such as the worsening working conditions lamented by AVT professionals – especially freelancers – in recent years and how they might be affected by the advent of AI technologies in the industry. The second chapter delves into the history and development of three AI technologies which are used in combination with neural machine translation in automatic AVT tools: automatic speech recognition, speech synthesis and deepfakes (voice cloning and visual deepfakes for lip syncing), including real examples of start-up companies that utilize them – or are planning to do so – to localize audiovisual content automatically or semi-automatically. The third chapter explores the many ethical concerns around these innovative technologies, which extend far beyond the field of translation; at the same time, it attempts to revindicate their potential to bring about immense progress in terms of accessibility and international cooperation, provided that their use is properly regulated. Lastly, the fourth chapter describes two experiments, testing the efficacy of the currently available tools for automatic subtitling and automatic dubbing respectively, in order to take a closer look at their perks and limitations compared to more traditional approaches. This analysis aims to help discerning legitimate concerns from unfounded speculations with regards to the AI technologies which are entering the field of AVT; the intention behind it is to humbly suggest a constructive and optimistic view of the technological transformations that appear to be underway, whilst also acknowledging their potential risks.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Throughout the years, technology has had an undeniable impact on the AVT field. It has revolutionized the way audiovisual content is consumed by allowing audiences to easily access it at any time and on any device. Especially after the introduction of OTT streaming platforms such as Netflix, Amazon Prime Video, Disney+, Apple TV+, and HBO Max, which offer a vast catalog of national and international products, the consumption of audiovisual products has been on a constant rise and, consequently, the demand for localized content too. In turn, the AVT industry resorts to new technologies and practices to handle the ever-growing workload and the faster turnaround times. Due to the numerous implications that it has on the industry, technological advancement can be considered an area of research of particular interest for the AVT studies. However, in the case of dubbing, research and discussion regarding the topic is lagging behind because of the more limited impact that technology has had on the very conservative dubbing industry. Therefore, the aim of the dissertation is to offer an overview of some of the latest technological innovations and practices that have already been implemented (i.e. cloud dubbing and DeepDub technology) or that are still under development and research (i.e. automatic speech recognition and respeaking, machine translation and post-editing, audio-based and visual-based dubbing techniques, text-based editing of talking-head videos, and automatic dubbing), and respectively discuss their reception by the industry professionals, and make assumptions about their future implementation in the dubbing field.