254 resultados para distance measures
Resumo:
Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole - term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering. Copyright 2014 ACM.
Resumo:
In this paper we present pyktree, an implementation of the K-tree algorithm in the Python programming language. The K-tree algorithm provides highly balanced search trees for vector quantization that scales up to very large data sets. Pyktree is highly modular and well suited for rapid-prototyping of novel distance measures and centroid representations. It is easy to install and provides a python package for library use as well as command line tools.
Resumo:
Speaker diarization is the process of annotating an input audio with information that attributes temporal regions of the audio signal to their respective sources, which may include both speech and non-speech events. For speech regions, the diarization system also specifies the locations of speaker boundaries and assign relative speaker labels to each homogeneous segment of speech. In short, speaker diarization systems effectively answer the question of ‘who spoke when’. There are several important applications for speaker diarization technology, such as facilitating speaker indexing systems to allow users to directly access the relevant segments of interest within a given audio, and assisting with other downstream processes such as summarizing and parsing. When combined with automatic speech recognition (ASR) systems, the metadata extracted from a speaker diarization system can provide complementary information for ASR transcripts including the location of speaker turns and relative speaker segment labels, making the transcripts more readable. Speaker diarization output can also be used to localize the instances of specific speakers to pool data for model adaptation, which in turn boosts transcription accuracies. Speaker diarization therefore plays an important role as a preliminary step in automatic transcription of audio data. The aim of this work is to improve the usefulness and practicality of speaker diarization technology, through the reduction of diarization error rates. In particular, this research is focused on the segmentation and clustering stages within a diarization system. Although particular emphasis is placed on the broadcast news audio domain and systems developed throughout this work are also trained and tested on broadcast news data, the techniques proposed in this dissertation are also applicable to other domains including telephone conversations and meetings audio. Three main research themes were pursued: heuristic rules for speaker segmentation, modelling uncertainty in speaker model estimates, and modelling uncertainty in eigenvoice speaker modelling. The use of heuristic approaches for the speaker segmentation task was first investigated, with emphasis placed on minimizing missed boundary detections. A set of heuristic rules was proposed, to govern the detection and heuristic selection of candidate speaker segment boundaries. A second pass, using the same heuristic algorithm with a smaller window, was also proposed with the aim of improving detection of boundaries around short speaker segments. Compared to single threshold based methods, the proposed heuristic approach was shown to provide improved segmentation performance, leading to a reduction in the overall diarization error rate. Methods to model the uncertainty in speaker model estimates were developed, to address the difficulties associated with making segmentation and clustering decisions with limited data in the speaker segments. The Bayes factor, derived specifically for multivariate Gaussian speaker modelling, was introduced to account for the uncertainty of the speaker model estimates. The use of the Bayes factor also enabled the incorporation of prior information regarding the audio to aid segmentation and clustering decisions. The idea of modelling uncertainty in speaker model estimates was also extended to the eigenvoice speaker modelling framework for the speaker clustering task. Building on the application of Bayesian approaches to the speaker diarization problem, the proposed approach takes into account the uncertainty associated with the explicit estimation of the speaker factors. The proposed decision criteria, based on Bayesian theory, was shown to generally outperform their non- Bayesian counterparts.
Resumo:
Semantic space models of word meaning derived from co-occurrence statistics within a corpus of documents, such as the Hyperspace Analogous to Language (HAL) model, have been proposed in the past. While word similarity can be computed using these models, it is not clear how semantic spaces derived from different sets of documents can be compared. In this paper, we focus on this problem, and we revisit the proposal of using semantic subspace distance measurements [1]. In particular, we outline the research questions that still need to be addressed to investigate and validate these distance measures. Then, we describe our plans for future research.
Resumo:
This paper investigates: - correlation between transit route passenger loading and travel distance - its implications on quality of service (QoS) and resource productivity. It uses Automatic Fare Collection (AFC) data across a weekday on a premium bus line in Brisbane, Australia. A composite load-distance factor is proposed as a new measure for profiling transit route on-board passenger comfort QoS. Understanding these measures and their correlation is important for planning, design, and operational activities.
Resumo:
This article presents the field applications and validations for the controlled Monte Carlo data generation scheme. This scheme was previously derived to assist the Mahalanobis squared distance–based damage identification method to cope with data-shortage problems which often cause inadequate data multinormality and unreliable identification outcome. To do so, real-vibration datasets from two actual civil engineering structures with such data (and identification) problems are selected as the test objects which are then shown to be in need of enhancement to consolidate their conditions. By utilizing the robust probability measures of the data condition indices in controlled Monte Carlo data generation and statistical sensitivity analysis of the Mahalanobis squared distance computational system, well-conditioned synthetic data generated by an optimal controlled Monte Carlo data generation configurations can be unbiasedly evaluated against those generated by other set-ups and against the original data. The analysis results reconfirm that controlled Monte Carlo data generation is able to overcome the shortage of observations, improve the data multinormality and enhance the reliability of the Mahalanobis squared distance–based damage identification method particularly with respect to false-positive errors. The results also highlight the dynamic structure of controlled Monte Carlo data generation that makes this scheme well adaptive to any type of input data with any (original) distributional condition.
Resumo:
This study measures the efficiencies incorporating waste generation using Japanese prefecture level data. We apply and compare several models using directional distance functions. There are wide variations in the efficiency scores between the two orientations, "input, desirable and undesirable output orientation" and "undesirable output orientation". However, the difference in abatement factor does not result in wide variations in the efficiency scores. Our results show that there are wide differences in the efficiency scores among prefectures. © 2012 Springer.
Resumo:
Work zone safety studies have traditionally relied on historical crash records—an approach which is reactive in nature as it requires crashes to accumulate first before taking any preventive actions. However, detailed and accurate data on work zone crashes are often not available, as is the case for Australian road work zones. The lack of reliable safety records and the reactive nature of the crash-based safety analysis approach motivated this research to seek alternative and proactive measures of safety. Various surrogate measures of safety have been developed in the traffic safety literature including time to collision, time to accident, gap time, post encroachment time, required deceleration rate, proportion of stopping distances, lateral distance to departure, and time to departure. These measures express how close road-user(s) are from a potential crash by analysing their movement trajectories. A review of this fast-growing literature is presented in this paper from the viewpoint of applying the measures to untangle work zone safety issues. The review revealed that the use of the surrogate measures is very limited for analysing work zone safety, although numerous studies have used these measures for analysing safety in other parts of the road network, such as intersections and motorway ramps. There exist great opportunities for adopting this proactive safety assessment approach to transform work zone safety for both roadworkers and motorists.
Resumo:
Adolescent Idiopathic Scoliosis (AIS) is the most common deformity of the spine, affecting 2-4% of the population. Previous studies have shown that the vertebrae in scoliotic spines undergo abnormal shape changes, however there has been little exploration of how scoliosis affects bone density distribution within the vertebrae. In this study, existing CT scans of 53 female idiopathic scoliosis patients with right-sided main thoracic curves were used to measure the lateral (right to left) bone density profile at mid-height through each vertebral body. Five key bone density profile measures were identified from each normalised bone density distribution, and multiple regression analysis was performed to explore the relationship between bone density distribution and patient demographics (age, height, weight, body mass index (BMI), skeletal maturity, time since Menarche, vertebral level, and scoliosis curve severity). Results showed a marked convex/concave asymmetry in bone density for vertebral levels at or near the apex of the scoliotic curve. At the apical vertebra, mean bone density at the left side (concave) cortical shell was 23.5% higher than for the right (convex) cortical shell, and cancellous bone density along the central 60% of the lateral path from convex to concave increased by 13.8%. The centre of mass of the bone density profile at the thoracic curve apex was located 53.8% of the distance along the lateral path, indicating a shift of nearly 4% toward the concavity of the deformity. These lateral bone density gradients tapered off when moving away from the apical vertebra. Multi-linear regressions showed that the right cortical shell peak bone density is significantly correlated with skeletal maturity, with each Risser increment corresponding to an increase in mineral equivalent bone density of 4-5%. There were also statistically significant relationships between patient height, weight and BMI, and the gradient of cancellous bone density along the central 60% of the lateral path. Bone density gradient is positively correlated with weight, and negatively correlated with height and BMI, such that at the apical vertebra, a unit decrease in BMI corresponds to an almost 100% increase in bone density gradient.
Resumo:
Cutaneous malignant melanoma (CMM) is a major health issue in Queensland, Australia, which has the world’s highest incidence. Recent molecular and epidemiologic studies suggest that CMM arises through multiple etiological pathways involving gene-environment interactions. Understanding the potential mechanisms leading to CMM requires larger studies than those previously conducted. This article describes the design and baseline characteristics of Q-MEGA, the Queensland Study of Melanoma: Environmental and Genetic Associations, which followed up 4 population-based samples of CMM patients in Queensland, including children, adolescents, men aged over 50, and a large sample of adult cases and their families, including twins. Q-MEGA aims to investigate the roles of genetic and environmental factors, and their interaction, in the etiology of melanoma. Three thousand, four hundred and seventy-one participants took part in the follow-up study and were administered a computer-assisted telephone interview in 2002-2005. Updated data on environmental and phenotypic risk factors, and 2777 blood samples were collected from interviewed participants as well as a subset of relatives. This study provides a large and well-described population-based sample of CMM cases with follow-up data. Characteristics of the cases and repeatability of sun exposure and phenotype measures between the baseline and the follow-up surveys, from 6 to 17 years later, are also described.