990 resultados para imbalanced data
Resumo:
Bioacoustic data can provide an important base for environmental monitoring. To explore a large amount of field recordings collected, an automated similarity search algorithm is presented in this paper. A region of an audio defined by frequency and time bounds is provided by a user; the content of the region is used to construct a query. In the retrieving process, our algorithm will automatically scan through recordings to search for similar regions. In detail, we present a feature extraction approach based on the visual content of vocalisations – in this case ridges, and develop a generic regional representation of vocalisations for indexing. Our feature extraction method works best for bird vocalisations showing ridge characteristics. The regional representation method allows the content of an arbitrary region of a continuous recording to be described in a compressed format.
Resumo:
Speaker attribution is the task of annotating a spoken audio archive based on speaker identities. This can be achieved using speaker diarization and speaker linking. In our previous work, we proposed an efficient attribution system, using complete-linkage clustering, for conducting attribution of large sets of two-speaker telephone data. In this paper, we build on our proposed approach to achieve a robust system, applicable to multiple recording domains. To do this, we first extend the diarization module of our system to accommodate multi-speaker (>2) recordings. We achieve this through using a robust cross-likelihood ratio (CLR) threshold stopping criterion for clustering, as opposed to the original stopping criterion of two speakers used for telephone data. We evaluate this baseline diarization module across a dataset of Australian broadcast news recordings, showing a significant lack of diarization accuracy without previous knowledge of the true number of speakers within a recording. We thus propose applying an additional pass of complete-linkage clustering to the diarization module, demonstrating an absolute improvement of 20% in diarization error rate (DER). We then evaluate our proposed multi-domain attribution system across the broadcast news data, demonstrating achievable attribution error rates (AER) as low as 17%.
Resumo:
Techniques to improve the automated analysis of natural and spontaneous facial expressions have been developed. The outcome of the research has applications in several fields including national security (eg: expression invariant face recognition); education (eg: affect aware interfaces); mental and physical health (eg: depression and pain recognition).
Resumo:
This project was a step forward in developing intrusion detection systems in distributed environments such as web services. It investigates a new approach of detection based on so-called "taint-marking" techniques and introduces a theoretical framework along with its implementation in the Linux kernel.
Resumo:
Purpose Intensity modulated radiotherapy (IMRT) treatments require more beam-on time and produce more linac head leakage to deliver similar doses to conventional, unmodulated, radiotherapy treatments. It is necessary to take this increased leakage into account when evaluating the results of radiation surveys around bunkers that are, or will be, used for IMRT. The recommended procedure of 15 applying a monitor-unit based workload correction factor to secondary barrier survey measurements, to account for this increased leakage when evaluating radiation survey measurements around IMRT bunkers, can lead to potentially-costly over estimation of the required barrier thickness. This study aims to provide initial guidance on the validity of reducing the value of the correction factor when applied to different radiation barriers (primary barriers, doors, maze walls and other walls) by 20 evaluating three different bunker designs. Methods Radiation survey measurements of primary, scattered and leakage radiation were obtained at each of five survey points around each of three different radiotherapy bunkers and the contribution of leakage to the total measured radiation dose at each point was evaluated. Measurements at each survey point were made with the linac gantry set to 12 equidistant positions from 0 to 330o, to 25 assess the effects of radiation beam direction on the results. Results For all three bunker designs, less than 0.5% of dose measured at and alongside the primary barriers, less than 25% of the dose measured outside the bunker doors and up to 100% of the dose measured outside other secondary barriers was found to be caused by linac head leakage. Conclusions Results of this study suggest that IMRT workload corrections are unnecessary, for 30 survey measurements made at and alongside primary barriers. Use of reduced IMRT workload correction factors is recommended when evaluating survey measurements around a bunker door, provided that a subset of the measurements used in this study are repeated for the bunker in question. Reduction of the correction factor for other secondary barrier survey measurements is not recommended unless the contribution from leakage is separetely evaluated.
Resumo:
This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.
Resumo:
This research aims to develop a reliable density estimation method for signalised arterials based on cumulative counts from upstream and downstream detectors. In order to overcome counting errors associated with urban arterials with mid-link sinks and sources, CUmulative plots and Probe Integration for Travel timE estimation (CUPRITE) is employed for density estimation. The method, by utilizing probe vehicles’ samples, reduces or cancels the counting inconsistencies when vehicles’ conservation is not satisfied within a section. The method is tested in a controlled environment, and the authors demonstrate the effectiveness of CUPRITE for density estimation in a signalised section, and discuss issues associated with the method.
Resumo:
Numerous statements and declarations have been made over recent decades in support of open access to research data. The growing recognition of the importance of open access to research data has been accompanied by calls on public research funding agencies and universities to facilitate better access to publicly funded research data so that it can be re-used and redistributed as public goods. International and inter-governmental bodies such as the ICSU/CODATA, the OECD and the European Union are strong supporters of open access to and re-use of publicly funded research data. This thesis focuses on the research data created by university researchers in Malaysian public universities whose research activities are funded by the Federal Government of Malaysia. Malaysia, like many countries, has not yet formulated a policy on open access to and re-use of publicly funded research data. Therefore, the aim of this thesis is to develop a policy to support the objective of enabling open access to and re-use of publicly funded research data in Malaysian public universities. Policy development is very important if the objective of enabling open access to and re-use of publicly funded research data is to be successfully achieved. In developing the policy, this thesis identifies a myriad of legal impediments arising from intellectual property rights, confidentiality, privacy and national security laws, novelty requirements in patent law and lack of a legal duty to ensure data quality. Legal impediments such as these have the effect of restricting, obstructing, hindering or slowing down the objective of enabling open access to and re-use of publicly funded research data. A key focus in the formulation of the policy was the need to resolve the various legal impediments that have been identified. This thesis analyses the existing policies and guidelines of Malaysian public universities to ascertain to what extent the legal impediments have been resolved. An international perspective is adopted by making a comparative analysis of the policies of public research funding agencies and universities in the United Kingdom, the United States and Australia to understand how they have dealt with the identified legal impediments. These countries have led the way in introducing policies which support open access to and re-use of publicly funded research data. As well as proposing a policy supporting open access to and re-use of publicly funded research data in Malaysian public universities, this thesis provides procedures for the implementation of the policy and guidelines for addressing the legal impediments to open access and re-use.
Resumo:
Collecting regular personal reflections from first year teachers in rural and remote schools is challenging as they are busily absorbed in their practice, and separated from each other and the researchers by thousands of kilometres. In response, an innovative web-based solution was designed to both collect data and be a responsive support system for early career teachers as they came to terms with their new professional identities within rural and remote school settings. Using an emailed link to a web-based application named goingok.com, the participants are charting their first year plotlines using a sliding scale from ‘distressed’, ‘ok’ to ‘soaring’ and describing their self-assessment in short descriptive posts. These reflections are visible to the participants as a developing online journal, while the collections of de-identified developing plotlines are visible to the research team, alongside numerical data. This paper explores important aspects of the design process, together with the challenges and opportunities encountered in its implementation. A number of the key considerations for choosing to develop a web application for data collection are initially identified, and the resultant application features and scope are then examined. Examples are then provided about how a responsive software development approach can be part of a supportive feedback loop for participants while being an effective data collection process. Opportunities for further development are also suggested with projected implications for future research.
Resumo:
In contemporary game development circles the ‘game making jam’ has become an important rite of passage and baptism event, an exploration space and a central indie lifestyle affirmation and community event. Game jams have recently become a focus for design researchers interested in the creative process. In this paper we tell the story of an established local game jam and our various documentation and data collection methods. We present the beginnings of the current project, which seeks to map the creative teams and their process in the space of the challenge, and which aims to enable participants to be more than the objects of the data collection. A perceived issue is that typical documentation approaches are ‘about’ the event as opposed to ‘made by’ the participants and are thus both at odds with the spirit of the jam as a phenomenon and do not really access the rich playful potential of participant experience. In the data collection and visualisation projects described here, we focus on using collected data to re-include the participants in telling stories about their experiences of the event as a place-based experience. Our goal is to find a means to encourage production of ‘anecdata’ - data based on individual story telling that is subjective, malleable, and resists collection via formal mechanisms - and to enable mimesis, or active narrating, on the part of the participants. We present a concept design for data as game based on the logic of early medieval maps and we reflect on how we could enable participation in the data collection itself.
Resumo:
A 3-year longitudinal study Transforming Children’s Mathematical and Scientific Development integrates, through data modelling, a pedagogical approach focused on mathematical patterns and structural relationships with learning in science. As part of this study, a purposive sample of 21 highly able Grade 1 students was engaged in an innovative data modelling program. In the majority of students, representational development was observed. Their complex graphs depicting categorical and continuous data revealed a high level of structure and enabled identification of structural features critical to this development.
Resumo:
The activities introduced here were used in association with a research project in four Year 4 classrooms and are suggested as a motivating way to address several criteria for Measurement and Data in the Australian Curriculum: Mathematics. The activities involve measuring the arm span of one student in a class many times and then of all students once.
Resumo:
Trees are capable of portraying the semi-structured data which is common in web domain. Finding similarities between trees is mandatory for several applications that deal with semi-structured data. Existing similarity methods examine a pair of trees by comparing through nodes and paths of two trees, and find the similarity between them. However, these methods provide unfavorable results for unordered tree data and result in yielding NP-hard or MAX-SNP hard complexity. In this paper, we present a novel method that encodes a tree with an optimal traversing approach first, and then, utilizes it to model the tree with its equivalent matrix representation for finding similarity between unordered trees efficiently. Empirical analysis shows that the proposed method is able to achieve high accuracy even on the large data sets.