936 resultados para Data Analytics
Resumo:
As support grows for greater access to information and data held by governments, so does awareness of the need for appropriate policy, technical and legal frameworks to achieve the desired economic and societal outcomes. Since the late 2000s numerous international organizations, inter-governmental bodies and governments have issued open government data policies, which set out key principles underpinning access to, and the release and reuse of data. These policies reiterate the value of government data and establish the default position that it should be openly accessible to the public under transparent and non-discriminatory conditions, which are conducive to innovative reuse of the data. A key principle stated in open government data policies is that legal rights in government information must be exercised in a manner that is consistent with and supports the open accessibility and reusability of the data. In particular, where government information and data is protected by copyright, access should be provided under licensing terms which clearly permit its reuse and dissemination. This principle has been further developed in the policies issued by Australian Governments into a specific requirement that Government agencies are to apply the Creative Commons Attribution licence (CC BY) as the default licensing position when releasing government information and data. A wide-ranging survey of the practices of Australian Government agencies in managing their information and data, commissioned by the Office of the Australian Information Commissioner in 2012, provides valuable insights into progress towards the achievement of open government policy objectives and the adoption of open licensing practices. The survey results indicate that Australian Government agencies are embracing open access and a proactive disclosure culture and that open licensing under Creative Commons licences is increasingly prevalent. However, the finding that ‘[t]he default position of open access licensing is not clearly or robustly stated, nor properly reflected in the practice of Government agencies’ points to the need to further develop the policy framework and the principles governing information access and reuse, and to provide practical guidance tools on open licensing if the broadest range of government information and data is to be made available for innovative reuse.
Resumo:
This study considered the problem of predicting survival, based on three alternative models: a single Weibull, a mixture of Weibulls and a cure model. Instead of the common procedure of choosing a single “best” model, where “best” is defined in terms of goodness of fit to the data, a Bayesian model averaging (BMA) approach was adopted to account for model uncertainty. This was illustrated using a case study in which the aim was the description of lymphoma cancer survival with covariates given by phenotypes and gene expression. The results of this study indicate that if the sample size is sufficiently large, one of the three models emerge as having highest probability given the data, as indicated by the goodness of fit measure; the Bayesian information criterion (BIC). However, when the sample size was reduced, no single model was revealed as “best”, suggesting that a BMA approach would be appropriate. Although a BMA approach can compromise on goodness of fit to the data (when compared to the true model), it can provide robust predictions and facilitate more detailed investigation of the relationships between gene expression and patient survival. Keywords: Bayesian modelling; Bayesian model averaging; Cure model; Markov Chain Monte Carlo; Mixture model; Survival analysis; Weibull distribution
Resumo:
The Bluetooth technology is being increasingly used, among the Automated Vehicle Identification Systems, to retrieve important information about urban networks. Because the movement of Bluetooth-equipped vehicles can be monitored, throughout the network of Bluetooth sensors, this technology represents an effective means to acquire accurate time dependant Origin Destination information. In order to obtain reliable estimations, however, a number of issues need to be addressed, through data filtering and correction techniques. Some of the main challenges inherent to Bluetooth data are, first, that Bluetooth sensors may fail to detect all of the nearby Bluetooth-enabled vehicles. As a consequence, the exact journey for some vehicles may become a latent pattern that will need to be estimated. Second, sensors that are in close proximity to each other may have overlapping detection areas, thus making the task of retrieving the correct travelled path even more challenging. The aim of this paper is twofold: to give an overview of the issues inherent to the Bluetooth technology, through the analysis of the data available from the Bluetooth sensors in Brisbane; and to propose a method for retrieving the itineraries of the individual Bluetooth vehicles. We argue that estimating these latent itineraries, accurately, is a crucial step toward the retrieval of accurate dynamic Origin Destination Matrices.
Resumo:
Transit passenger market segmentation enables transit operators to target different classes of transit users to provide customized information and services. The Smart Card (SC) data, from Automated Fare Collection system, facilitates the understanding of multiday travel regularity of transit passengers, and can be used to segment them into identifiable classes of similar behaviors and needs. However, the use of SC data for market segmentation has attracted very limited attention in the literature. This paper proposes a novel methodology for mining spatial and temporal travel regularity from each individual passenger’s historical SC transactions and segments them into four segments of transit users. After reconstructing the travel itineraries from historical SC transactions, the paper adopts the Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm to mine travel regularity of each SC user. The travel regularity is then used to segment SC users by an a priori market segmentation approach. The methodology proposed in this paper assists transit operators to understand their passengers and provide them oriented information and services.
Resumo:
Over the past decade, vision-based tracking systems have been successfully deployed in professional sports such as tennis and cricket for enhanced broadcast visualizations as well as aiding umpiring decisions. Despite the high-level of accuracy of the tracking systems and the sheer volume of spatiotemporal data they generate, the use of this high quality data for quantitative player performance and prediction has been lacking. In this paper, we present a method which predicts the location of a future shot based on the spatiotemporal parameters of the incoming shots (i.e. shot speed, location, angle and feet location) from such a vision system. Having the ability to accurately predict future short-term events has enormous implications in the area of automatic sports broadcasting in addition to coaching and commentary domains. Using Hawk-Eye data from the 2012 Australian Open Men's draw, we utilize a Dynamic Bayesian Network to model player behaviors and use an online model adaptation method to match the player's behavior to enhance shot predictability. To show the utility of our approach, we analyze the shot predictability of the top 3 players seeds in the tournament (Djokovic, Federer and Nadal) as they played the most amounts of games.
Resumo:
Technological advances have led to an influx of affordable hardware that supports sensing, computation and communication. This hardware is increasingly deployed in public and private spaces, tracking and aggregating a wealth of real-time environmental data. Although these technologies are the focus of several research areas, there is a lack of research dealing with the problem of making these capabilities accessible to everyday users. This thesis represents a first step towards developing systems that will allow users to leverage the available infrastructure and create custom tailored solutions. It explores how this notion can be utilized in the context of energy monitoring to improve conventional approaches. The project adopted a user-centered design process to inform the development of a flexible system for real-time data stream composition and visualization. This system features an extensible architecture and defines a unified API for heterogeneous data streams. Rather than displaying the data in a predetermined fashion, it makes this information available as building blocks that can be combined and shared. It is based on the insight that individual users have diverse information needs and presentation preferences. Therefore, it allows users to compose rich information displays, incorporating personally relevant data from an extensive information ecosystem. The prototype was evaluated in an exploratory study to observe its natural use in a real-world setting, gathering empirical usage statistics and conducting semi-structured interviews. The results show that a high degree of customization does not warrant sustained usage. Other factors were identified, yielding recommendations for increasing the impact on energy consumption.
Resumo:
The study of the relationship between macroscopic traffic parameters, such as flow, speed and travel time, is essential to the understanding of the behaviour of freeway and arterial roads. However, the temporal dynamics of these parameters are difficult to model, especially for arterial roads, where the process of traffic change is driven by a variety of variables. The introduction of the Bluetooth technology into the transportation area has proven exceptionally useful for monitoring vehicular traffic, as it allows reliable estimation of travel times and traffic demands. In this work, we propose an approach based on Bayesian networks for analyzing and predicting the complex dynamics of flow or volume, based on travel time observations from Bluetooth sensors. The spatio-temporal relationship between volume and travel time is captured through a first-order transition model, and a univariate Gaussian sensor model. The two models are trained and tested on travel time and volume data, from an arterial link, collected over a period of six days. To reduce the computational costs of the inference tasks, volume is converted into a discrete variable. The discretization process is carried out through a Self-Organizing Map. Preliminary results show that a simple Bayesian network can effectively estimate and predict the complex temporal dynamics of arterial volumes from the travel time data. Not only is the model well suited to produce posterior distributions over single past, current and future states; but it also allows computing the estimations of joint distributions, over sequences of states. Furthermore, the Bayesian network can achieve excellent prediction, even when the stream of travel time observation is partially incomplete.
Resumo:
This paper analyses the probabilistic linear discriminant analysis (PLDA) speaker verification approach with limited development data. This paper investigates the use of the median as the central tendency of a speaker’s i-vector representation, and the effectiveness of weighted discriminative techniques on the performance of state-of-the-art length-normalised Gaussian PLDA (GPLDA) speaker verification systems. The analysis within shows that the median (using a median fisher discriminator (MFD)) provides a better representation of a speaker when the number of representative i-vectors available during development is reduced, and that further, usage of the pair-wise weighting approach in weighted LDA and weighted MFD provides further improvement in limited development conditions. Best performance is obtained using a weighted MFD approach, which shows over 10% improvement in EER over the baseline GPLDA system on mismatched and interview-interview conditions.
Resumo:
The geographic location of cloud data storage centres is an important issue for many organisations and individuals due to various regulations that require data and operations to reside in specific geographic locations. Thus, cloud users may want to be sure that their stored data have not been relocated into unknown geographic regions that may compromise the security of their stored data. Albeshri et al. (2012) combined proof of storage (POS) protocols with distance-bounding protocols to address this problem. However, their scheme involves unnecessary delay when utilising typical POS schemes due to computational overhead at the server side. The aim of this paper is to improve the basic GeoProof protocol by reducing the computation overhead at the server side. We show how this can maintain the same level of security while achieving more accurate geographic assurance.
Resumo:
Prophylactic surgery including hysterectomy and bilateral salpingo-oophorectomy (BSO) is recommended in BRCA positive women, while in women from the general population, hysterectomy plus BSO may increase the risk of overall mortality. The effect of hysterectomy plus BSO on women previously diagnosed with breast cancer is unknown. We used data from a population-base data linkage study of all women diagnosed with primary breast cancer in Queensland, Australia between 1997 and 2008 (n=21,067). We fitted flexible parametric breast cancer specific and overall survival models with 95% confidence intervals (also known as Royston-Parmar models) to assess the impact of risk-reducing surgery (removal of uterus, one or both ovaries). We also stratified analyses by age 20-49 and 50-79 years, respectively. Overall, 1,426 women (7%) underwent risk-reducing surgery (13% of premenopausal women and 3% of postmenopausal women). No women who had risk-reducing surgery, compared to 171 who did not have risk-reducing surgery developed a gynaecological cancer. Overall, 3,165 (15%) women died, including 2,195 (10%) from breast cancer. Hysterectomy plus BSO was associated with significantly reduced risk of death overall (adjusted HR = 0.69, 95% CI 0.53-0.89; P =0.005). Risk reduction was greater among premenopausal women, whose risk of death halved (HR, 0.45; 95% CI, 0.25-0.79; P < 0.006). This was largely driven by reduction in breast cancer-specific mortality (HR, 0.43; 95% CI, 0.24-0.79; P < 0.006). This population-based study found that risk-reducing surgery halved the mortality risk for premenopausal breast cancer patients. Replication of our results in independent cohorts, and subsequently randomised trials are needed to confirm these findings.
Resumo:
For industrial wireless sensor networks, maintaining the routing path for a high packet delivery ratio is one of the key objectives in network operations. It is important to both provide the high data delivery rate at the sink node and guarantee a timely delivery of the data packet at the sink node. Most proactive routing protocols for sensor networks are based on simple periodic updates to distribute the routing information. A faulty link causes packet loss and retransmission at the source until periodic route update packets are issued and the link has been identified as broken. We propose a new proactive route maintenance process where periodic update is backed-up with a secondary layer of local updates repeating with shorter periods for timely discovery of broken links. Proposed route maintenance scheme improves reliability of the network by decreasing the packet loss due to delayed identification of broken links. We show by simulation that proposed mechanism behaves better than the existing popular routing protocols (AODV, AOMDV and DSDV) in terms of end-to-end delay, routing overhead, packet reception ratio.
Resumo:
Facial expression recognition (FER) systems must ultimately work on real data in uncontrolled environments although most research studies have been conducted on lab-based data with posed or evoked facial expressions obtained in pre-set laboratory environments. It is very difficult to obtain data in real-world situations because privacy laws prevent unauthorized capture and use of video from events such as funerals, birthday parties, marriages etc. It is a challenge to acquire such data on a scale large enough for benchmarking algorithms. Although video obtained from TV or movies or postings on the World Wide Web may also contain ‘acted’ emotions and facial expressions, they may be more ‘realistic’ than lab-based data currently used by most researchers. Or is it? One way of testing this is to compare feature distributions and FER performance. This paper describes a database that has been collected from television broadcasts and the World Wide Web containing a range of environmental and facial variations expected in real conditions and uses it to answer this question. A fully automatic system that uses a fusion based approach for FER on such data is introduced for performance evaluation. Performance improvements arising from the fusion of point-based texture and geometry features, and the robustness to image scale variations are experimentally evaluated on this image and video dataset. Differences in FER performance between lab-based and realistic data, between different feature sets, and between different train-test data splits are investigated.
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques.
Resumo:
A retrospective, descriptive analysis of a sample of children under 18 years presenting to a hospital emergency department (ED) for treatment of an injury was conducted. The aim was to explore characteristics and identify differences between children assigned abuse codes and children assigned unintentional injury codes using an injury surveillance database. Only 0.1% of children had been assigned the abuse code and 3.9% a code indicating possible abuse. Children between 2-5 years formed the largest proportion of those coded to abuse. Superficial injury and bruising were the most common types of injury seen in children in the abuse group and the possible abuse group (26.9% and 18.8% respectively), whereas those with unintentional injury were most likely to present with open wounds (18.4%). This study demonstrates that routinely collected injury surveillance data can be a useful source of information for describing injury characteristics in children assigned abuse codes compared to those assigned no abuse codes.
Resumo:
Aims: To compare different methods for identifying alcohol involvement in injury-related emergency department presentation in Queensland youth, and to explore the alcohol terminology used in triage text. Methods: Emergency Department Information System data were provided for patients aged 12-24 years with an injury-related diagnosis code for a 5 year period 2006-2010 presenting to a Queensland emergency department (N=348895). Three approaches were used to estimate alcohol involvement: 1) analysis of coded data, 2) mining of triage text, and 3) estimation using an adaptation of alcohol attributable fractions (AAF). Cases were identified as ‘alcohol-involved’ by code and text, as well as AAF weighted. Results: Around 6.4% of these injury presentations overall had some documentation of alcohol involvement, with higher proportions of alcohol involvement documented for 18-24 year olds, females, indigenous youth, where presentations occurred on a Saturday or Sunday, and where presentations occurred between midnight and 5am. The most common alcohol terms identified for all subgroups were generic alcohol terms (eg. ETOH or alcohol) with almost half of the cases where alcohol involvement was documented having a generic alcohol term recorded in the triage text. Conclusions: Emergency department data is a useful source of information for identification of high risk sub-groups to target intervention opportunities, though it is not a reliable source of data for incidence or trend estimation in its current unstandardised form. Improving the accuracy and consistency of identification, documenting and coding of alcohol-involvement at the point of data capture in the emergency department is the most desirable long term approach to produce a more solid evidence base to support policy and practice in this field.