812 resultados para Big data, Spark, Hadoop
Resumo:
In the context of learning paradigms of identification in the limit, we address the question: why is uncertainty sometimes desirable? We use mind change bounds on the output hypotheses as a measure of uncertainty and interpret ‘desirable’ as reduction in data memorization, also defined in terms of mind change bounds. The resulting model is closely related to iterative learning with bounded mind change complexity, but the dual use of mind change bounds — for hypotheses and for data — is a key distinctive feature of our approach. We show that situations exist where the more mind changes the learner is willing to accept, the less the amount of data it needs to remember in order to converge to the correct hypothesis. We also investigate relationships between our model and learning from good examples, set-driven, monotonic and strong-monotonic learners, as well as class-comprising versus class-preserving learnability.
Resumo:
Background Birth weight and length have seasonal fluctuations. Previous analyses of birth weight by latitude effects identified seemingly contradictory results, showing both 6 and 12 monthly periodicities in weight. The aims of this paper are twofold: (a) to explore seasonal patterns in a large, Danish Medical Birth Register, and (b) to explore models based on seasonal exposures and a non-linear exposure-risk relationship. Methods Birth weight and birth lengths on over 1.5 million Danish singleton, live births were examined for seasonality. We modelled seasonal patterns based on linear, U- and J-shaped exposure-risk relationships. We then added an extra layer of complexity by modelling weighted population-based exposure patterns. Results The Danish data showed clear seasonal fluctuations for both birth weight and birth length. A bimodal model best fits the data, however the amplitude of the 6 and 12 month peaks changed over time. In the modelling exercises, U- and J-shaped exposure-risk relationships generate time series with both 6 and 12 month periodicities. Changing the weightings of the population exposure risks result in unexpected properties. A J-shaped exposure-risk relationship with a diminishing population exposure over time fitted the observed seasonal pattern in the Danish birth weight data. Conclusion In keeping with many other studies, Danish birth anthropometric data show complex and shifting seasonal patterns. We speculate that annual periodicities with non-linear exposure-risk models may underlie these findings. Understanding the nature of seasonal fluctuations can help generate candidate exposures.
Resumo:
Objective: To determine whether primary care management of chronic heart failure (CHF) differed between rural and urban areas in Australia. Design: A cross-sectional survey stratified by Rural, Remote and Metropolitan Areas (RRMA) classification. The primary source of data was the Cardiac Awareness Survey and Evaluation (CASE) study. Setting: Secondary analysis of data obtained from 341 Australian general practitioners and 23 845 adults aged 60 years or more in 1998. Main outcome measures: CHF determined by criteria recommended by the World Health Organization, diagnostic practices, use of pharmacotherapy, and CHF-related hospital admissions in the 12 months before the study. Results: There was a significantly higher prevalence of CHF among general practice patients in large and small rural towns (16.1%) compared with capital city and metropolitan areas (12.4%) (P < 0.001). Echocardiography was used less often for diagnosis in rural towns compared with metropolitan areas (52.0% v 67.3%, P < 0.001). Rates of specialist referral were also significantly lower in rural towns than in metropolitan areas (59.1% v 69.6%, P < 0.001), as were prescribing rates of angiotensin-converting enzyme inhibitors (51.4% v 60.1%, P < 0.001). There was no geographical variation in prescribing rates of β-blockers (12.6% [rural] v 11.8% [metropolitan], P = 0.32). Overall, few survey participants received recommended “evidence-based practice” diagnosis and management for CHF (metropolitan, 4.6%; rural, 3.9%; and remote areas, 3.7%). Conclusions: This study found a higher prevalence of CHF, and significantly lower use of recommended diagnostic methods and pharmacological treatment among patients in rural areas.
Resumo:
Objectives: To quantify the concordance of hospital child maltreatment data with child protection service (CPS) records and identify factors associated with linkage. Methods: Multivariable logistic regression analysis was conducted following retrospective medical record review and database linkage of 884 child records from 20 hospitals and the CPS in Queensland, Australia. Results: Nearly all children with hospital assigned maltreatment codes (93.1%) had a CPS record. Of these, 85.1% had a recent notification. 29% of the linked maltreatment group (n=113) were not known to CPS prior to the hospital presentation. Almost 1/3 of children with unintentional injury hospital codes were known to CPS. Just over 24% of the linked unintentional injury group (n=34) were not known to CPS prior to the hospital presentation but became known during or after discharge from hospital. These estimates are higher than the 2006/07 annual rate of 2.39% of children being notified to CPS. Rural children were more likely to link to CPS, and children were over 3 times more likely to link if the index injury documentation included additional diagnoses or factors affecting their health. Conclusions: The system for referring maltreatment cases to CPS is generally efficient, although up to 1 in 15 children had codes for maltreatment but could not be linked to CPS data. The high proportion of children with unintentional injury codes who linked to CPS suggests clinicians and hospital-based child protection staff should be supported by further education and training to ensure children at risk are being detected by the child protection system.
Resumo:
Data preprocessing is widely recognized as an important stage in anomaly detection. This paper reviews the data preprocessing techniques used by anomaly-based network intrusion detection systems (NIDS), concentrating on which aspects of the network traffic are analyzed, and what feature construction and selection methods have been used. Motivation for the paper comes from the large impact data preprocessing has on the accuracy and capability of anomaly-based NIDS. The review finds that many NIDS limit their view of network traffic to the TCP/IP packet headers. Time-based statistics can be derived from these headers to detect network scans, network worm behavior, and denial of service attacks. A number of other NIDS perform deeper inspection of request packets to detect attacks against network services and network applications. More recent approaches analyze full service responses to detect attacks targeting clients. The review covers a wide range of NIDS, highlighting which classes of attack are detectable by each of these approaches. Data preprocessing is found to predominantly rely on expert domain knowledge for identifying the most relevant parts of network traffic and for constructing the initial candidate set of traffic features. On the other hand, automated methods have been widely used for feature extraction to reduce data dimensionality, and feature selection to find the most relevant subset of features from this candidate set. The review shows a trend toward deeper packet inspection to construct more relevant features through targeted content parsing. These context sensitive features are required to detect current attacks.
Resumo:
Acoustic sensors play an important role in augmenting the traditional biodiversity monitoring activities carried out by ecologists and conservation biologists. With this ability however comes the burden of analysing large volumes of complex acoustic data. Given the complexity of acoustic sensor data, fully automated analysis for a wide range of species is still a significant challenge. This research investigates the use of citizen scientists to analyse large volumes of environmental acoustic data in order to identify bird species. Specifically, it investigates ways in which the efficiency of a user can be improved through the use of species identification tools and the use of reputation models to predict the accuracy of users with unidentified skill levels. Initial experimental results are reported.
Resumo:
This thesis explores the proposition that growth and development in the screen and creative industries is not confined to the major capital cities. Lifestyle considerations, combined with advances in digital technology, convergence and greater access to broadband are altering requirements for geographic location, and creative workers are being drawn away from the big metropolises to certain regional areas. Regional screen industry enclaves are emerging outside of London, in the Highlands and Islands of Scotland, in Nova Scotia in Canada and in New Zealand. In the Australian context, the proposition is tested in an area regarded as a ‘special case’ in creative industry expansion: the Northern Rivers region of NSW. A key feature of the ‘specialness’ of this region is the large number of experienced, credited producers who live and operate their businesses within the region. The development of screen and creative industries in the Northern Rivers over the decade 2000 – 2010 has implications for regional regeneration and offers new insights into the rapidly changing screen industry landscape. This development also has implications for creative industry discourse, especially the dominance of the urban in creative industries thought. The research is pioneering in a number of ways. Building on the work conducted for my Masters thesis in 2000, a second study was conducted during the research phase, adapting creative industries theory and mapping methods, which have been largely city and nation-centric, and applying them to a regional context. The study adopted an action research approach as an industry development strategy for screen industries, while at the same time developing fine-grained ground up methods for collecting primary quantitative data on the size and scope of the creative industries. In accordance with the action research framework, the researcher also acted in the dual roles of industry activist and screen industry producer in the region. The central focus of the research has been both to document and contribute to the growth and development of screen and creative industries over the past decade in the Northern Rivers region. These interventions, along with policy developments at both a local and national level, and broader global shifts, have had the effect of repositioning the sector from a marginal one to a priority area considered integral to the future economic and cultural life of the region. The research includes a detailed mapping study undertaken in 2005 with comparisons to an earlier 2000 study and to ABS data for 2001 and 2006 to reveal growth trends. It also includes two case studies of projects that developed from idea to production and completion in the region during the decade in question. The studies reveal the drivers, impediments and policy implications for sustaining the development of screen industries in a regional area. A major finding of the research was the large and increasing number of experienced producers who operate within the region and the leadership role they play in driving the development of the emerging local industry. The two case studies demonstrate the impact of policy decisions on local screen industry producers and their enterprises. A brief overview of research in other regional areas is presented, including two international examples, and what they reveal about regional regeneration. Implications are drawn for creative industries discourse and regional development policy challenges for the future.
Resumo:
Since manually constructing domain-specific sentiment lexicons is extremely time consuming and it may not even be feasible for domains where linguistic expertise is not available. Research on the automatic construction of domain-specific sentiment lexicons has become a hot topic in recent years. The main contribution of this paper is the illustration of a novel semi-supervised learning method which exploits both term-to-term and document-to-term relations hidden in a corpus for the construction of domain specific sentiment lexicons. More specifically, the proposed two-pass pseudo labeling method combines shallow linguistic parsing and corpusbase statistical learning to make domain-specific sentiment extraction scalable with respect to the sheer volume of opinionated documents archived on the Internet these days. Another novelty of the proposed method is that it can utilize the readily available user-contributed labels of opinionated documents (e.g., the user ratings of product reviews) to bootstrap the performance of sentiment lexicon construction. Our experiments show that the proposed method can generate high quality domain-specific sentiment lexicons as directly assessed by human experts. Moreover, the system generated domain-specific sentiment lexicons can improve polarity prediction tasks at the document level by 2:18% when compared to other well-known baseline methods. Our research opens the door to the development of practical and scalable methods for domain-specific sentiment analysis.
Resumo:
This thesis investigates profiling and differentiating customers through the use of statistical data mining techniques. The business application of our work centres on examining individuals’ seldomly studied yet critical consumption behaviour over an extensive time period within the context of the wireless telecommunication industry; consumption behaviour (as oppose to purchasing behaviour) is behaviour that has been performed so frequently that it become habitual and involves minimal intentions or decision making. Key variables investigated are the activity initialised timestamp and cell tower location as well as the activity type and usage quantity (e.g., voice call with duration in seconds); and the research focuses are on customers’ spatial and temporal usage behaviour. The main methodological emphasis is on the development of clustering models based on Gaussian mixture models (GMMs) which are fitted with the use of the recently developed variational Bayesian (VB) method. VB is an efficient deterministic alternative to the popular but computationally demandingMarkov chainMonte Carlo (MCMC) methods. The standard VBGMMalgorithm is extended by allowing component splitting such that it is robust to initial parameter choices and can automatically and efficiently determine the number of components. The new algorithm we propose allows more effective modelling of individuals’ highly heterogeneous and spiky spatial usage behaviour, or more generally human mobility patterns; the term spiky describes data patterns with large areas of low probability mixed with small areas of high probability. Customers are then characterised and segmented based on the fitted GMM which corresponds to how each of them uses the products/services spatially in their daily lives; this is essentially their likely lifestyle and occupational traits. Other significant research contributions include fitting GMMs using VB to circular data i.e., the temporal usage behaviour, and developing clustering algorithms suitable for high dimensional data based on the use of VB-GMM.
Resumo:
High levels of sitting have been linked with poor health outcomes. Previously a pragmatic MTI accelerometer data cut-point (100 count/min-1) has been used to estimate sitting. Data on the accuracy of this cut-point is unavailable. PURPOSE: To ascertain whether the 100 count/min-1 cut-point accurately isolates sitting from standing activities. METHODS: Participants fitted with an MTI accelerometer were observed performing a range of sitting, standing, light & moderate activities. 1-min epoch MTI data were matched to observed activities, then re-categorized as either sitting or not using the 100 count/min-1 cut-point. Self-report demographics and current physical activity were collected. Generalized estimating equation for repeated measures with a binary logistic model analyses (GEE), corrected for age, gender and BMI, were conducted to ascertain the odds of the MTI data being misclassified. RESULTS: Data were from 26 healthy subjects (8 men; 50% aged <25 years; mean BMI (SD) 22.7(3.8)m/kg2). MTI sitting and standing data mode was 0 count/min-1, with 46% of sitting activities and 21% of standing activities recording 0 count/min-1. The GEE was unable to accurately isolate sitting from standing activities using the 100 count/min-1 cut-point, since all sitting activities were incorrectly predicted as standing (p=0.05). To further explore the sensitivity of MTI data to delineate sitting from standing, the upper 95% confidence interval of the mean for the sitting activities (46 count/min-1) was used to re-categorise the data; this resulted in the GEE correctly classifying 49% of sitting, and 69% of standing activities. Using the 100 count/min-1 cut-point the data were re-categorised into a combined ‘sit/stand’ category and tested against other light activities: 88% of sit/stand and 87% of light activities were accurately predicted. Using Freedson’s moderate cut-point of 1952 count/min-1 the GEE accurately predicted 97% of light vs. 90% of moderate activities. CONCLUSION: The distributions of MTI recorded sitting and standing data overlap considerably, as such the 100 count/min -1 cut-point did not accurately isolate sitting from other static standing activities. The 100 count/min -1 cut-point more accurately predicted sit/stand vs. other movement orientated activities.