942 resultados para Blog datasets
Resumo:
Today’s evolving networks are experiencing a large number of different attacks ranging from system break-ins, infection from automatic attack tools such as worms, viruses, trojan horses and denial of service (DoS). One important aspect of such attacks is that they are often indiscriminate and target Internet addresses without regard to whether they are bona fide allocated or not. Due to the absence of any advertised host services the traffic observed on unused IP addresses is by definition unsolicited and likely to be either opportunistic or malicious. The analysis of large repositories of such traffic can be used to extract useful information about both ongoing and new attack patterns and unearth unusual attack behaviors. However, such an analysis is difficult due to the size and nature of the collected traffic on unused address spaces. In this dissertation, we present a network traffic analysis technique which uses traffic collected from unused address spaces and relies on the statistical properties of the collected traffic, in order to accurately and quickly detect new and ongoing network anomalies. Detection of network anomalies is based on the concept that an anomalous activity usually transforms the network parameters in such a way that their statistical properties no longer remain constant, resulting in abrupt changes. In this dissertation, we use sequential analysis techniques to identify changes in the behavior of network traffic targeting unused address spaces to unveil both ongoing and new attack patterns. Specifically, we have developed a dynamic sliding window based non-parametric cumulative sum change detection techniques for identification of changes in network traffic. Furthermore we have introduced dynamic thresholds to detect changes in network traffic behavior and also detect when a particular change has ended. Experimental results are presented that demonstrate the operational effectiveness and efficiency of the proposed approach, using both synthetically generated datasets and real network traces collected from a dedicated block of unused IP addresses.
Resumo:
This report explains the objectives, datasets and evaluation criteria of both the clustering and classification tasks set in the INEX 2009 XML Mining track. The report also describes the approaches and results obtained by the different participants.
Resumo:
The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
Resumo:
Presentation describling a project in data intensive research in the humanities. Measuring activity of publically available data in social networks such as Blogosphere, Twitter, Flickr, YouTube
Resumo:
Nature Refuges encompass the second largest extent of protected area estate in Queensland. Major problems exist in the data capture, map presentation, data quality and integrity of these boundaries. The spatial accuracies/inaccuracies of the Nature Refuge administrative boundaries directly influence the ability to preserve valuable ecosystems by challenging negative environmental impacts on these properties. This research work is about supporting the Nature Refuge Programs efforts to secure Queensland’s natural and cultural values on private land by utilising GIS and its advanced functionalities. The research design organizes and enters Queensland’s Nature Refuge boundaries into a spatial environment. Survey quality data collection techniques such as the Global Positioning Systems (GPS) are investigated to capture Nature Refuge boundary information. Using the concepts of map communication GIS Cartography is utilised for the protected area plan design. New spatial datasets are generated facilitating the effectiveness of investigative data analysis. The geodatabase model developed by this study adds rich GIS behaviour providing the capability to store, query, and manipulate geographic information. It provides the ability to leverage data relationships and enforces topological integrity creating savings in customization and productivity. The final phase of the research design incorporates the advanced functions of ArcGIS. These functions facilitate building spatial system models. The geodatabase and process models developed by this research can be easily modified and the data relating to mining can be replaced by other negative environmental impacts affecting the Nature Refuges. Results of the research are presented as graphs and maps providing visual evidence supporting the usefulness of GIS as means for capturing, visualising and enhancing spatial quality and integrity of Nature Refuge boundaries.
Resumo:
Many randomised controlled trials (RCT) have been conducted using Piper methysticum (kava), however no qualitative research exploring the experience of taking kava during a clinical trial has previously been reported. ---------- Patients and methods: A qualitative research component (in the form of semi structured and open ended written questions) was incorporated into an RCT to explore the experiences of those participating in a clinical trial of kava. The written questions were provided to participants at weeks 2 and 3 (after randomisation, after each controlled phase). The researcher and participants were blinded as to whether they were taking kava or placebo. Two open ended questions were posed to elucidate their experiences from taking either kava or placebo. Thematic analysis was undertaken and researcher triangulation employed to ensure analytical rigour. Key themes after the kava phases were a reduction in anxiety and stress, and calming or relaxing mental effects. Other themes related to improvement in sleep and in somatic anxiety symptoms. ---------- Results: Kava use did not cause any serious adverse reactions although a few respondents reported nausea or other gastrointestinal side effects. This represents the first documented qualitative investigation of the experience of taking kava during a clinical trial. The primary themes involved anxiolytic and calming effects, with only a minor theme reflecting side effects. Our exploratory qualitative data was consistent with the significant quantitative results revealed in the study and provides additional support to suggest the trial results did not exclude any important positive or negative effects (at least as experienced by the trial participants).
Resumo:
Predicting safety on roadways is standard practice for road safety professionals and has a corresponding extensive literature. The majority of safety prediction models are estimated using roadway segment and intersection (microscale) data, while more recently efforts have been undertaken to predict safety at the planning level (macroscale). Safety prediction models typically include roadway, operations, and exposure variables—factors known to affect safety in fundamental ways. Environmental variables, in particular variables attempting to capture the effect of rain on road safety, are difficult to obtain and have rarely been considered. In the few cases weather variables have been included, historical averages rather than actual weather conditions during which crashes are observed have been used. Without the inclusion of weather related variables researchers have had difficulty explaining regional differences in the safety performance of various entities (e.g. intersections, road segments, highways, etc.) As part of the NCHRP 8-44 research effort, researchers developed PLANSAFE, or planning level safety prediction models. These models make use of socio-economic, demographic, and roadway variables for predicting planning level safety. Accounting for regional differences - similar to the experience for microscale safety models - has been problematic during the development of planning level safety prediction models. More specifically, without weather related variables there is an insufficient set of variables for explaining safety differences across regions and states. Furthermore, omitted variable bias resulting from excluding these important variables may adversely impact the coefficients of included variables, thus contributing to difficulty in model interpretation and accuracy. This paper summarizes the results of an effort to include weather related variables, particularly various measures of rainfall, into accident frequency prediction and the prediction of the frequency of fatal and/or injury degree of severity crash models. The purpose of the study was to determine whether these variables do in fact improve overall goodness of fit of the models, whether these variables may explain some or all of observed regional differences, and identifying the estimated effects of rainfall on safety. The models are based on Traffic Analysis Zone level datasets from Michigan, and Pima and Maricopa Counties in Arizona. Numerous rain-related variables were found to be statistically significant, selected rain related variables improved the overall goodness of fit, and inclusion of these variables reduced the portion of the model explained by the constant in the base models without weather variables. Rain tends to diminish safety, as expected, in fairly complex ways, depending on rain frequency and intensity.
Resumo:
There has been considerable research conducted over the last 20 years focused on predicting motor vehicle crashes on transportation facilities. The range of statistical models commonly applied includes binomial, Poisson, Poisson-gamma (or negative binomial), zero-inflated Poisson and negative binomial models (ZIP and ZINB), and multinomial probability models. Given the range of possible modeling approaches and the host of assumptions with each modeling approach, making an intelligent choice for modeling motor vehicle crash data is difficult. There is little discussion in the literature comparing different statistical modeling approaches, identifying which statistical models are most appropriate for modeling crash data, and providing a strong justification from basic crash principles. In the recent literature, it has been suggested that the motor vehicle crash process can successfully be modeled by assuming a dual-state data-generating process, which implies that entities (e.g., intersections, road segments, pedestrian crossings, etc.) exist in one of two states—perfectly safe and unsafe. As a result, the ZIP and ZINB are two models that have been applied to account for the preponderance of “excess” zeros frequently observed in crash count data. The objective of this study is to provide defensible guidance on how to appropriate model crash data. We first examine the motor vehicle crash process using theoretical principles and a basic understanding of the crash process. It is shown that the fundamental crash process follows a Bernoulli trial with unequal probability of independent events, also known as Poisson trials. We examine the evolution of statistical models as they apply to the motor vehicle crash process, and indicate how well they statistically approximate the crash process. We also present the theory behind dual-state process count models, and note why they have become popular for modeling crash data. A simulation experiment is then conducted to demonstrate how crash data give rise to “excess” zeros frequently observed in crash data. It is shown that the Poisson and other mixed probabilistic structures are approximations assumed for modeling the motor vehicle crash process. Furthermore, it is demonstrated that under certain (fairly common) circumstances excess zeros are observed—and that these circumstances arise from low exposure and/or inappropriate selection of time/space scales and not an underlying dual state process. In conclusion, carefully selecting the time/space scales for analysis, including an improved set of explanatory variables and/or unobserved heterogeneity effects in count regression models, or applying small-area statistical methods (observations with low exposure) represent the most defensible modeling approaches for datasets with a preponderance of zeros
Resumo:
Queensland University of Technology (QUT) completed an Australian National Data Service (ANDS) funded “Seeding the Commons Project” to contribute metadata to Research Data Australia. The project employed two Research Data Librarians from October 2009 through to July 2010. Technical support for the project was provided by QUT’s High Performance Computing and Research Support Specialists. ---------- The project identified and described QUT’s category 1 (ARC / NHMRC) research datasets. Metadata for the research datasets was stored in QUT’s Research Data Repository (Architecta Mediaflux). Metadata which was suitable for inclusion in Research Data Australia was made available to the Australian Research Data Commons (ARDC) in RIF-CS format. ---------- Several workflows and processes were developed during the project. 195 data interviews took place in connection with 424 separate research activities which resulted in the identification of 492 datasets. ---------- The project had a high level of technical support from QUT High Performance Computing and Research Support Specialists who developed the Research Data Librarian interface to the data repository that enabled manual entry of interview data and dataset metadata, creation of relationships between repository objects. The Research Data Librarians mapped the QUT metadata repository fields to RIF-CS and an application was created by the HPC and Research Support Specialists to generate RIF-CS files for harvest by the Australian Research Data Commons (ARDC). ---------- This poster will focus on the workflows and processes established for the project including: ---------- • Interview processes and instruments • Data Ingest from existing systems (including mapping to RIF-CS) • Data entry and the Data Librarian interface to Mediaflux • Verification processes • Mapping and creation of RIF-CS for the ARDC
Resumo:
The Paediatric Spine Research group was formed in 2002 to perform high quality research into the prevention and management of spinal deformity, with an emphasis on scoliosis. The group has successfully built collaborative bridges between the scientific and research expertise at QUT, and the clinical skills and experience of the spinal orthopaedic surgeons at the Mater Children’s Hospital in Brisbane. Clinical and biomechanical research is now possible as a result of the development of detailed databases of patients who have innovative and unique surgical interventions for spinal deformity such as thoracoscopic scoliosis correction, thoracoscopic staple insertion for juvenile idiopathic scoliosis and minimally invasive growing rods. The Mater in Brisbane provides these unique datasets of spinal deformity surgery patients, whose procedures are not being performed anywhere else in the Southern Hemisphere. The most detailed is a database of thoracoscopic scoliosis correction surgery which now contains 180 patients with electronic collections of X-Rays, photographs and patient questionnaires. With ethics approval, a subset of these patients has had CT scans, and a further subset have had MRI scans with and without a compressive load to simulate the erect standing position. This database has to date contributed to 17 international refereed journal papers, a further 7 journal papers either under review or in final preparation, 53 national conference presentations and 35 international conference presentations. Major findings from selected journal publications will be presented. It is anticipated that as the surgical databases grow they will continue to provide invaluable clinical data which will feed into clinically relevant projects driven by both medical and engineering researchers whose findings will benefit spinal deformity patients and scientific knowledge worldwide.
Resumo:
Recent studies have detected a dominant accumulation mode (~100 nm) in the Sea Spray Aerosol (SSA) number distribution. There is evidence to suggest that particles in this mode are composed primarily of organics. To investigate this hypothesis we conducted experiments on NaCl, artificial SSA and natural SSA particles with a Volatility-Hygroscopicity-Tandem-Differential-Mobility-Analyser (VH-TDMA). NaCl particles were atomiser generated and a bubble generator was constructed to produce artificial and natural SSA particles. Natural seawater samples for use in the bubble generator were collected from biologically active, terrestrially-affected coastal water in Moreton Bay, Australia. Differences in the VH-TDMA-measured volatility curves of artificial and natural SSA particles were used to investigate and quantify the organic fraction of natural SSA particles. Hygroscopic Growth Factor (HGF) data, also obtained by the VH-TDMA, were used to confirm the conclusions drawn from the volatility data. Both datasets indicated that the organic fraction of our natural SSA particles evaporated in the VH-TDMA over the temperature range 170–200°C. The organic volume fraction for 71–77 nm natural SSA particles was 8±6%. Organic volume fraction did not vary significantly with varying water residence time (40 secs to 24 hrs) in the bubble generator or SSA particle diameter in the range 38–173 nm. At room temperature we measured shape- and Kelvin-corrected HGF at 90% RH of 2.46±0.02 for NaCl, 2.35±0.02 for artifical SSA and 2.26±0.02 for natural SSA particles. Overall, these results suggest that the natural accumulation mode SSA particles produced in these experiments contained only a minor organic fraction, which had little effect on hygroscopic growth. Our measurement of 8±6% is an order of magnitude below two previous measurements of the organic fraction in SSA particles of comparable sizes. We stress that our results were obtained using coastal seawater and they can’t necessarily be applied on a regional or global ocean scale. Nevertheless, considering the order of magnitude discrepancy between this and previous studies, further research with independent measurement techniques and a variety of different seawaters is required to better quantify how much organic material is present in accumulation mode SSA.
Resumo:
The Guardian reportage of the United Kingdom Member of Parliament (MP) expenses scandal of 2009 used crowdsourcing and computational journalism techniques. Computational journalism can be broadly defined as the application of computer science techniques to the activities of journalism. Its foundation lies in computer assisted reporting techniques and its importance is increasing due to the: (a) increasing availability of large scale government datasets for scrutiny; (b) declining cost, increasing power and ease of use of data mining and filtering software; and Web 2.0; and (c) explosion of online public engagement and opinion.. This paper provides a case study of the Guardian MP expenses scandal reportage and reveals some key challenges and opportunities for digital journalism. It finds journalists may increasingly take an active role in understanding, interpreting, verifying and reporting clues or conclusions that arise from the interrogations of datasets (computational journalism). Secondly a distinction should be made between information reportage and computational journalism in the digital realm, just as a distinction might be made between citizen reporting and citizen journalism. Thirdly, an opportunity exists for online news providers to take a ‘curatorial’ role, selecting and making easily available the best data sources for readers to use (information reportage). These activities have always been fundamental to journalism, however the way in which they are undertaken may change. Findings from this paper may suggest opportunities and challenges for the implementation of computational journalism techniques in practice by digital Australian media providers, and further areas of research.
Resumo:
As organizations reach higher levels of Business Process Management maturity, they tend to accumulate large collections of process models. These repositories may contain thousands of activities and be managed by different stakeholders with varying skills and responsibilities. However, while being of great value, these repositories induce high management costs. Thus, it becomes essential to keep track of the various model versions as they may mutually overlap, supersede one another and evolve over time. We propose an innovative versioning model and associated storage structure, specifically designed to maximize sharing across process model versions, and to automatically handle change propagation. The focal point of this technique is to version single process model fragments, rather than entire process models. Indeed empirical evidence shows that real-life process model repositories have numerous duplicate fragments. Experiments on two industrial datasets confirm the usefulness of our technique.
Resumo:
Queensland University of Technology’s Institutional Repository, QUT ePrints (http://eprints.qut.edu.au/), was established in 2003. With the help of an institutional mandate (endorsed in 2004) the repository now holds over 11,000 open access publications. The repository’s success is celebrated within the University and acknowledged nationally and internationally. QUT ePrints was built on GNU EPrints open source repository software (currently running v.3.1.3) and was originally configured to accommodate open access versions of the traditional range of research publications (journal articles, conference papers, books, book chapters and working papers). However, in 2009, the repository’s scope, content and systems were broadened and the ‘QUT Digital repository’ is now a service encompassing a range of digital collections, services and systems. For a work to be accepted in to the institutional repository, at least one of the authors/creators must have a current affiliation with QUT. However, the success of QUT ePrints in terms of its capacity to increase the visibility and accessibility of our researchers' scholarly works resulted in requests to accept digital collections of works which were out of scope. To address this need, a number of parallel digital collections have been developed. These collections include, OZcase, a collection of legal research materials and ‘The Sugar Industry Collection’; a digitsed collection of books and articles on sugar cane production and processing. Additionally, the Library has responded to requests from academics for a service to support the publication of new, and existing, peer reviewed open access journals. A project is currently underway to help a group of senior QUT academics publish a new international peer reviewed journal. The QUT Digital Repository website will be a portal for access to a range of resources to support copyright management. It is likely that it will provide an access point for the institution’s data repository. The data repository, provisionally named the ‘QUT Data Commons’, is currently a work-in-progress. The metadata for some QUT datasets will also be harvested by and discoverable via ‘Research Data Australia’, the dataset discovery service managed by the Australian National Data Service (ANDS). QUT Digital repository will integrate a range of technologies and services related to scholarly communication. This paper will discuss the development of the QUT Digital Repository, its strategic functions, the stakeholders involved and lessons learned.