282 resultados para Data anonymization and sanitization


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This research studied distributed computing of all-to-all comparison problems with big data sets. The thesis formalised the problem, and developed a high-performance and scalable computing framework with a programming model, data distribution strategies and task scheduling policies to solve the problem. The study considered storage usage, data locality and load balancing for performance improvement in solving the problem. The research outcomes can be applied in bioinformatics, biometrics and data mining and other domains in which all-to-all comparisons are a typical computing pattern.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The idea of extracting knowledge in process mining is a descendant of data mining. Both mining disciplines emphasise data flow and relations among elements in the data. Unfortunately, challenges have been encountered when working with the data flow and relations. One of the challenges is that the representation of the data flow between a pair of elements or tasks is insufficiently simplified and formulated, as it considers only a one-to-one data flow relation. In this paper, we discuss how the effectiveness of knowledge representation can be extended in both disciplines. To this end, we introduce a new representation of the data flow and dependency formulation using a flow graph. The flow graph solves the issue of the insufficiency of presenting other relation types, such as many-to-one and one-to-many relations. As an experiment, a new evaluation framework is applied to the Teleclaim process in order to show how this method can provide us with more precise results when compared with other representations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Body Area Network (BAN) is an emerging technology that focuses on monitoring physiological data in, on and around the human body. BAN technology permits wearable and implanted sensors to collect vital data about the human body and transmit it to other nodes via low-energy communication. In this paper, we investigate interactions in terms of data flows between parties involved in BANs under four different scenarios targeting outdoor and indoor medical environments: hospital, home, emergency and open areas. Based on these scenarios, we identify data flow requirements between BAN elements such as sensors and control units (CUs) and parties involved in BANs such as the patient, doctors, nurses and relatives. Identified requirements are used to generate BAN data flow models. Petri Nets (PNs) are used as the formal modelling language. We check the validity of the models and compare them with the existing related work. Finally, using the models, we identify communication and security requirements based on the most common active and passive attack scenarios.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Sequences of two chloroplast photosystem genes, psaA and psbB, together comprising about 3,500 bp, were obtained for all five major groups of extant seed plants and several outgroups among other vascular plants. Strongly supported, but significantly conflicting, phylogenetic signals were obtained in parsimony analyses from partitions of the data into first and second codon positions versus third positions. In the former, both genes agreed on a monophyletic gymnosperms, with Gnetales closely related to certain conifers. In the latter, Gnetales are inferred to be the sister group of all other seed plants, with gymnosperms paraphyletic. None of the data supported the modern ‘‘anthophyte hypothesis,’’ which places Gnetales as the sister group of flowering plants. A series of simulation studies were undertaken to examine the error rate for parsimony inference. Three kinds of errors were examined: random error, systematic bias (both properties of finite data sets), and statistical inconsistency owing to long-branch attraction (an asymptotic property). Parsimony reconstructions were extremely biased for third-position data for psbB. Regardless of the true underlying tree, a tree in which Gnetales are sister to all other seed plants was likely to be reconstructed for these data. None of the combinations of genes or partitions permits the anthophyte tree to be reconstructed with high probability. Simulations of progressively larger data sets indicate the existence of long-branch attraction (statistical inconsistency) for third-position psbB data if either the anthophyte tree or the gymnosperm tree is correct. This is also true for the anthophyte tree using either psaA third positions or psbB first and second positions. A factor contributing to bias and inconsistency is extremely short branches at the base of the seed plant radiation, coupled with extremely high rates in Gnetales and nonseed plant outgroups. M. J. Sanderson,* M. F. Wojciechowski,*† J.-M. Hu,* T. Sher Khan,* and S. G. Brady

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This project report presents the results of a study on wireless communication data transfer rates for a mobile device running a custombuilt construction defect reporting application. The study measured the time taken to transmit data about a construction defect, which included digital imagery and text, in order to assess the feasibility of transferring various types and sizes of data and the ICT-supported construction management applications that could be developed as a consequence. Data transfer rates over GPRS through the Telstra network and WiFi over a private network were compared. Based on the data size and data transfer time, the rate of transfer was calculated to determine the actual data transmission speeds at which the information was being sent using the wireless mobile communication protocols. The report finds that the transmission speeds vary considerably when using GPRS and can be significantly slower than what is advertised by mobile network providers. While WiFi is much faster than GPRS, the limited range of WiFi limits the protocol to residential-scale construction sites.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

ANDS Guides http://ands.org.au/guides/index.html These guides provide information about ANDS services and some fundamental issues in data-intensive research and research data management. These are not rules, prescriptions or proscriptions. They are guidelines and checklists to inform and broaden the range of possibilities for researchers, data managers, and research organisations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Problem-based learning (PBL) is a pedagogical methodology that presents the learner with a problem to be solved to stimulate and situate learning. This paper presents key characteristics of a problem-based learning environment that determines its suitability as a data source for workrelated research studies. To date, little has been written about the availability and validity of PBL environments as a data source and its suitability for work-related research. We describe problembased learning and use a research project case study to illustrate the challenges associated with industry work samples. We then describe the PBL course used in our research case study and use this example to illustrate the key attributes of problem-based learning environments and show how the chosen PBL environment met the work-related research requirements of the research case study. We propose that the more realistic the PBL work context and work group composition, the better the PBL environment as a data source for a work-related research. The work context is more realistic when relevant and complex project-based problems are tackled in industry-like work conditions over longer time frames. Work group composition is more realistic when participants with industry-level education and experience enact specialized roles in different disciplines within a professional community.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present the design and deployment results for PosNet - a large-scale, long-duration sensor network that gathers summary position and status information from mobile nodes. The mobile nodes have a fixed-sized memory buffer to which position data is added at a constant rate, and from which data is downloaded at a non-constant rate. We have developed a novel algorithm that performs online summarization of position data within the buffer, where the algorithm naturally accommodates data input and output rate mismatch, and also provides a delay-tolerant approach to data transport. The algorithm has been extensively tested in a large-scale long-duration cattle monitoring and control application.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we present a novel platform for underwater sensor networks to be used for long-term monitoring of coral reefs and �sheries. The sensor network consists of static and mobile underwater sensor nodes. The nodes communicate point-to-point using a novel high-speed optical communication system integrated into the TinyOS stack, and they broadcast using an acoustic protocol integrated in the TinyOS stack. The nodes have a variety of sensing capabilities, including cameras, water temperature, and pressure. The mobile nodes can locate and hover above the static nodes for data muling, and they can perform network maintenance functions such as deployment, relocation, and recovery. In this paper we describe the hardware and software architecture of this underwater sensor network. We then describe the optical and acoustic networking protocols and present experimental networking and data collected in a pool, in rivers, and in the ocean. Finally, we describe our experiments with mobility for data muling in this network.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A teaching and learning development project is currently under way at Queens-land University of Technology to develop advanced technology videotapes for use with the delivery of structural engineering courses. These tapes consist of integrated computer and laboratory simulations of important concepts, and behaviour of structures and their components for a number of structural engineering subjects. They will be used as part of the regular lectures and thus will not only improve the quality of lectures and learning environment, but also will be able to replace the ever-dwindling laboratory teaching in these subjects. The use of these videotapes, developed using advanced computer graphics, data visualization and video technologies, will enrich the learning process of the current diverse engineering student body. This paper presents the details of this new method, the methodology used, the results and evaluation in relation to one of the structural engineering subjects, steel structures.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Queensland University of Technology (QUT) completed an Australian National Data Service (ANDS) funded “Seeding the Commons Project” to contribute metadata to Research Data Australia. The project employed two Research Data Librarians from October 2009 through to July 2010. Technical support for the project was provided by QUT’s High Performance Computing and Research Support Specialists. ---------- The project identified and described QUT’s category 1 (ARC / NHMRC) research datasets. Metadata for the research datasets was stored in QUT’s Research Data Repository (Architecta Mediaflux). Metadata which was suitable for inclusion in Research Data Australia was made available to the Australian Research Data Commons (ARDC) in RIF-CS format. ---------- Several workflows and processes were developed during the project. 195 data interviews took place in connection with 424 separate research activities which resulted in the identification of 492 datasets. ---------- The project had a high level of technical support from QUT High Performance Computing and Research Support Specialists who developed the Research Data Librarian interface to the data repository that enabled manual entry of interview data and dataset metadata, creation of relationships between repository objects. The Research Data Librarians mapped the QUT metadata repository fields to RIF-CS and an application was created by the HPC and Research Support Specialists to generate RIF-CS files for harvest by the Australian Research Data Commons (ARDC). ---------- This poster will focus on the workflows and processes established for the project including: ---------- • Interview processes and instruments • Data Ingest from existing systems (including mapping to RIF-CS) • Data entry and the Data Librarian interface to Mediaflux • Verification processes • Mapping and creation of RIF-CS for the ARDC

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In a seminal data mining article, Leo Breiman [1] argued that to develop effective predictive classification and regression models, we need to move away from the sole dependency on statistical algorithms and embrace a wider toolkit of modeling algorithms that include data mining procedures. Nevertheless, many researchers still rely solely on statistical procedures when undertaking data modeling tasks; the sole reliance on these procedures has lead to the development of irrelevant theory and questionable research conclusions ([1], p.199). We will outline initiatives that the HPC & Research Support group is undertaking to engage researchers with data mining tools and techniques; including a new range of seminars, workshops, and one-on-one consultations covering data mining algorithms, the relationship between data mining and the research cycle, and limitations and problems with these new algorithms. Organisational limitations and restrictions to these initiatives are also discussed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The present rate of technological advance continues to place significant demands on data storage devices. The sheer amount of digital data being generated each year along with consumer expectations, fuels these demands. At present, most digital data is stored magnetically, in the form of hard disk drives or on magnetic tape. The increase in areal density (AD) of magnetic hard disk drives over the past 50 years has been of the order of 100 million times, and current devices are storing data at ADs of the order of hundreds of gigabits per square inch. However, it has been known for some time that the progress in this form of data storage is approaching fundamental limits. The main limitation relates to the lower size limit that an individual bit can have for stable storage. Various techniques for overcoming these fundamental limits are currently the focus of considerable research effort. Most attempt to improve current data storage methods, or modify these slightly for higher density storage. Alternatively, three dimensional optical data storage is a promising field for the information storage needs of the future, offering very high density, high speed memory. There are two ways in which data may be recorded in a three dimensional optical medium; either bit-by-bit (similar in principle to an optical disc medium such as CD or DVD) or by using pages of bit data. Bit-by-bit techniques for three dimensional storage offer high density but are inherently slow due to the serial nature of data access. Page-based techniques, where a two-dimensional page of data bits is written in one write operation, can offer significantly higher data rates, due to their parallel nature. Holographic Data Storage (HDS) is one such page-oriented optical memory technique. This field of research has been active for several decades, but with few commercial products presently available. Another page-oriented optical memory technique involves recording pages of data as phase masks in a photorefractive medium. A photorefractive material is one by which the refractive index can be modified by light of the appropriate wavelength and intensity, and this property can be used to store information in these materials. In phase mask storage, two dimensional pages of data are recorded into a photorefractive crystal, as refractive index changes in the medium. A low-intensity readout beam propagating through the medium will have its intensity profile modified by these refractive index changes and a CCD camera can be used to monitor the readout beam, and thus read the stored data. The main aim of this research was to investigate data storage using phase masks in the photorefractive crystal, lithium niobate (LiNbO3). Firstly the experimental methods for storing the two dimensional pages of data (a set of vertical stripes of varying lengths) in the medium are presented. The laser beam used for writing, whose intensity profile is modified by an amplitudemask which contains a pattern of the information to be stored, illuminates the lithium niobate crystal and the photorefractive effect causes the patterns to be stored as refractive index changes in the medium. These patterns are read out non-destructively using a low intensity probe beam and a CCD camera. A common complication of information storage in photorefractive crystals is the issue of destructive readout. This is a problem particularly for holographic data storage, where the readout beam should be at the same wavelength as the beam used for writing. Since the charge carriers in the medium are still sensitive to the read light field, the readout beam erases the stored information. A method to avoid this is by using thermal fixing. Here the photorefractive medium is heated to temperatures above 150�C; this process forms an ionic grating in the medium. This ionic grating is insensitive to the readout beam and therefore the information is not erased during readout. A non-contact method for determining temperature change in a lithium niobate crystal is presented in this thesis. The temperature-dependent birefringent properties of the medium cause intensity oscillations to be observed for a beam propagating through the medium during a change in temperature. It is shown that each oscillation corresponds to a particular temperature change, and by counting the number of oscillations observed, the temperature change of the medium can be deduced. The presented technique for measuring temperature change could easily be applied to a situation where thermal fixing of data in a photorefractive medium is required. Furthermore, by using an expanded beam and monitoring the intensity oscillations over a wide region, it is shown that the temperature in various locations of the crystal can be monitored simultaneously. This technique could be used to deduce temperature gradients in the medium. It is shown that the three dimensional nature of the recording medium causes interesting degradation effects to occur when the patterns are written for a longer-than-optimal time. This degradation results in the splitting of the vertical stripes in the data pattern, and for long writing exposure times this process can result in the complete deterioration of the information in the medium. It is shown in that simply by using incoherent illumination, the original pattern can be recovered from the degraded state. The reason for the recovery is that the refractive index changes causing the degradation are of a smaller magnitude since they are induced by the write field components scattered from the written structures. During incoherent erasure, the lower magnitude refractive index changes are neutralised first, allowing the original pattern to be recovered. The degradation process is shown to be reversed during the recovery process, and a simple relationship is found relating the time at which particular features appear during degradation and recovery. A further outcome of this work is that the minimum stripe width of 30 ìm is required for accurate storage and recovery of the information in the medium, any size smaller than this results in incomplete recovery. The degradation and recovery process could be applied to an application in image scrambling or cryptography for optical information storage. A two dimensional numerical model based on the finite-difference beam propagation method (FD-BPM) is presented and used to gain insight into the pattern storage process. The model shows that the degradation of the patterns is due to the complicated path taken by the write beam as it propagates through the crystal, and in particular the scattering of this beam from the induced refractive index structures in the medium. The model indicates that the highest quality pattern storage would be achieved with a thin 0.5 mm medium; however this type of medium would also remove the degradation property of the patterns and the subsequent recovery process. To overcome the simplistic treatment of the refractive index change in the FD-BPM model, a fully three dimensional photorefractive model developed by Devaux is presented. This model shows significant insight into the pattern storage, particularly for the degradation and recovery process, and confirms the theory that the recovery of the degraded patterns is possible since the refractive index changes responsible for the degradation are of a smaller magnitude. Finally, detailed analysis of the pattern formation and degradation dynamics for periodic patterns of various periodicities is presented. It is shown that stripe widths in the write beam of greater than 150 ìm result in the formation of different types of refractive index changes, compared with the stripes of smaller widths. As a result, it is shown that the pattern storage method discussed in this thesis has an upper feature size limit of 150 ìm, for accurate and reliable pattern storage.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Objective: to assess the accuracy of data linkage across the spectrum of emergency care in the absence of a unique patient identifier, and to use the linked data to examine service delivery outcomes in an emergency department setting. Design: automated data linkage and manual data linkage were compared to determine their relative accuracy. Data were extracted from three separate health information systems: ambulance, ED and hospital inpatients, then linked to provide information about the emergency journey of each patient. The linking was done manually through physical review of records and automatically using a data linking tool (Health Data Integration) developed by the CSIRO. Match rate and quality of the linking were compared. Setting: 10, 835 patient presentations to a large, regional teaching hospital ED over a two month period (August-September 2007). Results: comparison of the manual and automated linkage outcomes for each pair of linked datasets demonstrated a sensitivity of between 95% and 99%; a specificity of between 75% and 99%; and a positive predictive value of between 88% and 95%. Conclusions: Our results indicate that automated linking provides a sound basis for health service analysis, even in the absence of a unique patient identifier. The use of an automated linking tool yields accurate data suitable for planning and service delivery purposes and enables the data to be linked regularly to examine service delivery outcomes.