934 resultados para Data anonymization and sanitization
Resumo:
Recent advances in technology have produced a significant increase in the availability of free sensor data over the Internet. With affordable weather monitoring stations now available to individual meteorology enthusiasts a reservoir of real time data such as temperature, rainfall and wind speed can now be obtained for most of the United States and Europe. Despite the abundance of available data, obtaining useable information about the weather in your local neighbourhood requires complex processing that poses several challenges. This paper discusses a collection of technologies and applications that harvest, refine and process this data, culminating in information that has been tailored toward the user. In this case we are particularly interested in allowing a user to make direct queries about the weather at any location, even when this is not directly instrumented, using interpolation methods. We also consider how the uncertainty that the interpolation introduces can then be communicated to the user of the system, using UncertML, a developing standard for uncertainty representation.
Resumo:
We address the important bioinformatics problem of predicting protein function from a protein's primary sequence. We consider the functional classification of G-Protein-Coupled Receptors (GPCRs), whose functions are specified in a class hierarchy. We tackle this task using a novel top-down hierarchical classification system where, for each node in the class hierarchy, the predictor attributes to be used in that node and the classifier to be applied to the selected attributes are chosen in a data-driven manner. Compared with a previous hierarchical classification system selecting classifiers only, our new system significantly reduced processing time without significantly sacrificing predictive accuracy.
Resumo:
The Securities and Exchange Commission (SEC) in the United States mandated a new digital reporting system for US companies in late 2008. The new generation of information provision has been dubbed by Chairman Cox, ‘interactive data’ (SEC, 2006a). Despite the promise of its name, we find that in the development of the project retail investors are invoked as calculative actors rather than engaged in dialogue. Similarly, the potential for the underlying technology to be applied in ways to encourage new forms of accountability appears to be forfeited in the interests of enrolling company filers. We theorise the activities of the SEC and in particular its chairman at the time, Christopher Cox, over a three year period, both prior to and following the ‘credit crisis’. We argue that individuals and institutions play a central role in advancing the socio-technical project that is constituted by interactive data. We adopt insights from ANT (Callon, 1986; Latour, 1987, 2005b) and governmentality (Miller, 2008; Miller and Rose, 2008) to show how regulators and the proponents of the technology have acted as spokespersons for the interactive data technology and the retail investor. We examine the way in which calculative accountability has been privileged in the SEC’s construction of the retail investor as concerned with atomised, quantitative data (Kamuf, 2007; Roberts, 2009; Tsoukas, 1997). We find that the possibilities for the democratising effects of digital information on the Internet has not been realised in the interactive data project and that it contains risks for the very investors the SEC claims to seek to protect.
Resumo:
Although crisp data are fundamentally indispensable for determining the profit Malmquist productivity index (MPI), the observed values in real-world problems are often imprecise or vague. These imprecise or vague data can be suitably characterized with fuzzy and interval methods. In this paper, we reformulate the conventional profit MPI problem as an imprecise data envelopment analysis (DEA) problem, and propose two novel methods for measuring the overall profit MPI when the inputs, outputs, and price vectors are fuzzy or vary in intervals. We develop a fuzzy version of the conventional MPI model by using a ranking method, and solve the model with a commercial off-the-shelf DEA software package. In addition, we define an interval for the overall profit MPI of each decision-making unit (DMU) and divide the DMUs into six groups according to the intervals obtained for their overall profit efficiency and MPIs. We also present two numerical examples to demonstrate the applicability of the two proposed models and exhibit the efficacy of the procedures and algorithms. © 2011 Elsevier Ltd.
Resumo:
One of the aims of the Science and Technology Committee (STC) of the Group on Earth Observations (GEO) was to establish a GEO Label- a label to certify geospatial datasets and their quality. As proposed, the GEO Label will be used as a value indicator for geospatial data and datasets accessible through the Global Earth Observation System of Systems (GEOSS). It is suggested that the development of such a label will significantly improve user recognition of the quality of geospatial datasets and that its use will help promote trust in datasets that carry the established GEO Label. Furthermore, the GEO Label is seen as an incentive to data providers. At the moment GEOSS contains a large amount of data and is constantly growing. Taking this into account, a GEO Label could assist in searching by providing users with visual cues of dataset quality and possibly relevance; a GEO Label could effectively stand as a decision support mechanism for dataset selection. Currently our project - GeoViQua, - together with EGIDA and ID-03 is undertaking research to define and evaluate the concept of a GEO Label. The development and evaluation process will be carried out in three phases. In phase I we have conducted an online survey (GEO Label Questionnaire) to identify the initial user and producer views on a GEO Label or its potential role. In phase II we will conduct a further study presenting some GEO Label examples that will be based on Phase I. We will elicit feedback on these examples under controlled conditions. In phase III we will create physical prototypes which will be used in a human subject study. The most successful prototypes will then be put forward as potential GEO Label options. At the moment we are in phase I, where we developed an online questionnaire to collect the initial GEO Label requirements and to identify the role that a GEO Label should serve from the user and producer standpoint. The GEO Label Questionnaire consists of generic questions to identify whether users and producers believe a GEO Label is relevant to geospatial data; whether they want a single "one-for-all" label or separate labels that will serve a particular role; the function that would be most relevant for a GEO Label to carry; and the functionality that users and producers would like to see from common rating and review systems they use. To distribute the questionnaire, relevant user and expert groups were contacted at meetings or by email. At this stage we successfully collected over 80 valid responses from geospatial data users and producers. This communication will provide a comprehensive analysis of the survey results, indicating to what extent the users surveyed in Phase I value a GEO Label, and suggesting in what directions a GEO Label may develop. Potential GEO Label examples based on the results of the survey will be presented for use in Phase II.
Resumo:
Mobile technologies have yet to be widely adopted by the Architectural, Engineering, and Construction (AEC) industry despite being one of the major growth areas in computing in recent years. This lack of uptake in the AEC industry is likely due, in large part, to the combination of small screen size and inappropriate interaction demands of current mobile technologies. This paper discusses the scope for multimodal interaction design with a specific focus on speech-based interaction to enhance the suitability of mobile technology use within the AEC industry by broadening the field data input capabilities of such technologies. To investigate the appropriateness of using multimodal technology for field data collection in the AEC industry, we have developed a prototype Multimodal Field Data Entry (MFDE) application. This application, which allows concrete testing technicians to record quality control data in the field, has been designed to support two different modalities of data input speech-based data entry and stylus-based data entry. To compare the effectiveness or usability of, and user preference for, the different input options, we have designed a comprehensive lab-based evaluation of the application. To appropriately reflect the anticipated context of use within the study design, careful consideration had to be given to the key elements of a construction site that would potentially influence a test technician's ability to use the input techniques. These considerations and the resultant evaluation design are discussed in detail in this paper.
Resumo:
This thesis describes advances in the characterisation, calibration and data processing of optical coherence tomography (OCT) systems. Femtosecond (fs) laser inscription was used for producing OCT-phantoms. Transparent materials are generally inert to infra-red radiations, but with fs lasers material modification occurs via non-linear processes when the highly focused light source interacts with the materials. This modification is confined to the focal volume and is highly reproducible. In order to select the best inscription parameters, combination of different inscription parameters were tested, using three fs laser systems, with different operating properties, on a variety of materials. This facilitated the understanding of the key characteristics of the produced structures with the aim of producing viable OCT-phantoms. Finally, OCT-phantoms were successfully designed and fabricated in fused silica. The use of these phantoms to characterise many properties (resolution, distortion, sensitivity decay, scan linearity) of an OCT system was demonstrated. Quantitative methods were developed to support the characterisation of an OCT system collecting images from phantoms and also to improve the quality of the OCT images. Characterisation methods include the measurement of the spatially variant resolution (point spread function (PSF) and modulation transfer function (MTF)), sensitivity and distortion. Processing of OCT data is a computer intensive process. Standard central processing unit (CPU) based processing might take several minutes to a few hours to process acquired data, thus data processing is a significant bottleneck. An alternative choice is to use expensive hardware-based processing such as field programmable gate arrays (FPGAs). However, recently graphics processing unit (GPU) based data processing methods have been developed to minimize this data processing and rendering time. These processing techniques include standard-processing methods which includes a set of algorithms to process the raw data (interference) obtained by the detector and generate A-scans. The work presented here describes accelerated data processing and post processing techniques for OCT systems. The GPU based processing developed, during the PhD, was later implemented into a custom built Fourier domain optical coherence tomography (FD-OCT) system. This system currently processes and renders data in real time. Processing throughput of this system is currently limited by the camera capture rate. OCTphantoms have been heavily used for the qualitative characterization and adjustment/ fine tuning of the operating conditions of OCT system. Currently, investigations are under way to characterize OCT systems using our phantoms. The work presented in this thesis demonstrate several novel techniques of fabricating OCT-phantoms and accelerating OCT data processing using GPUs. In the process of developing phantoms and quantitative methods, a thorough understanding and practical knowledge of OCT and fs laser processing systems was developed. This understanding leads to several novel pieces of research that are not only relevant to OCT but have broader importance. For example, extensive understanding of the properties of fs inscribed structures will be useful in other photonic application such as making of phase mask, wave guides and microfluidic channels. Acceleration of data processing with GPUs is also useful in other fields.
Resumo:
The evaluation of geospatial data quality and trustworthiness presents a major challenge to geospatial data users when making a dataset selection decision. The research presented here therefore focused on defining and developing a GEO label – a decision support mechanism to assist data users in efficient and effective geospatial dataset selection on the basis of quality, trustworthiness and fitness for use. This thesis thus presents six phases of research and development conducted to: (a) identify the informational aspects upon which users rely when assessing geospatial dataset quality and trustworthiness; (2) elicit initial user views on the GEO label role in supporting dataset comparison and selection; (3) evaluate prototype label visualisations; (4) develop a Web service to support GEO label generation; (5) develop a prototype GEO label-based dataset discovery and intercomparison decision support tool; and (6) evaluate the prototype tool in a controlled human-subject study. The results of the studies revealed, and subsequently confirmed, eight geospatial data informational aspects that were considered important by users when evaluating geospatial dataset quality and trustworthiness, namely: producer information, producer comments, lineage information, compliance with standards, quantitative quality information, user feedback, expert reviews, and citations information. Following an iterative user-centred design (UCD) approach, it was established that the GEO label should visually summarise availability and allow interrogation of these key informational aspects. A Web service was developed to support generation of dynamic GEO label representations and integrated into a number of real-world GIS applications. The service was also utilised in the development of the GEO LINC tool – a GEO label-based dataset discovery and intercomparison decision support tool. The results of the final evaluation study indicated that (a) the GEO label effectively communicates the availability of dataset quality and trustworthiness information and (b) GEO LINC successfully facilitates ‘at a glance’ dataset intercomparison and fitness for purpose-based dataset selection.
Resumo:
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space
Resumo:
The correct modelling of long- and short-term seasonality is a very interesting issue. The choice between the deterministic and stochastic modelling of trend and seasonality and their implications are as relevant as the case of deterministic and stochastic trends itself. The study considers the special case when the stochastic trend and seasonality do not evolve independently and the usual differencing filters do not apply. The results are applied to the day-ahead (spot) trading data of some main European energy exchanges (power and natural gas).
Resumo:
The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^
Resumo:
Groundwater systems of different densities are often mathematically modeled to understand and predict environmental behavior such as seawater intrusion or submarine groundwater discharge. Additional data collection may be justified if it will cost-effectively aid in reducing the uncertainty of a model's prediction. The collection of salinity, as well as, temperature data could aid in reducing predictive uncertainty in a variable-density model. However, before numerical models can be created, rigorous testing of the modeling code needs to be completed. This research documents the benchmark testing of a new modeling code, SEAWAT Version 4. The benchmark problems include various combinations of density-dependent flow resulting from variations in concentration and temperature. The verified code, SEAWAT, was then applied to two different hydrological analyses to explore the capacity of a variable-density model to guide data collection. ^ The first analysis tested a linear method to guide data collection by quantifying the contribution of different data types and locations toward reducing predictive uncertainty in a nonlinear variable-density flow and transport model. The relative contributions of temperature and concentration measurements, at different locations within a simulated carbonate platform, for predicting movement of the saltwater interface were assessed. Results from the method showed that concentration data had greater worth than temperature data in reducing predictive uncertainty in this case. Results also indicated that a linear method could be used to quantify data worth in a nonlinear model. ^ The second hydrological analysis utilized a model to identify the transient response of the salinity, temperature, age, and amount of submarine groundwater discharge to changes in tidal ocean stage, seasonal temperature variations, and different types of geology. The model was compared to multiple kinds of data to (1) calibrate and verify the model, and (2) explore the potential for the model to be used to guide the collection of data using techniques such as electromagnetic resistivity, thermal imagery, and seepage meters. Results indicated that the model can be used to give insight to submarine groundwater discharge and be used to guide data collection. ^
Resumo:
This dissertation established a software-hardware integrated design for a multisite data repository in pediatric epilepsy. A total of 16 institutions formed a consortium for this web-based application. This innovative fully operational web application allows users to upload and retrieve information through a unique human-computer graphical interface that is remotely accessible to all users of the consortium. A solution based on a Linux platform with My-SQL and Personal Home Page scripts (PHP) has been selected. Research was conducted to evaluate mechanisms to electronically transfer diverse datasets from different hospitals and collect the clinical data in concert with their related functional magnetic resonance imaging (fMRI). What was unique in the approach considered is that all pertinent clinical information about patients is synthesized with input from clinical experts into 4 different forms, which were: Clinical, fMRI scoring, Image information, and Neuropsychological data entry forms. A first contribution of this dissertation was in proposing an integrated processing platform that was site and scanner independent in order to uniformly process the varied fMRI datasets and to generate comparative brain activation patterns. The data collection from the consortium complied with the IRB requirements and provides all the safeguards for security and confidentiality requirements. An 1-MR1-based software library was used to perform data processing and statistical analysis to obtain the brain activation maps. Lateralization Index (LI) of healthy control (HC) subjects in contrast to localization-related epilepsy (LRE) subjects were evaluated. Over 110 activation maps were generated, and their respective LIs were computed yielding the following groups: (a) strong right lateralization: (HC=0%, LRE=18%), (b) right lateralization: (HC=2%, LRE=10%), (c) bilateral: (HC=20%, LRE=15%), (d) left lateralization: (HC=42%, LRE=26%), e) strong left lateralization: (HC=36%, LRE=31%). Moreover, nonlinear-multidimensional decision functions were used to seek an optimal separation between typical and atypical brain activations on the basis of the demographics as well as the extent and intensity of these brain activations. The intent was not to seek the highest output measures given the inherent overlap of the data, but rather to assess which of the many dimensions were critical in the overall assessment of typical and atypical language activations with the freedom to select any number of dimensions and impose any degree of complexity in the nonlinearity of the decision space.
Resumo:
With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.
Resumo:
With the exponential increasing demands and uses of GIS data visualization system, such as urban planning, environment and climate change monitoring, weather simulation, hydrographic gauge and so forth, the geospatial vector and raster data visualization research, application and technology has become prevalent. However, we observe that current web GIS techniques are merely suitable for static vector and raster data where no dynamic overlaying layers. While it is desirable to enable visual explorations of large-scale dynamic vector and raster geospatial data in a web environment, improving the performance between backend datasets and the vector and raster applications remains a challenging technical issue. This dissertation is to implement these challenging and unimplemented areas: how to provide a large-scale dynamic vector and raster data visualization service with dynamic overlaying layers accessible from various client devices through a standard web browser, and how to make the large-scale dynamic vector and raster data visualization service as rapid as the static one. To accomplish these, a large-scale dynamic vector and raster data visualization geographic information system based on parallel map tiling and a comprehensive performance improvement solution are proposed, designed and implemented. They include: the quadtree-based indexing and parallel map tiling, the Legend String, the vector data visualization with dynamic layers overlaying, the vector data time series visualization, the algorithm of vector data rendering, the algorithm of raster data re-projection, the algorithm for elimination of superfluous level of detail, the algorithm for vector data gridding and re-grouping and the cluster servers side vector and raster data caching.