990 resultados para 0804 Data Format
Resumo:
The General Election for the 56th United Kingdom Parliament was held on 7 May 2015. Tweets related to UK politics, not only those with the specific hashtag ”#GE2015”, have been collected in the period between March 1 and May 31, 2015. The resulting dataset contains over 28 million tweets for a total of 118 GB in uncompressed format or 15 GB in compressed format. This study describes the method that was used to collect the tweets and presents some analysis, including a political sentiment index, and outlines interesting research directions on Big Social Data based on Twitter microblogging.
Resumo:
Semantiska webben är ett begrepp som handlar om att göra data tillgängligt på ett sätt som gör att datorer kan söka, tolka och sätta data i ett sammanhang. Då mycket av datalagring idag sker i relationsdatabaser behövs nya sätt att omvandla och lagra data för att det ska vara tillgängligt för den semantiska webben.Forskning som genomförts har visat att transformering av data från relationsdatabaser till RDF som är det format som gör data sökbart på semantiska webben är möjlig men det finns idag ingen standardisering för hur detta ska ske.För att data som transformeras ska få rätt betydelse i RDF så krävs ontologier som beskriver olika begrepps relationer. Nationella vägdatabasen (NVDB) är en relationsdatabas som hantera geospatiala data som används i olika geografiska informationssystem (GIS). För samarbetspartnern Triona var det intressant att beskriva hur denna typ av data kan omvandlas för att passa den semantiska webben.Syftet var att analysera hur man överför geospatiala data från en relationsdatabas till den semantiska webben. Målet med studien var att skapa en modell för hur man överför geospatiala data till i en relationsdatabas till en RDF-lagring och hur man skapar en ontologi som passar för NVDB’s data och datastruktur.En fallstudie genomfördes med dokumentstudier utifrån en inledande litteraturstudie.En ontologi skapades för det specifika fallet och utifrån detta skapades en modell för hur man överför geospatiala data från NVDB till RDF via programvaran TripleGeo. Analysen har skett genom att transformerad data har analyserats med hjälp av befintlig teori om RDF och dess struktur och sedan jämföra och se så att data får rätt betydelse. Resultatet har också validerats genom att använda W3C’s tjänst för att validera RDF.Resultatet visar hur man transformerar data från en relationsdatabas med geospatiala data till RDF samt hur en ontologi för detta skapats. Resultatet visar också en modell som beskriver hur detta utförs och kan ses som ett försök till att generalisera och standardisera en metod för att överföra geospatiala data till RDF.
Resumo:
The practitioners of bioinformatics require increasing sophistication from their software tools to take into account the particular characteristics that make their domain complex. For example, there is a great variation of experience of researchers, from novices who would like guidance from experts in the best resources to use to experts that wish to take greater management control of the tools used in their experiments. Also, the range of available, and conflicting, data formats is growing and there is a desire to automate the many trivial manual stages of in-silico experiments. Agent-oriented software development is one approach to tackling the design of complex applications. In this paper, we argue that, in fact, agent-oriented development is a particularly well-suited approach to developing bioinformatics tools that take into account the wider domain characteristics. To illustrate this, we design a data curation tool, which manages the format of experimental data, extend it to better account for the extra requirements placed by the domain characteristics, and show how the characteristics lead to a system well suited to an agent-oriented view.
Resumo:
The practitioners of bioinformatics require increasing sophistication from their software tools to take into account the particular characteristics that make their domain complex. For example, there is a great variation of experience of researchers, from novices who would like guidance from experts in the best resources to use to experts that wish to take greater management control of the tools used in their experiments. Also, the range of available, and conflicting, data formats is growing and there is a desire to automate the many trivial manual stages of in-silico experiments. Agent-oriented software development is one approach to tackling the design of complex applications. In this paper, we argue that, in fact, agent-oriented development is a particularly well-suited approach to developing bioinformatics tools that take into account the wider domain characteristics. To illustrate this, we design a data curation tool, which manages the format of experimental data, extend it to better account for the extra requirements placed by the domain characteristics, and show how the characteristics lead to a system well suited to an agent-oriented view.
Resumo:
The Short-term Water Information and Forecasting Tools (SWIFT) is a suite of tools for flood and short-term streamflow forecasting, consisting of a collection of hydrologic model components and utilities. Catchments are modeled using conceptual subareas and a node-link structure for channel routing. The tools comprise modules for calibration, model state updating, output error correction, ensemble runs and data assimilation. Given the combinatorial nature of the modelling experiments and the sub-daily time steps typically used for simulations, the volume of model configurations and time series data is substantial and its management is not trivial. SWIFT is currently used mostly for research purposes but has also been used operationally, with intersecting but significantly different requirements. Early versions of SWIFT used mostly ad-hoc text files handled via Fortran code, with limited use of netCDF for time series data. The configuration and data handling modules have since been redesigned. The model configuration now follows a design where the data model is decoupled from the on-disk persistence mechanism. For research purposes the preferred on-disk format is JSON, to leverage numerous software libraries in a variety of languages, while retaining the legacy option of custom tab-separated text formats when it is a preferred access arrangement for the researcher. By decoupling data model and data persistence, it is much easier to interchangeably use for instance relational databases to provide stricter provenance and audit trail capabilities in an operational flood forecasting context. For the time series data, given the volume and required throughput, text based formats are usually inadequate. A schema derived from CF conventions has been designed to efficiently handle time series for SWIFT.
Resumo:
Non-audio signals have been recorded in the flash ROM memory of a portable MP3 player, in WAV format file, to examine the possibility of using these cheap and small instruments as general-purpose portable data loggers. A 1200-Hz FM carrier modulated by the non-audio signal has replaced the microphone signal, while using the REC operating mode of the MP3 player, which triggers the voice recording function. The signal recovery was carried out by a PLL-based FM demodulator whose input is the FM signal captured in the coil leads of the MP3 player's earphone. Sinusoidal and electrocardiogram signals have been used in the system evaluation. Although the quality of low frequency signals needs improvement, overall the results indicate the viability of the proposal. Suggestions are made for improvements and extensions of the work.
Resumo:
In soil surveys, several sampling systems can be used to define the most representative sites for sample collection and description of soil profiles. In recent years, the conditioned Latin hypercube sampling system has gained prominence for soil surveys. In Brazil, most of the soil maps are at small scales and in paper format, which hinders their refinement. The objectives of this work include: (i) to compare two sampling systems by conditioned Latin hypercube to map soil classes and soil properties; (II) to retrieve information from a detailed scale soil map of a pilot watershed for its refinement, comparing two data mining tools, and validation of the new soil map; and (III) to create and validate a soil map of a much larger and similar area from the extrapolation of information extracted from the existing soil map. Two sampling systems were created by conditioned Latin hypercube and by the cost-constrained conditioned Latin hypercube. At each prospection place, soil classification and measurement of the A horizon thickness were performed. Maps were generated and validated for each sampling system, comparing the efficiency of these methods. The conditioned Latin hypercube captured greater variability of soils and properties than the cost-constrained conditioned Latin hypercube, despite the former provided greater difficulty in field work. The conditioned Latin hypercube can capture greater soil variability and the cost-constrained conditioned Latin hypercube presents great potential for use in soil surveys, especially in areas of difficult access. From an existing detailed scale soil map of a pilot watershed, topographical information for each soil class was extracted from a Digital Elevation Model and its derivatives, by two data mining tools. Maps were generated using each tool. The more accurate of these tools was used for extrapolation of soil information for a much larger and similar area and the generated map was validated. It was possible to retrieve the existing soil map information and apply it on a larger area containing similar soil forming factors, at much low financial cost. The KnowledgeMiner tool for data mining, and ArcSIE, used to create the soil map, presented better results and enabled the use of existing soil map to extract soil information and its application in similar larger areas at reduced costs, which is especially important in development countries with limited financial resources for such activities, such as Brazil.
Resumo:
The Internet of Things (IoT) is the next industrial revolution: we will interact naturally with real and virtual devices as a key part of our daily life. This technology shift is expected to be greater than the Web and Mobile combined. As extremely different technologies are needed to build connected devices, the Internet of Things field is a junction between electronics, telecommunications and software engineering. Internet of Things application development happens in silos, often using proprietary and closed communication protocols. There is the common belief that only if we can solve the interoperability problem we can have a real Internet of Things. After a deep analysis of the IoT protocols, we identified a set of primitives for IoT applications. We argue that each IoT protocol can be expressed in term of those primitives, thus solving the interoperability problem at the application protocol level. Moreover, the primitives are network and transport independent and make no assumption in that regard. This dissertation presents our implementation of an IoT platform: the Ponte project. Privacy issues follows the rise of the Internet of Things: it is clear that the IoT must ensure resilience to attacks, data authentication, access control and client privacy. We argue that it is not possible to solve the privacy issue without solving the interoperability problem: enforcing privacy rules implies the need to limit and filter the data delivery process. However, filtering data require knowledge of how the format and the semantics of the data: after an analysis of the possible data formats and representations for the IoT, we identify JSON-LD and the Semantic Web as the best solution for IoT applications. Then, this dissertation present our approach to increase the throughput of filtering semantic data by a factor of ten.
Resumo:
BACKGROUND: Physiologic data display is essential to decision making in critical care. Current displays echo first-generation hemodynamic monitors dating to the 1970s and have not kept pace with new insights into physiology or the needs of clinicians who must make progressively more complex decisions about their patients. The effectiveness of any redesign must be tested before deployment. Tools that compare current displays with novel presentations of processed physiologic data are required. Regenerating conventional physiologic displays from archived physiologic data is an essential first step. OBJECTIVES: The purposes of the study were to (1) describe the SSSI (single sensor single indicator) paradigm that is currently used for physiologic signal displays, (2) identify and discuss possible extensions and enhancements of the SSSI paradigm, and (3) develop a general approach and a software prototype to construct such "extended SSSI displays" from raw data. RESULTS: We present Multi Wave Animator (MWA) framework-a set of open source MATLAB (MathWorks, Inc., Natick, MA, USA) scripts aimed to create dynamic visualizations (eg, video files in AVI format) of patient vital signs recorded from bedside (intensive care unit or operating room) monitors. Multi Wave Animator creates animations in which vital signs are displayed to mimic their appearance on current bedside monitors. The source code of MWA is freely available online together with a detailed tutorial and sample data sets.
Resumo:
PURPOSE: To describe the implementation and use of an electronic patient-referral system as an aid to the efficient referral of patients to a remote and specialized treatment center. METHODS AND MATERIALS: A system for the exchange of radiotherapy data between different commercial planning systems and a specially developed planning system for proton therapy has been developed through the use of the PAPYRUS diagnostic image standard as an intermediate format. To ensure the cooperation of the different TPS manufacturers, the number of data sets defined for transfer has been restricted to the three core data sets of CT, VOIs, and three-dimensional dose distributions. As a complement to the exchange of data, network-wide application-sharing (video-conferencing) technologies have been adopted to provide methods for the interactive discussion and assessment of treatments plans with one or more partner clinics. RESULTS: Through the use of evaluation plans based on the exchanged data, referring clinics can accurately assess the advantages offered by proton therapy on a patient-by-patient basis, while the practicality or otherwise of the proposed treatments can simultaneously be assessed by the proton therapy center. Such a system, along with the interactive capabilities provided by video-conferencing methods, has been found to be an efficient solution to the problem of patient assessment and selection at a specialized treatment center, and is a necessary first step toward the full electronic integration of such centers with their remotely situated referral centers.
Resumo:
Source materials like fine art, over-sized, fragile maps, and delicate artifacts have traditionally been digitally converted through the use of controlled lighting and high resolution scanners and camera backs. In addition the capture of items such as general and special collections bound monographs has recently grown both through consortial efforts like the Internet Archive's Open Content Alliance and locally at the individual institution level. These projects, in turn, have introduced increasingly higher resolution consumer-grade digital single lens reflex cameras or "DSLRs" as a significant part of the general cultural heritage digital conversion workflow. Central to the authors' discussion is the fact that both camera backs and DSLRs commonly share the ability to capture native raw file formats. Because these formats include such advantages as access to an image's raw mosaic sensor data within their architecture, many institutions choose raw for initial capture due to its high bit-level and unprocessed nature. However to date these same raw formats, so important to many at the point of capture, have yet to be considered "archival" within most published still imaging standards, if they are considered at all. Throughout many workflows raw files are deleted and thrown away after more traditionally "archival" uncompressed TIFF or JPEG 2000 files have been derived downstream from their raw source formats [1][2]. As a result, the authors examine the nature of raw anew and consider the basic questions, Should raw files be retained? What might their role be? Might they in fact form a new archival format space? Included in the discussion is a survey of assorted raw file types and their attributes. Also addressed are various sustainability issues as they pertain to archival formats with a special emphasis on both raw's positive and negative characteristics as they apply to archival practices. Current common archival workflows versus possible raw-based ones are investigated as well. These comparisons are noted in the context of each approach's differing levels of usable captured image data, various preservation virtues, and the divergent ideas of strictly fixed renditions versus the potential for improved renditions over time. Special attention is given to the DNG raw format through a detailed inspection of a number of its various structural components and the roles that they play in the format's latest specification. Finally an evaluation is drawn of both proprietary raw formats in general and DNG in particular as possible alternative archival formats for still imaging.
Resumo:
BACKGROUND Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method. METHODS In this Privacy Preserving Probabilistic Record Linkage method we apply a three-party protocol, with two sites collecting individual data and an independent trusted linkage center as the third partner. Our method consists of three main steps: pre-processing, encryption and probabilistic record linkage. Data pre-processing and encryption are done at the sites by local personnel. To guarantee similar quality and format of variables and identical encryption procedure at each site, the linkage center generates semi-automated pre-processing and encryption templates. To retrieve information (i.e. data structure) for the creation of templates without ever accessing plain person identifiable information, we introduced a novel method of data masking. Sensitive string variables are encrypted using Bloom filters, which enables calculation of similarity coefficients. For date variables, we developed special encryption procedures to handle the most common date errors. The linkage center performs probabilistic record linkage with encrypted person identifiable information and plain non-sensitive variables. RESULTS In this paper we describe step by step how to link existing health-related data using encryption methods to preserve privacy of persons in the study. CONCLUSION Privacy Preserving Probabilistic Record linkage expands record linkage facilities in settings where a unique identifier is unavailable and/or regulations restrict access to the non-unique person identifiable information needed to link existing health-related data sets. Automated pre-processing and encryption fully protect sensitive information ensuring participant confidentiality. This method is suitable not just for epidemiological research but also for any setting with similar challenges.
Resumo:
Introduction: Clinical reasoning is essential for the practice of medicine. In theory of development of medical expertise it is stated, that clinical reasoning starts from analytical processes namely the storage of isolated facts and the logical application of the ‘rules’ of diagnosis. Then the learners successively develop so called semantic networks and illness-scripts which finally are used in an intuitive non-analytic fashion [1], [2]. The script concordance test (SCT) is an example for assessing clinical reasoning [3]. However the aggregate scoring [3] of the SCT is recognized as problematic [4]. The SCT`s scoring leads to logical inconsistencies and is likely to reflect construct-irrelevant differences in examinees’ response styles [4]. Also the expert panel judgments might lead to an unintended error of measurement [4]. In this PhD project the following research questions will be addressed: 1. How does a format look like to assess clinical reasoning (similar to the SCT but) with multiple true-false questions or other formats with unambiguous correct answers, and by this address the above mentioned pitfalls in traditional scoring of the SCT? 2. How well does this format fulfill the Ottawa criteria for good assessment, with special regards to educational and catalytic effects [5]? Methods: 1. In a first study it shall be assessed whether designing a new format using multiple true-false items to assess clinical reasoning similar to the SCT-format is arguable in a theoretically and practically sound fashion. For this study focus groups or interviews with assessment experts and students will be undertaken. 2. In an study using focus groups and psychometric data Norcini`s and colleagues Criteria for Good Assessment [5] shall be determined for the new format in a real assessment. Furthermore the scoring method for this new format shall be optimized using real and simulated data.
Resumo:
A wide variety of spatial data collection efforts are ongoing throughout local, state and federal agencies, private firms and non-profit organizations. Each effort is established for a different purpose but organizations and individuals often collect and maintain the same or similar information. The United States federal government has undertaken many initiatives such as the National Spatial Data Infrastructure, the National Map and Geospatial One-Stop to reduce duplicative spatial data collection and promote the coordinated use, sharing, and dissemination of spatial data nationwide. A key premise in most of these initiatives is that no national government will be able to gather and maintain more than a small percentage of the geographic data that users want and desire. Thus, national initiatives depend typically on the cooperation of those already gathering spatial data and those using GIs to meet specific needs to help construct and maintain these spatial data infrastructures and geo-libraries for their nations (Onsrud 2001). Some of the impediments to widespread spatial data sharing are well known from directly asking GIs data producers why they are not currently involved in creating datasets that are of common or compatible formats, documenting their datasets in a standardized metadata format or making their datasets more readily available to others through Data Clearinghouses or geo-libraries. The research described in this thesis addresses the impediments to wide-scale spatial data sharing faced by GIs data producers and explores a new conceptual data-sharing approach, the Public Commons for Geospatial Data, that supports user-friendly metadata creation, open access licenses, archival services and documentation of parent lineage of the contributors and value- adders of digital spatial data sets.
Resumo:
Our paper asks the question: Does mode of instruction format (live or online format) effect test scores in the principles of macroeconomics classes? Our data are from several sections of principles of macroeconomics, some in live format, some in online format, and all taught by the same instructor. We find that test scores for the online format, when corrected for sample selection bias, are four points higher than for the live format, and the difference is statistically significant. One possible explanation for this is that there was slightly higher human capital in the classes that had the online format. A Oaxaca decomposition of this difference in grades was conducted to see how much was due to human capital and how much was due to the differences in the rates of return to human capital. This analysis reveals that 25% of the difference was due to the higher human capital with the remaining 75% due to differences in the returns to human capital. It is possible that for the relatively older student with the appropriate online learning skill set, and with schedule constrains created by family and job, the online format provides them with a more productive learning environment than does the alternative traditional live class format. Also, because our data are limited to the student s academic transcript, we recommend future research include data on learning style characteristics, and the constraints formed by family and job choices.