940 resultados para Data anonymization and sanitization


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Tara Oceans Expedition (2009-2013) sampled the world oceans on board a 36 m long schooner, collecting environmental data and organisms from viruses to planktonic metazoans for later analyses using modern sequencing and state-of-the-art imaging technologies. Tara Oceans Data are particularly suited to study the genetic, morphological and functional diversity of plankton. The present dataset contains navigation and meteorological data measured during one campaign of the Tara Oceans Expedition. Latitude and Longitude were obtained from TSG data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Tara Oceans Expedition (2009-2013) sampled the world oceans on board a 36 m long schooner, collecting environmental data and organisms from viruses to planktonic metazoans for later analyses using modern sequencing and state-of-the-art imaging technologies. Tara Oceans Data are particularly suited to study the genetic, morphological and functional diversity of plankton. The present dataset contains navigation and meteorological data measured during one campaign of the Tara Oceans Expedition. Latitude and Longitude were obtained from TSG data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Tara Oceans Expedition (2009-2013) sampled the world oceans on board a 36 m long schooner, collecting environmental data and organisms from viruses to planktonic metazoans for later analyses using modern sequencing and state-of-the-art imaging technologies. Tara Oceans Data are particularly suited to study the genetic, morphological and functional diversity of plankton. The present dataset contains navigation and meteorological data measured during one campaign of the Tara Oceans Expedition. Latitude and Longitude were obtained from TSG data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Tara Oceans Expedition (2009-2013) sampled the world oceans on board a 36 m long schooner, collecting environmental data and organisms from viruses to planktonic metazoans for later analyses using modern sequencing and state-of-the-art imaging technologies. Tara Oceans Data are particularly suited to study the genetic, morphological and functional diversity of plankton. The present dataset contains navigation and meteorological data measured during one campaign of the Tara Oceans Expedition. Latitude and Longitude were obtained from TSG data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Acknowledgements The authors would like to thank Jonathan Dick, Josie Geris, Jason Lessels, and Claire Tunaley for data collection and Audrey Innes for lab sample preparation. We also thank Christian Birkel for discussions about the model structure and comments on an earlier draft of the paper. Climatic data were provided by Iain Malcolm and Marine Scotland Fisheries at the Freshwater Lab, Pitlochry. Additional precipitation data were provided by the UK Meteorological Office and the British Atmospheric Data Centre (BADC).We thank the European Research Council ERC (project GA 335910 VEWA) for funding the VeWa project.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The MAREDAT atlas covers 11 types of plankton, ranging in size from bacteria to jellyfish. Together, these plankton groups determine the health and productivity of the global ocean and play a vital role in the global carbon cycle. Working within a uniform and consistent spatial and depth grid (map) of the global ocean, the researchers compiled thousands and tens of thousands of data points to identify regions of plankton abundance and scarcity as well as areas of data abundance and scarcity. At many of the grid points, the MAREDAT team accomplished the difficult conversion from abundance (numbers of organisms) to biomass (carbon mass of organisms). The MAREDAT atlas provides an unprecedented global data set for ecological and biochemical analysis and modeling as well as a clear mandate for compiling additional existing data and for focusing future data gathering efforts on key groups in key areas of the ocean. This is a gridded data product about diazotrophic organisms . There are 6 variables. Each variable is gridded on a dimension of 360 (longitude) * 180 (latitude) * 33 (depth) * 12 (month). The first group of 3 variables are: (1) number of biomass observations, (2) biomass, and (3) special nifH-gene-based biomass. The second group of 3 variables is same as the first group except that it only grids non-zero data. We have constructed a database on diazotrophic organisms in the global pelagic upper ocean by compiling more than 11,000 direct field measurements including 3 sub-databases: (1) nitrogen fixation rates, (2) cyanobacterial diazotroph abundances from cell counts and (3) cyanobacterial diazotroph abundances from qPCR assays targeting nifH genes. Biomass conversion factors are estimated based on cell sizes to convert abundance data to diazotrophic biomass. Data are assigned to 3 groups including Trichodesmium, unicellular diazotrophic cyanobacteria (group A, B and C when applicable) and heterocystous cyanobacteria (Richelia and Calothrix). Total nitrogen fixation rates and diazotrophic biomass are calculated by summing the values from all the groups. Some of nitrogen fixation rates are whole seawater measurements and are used as total nitrogen fixation rates. Both volumetric and depth-integrated values were reported. Depth-integrated values are also calculated for those vertical profiles with values at 3 or more depths.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Site 1103 was one of a transect of three sites drilled across the Antarctic Peninsula continental shelf during Leg 178. The aim of drilling on the shelf was to determine the age of the sedimentary sequences and to ground truth previous interpretations of the depositional environment (i.e., topsets and foresets) of progradational seismostratigraphic sequences S1, S2, S3, and S4. The ultimate objective was to obtain a better understanding of the history of glacial advances and retreats in this west Antarctic margin. Drilling the topsets of the progradational wedge (0-247 m below seafloor [mbsf]), which consist of unsorted and unconsolidated materials of seismic Unit S1, was very unfavorable, resulting in very low (2.3%) core recovery. Recovery improved (34%) below 247 mbsf, corresponding to sediments of seismic Unit S3, which have a consolidated matrix. Logs were only obtained from the interval between 75 and 244 mbsf, and inconsistencies on the automatic analog picking of the signals received from the sonic log at the array and at the two other receivers prevented accurate shipboard time-depth conversions. This, in turn, limited the capacity for making seismic stratigraphic interpretations at this site and regionally. This study is an attempt to compile all available data sources, perform quality checks, and introduce nonstandard processing techniques for the logging data obtained to arrive at a reliable and continuous depth vs. velocity profile. We defined 13 data categories using differential traveltime information. Polynomial exclusion techniques with various orders and low-pass filtering reduced the noise of the initial data pool and produced a definite velocity depth profile that is synchronous with the resistivity logging data. A comparison of the velocity profile produced with various other logs of Site 1103 further validates the presented data. All major logging units are expressed within the new velocity data. A depth-migrated section with the new velocity data is presented together with the original time section and initial depth estimates published within the Leg 178 Initial Reports volume. The presented data confirms the location of the shelf unconformity at 222 ms two-way traveltime (TWT), or 243 mbsf, and allows its seismic identification as a strong negative and subsequent positive reflection.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last several years there has been an increase in the amount of qualitative research using in-depth interviews and comprehensive content analyses in sport psychology. However, no explicit method has been provided to deal with the large amount of unstructured data. This article provides common guidelines for organizing and interpreting unstructured data. Two main operations are suggested and discussed: first, coding meaningful text segments, or creating tags, and second, regrouping similar text segments,or creating categories. Furthermore, software programs for the microcomputer are presented as away to facilitate the organization and interpretation of qualitative data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The generation of heterogeneous big data sources with ever increasing volumes, velocities and veracities over the he last few years has inspired the data science and research community to address the challenge of extracting knowledge form big data. Such a wealth of generated data across the board can be intelligently exploited to advance our knowledge about our environment, public health, critical infrastructure and security. In recent years we have developed generic approaches to process such big data at multiple levels for advancing decision-support. It specifically concerns data processing with semantic harmonisation, low level fusion, analytics, knowledge modelling with high level fusion and reasoning. Such approaches will be introduced and presented in context of the TRIDEC project results on critical oil and gas industry drilling operations and also the ongoing large eVacuate project on critical crowd behaviour detection in confined spaces.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This dissertation contains four essays that all share a common purpose: developing new methodologies to exploit the potential of high-frequency data for the measurement, modeling and forecasting of financial assets volatility and correlations. The first two chapters provide useful tools for univariate applications while the last two chapters develop multivariate methodologies. In chapter 1, we introduce a new class of univariate volatility models named FloGARCH models. FloGARCH models provide a parsimonious joint model for low frequency returns and realized measures, and are sufficiently flexible to capture long memory as well as asymmetries related to leverage effects. We analyze the performances of the models in a realistic numerical study and on the basis of a data set composed of 65 equities. Using more than 10 years of high-frequency transactions, we document significant statistical gains related to the FloGARCH models in terms of in-sample fit, out-of-sample fit and forecasting accuracy compared to classical and Realized GARCH models. In chapter 2, using 12 years of high-frequency transactions for 55 U.S. stocks, we argue that combining low-frequency exogenous economic indicators with high-frequency financial data improves the ability of conditionally heteroskedastic models to forecast the volatility of returns, their full multi-step ahead conditional distribution and the multi-period Value-at-Risk. Using a refined version of the Realized LGARCH model allowing for time-varying intercept and implemented with realized kernels, we document that nominal corporate profits and term spreads have strong long-run predictive ability and generate accurate risk measures forecasts over long-horizon. The results are based on several loss functions and tests, including the Model Confidence Set. Chapter 3 is a joint work with David Veredas. We study the class of disentangled realized estimators for the integrated covariance matrix of Brownian semimartingales with finite activity jumps. These estimators separate correlations and volatilities. We analyze different combinations of quantile- and median-based realized volatilities, and four estimators of realized correlations with three synchronization schemes. Their finite sample properties are studied under four data generating processes, in presence, or not, of microstructure noise, and under synchronous and asynchronous trading. The main finding is that the pre-averaged version of disentangled estimators based on Gaussian ranks (for the correlations) and median deviations (for the volatilities) provide a precise, computationally efficient, and easy alternative to measure integrated covariances on the basis of noisy and asynchronous prices. Along these lines, a minimum variance portfolio application shows the superiority of this disentangled realized estimator in terms of numerous performance metrics. Chapter 4 is co-authored with Niels S. Hansen, Asger Lunde and Kasper V. Olesen, all affiliated with CREATES at Aarhus University. We propose to use the Realized Beta GARCH model to exploit the potential of high-frequency data in commodity markets. The model produces high quality forecasts of pairwise correlations between commodities which can be used to construct a composite covariance matrix. We evaluate the quality of this matrix in a portfolio context and compare it to models used in the industry. We demonstrate significant economic gains in a realistic setting including short selling constraints and transaction costs.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In 2005, the University of Maryland acquired over 70 digital videos spanning 35 years of Jim Henson’s groundbreaking work in television and film. To support in-house discovery and use, the collection was cataloged in detail using AACR2 and MARC21, and a web-based finding aid was also created. In the past year, I created an "r-ball" (a linked data set described using RDA) of these same resources. The presentation will compare and contrast these three ways of accessing the Jim Henson Works collection, with insights gleaned from providing resource discovery using RIMMF (RDA in Many Metadata Formats).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Americans are accustomed to a wide range of data collection in their lives: census, polls, surveys, user registrations, and disclosure forms. When logging onto the Internet, users’ actions are being tracked everywhere: clicking, typing, tapping, swiping, searching, and placing orders. All of this data is stored to create data-driven profiles of each user. Social network sites, furthermore, set the voluntarily sharing of personal data as the default mode of engagement. But people’s time and energy devoted to creating this massive amount of data, on paper and online, are taken for granted. Few people would consider their time and energy spent on data production as labor. Even if some people do acknowledge their labor for data, they believe it is accessory to the activities at hand. In the face of pervasive data collection and the rising time spent on screens, why do people keep ignoring their labor for data? How has labor for data been become invisible, as something that is disregarded by many users? What does invisible labor for data imply for everyday cultural practices in the United States? Invisible Labor for Data addresses these questions. I argue that three intertwined forces contribute to framing data production as being void of labor: data production institutions throughout history, the Internet’s technological infrastructure (especially with the implementation of algorithms), and the multiplication of virtual spaces. There is a common tendency in the framework of human interactions with computers to deprive data and bodies of their materiality. My Introduction and Chapter 1 offer theoretical interventions by reinstating embodied materiality and redefining labor for data as an ongoing process. The middle Chapters present case studies explaining how labor for data is pushed to the margin of the narratives about data production. I focus on a nationwide debate in the 1960s on whether the U.S. should build a databank, contemporary Big Data practices in the data broker and the Internet industries, and the group of people who are hired to produce data for other people’s avatars in the virtual games. I conclude with a discussion on how the new development of crowdsourcing projects may usher in the new chapter in exploiting invisible and discounted labor for data.