895 resultados para data types and operators


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.

Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many studies have shown the considerable potential for the application of remote-sensing-based methods for deriving estimates of lake water quality. However, the reliable application of these methods across time and space is complicated by the diversity of lake types, sensor configuration, and the multitude of different algorithms proposed. This study tested one operational and 46 empirical algorithms sourced from the peer-reviewed literature that have individually shown potential for estimating lake water quality properties in the form of chlorophyll-a (algal biomass) and Secchi disc depth (SDD) (water transparency) in independent studies. Nearly half (19) of the algorithms were unsuitable for use with the remote-sensing data available for this study. The remaining 28 were assessed using the Terra/Aqua satellite archive to identify the best performing algorithms in terms of accuracy and transferability within the period 2001–2004 in four test lakes, namely Vänern, Vättern, Geneva, and Balaton. These lakes represent the broad continuum of large European lake types, varying in terms of eco-region (latitude/longitude and altitude), morphology, mixing regime, and trophic status. All algorithms were tested for each lake separately and combined to assess the degree of their applicability in ecologically different sites. None of the algorithms assessed in this study exhibited promise when all four lakes were combined into a single data set and most algorithms performed poorly even for specific lake types. A chlorophyll-a retrieval algorithm originally developed for eutrophic lakes showed the most promising results (R2 = 0.59) in oligotrophic lakes. Two SDD retrieval algorithms, one originally developed for turbid lakes and the other for lakes with various characteristics, exhibited promising results in relatively less turbid lakes (R2 = 0.62 and 0.76, respectively). The results presented here highlight the complexity associated with remotely sensed lake water quality estimates and the high degree of uncertainty due to various limitations, including the lake water optical properties and the choice of methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The MAREDAT atlas covers 11 types of plankton, ranging in size from bacteria to jellyfish. Together, these plankton groups determine the health and productivity of the global ocean and play a vital role in the global carbon cycle. Working within a uniform and consistent spatial and depth grid (map) of the global ocean, the researchers compiled thousands and tens of thousands of data points to identify regions of plankton abundance and scarcity as well as areas of data abundance and scarcity. At many of the grid points, the MAREDAT team accomplished the difficult conversion from abundance (numbers of organisms) to biomass (carbon mass of organisms). The MAREDAT atlas provides an unprecedented global data set for ecological and biochemical analysis and modeling as well as a clear mandate for compiling additional existing data and for focusing future data gathering efforts on key groups in key areas of the ocean. The present data set presents depth integrated values of diazotrophs Gamma-A nifH genes abundance, computed from a collection of source data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The MAREDAT atlas covers 11 types of plankton, ranging in size from bacteria to jellyfish. Together, these plankton groups determine the health and productivity of the global ocean and play a vital role in the global carbon cycle. Working within a uniform and consistent spatial and depth grid (map) of the global ocean, the researchers compiled thousands and tens of thousands of data points to identify regions of plankton abundance and scarcity as well as areas of data abundance and scarcity. At many of the grid points, the MAREDAT team accomplished the difficult conversion from abundance (numbers of organisms) to biomass (carbon mass of organisms). The MAREDAT atlas provides an unprecedented global data set for ecological and biochemical analysis and modeling as well as a clear mandate for compiling additional existing data and for focusing future data gathering efforts on key groups in key areas of the ocean. The present data set presents depth integrated values of diazotrophs nitrogen fixation rates, computed from a collection of source data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Site 1103 was one of a transect of three sites drilled across the Antarctic Peninsula continental shelf during Leg 178. The aim of drilling on the shelf was to determine the age of the sedimentary sequences and to ground truth previous interpretations of the depositional environment (i.e., topsets and foresets) of progradational seismostratigraphic sequences S1, S2, S3, and S4. The ultimate objective was to obtain a better understanding of the history of glacial advances and retreats in this west Antarctic margin. Drilling the topsets of the progradational wedge (0-247 m below seafloor [mbsf]), which consist of unsorted and unconsolidated materials of seismic Unit S1, was very unfavorable, resulting in very low (2.3%) core recovery. Recovery improved (34%) below 247 mbsf, corresponding to sediments of seismic Unit S3, which have a consolidated matrix. Logs were only obtained from the interval between 75 and 244 mbsf, and inconsistencies on the automatic analog picking of the signals received from the sonic log at the array and at the two other receivers prevented accurate shipboard time-depth conversions. This, in turn, limited the capacity for making seismic stratigraphic interpretations at this site and regionally. This study is an attempt to compile all available data sources, perform quality checks, and introduce nonstandard processing techniques for the logging data obtained to arrive at a reliable and continuous depth vs. velocity profile. We defined 13 data categories using differential traveltime information. Polynomial exclusion techniques with various orders and low-pass filtering reduced the noise of the initial data pool and produced a definite velocity depth profile that is synchronous with the resistivity logging data. A comparison of the velocity profile produced with various other logs of Site 1103 further validates the presented data. All major logging units are expressed within the new velocity data. A depth-migrated section with the new velocity data is presented together with the original time section and initial depth estimates published within the Leg 178 Initial Reports volume. The presented data confirms the location of the shelf unconformity at 222 ms two-way traveltime (TWT), or 243 mbsf, and allows its seismic identification as a strong negative and subsequent positive reflection.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last several years there has been an increase in the amount of qualitative research using in-depth interviews and comprehensive content analyses in sport psychology. However, no explicit method has been provided to deal with the large amount of unstructured data. This article provides common guidelines for organizing and interpreting unstructured data. Two main operations are suggested and discussed: first, coding meaningful text segments, or creating tags, and second, regrouping similar text segments,or creating categories. Furthermore, software programs for the microcomputer are presented as away to facilitate the organization and interpretation of qualitative data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The domestication of plants and animals marks one of the most significant transitions in human, and indeed global, history. Traditionally, study of the domestication process was the exclusive domain of archaeologists and agricultural scientists; today it is an increasingly multidisciplinary enterprise that has come to involve the skills of evolutionary biologists and geneticists. Although the application of new information sources and methodologies has dramatically transformed our ability to study and understand domestication, it has also generated increasingly large and complex datasets, the interpretation of which is not straightforward. In particular, challenges of equifinality, evolutionary variance, and emergence of unexpected or counter-intuitive patterns all face researchers attempting to infer past processes directly from patterns in data. We argue that explicit modeling approaches, drawing upon emerging methodologies in statistics and population genetics, provide a powerful means of addressing these limitations. Modeling also offers an approach to analyzing datasets that avoids conclusions steered by implicit biases, and makes possible the formal integration of different data types. Here we outline some of the modeling approaches most relevant to current problems in domestication research, and demonstrate the ways in which simulation modeling is beginning to reshape our understanding of the domestication process.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The study of the Upper Jurassic-Lower Cretaceous deposits (Higueruelas, Villar del Arzobispo and Aldea de Cortés Formations) of the South Iberian Basin (NW Valencia, Spain) reveals new stratigraphic and sedimentological data, which have significant implications on the stratigraphic framework, depositional environments and age of these units. The Higueruelas Fm was deposited in a mid-inner carbonate platform where oncolitic bars migrated by the action of storms and where oncoid production progressively decreased towards the uppermost part of the unit. The overlying Villar del Arzobispo Fm has been traditionally interpreted as an inner platform-lagoon evolving into a tidal-flat. Here it is interpreted as an inner-carbonate platform affected by storms, where oolitic shoals protected a lagoon, which had siliciclastic inputs from the continent. The Aldea de Cortés Fm has been previously interpreted as a lagoon surrounded by tidal-flats and fluvial-deltaic plains. Here it is reinterpreted as a coastal wetland where siliciclastic muddy deposits interacted with shallow fresh to marine water bodies, aeolian dunes and continental siliciclastic inputs. The contact between the Higueruelas and Villar del Arzobispo Fms, classically defined as gradual, is also interpreted here as rapid. More importantly, the contact between the Villar del Arzobispo and Aldea de Cortés Fms, previously considered as unconformable, is here interpreted as gradual. The presence of Alveosepta in the Villar del Arzobispo Fm suggests that at least part of this unit is Kimmeridgian, unlike the previously assigned Late Tithonian-Middle Berriasian age. Consequently, the underlying Higueruelas Fm, previously considered Tithonian, should not be younger than Kimmeridgian. Accordingly, sedimentation of the Aldea de Cortés Fm, previously considered Valangian-Hauterivian, probably started during the Tithonian and it may be considered part of the regressive trend of the Late Jurassic-Early Cretaceous cycle. This is consistent with the dinosaur faunas, typically Jurassic, described in the Villar del Arzobispo and Aldea de Cortés Fms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Identifying 20th-century periodic coastal surge variation is strategic for the 21st-century coastal surge estimates, as surge periodicities may amplify/reduce future MSL enhanced surge forecasts. Extreme coastal surge data from Belfast Harbour (UK) tide gauges are available for 1901–2010 and provide the potential for decadal-plus periodic coastal surge analysis. Annual extreme surge-elevation distributions (sampled every 10-min) are analysed using PCA and cluster analysis to decompose variation within- and between-years to assess similarity of years in terms of Surge Climate Types, and to establish significance of any transitions in Type occurrence over time using non-parametric Markov analysis. Annual extreme surge variation is shown to be periodically organised across the 20th century. Extreme surge magnitude and distribution show a number of significant cyclonic induced multi-annual (2, 3, 5 & 6 years) cycles, as well as dominant multi-decadal (15–25 years) cycles of variation superimposed on an 80 year fluctuation in atmospheric–oceanic variation across the North Atlantic (relative to NAO/AMO interaction). The top 30 extreme surge events show some relationship with NAO per se, given that 80% are associated with westerly dominant atmospheric flows (+ NAO), but there are 20% of the events associated with blocking air massess (− NAO). Although 20% of the top 30 ranked positive surges occurred within the last twenty years, there is no unequivocal evidence of recent acceleration in extreme surge magnitude related to other than the scale of natural periodic variation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The application of custom classification techniques and posterior probability modeling (PPM) using Worldview-2 multispectral imagery to archaeological field survey is presented in this paper. Research is focused on the identification of Neolithic felsite stone tool workshops in the North Mavine region of the Shetland Islands in Northern Scotland. Sample data from known workshops surveyed using differential GPS are used alongside known non-sites to train a linear discriminant analysis (LDA) classifier based on a combination of datasets including Worldview-2 bands, band difference ratios (BDR) and topographical derivatives. Principal components analysis is further used to test and reduce dimensionality caused by redundant datasets. Probability models were generated by LDA using principal components and tested with sites identified through geological field survey. Testing shows the prospective ability of this technique and significance between 0.05 and 0.01, and gain statistics between 0.90 and 0.94, higher than those obtained using maximum likelihood and random forest classifiers. Results suggest that this approach is best suited to relatively homogenous site types, and performs better with correlated data sources. Finally, by combining posterior probability models and least-cost analysis, a survey least-cost efficacy model is generated showing the utility of such approaches to archaeological field survey.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Objectives This paper describes the methods used in the International Cancer Benchmarking Partnership Module 4 Survey (ICBPM4) which examines time intervals and routes to cancer diagnosis in 10 jurisdictions. We present the study design with defining and measuring time intervals, identifying patients with cancer, questionnaire development, data management and analyses.
Design and setting Recruitment of participants to the ICBPM4 survey is based on cancer registries in each jurisdiction. Questionnaires draw on previous instruments and have been through a process of cognitive testing and piloting in three jurisdictions followed by standardised translation and adaptation. Data analysis focuses on comparing differences in time intervals and routes to diagnosis in the jurisdictions.
Participants Our target is 200 patients with symptomatic breast, lung, colorectal and ovarian cancer in each jurisdiction. Patients are approached directly or via their primary care physician (PCP). Patients’ PCPs and cancer treatment specialists (CTSs) are surveyed, anddata rules’ are applied to combine and reconcile conflicting information. Where CTS information is unavailable, audit information is sought from treatment records and databases.
Main outcomes Reliability testing of the patient questionnaire showed that agreement was complete (κ=1) in four items and substantial (κ=0.8, 95% CI 0.333 to 1) in one item. The identification of eligible patients is sufficient to meet the targets for breast, lung and colorectal cancer. Initial patient and PCP survey response rates from the UK and Sweden are comparable with similar published surveys. Data collection was completed in early 2016 for all cancer types.
Conclusion An international questionnaire-based survey of patients with cancer, PCPs and CTSs has been developed and launched in 10 jurisdictions. ICBPM4 will help to further understand international differences in cancer survival by comparing time intervals and routes to cancer diagnosis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Innovation is a strategic necessity for the survival of today’s organizations. The wide recognition of innovation as a competitive necessity, particularly in dynamic market environments, makes it an evergreen domain for research. This dissertation deals with innovation in small Information Technology (IT) firms in India. The IT industry in India has been a phenomenal success story of the last three decades, and is today facing a crucial phase in its history characterized by the need for fundamental changes in strategies, driven by innovation. This study, while motivated by the dynamics of changing times, importantly addresses the research gap on small firm innovation in Indian IT.This study addresses three main objectives: (a) drivers of innovation in small IT firms in India (b) impact of innovation on firm performance (c) variation in the extent of innovation adoption in small firms. Product and process innovation were identified as the two most contextually relevant types of innovation for small IT firms. The antecedents of innovation were identified as Intellectual Capital, Creative Capability, Top Management Support, Organization Learning Capability, Customer Involvement, External Networking and Employee Involvement.Survey method was adopted for data collection and the study unit was the firm. Surveys were conducted in 2014 across five South Indian cities. Small firm was defined as one with 10-499 employees. Responses from 205 firms were chosen for analysis. Rigorous statistical analysis was done to generate meaningful insights. The set of drivers of product innovation (Intellectual Capital, Creative Capability, Top Management Support, Customer Involvement, External Networking, and Employee Involvement)were different from that of process innovation (Creative Capability, Organization Learning Capability, External Networking, and Employee Involvement). Both product and process innovation had strong impact on firm performance. It was found that firms that adopted a combination of product innovation and process innovation had the highest levels of firm performance. Product innovation and process innovation fully mediated the relationship between all the seven antecedents and firm performance The results of this study have several important theoretical and practical implications. To the best of the researcher’s knowledge, this is the first time that an empirical study of firm level innovation of this kind has been undertaken in India. A measurement model for product and process innovation was developed, and the drivers of innovation were established statistically. Customer Involvement, External Networking and Employee Involvement are elements of Open Innovation, and all three had strong association with product innovation, and the latter twohad strong association with process innovation. The results showed that proclivity for Open Innovation is healthy in the Indian context. Practical implications have been outlined along how firms can organize themselves for innovation, the human talent for innovation, the right culture for innovation and for open innovation. While some specific examples of possible future studies have been recommended, the researcher believes that the study provides numerous opportunities to further this line of enquiry.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The generation of heterogeneous big data sources with ever increasing volumes, velocities and veracities over the he last few years has inspired the data science and research community to address the challenge of extracting knowledge form big data. Such a wealth of generated data across the board can be intelligently exploited to advance our knowledge about our environment, public health, critical infrastructure and security. In recent years we have developed generic approaches to process such big data at multiple levels for advancing decision-support. It specifically concerns data processing with semantic harmonisation, low level fusion, analytics, knowledge modelling with high level fusion and reasoning. Such approaches will be introduced and presented in context of the TRIDEC project results on critical oil and gas industry drilling operations and also the ongoing large eVacuate project on critical crowd behaviour detection in confined spaces.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08