872 resultados para heterogeneous data sources
Resumo:
Site 1103 was one of a transect of three sites drilled across the Antarctic Peninsula continental shelf during Leg 178. The aim of drilling on the shelf was to determine the age of the sedimentary sequences and to ground truth previous interpretations of the depositional environment (i.e., topsets and foresets) of progradational seismostratigraphic sequences S1, S2, S3, and S4. The ultimate objective was to obtain a better understanding of the history of glacial advances and retreats in this west Antarctic margin. Drilling the topsets of the progradational wedge (0-247 m below seafloor [mbsf]), which consist of unsorted and unconsolidated materials of seismic Unit S1, was very unfavorable, resulting in very low (2.3%) core recovery. Recovery improved (34%) below 247 mbsf, corresponding to sediments of seismic Unit S3, which have a consolidated matrix. Logs were only obtained from the interval between 75 and 244 mbsf, and inconsistencies on the automatic analog picking of the signals received from the sonic log at the array and at the two other receivers prevented accurate shipboard time-depth conversions. This, in turn, limited the capacity for making seismic stratigraphic interpretations at this site and regionally. This study is an attempt to compile all available data sources, perform quality checks, and introduce nonstandard processing techniques for the logging data obtained to arrive at a reliable and continuous depth vs. velocity profile. We defined 13 data categories using differential traveltime information. Polynomial exclusion techniques with various orders and low-pass filtering reduced the noise of the initial data pool and produced a definite velocity depth profile that is synchronous with the resistivity logging data. A comparison of the velocity profile produced with various other logs of Site 1103 further validates the presented data. All major logging units are expressed within the new velocity data. A depth-migrated section with the new velocity data is presented together with the original time section and initial depth estimates published within the Leg 178 Initial Reports volume. The presented data confirms the location of the shelf unconformity at 222 ms two-way traveltime (TWT), or 243 mbsf, and allows its seismic identification as a strong negative and subsequent positive reflection.
Resumo:
The creation of Causal Loop Diagrams (CLDs) is a major phase in the System Dynamics (SD) life-cycle, since the created CLDs express dependencies and feedback in the system under study, as well as, guide modellers in building meaningful simulation models. The cre-ation of CLDs is still subject to the modeller's domain expertise (mental model) and her ability to abstract the system, because of the strong de-pendency on semantic knowledge. Since the beginning of SD, available system data sources (written and numerical models) have always been sparsely available, very limited and imperfect and thus of little benefit to the whole modelling process. However, in recent years, we have seen an explosion in generated data, especially in all business related domains that are analysed via Business Dynamics (BD). In this paper, we intro-duce a systematic tool supported CLD creation approach, which analyses and utilises available disparate data sources within the business domain. We demonstrate the application of our methodology on a given business use-case and evaluate the resulting CLD. Finally, we propose directions for future research to further push the automation in the CLD creation and increase confidence in the generated CLDs.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
We propose three research problems to explore the relations between trust and security in the setting of distributed computation. In the first problem, we study trust-based adversary detection in distributed consensus computation. The adversaries we consider behave arbitrarily disobeying the consensus protocol. We propose a trust-based consensus algorithm with local and global trust evaluations. The algorithm can be abstracted using a two-layer structure with the top layer running a trust-based consensus algorithm and the bottom layer as a subroutine executing a global trust update scheme. We utilize a set of pre-trusted nodes, headers, to propagate local trust opinions throughout the network. This two-layer framework is flexible in that it can be easily extensible to contain more complicated decision rules, and global trust schemes. The first problem assumes that normal nodes are homogeneous, i.e. it is guaranteed that a normal node always behaves as it is programmed. In the second and third problems however, we assume that nodes are heterogeneous, i.e, given a task, the probability that a node generates a correct answer varies from node to node. The adversaries considered in these two problems are workers from the open crowd who are either investing little efforts in the tasks assigned to them or intentionally give wrong answers to questions. In the second part of the thesis, we consider a typical crowdsourcing task that aggregates input from multiple workers as a problem in information fusion. To cope with the issue of noisy and sometimes malicious input from workers, trust is used to model workers' expertise. In a multi-domain knowledge learning task, however, using scalar-valued trust to model a worker's performance is not sufficient to reflect the worker's trustworthiness in each of the domains. To address this issue, we propose a probabilistic model to jointly infer multi-dimensional trust of workers, multi-domain properties of questions, and true labels of questions. Our model is very flexible and extensible to incorporate metadata associated with questions. To show that, we further propose two extended models, one of which handles input tasks with real-valued features and the other handles tasks with text features by incorporating topic models. Our models can effectively recover trust vectors of workers, which can be very useful in task assignment adaptive to workers' trust in the future. These results can be applied for fusion of information from multiple data sources like sensors, human input, machine learning results, or a hybrid of them. In the second subproblem, we address crowdsourcing with adversaries under logical constraints. We observe that questions are often not independent in real life applications. Instead, there are logical relations between them. Similarly, workers that provide answers are not independent of each other either. Answers given by workers with similar attributes tend to be correlated. Therefore, we propose a novel unified graphical model consisting of two layers. The top layer encodes domain knowledge which allows users to express logical relations using first-order logic rules and the bottom layer encodes a traditional crowdsourcing graphical model. Our model can be seen as a generalized probabilistic soft logic framework that encodes both logical relations and probabilistic dependencies. To solve the collective inference problem efficiently, we have devised a scalable joint inference algorithm based on the alternating direction method of multipliers. The third part of the thesis considers the problem of optimal assignment under budget constraints when workers are unreliable and sometimes malicious. In a real crowdsourcing market, each answer obtained from a worker incurs cost. The cost is associated with both the level of trustworthiness of workers and the difficulty of tasks. Typically, access to expert-level (more trustworthy) workers is more expensive than to average crowd and completion of a challenging task is more costly than a click-away question. In this problem, we address the problem of optimal assignment of heterogeneous tasks to workers of varying trust levels with budget constraints. Specifically, we design a trust-aware task allocation algorithm that takes as inputs the estimated trust of workers and pre-set budget, and outputs the optimal assignment of tasks to workers. We derive the bound of total error probability that relates to budget, trustworthiness of crowds, and costs of obtaining labels from crowds naturally. Higher budget, more trustworthy crowds, and less costly jobs result in a lower theoretical bound. Our allocation scheme does not depend on the specific design of the trust evaluation component. Therefore, it can be combined with generic trust evaluation algorithms.
Resumo:
Choosing a single similarity threshold for cutting dendrograms is not sufficient for performing hierarchical clustering analysis of heterogeneous data sets. In addition, alternative automated or semi-automated methods that cut dendrograms in multiple levels make assumptions about the data in hand. In an attempt to help the user to find patterns in the data and resolve ambiguities in cluster assignments, we developed MLCut: a tool that provides visual support for exploring dendrograms of heterogeneous data sets in different levels of detail. The interactive exploration of the dendrogram is coordinated with a representation of the original data, shown as parallel coordinates. The tool supports three analysis steps. Firstly, a single-height similarity threshold can be applied using a dynamic slider to identify the main clusters. Secondly, a distinctiveness threshold can be applied using a second dynamic slider to identify “weak-edges” that indicate heterogeneity within clusters. Thirdly, the user can drill-down to further explore the dendrogram structure - always in relation to the original data - and cut the branches of the tree at multiple levels. Interactive drill-down is supported using mouse events such as hovering, pointing and clicking on elements of the dendrogram. Two prototypes of this tool have been developed in collaboration with a group of biologists for analysing their own data sets. We found that enabling the users to cut the tree at multiple levels, while viewing the effect in the original data, is a promising method for clustering which could lead to scientific discoveries.
Resumo:
El volumen de datos en bibliotecas ha aumentado enormemente en los últimos años, así como también la complejidad de sus fuentes y formatos de información, dificultando su gestión y acceso, especialmente como apoyo en la toma de decisiones. Sabiendo que una buena gestión de bibliotecas involucra la integración de indicadores estratégicos, la implementación de un Data Warehouse (DW), que gestione adecuadamente tal cantidad de información, así como su compleja mezcla de fuentes de datos, se convierte en una alternativa interesante a considerar. El artículo describe el diseño e implementación de un sistema de soporte de decisiones (DSS) basado en técnicas de DW para la biblioteca de la Universidad de Cuenca. Para esto, el estudio utiliza una metodología holística, propuesto por Siguenza-Guzman et al. (2014) para la evaluación integral de bibliotecas. Dicha metodología evalúa la colección y los servicios, incorporando importantes elementos para la gestión de bibliotecas, tales como: el desempeño de los servicios, el control de calidad, el uso de la colección y la interacción con el usuario. A partir de este análisis, se propone una arquitectura de DW que integra, procesa y almacena los datos. Finalmente, estos datos almacenados son analizados y visualizados a través de herramientas de procesamiento analítico en línea (OLAP). Las pruebas iniciales de implementación confirman la viabilidad y eficacia del enfoque propuesto, al integrar con éxito múltiples y heterogéneas fuentes y formatos de datos, facilitando que los directores de bibliotecas generen informes personalizados, e incluso permitiendo madurar los procesos transaccionales que diariamente se llevan a cabo.
Resumo:
Background Writing therapy to improve physical or mental health can take many forms. The most researched model of therapeutic writing (TW) is unfacilitated, individual expressive writing (written emotional disclosure). Facilitated writing activities are less widely researched. Data sources Databases including: MEDLINE, EMBASE, PsychINFO, Linguistics and Language Behavior Abstracts, AMED, and CINHAL were searched from inception to March 2013. Review methods Four TW practitioners provided expert advice. Study procedures were conducted by one reviewer and checked by a second. Randomised controlled trials (RCTs) and non-randomised comparative studies were included. Quality was appraised using the Cochrane risk of bias tool. Unfacilitated and facilitated TW studies were analysed separately under ICD-10 chapter headings. Meta-analyses were performed where possible using Revman 5.2. Costs were estimated from an NHS perspective and three cost-consequence case studies were prepared. Realist synthesis followed RAMESES guidelines. Objectives To review the clinical and cost-effectiveness of TW for people with long-term health conditions (LTCs) compared to no writing, or other controls, reporting any relevant clinical outcomes. To conduct a realist synthesis to understand how TW might work, and for whom. Results From 14,658 unique citations, 284 full text papers were reviewed and 64 studies (58 RCTs) were included in the final effectiveness reviews. Five studies examined facilitated TW, these were extremely heterogeneous with unclear or high risk of bias, but suggested that facilitated TW interventions may be beneficial in individual LTCs. Unfacilitated expressive writing was examined in 59 studies of variable, or unreported, quality. Overall there was very little or no evidence of any benefit reported in the following conditions (number of studies): HIV (six); breast cancer (eight); gynaecological and genitourinary cancers (five); mental health (five); asthma (four); psoriasis (three); chronic pain (four). In inflammatory arthropathies (six) there was a reduction in disease severity (n= 191, standardised mean difference (SMD) - 0.61 [95% confidence intervals (95% CI) -0.96, -0.26]) in the short term on meta-analysis of four studies. For all other LTCs there was either no, or sparse, data with no, or inconsistent, evidence of benefit. Meta-analyses conducted across all the LTCs provided no evidence that unfacilitated EW had any effect on depression at short term (n= 1,563, SMD -0.06, 95% CI -0.29 to 0.17, substantial heterogeneity), or long term (n= 778, SMD-0.04 95% CI -0.18 to 0.10, little heterogeneity) follow up, or on anxiety, physiological or biomarker-based outcomes. One study reported costs, none reported cost-effectiveness, twelve reported resource use; meta-analysis suggested reduced medication use but no impact on health centre visits. Estimated costs of intervention were low, but there was insufficient evidence to judge cost-effectiveness. Realist review findings suggested that facilitated TW is a complex intervention and group interaction contributes to the perception of benefit. It was unclear from the available data who might benefit most from facilitated TW. Limitations Difficulties with developing realist review programme theory meant that mechanisms operating during TW remain obscure. Conclusions Overall there is little evidence to support the effectiveness or cost-effectiveness of unfacilitated expressive writing interventions in people with LTCs. Further research focussed on facilitated TW in people with LTCs could be informative.
Resumo:
Survival models are being widely applied to the engineering field to model time-to-event data once censored data is here a common issue. Using parametric models or not, for the case of heterogeneous data, they may not always represent a good fit. The present study relays on critical pumps survival data where traditional parametric regression might be improved in order to obtain better approaches. Considering censored data and using an empiric method to split the data into two subgroups to give the possibility to fit separated models to our censored data, we’ve mixture two distinct distributions according a mixture-models approach. We have concluded that it is a good method to fit data that does not fit to a usual parametric distribution and achieve reliable parameters. A constant cumulative hazard rate policy was used as well to check optimum inspection times using the obtained model from the mixture-model, which could be a plus when comparing with the actual maintenance policies to check whether changes should be introduced or not.
Resumo:
This work is part of a project promoted by Emilia-Romagna that aims at encouraging research activities in order to support the innovation strategies of the regional economic system through the exploitation of new data sources. To gain this scope, a database containing administrative data is provided by the Municipality of Bologna. This is achieved by linking data from the Register Office of the Municipality and fiscal data coming from the tax returns submitted to the Revenue Agency and released by the Ministry of Economy and Finance for the period 2002-2017. The main purpose of the project is the analysis of the medium term financial and distributional trends of income of the citizens residing in the Municipality of Bologna. Exploiting this innovative source of data allow us to analyse the dynamics of income at municipal level, overcoming the lack of information in official survey-based statistic. We investigate these trends by building inequality indicators and by examining the persistence of in-work poverty. Our results represent an important informative element to improve the effectiveness and equity of welfare policies at the local level, and to guide the distribution of economic and social support and urban redevelopment interventions in different areas of the Municipality.
Resumo:
With the CERN LHC program underway, there has been an acceleration of data growth in the High Energy Physics (HEP) field and the usage of Machine Learning (ML) in HEP will be critical during the HL-LHC program when the data that will be produced will reach the exascale. ML techniques have been successfully used in many areas of HEP nevertheless, the development of a ML project and its implementation for production use is a highly time-consuming task and requires specific skills. Complicating this scenario is the fact that HEP data is stored in ROOT data format, which is mostly unknown outside of the HEP community. The work presented in this thesis is focused on the development of a ML as a Service (MLaaS) solution for HEP, aiming to provide a cloud service that allows HEP users to run ML pipelines via HTTP calls. These pipelines are executed by using the MLaaS4HEP framework, which allows reading data, processing data, and training ML models directly using ROOT files of arbitrary size from local or distributed data sources. Such a solution provides HEP users non-expert in ML with a tool that allows them to apply ML techniques in their analyses in a streamlined manner. Over the years the MLaaS4HEP framework has been developed, validated, and tested and new features have been added. A first MLaaS solution has been developed by automatizing the deployment of a platform equipped with the MLaaS4HEP framework. Then, a service with APIs has been developed, so that a user after being authenticated and authorized can submit MLaaS4HEP workflows producing trained ML models ready for the inference phase. A working prototype of this service is currently running on a virtual machine of INFN-Cloud and is compliant to be added to the INFN Cloud portfolio of services.
Resumo:
The pervasive availability of connected devices in any industrial and societal sector is pushing for an evolution of the well-established cloud computing model. The emerging paradigm of the cloud continuum embraces this decentralization trend and envisions virtualized computing resources physically located between traditional datacenters and data sources. By totally or partially executing closer to the network edge, applications can have quicker reactions to events, thus enabling advanced forms of automation and intelligence. However, these applications also induce new data-intensive workloads with low-latency constraints that require the adoption of specialized resources, such as high-performance communication options (e.g., RDMA, DPDK, XDP, etc.). Unfortunately, cloud providers still struggle to integrate these options into their infrastructures. That risks undermining the principle of generality that underlies the cloud computing scale economy by forcing developers to tailor their code to low-level APIs, non-standard programming models, and static execution environments. This thesis proposes a novel system architecture to empower cloud platforms across the whole cloud continuum with Network Acceleration as a Service (NAaaS). To provide commodity yet efficient access to acceleration, this architecture defines a layer of agnostic high-performance I/O APIs, exposed to applications and clearly separated from the heterogeneous protocols, interfaces, and hardware devices that implement it. A novel system component embodies this decoupling by offering a set of agnostic OS features to applications: memory management for zero-copy transfers, asynchronous I/O processing, and efficient packet scheduling. This thesis also explores the design space of the possible implementations of this architecture by proposing two reference middleware systems and by adopting them to support interactive use cases in the cloud continuum: a serverless platform and an Industry 4.0 scenario. A detailed discussion and a thorough performance evaluation demonstrate that the proposed architecture is suitable to enable the easy-to-use, flexible integration of modern network acceleration into next-generation cloud platforms.
Resumo:
The increasing number of extreme rainfall events, combined with the high population density and the imperviousness of the land surface, makes urban areas particularly vulnerable to pluvial flooding. In order to design and manage cities to be able to deal with this issue, the reconstruction of weather phenomena is essential. Among the most interesting data sources which show great potential are the observational networks of private sensors managed by citizens (crowdsourcing). The number of these personal weather stations is consistently increasing, and the spatial distribution roughly follows population density. Precisely for this reason, they perfectly suit this detailed study on the modelling of pluvial flood in urban environments. The uncertainty associated with these measurements of precipitation is still a matter of research. In order to characterise the accuracy and precision of the crowdsourced data, we carried out exploratory data analyses. A comparison between Netatmo hourly precipitation amounts and observations of the same quantity from weather stations managed by national weather services is presented. The crowdsourced stations have very good skills in rain detection but tend to underestimate the reference value. In detail, the accuracy and precision of crowd- sourced data change as precipitation increases, improving the spread going to the extreme values. Then, the ability of this kind of observation to improve the prediction of pluvial flooding is tested. To this aim, the simplified raster-based inundation model incorporated in the Saferplaces web platform is used for simulating pluvial flooding. Different precipitation fields have been produced and tested as input in the model. Two different case studies are analysed over the most densely populated Norwegian city: Oslo. The crowdsourced weather station observations, bias-corrected (i.e. increased by 25%), showed very good skills in detecting flooded areas.
Resumo:
There are many natural events that can negatively affect the urban ecosystem, but weather-climate variations are certainly among the most significant. The history of settlements has been characterized by extreme events like earthquakes and floods, which repeat themselves at different times, causing extensive damage to the built heritage on a structural and urban scale. Changes in climate also alter various climatic subsystems, changing rainfall regimes and hydrological cycles, increasing the frequency and intensity of extreme precipitation events (heavy rainfall). From an hydrological risk perspective, it is crucial to understand future events that could occur and their magnitude in order to design safer infrastructures. Unfortunately, it is not easy to understand future scenarios as the complexity of climate is enormous. For this thesis, precipitation and discharge extremes were primarily used as data sources. It is important to underline that the two data sets are not separated: changes in rainfall regime, due to climate change, could significantly affect overflows into receiving water bodies. It is imperative that we understand and model climate change effects on water structures to support the development of adaptation strategies. The main purpose of this thesis is to search for suitable water structures for a road located along the Tione River. Therefore, through the analysis of the area from a hydrological point of view, we aim to guarantee the safety of the infrastructure over time. The observations made have the purpose to underline how models such as a stochastic one can improve the quality of an analysis for design purposes, and influence choices.
Resumo:
Studies in several countries have shown the occurrence of forest transition, when forest cover increase overcomes the loss by deforestation. In Brazil, although deforestation is still higher than afforestation, this relationship may be inverse in some regions. Recent assessments suggest the tendency of the state of São Paulo towards forest transition. Aiming to analyze forest transition evidence and facilitate the use of existing information, we review data on native vegetation cover variation in São Paulo from four data sources (Instituto Florestal, SOS MataAtlântica/INPE, IBGE and CATI/IEA). Our results indicate that discrepancies among these assessments may be accounted by differences in methodologies and objectives. We highlight their common grounds and discuss possibilities to harmonize their information.
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física