899 resultados para Data distribution
Resumo:
The problem of learning from imbalanced data is of critical importance in a large number of application domains and can be a bottleneck in the performance of various conventional learning methods that assume the data distribution to be balanced. The class imbalance problem corresponds to dealing with the situation where one class massively outnumbers the other. The imbalance between majority and minority would lead machine learning to be biased and produce unreliable outcomes if the imbalanced data is used directly. There has been increasing interest in this research area and a number of algorithms have been developed. However, independent evaluation of the algorithms is limited. This paper aims at evaluating the performance of five representative data sampling methods namely SMOTE, ADASYN, BorderlineSMOTE, SMOTETomek and RUSBoost that deal with class imbalance problems. A comparative study is conducted and the performance of each method is critically analysed in terms of assessment metrics. © 2013 Springer-Verlag.
Resumo:
The International Molecular Exchange (IMEx) consortium is an international collaboration between major public interaction data providers to share literature-curation efforts and make a nonredundant set of protein interactions available in a single search interface on a common website (http://www.imexconsortium.org/). Common curation rules have been developed, and a central registry is used to manage the selection of articles to enter into the dataset. We discuss the advantages of such a service to the user, our quality-control measures and our data-distribution practices.
Resumo:
Background: This paper addresses the prediction of the free energy of binding of a drug candidate with enzyme InhA associated with Mycobacterium tuberculosis. This problem is found within rational drug design, where interactions between drug candidates and target proteins are verified through molecular docking simulations. In this application, it is important not only to correctly predict the free energy of binding, but also to provide a comprehensible model that could be validated by a domain specialist. Decision-tree induction algorithms have been successfully used in drug-design related applications, specially considering that decision trees are simple to understand, interpret, and validate. There are several decision-tree induction algorithms available for general-use, but each one has a bias that makes it more suitable for a particular data distribution. In this article, we propose and investigate the automatic design of decision-tree induction algorithms tailored to particular drug-enzyme binding data sets. We investigate the performance of our new method for evaluating binding conformations of different drug candidates to InhA, and we analyze our findings with respect to decision tree accuracy, comprehensibility, and biological relevance. Results: The empirical analysis indicates that our method is capable of automatically generating decision-tree induction algorithms that significantly outperform the traditional C4.5 algorithm with respect to both accuracy and comprehensibility. In addition, we provide the biological interpretation of the rules generated by our approach, reinforcing the importance of comprehensible predictive models in this particular bioinformatics application. Conclusions: We conclude that automatically designing a decision-tree algorithm tailored to molecular docking data is a promising alternative for the prediction of the free energy from the binding of a drug candidate with a flexible-receptor.
Resumo:
The continuous advancements and enhancements of wireless systems are enabling new compelling scenarios where mobile services can adapt according to the current execution context, represented by the computational resources available at the local device, current physical location, people in physical proximity, and so forth. Such services called context-aware require the timely delivery of all relevant information describing the current context, and that introduces several unsolved complexities, spanning from low-level context data transmission up to context data storage and replication into the mobile system. In addition, to ensure correct and scalable context provisioning, it is crucial to integrate and interoperate with different wireless technologies (WiFi, Bluetooth, etc.) and modes (infrastructure-based and ad-hoc), and to use decentralized solutions to store and replicate context data on mobile devices. These challenges call for novel middleware solutions, here called Context Data Distribution Infrastructures (CDDIs), capable of delivering relevant context data to mobile devices, while hiding all the issues introduced by data distribution in heterogeneous and large-scale mobile settings. This dissertation thoroughly analyzes CDDIs for mobile systems, with the main goal of achieving a holistic approach to the design of such type of middleware solutions. We discuss the main functions needed by context data distribution in large mobile systems, and we claim the precise definition and clean respect of quality-based contracts between context consumers and CDDI to reconfigure main middleware components at runtime. We present the design and the implementation of our proposals, both in simulation-based and in real-world scenarios, along with an extensive evaluation that confirms the technical soundness of proposed CDDI solutions. Finally, we consider three highly heterogeneous scenarios, namely disaster areas, smart campuses, and smart cities, to better remark the wide technical validity of our analysis and solutions under different network deployments and quality constraints.
Resumo:
Distributed computation and storage have been widely used for processing of big data sets. For many big data problems, with the size of data growing rapidly, the distribution of computing tasks and related data can affect the performance of the computing system greatly. In this paper, a distributed computing framework is presented for high performance computing of All-to-All Comparison Problems. A data distribution strategy is embedded in the framework for reduced storage space and balanced computing load. Experiments are conducted to demonstrate the effectiveness of the developed approach. They have shown that about 88% of the ideal performance capacity have be achieved in multiple machines through using the approach presented in this paper.
Resumo:
In contrast to single robotic agent, multi-robot systems are highly dependent on reliable communication. Robots have to synchronize tasks or to share poses and sensor readings with other agents, especially for co-operative mapping task where local sensor readings are incorporated into a global map. The drawback of existing communication frameworks is that most are based on a central component which has to be constantly within reach. Additionally, they do not prevent data loss between robots if a failure occurs in the communication link. During a distributed mapping task, loss of data is critical because it will corrupt the global map. In this work, we propose a cloud-based publish/subscribe mechanism which enables reliable communication between agents during a cooperative mission using the Data Distribution Service (DDS) as a transport layer. The usability of our approach is verified by several experiments taking into account complete temporary communication loss.
Resumo:
The requirement of distributed computing of all-to-all comparison (ATAC) problems in heterogeneous systems is increasingly important in various domains. Though Hadoop-based solutions are widely used, they are inefficient for the ATAC pattern, which is fundamentally different from the MapReduce pattern for which Hadoop is designed. They exhibit poor data locality and unbalanced allocation of comparison tasks, particularly in heterogeneous systems. The results in massive data movement at runtime and ineffective utilization of computing resources, affecting the overall computing performance significantly. To address these problems, a scalable and efficient data and task distribution strategy is presented in this paper for processing large-scale ATAC problems in heterogeneous systems. It not only saves storage space but also achieves load balancing and good data locality for all comparison tasks. Experiments of bioinformatics examples show that about 89\% of the ideal performance capacity of the multiple machines have be achieved through using the approach presented in this paper.
Resumo:
Hydrologic impacts of climate change are usually assessed by downscaling the General Circulation Model (GCM) output of large-scale climate variables to local-scale hydrologic variables. Such an assessment is characterized by uncertainty resulting from the ensembles of projections generated with multiple GCMs, which is known as intermodel or GCM uncertainty. Ensemble averaging with the assignment of weights to GCMs based on model evaluation is one of the methods to address such uncertainty and is used in the present study for regional-scale impact assessment. GCM outputs of large-scale climate variables are downscaled to subdivisional-scale monsoon rainfall. Weights are assigned to the GCMs on the basis of model performance and model convergence, which are evaluated with the Cumulative Distribution Functions (CDFs) generated from the downscaled GCM output (for both 20th Century [20C3M] and future scenarios) and observed data. Ensemble averaging approach, with the assignment of weights to GCMs, is characterized by the uncertainty caused by partial ignorance, which stems from nonavailability of the outputs of some of the GCMs for a few scenarios (in Intergovernmental Panel on Climate Change [IPCC] data distribution center for Assessment Report 4 [AR4]). This uncertainty is modeled with imprecise probability, i.e., the probability being represented as an interval gray number. Furthermore, the CDF generated with one GCM is entirely different from that with another and therefore the use of multiple GCMs results in a band of CDFs. Representing this band of CDFs with a single valued weighted mean CDF may be misleading. Such a band of CDFs can only be represented with an envelope that contains all the CDFs generated with a number of GCMs. Imprecise CDF represents such an envelope, which not only contains the CDFs generated with all the available GCMs but also to an extent accounts for the uncertainty resulting from the missing GCM output. This concept of imprecise probability is also validated in the present study. The imprecise CDFs of monsoon rainfall are derived for three 30-year time slices, 2020s, 2050s and 2080s, with A1B, A2 and B1 scenarios. The model is demonstrated with the prediction of monsoon rainfall in Orissa meteorological subdivision, which shows a possible decreasing trend in the future.
Resumo:
Several researchers have looked into various issues related to automatic parallelization of sequential programs for multicomputers. But there is a need for a coherent framework which encompasses all these issues. In this paper we present a such a framework which takes best advantage of the multicomputer architecture. We resort to tiling transformation for iteration space partitioning and propose a scheme of automatic data partitioning and dynamic data distribution. We have tried a simple implementation of our scheme on a transputer based multicomputer [1] and the results are encouraging.
Resumo:
In many applications, the training data, from which one needs to learn a classifier, is corrupted with label noise. Many standard algorithms such as SVM perform poorly in the presence of label noise. In this paper we investigate the robustness of risk minimization to label noise. We prove a sufficient condition on a loss function for the risk minimization under that loss to be tolerant to uniform label noise. We show that the 0-1 loss, sigmoid loss, ramp loss and probit loss satisfy this condition though none of the standard convex loss functions satisfy it. We also prove that, by choosing a sufficiently large value of a parameter in the loss function, the sigmoid loss, ramp loss and probit loss can be made tolerant to nonuniform label noise also if we can assume the classes to be separable under noise-free data distribution. Through extensive empirical studies, we show that risk minimization under the 0-1 loss, the sigmoid loss and the ramp loss has much better robustness to label noise when compared to the SVM algorithm. (C) 2015 Elsevier B.V. All rights reserved.
Resumo:
13 p.
Resumo:
Georreferenced information has been increasingly required for the planning and decision-making in different sectors of society. New ways of dissemination of data, such as the Open Geospatial Consortium (OGC) web services, have contributed to the ease of access to this information. Even with all the technological advances in the area of data distribution, there is still low availability of georreferenced data about the Amazon. The goal of the present work is the development of a spatial data infrastructure (SDI), that is, an environment of sharing and use of georreferenced data based on the technology of web services, metadata and interfaces that allow the user easy access to these data. The present work discussess the OGC patterns, the most relevant georeferrenced data servers, the main web clients, and the revolution in the dissemination of georeferrenced data which geobrowsers and web clients offered to regular users. Data to be released for the case study come from the project Exploitation of Non-wooden Forest Products-PFNM-in progress at the National Institute of Research in the Amazon-INPA-as well as from inventories of NGOs and other government bodies. Besides contributing to the enhancement of PFNM, this project aims at encouraging the use of GIS in the state of Amazonas offering tech support for the deployment of geographic databases and sharing between agencies, optimizing the resources applied in this area through the use of free software and integration of diffuse information currently available.
Resumo:
为满足海量数据的处理需求,业界提出了多种解决方案。云计算是目前较为热门的一种,它主要用廉价PC组成超大规模集群服务器来进行数据存储和处理。随着云计算技术的发展,越来越多的应用将转移到云中,数据库系统也不例外。但数据库系统要求的ACID特性在数据分布存储时可能导致部分操作性能低下,如连接查询操作。为在数据分布存储下提高数据库系统的性能,提出了一种面向查询的数据分布策略(Selection Oriented Distribution,SOD),即根据数据库的查询情况确定数据的分布算法。该算法适用于云计算,能明显提高系统的查询性能。
Resumo:
Paper presented at the Cloud Forward Conference 2015, October 6th-8th, Pisa
Resumo:
Air–sea dimethylsulfide (DMS) fluxes and bulk air–sea gradients were measured over the Southern Ocean in February–March 2012 during the Surface Ocean Aerosol Production (SOAP) study. The cruise encountered three distinct phytoplankton bloom regions, consisting of two blooms with moderate DMS levels, and a high biomass, dinoflagellate-dominated bloom with high seawater DMS levels (> 15 nM). Gas transfer coefficients were considerably scattered at wind speeds above 5 m/s. Bin averaging the data resulted in a linear relationship between wind speed and mean gas transfer velocity consistent with that previously observed. However, the wind-speed-binned gas transfer data distribution at all wind speeds is positively skewed. The flux and seawater DMS distributions were also positively skewed, which suggests that eddy covariance-derived gas transfer velocities are consistently influenced by additional, log-normal noise. A flux footprint analysis was conducted during a transect into the prevailing wind and through elevated DMS levels in the dinoflagellate bloom. Accounting for the temporal/spatial separation between flux and seawater concentration significantly reduces the scatter in computed transfer velocity. The SOAP gas transfer velocity data show no obvious modification of the gas transfer–wind speed relationship by biological activity or waves. This study highlights the challenges associated with eddy covariance gas transfer measurements in biologically active and heterogeneous bloom environments.