116 resultados para heterogeneous data sources

em Deakin Research Online - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a novel Bayesian formulation to exploit shared structures across multiple data sources, constructing foundations for effective mining and retrieval across disparate domains. We jointly analyze diverse data sources using a unifying piece of metadata (textual tags). We propose a method based on Bayesian Probabilistic Matrix Factorization (BPMF) which is able to explicitly model the partial knowledge common to the datasets using shared subspaces and the knowledge specific to each dataset using individual subspaces. For the proposed model, we derive an efficient algorithm for learning the joint factorization based on Gibbs sampling. The effectiveness of the model is demonstrated by social media retrieval tasks across single and multiple media. The proposed solution is applicable to a wider context, providing a formal framework suitable for exploiting individual as well as mutual knowledge present across heterogeneous data sources of many kinds.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hierarchical Dirichlet processes (HDP) was originally designed and experimented for a single data channel. In this paper we enhanced its ability to model heterogeneous data using a richer structure for the base measure being a product-space. The enhanced model, called Product Space HDP (PS-HDP), can (1) simultaneously model heterogeneous data from multiple sources in a Bayesian nonparametric framework and (2) discover multilevel latent structures from data to result in different types of topics/latent structures that can be explained jointly. We experimented with the MDC dataset, a large and real-world data collected from mobile phones. Our goal was to discover identity–location– time (a.k.a who-where-when) patterns at different levels (globally for all groups and locally for each group). We provided analysis on the activities and patterns learned from our model, visualized, compared and contrasted with the ground-truth to demonstrate the merit of the proposed framework. We further quantitatively evaluated and reported its performance using standard metrics including F1-score, NMI, RI, and purity. We also compared the performance of the PS-HDP model with those of popular existing clustering methods (including K-Means, NNMF, GMM, DP-Means, and AP). Lastly, we demonstrate the ability of the model in learning activities with missing data, a common problem encountered in pervasive and ubiquitous computing applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

For making good decisions in the area of petroleum production, it is becoming a big problem how to timely gather sufficient and correct information, which may be stored in databases, data files, or on the World Wide Web. In this paper, Gaia methodology and Open Agent Architecture were employed to contribute a framework to solve above problem. The framework consists of three levels, namely, role mode, agent type, and agent instance. The model with five roles is analyzed. Four agent types are designed Six agent instances are developed for constructing the system of petroleum information services. The experimental results show that all agents in the system can work cooperatively to organize and retrieve relevant petroleum information. The successful implementation of the framework shows that agent-based technology can significantly facilitate the construction of complex systems in distributed heterogeneous data resource environment.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Joint analysis of multiple data sources is becoming increasingly popular in transfer learning, multi-task learning and cross-domain data mining. One promising approach to model the data jointly is through learning the shared and individual factor subspaces. However, performance of this approach depends on the subspace dimensionalities and the level of sharing needs to be specified a priori. To this end, we propose a nonparametric joint factor analysis framework for modeling multiple related data sources. Our model utilizes the hierarchical beta process as a nonparametric prior to automatically infer the number of shared and individual factors. For posterior inference, we provide a Gibbs sampling scheme using auxiliary variables. The effectiveness of the proposed framework is validated through its application on two real world problems - transfer learning in text and image retrieval.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nonnegative matrix factorization based methods provide one of the simplest and most effective approaches to text mining. However, their applicability is mainly limited to analyzing a single data source. In this chapter, we propose a novel joint matrix factorization framework which can jointly analyze multiple data sources by exploiting their shared and individual structures. The proposed framework is flexible to handle any arbitrary sharing configurations encountered in real world data. We derive an efficient algorithm for learning the factorization and show that its convergence is theoretically guaranteed. We demonstrate the utility and effectiveness of the proposed framework in two real-world applications—improving social media retrieval using auxiliary sources and cross-social media retrieval. Representing each social media source using their textual tags, for both applications, we show that retrieval performance exceeds the existing state-of-the-art techniques. The proposed solution provides a generic framework and can be applicable to a wider context in data mining wherever one needs to exploit mutual and individual knowledge present across multiple data sources.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Integrating information in the molecular biosciences involves more than the cross-referencing of sequences or structures. Experimental protocols, results of computational analyses, annotations and links to relevant literature form integral parts of this information, and impart meaning to sequence or structure. In this review, we examine some existing approaches to integrating information in the molecular biosciences. We consider not only technical issues concerning the integration of heterogeneous data sources and the corresponding semantic implications, but also the integration of analytical results. Within the broad range of strategies for integration of data and information, we distinguish between platforms and developments. We discuss two current platforms and six current developments, and identify what we believe to be their strengths and limitations. We identify key unsolved problems in integrating information in the molecular biosciences, and discuss possible strategies for addressing them including semantic integration using ontologies, XML as a data model, and graphical user interfaces as integrative environments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The exponential increase in data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making. Benefits from effective data integration to the health and medical research community include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. The costs of poor quality processes elevate the risk of erroneous outcomes, an erosion of confidence in the data and the organisations using these data. To date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT) translational to any field of research.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Data collecting is necessary to some organizations such as nuclear power plants and earthquake bureaus, which have very small databases. Traditional data collecting is to obtain necessary data from internal and external data-sources and join all data together to create a homogeneous huge database. Because collected data may be untrusty, it can disguise really useful patterns in data. In this paper, breaking away traditional data collecting mode that deals with internal and external data equally, we argue that the first step for utilizing external data is to identify quality data in data-sources for given mining tasks. Pre- and post-analysis techniques are thus advocated for generating quality data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Collecting, analyzing, and making Molecularbiological annotation data accessible in different public data sources is still an ongoing project. Integration of such data from these data sources might lead to valuable biological knowledge. There are numerous annotation data and only some of those are structured. The number and contents of related sources are continuously increasing. In addition, the existing data sources have their own storage structure and implementation. As a result, these could lead to a limitation in the combining of annotation. Here, we proposed a tool, called ANNODA, for integrating Molecular-biological annotation data. Unlike the past work on database interoperation in the bioinformatics community, this database design uses web-links which are very useful for interactive navigation and meanwhile it also supports automated large-scale analysis tasks.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This study investigated the various sources of information used by business organisations and their association with business performance indicators. The results showed that larger Australian companies predominantly used various internal and external secondary data sources of information, followed by formal primary marketing research. The study indicated an association between market share and the use of competitors, advertising agencies, and sales promotion agencies as sources of information, while overall financial performance was associated with the use of competitors and media/trade publications.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The peer-to-peer content distribution network (PCDN) is a hot topic recently, and it has a huge potential for massive data intensive applications on the Internet. One of the challenges in PCDN is routing for data sources and data deliveries. In this paper, we studied a type of network model which is formed by dynamic autonomy area, structured source servers and proxy servers. Based on this network model, we proposed a number of algorithms to address the routing and data delivery issues. According to the highly dynamics of the autonomy area, we established dynamic tree structure proliferation system routing, proxy routing and resource searching algorithms. The simulations results showed that the performance of the proposed network model and the algorithms are stable.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

How to operate database efficiently and unfailingly in agent-based heterogeneous data source environment is becoming a big problem. In this paper, we contribute a framework and develop a couple of agent-oriented matchmakers with logical ring organization structure to match task agents' requests with middleware agents of databases. The middleware agent is a wrapper of a specific database and is run on the same server with the database management system. The matchmaker is of the features of proliferation and self-cancellation according to the sensory input from its environment. The ring-based coordination mechanism of matchmakers is designed. Two kinds of matchmakers, namely, host and duplicate, are designed for improving the scalability and robustness of agent-based system. The middleware agents are improved for satisfying the framework. We demonstrate the potentials of the framework by case study and present theoretical and empirical evidence that our approach is scalable and robust.