962 resultados para multiple data sources


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Forensic analysis requires the acquisition and management of many different types of evidence, including individual disk drives, RAID sets, network packets, memory images, and extracted files. Often the same evidence is reviewed by several different tools or examiners in different locations. We propose a backwards-compatible redesign of the Advanced Forensic Formatdan open, extensible file format for storing and sharing of evidence, arbitrary case related information and analysis results among different tools. The new specification, termed AFF4, is designed to be simple to implement, built upon the well supported ZIP file format specification. Furthermore, the AFF4 implementation has downward comparability with existing AFF files.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a novel Bayesian formulation to exploit shared structures across multiple data sources, constructing foundations for effective mining and retrieval across disparate domains. We jointly analyze diverse data sources using a unifying piece of metadata (textual tags). We propose a method based on Bayesian Probabilistic Matrix Factorization (BPMF) which is able to explicitly model the partial knowledge common to the datasets using shared subspaces and the knowledge specific to each dataset using individual subspaces. For the proposed model, we derive an efficient algorithm for learning the joint factorization based on Gibbs sampling. The effectiveness of the model is demonstrated by social media retrieval tasks across single and multiple media. The proposed solution is applicable to a wider context, providing a formal framework suitable for exploiting individual as well as mutual knowledge present across heterogeneous data sources of many kinds.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Joint analysis of multiple data sources is becoming increasingly popular in transfer learning, multi-task learning and cross-domain data mining. One promising approach to model the data jointly is through learning the shared and individual factor subspaces. However, performance of this approach depends on the subspace dimensionalities and the level of sharing needs to be specified a priori. To this end, we propose a nonparametric joint factor analysis framework for modeling multiple related data sources. Our model utilizes the hierarchical beta process as a nonparametric prior to automatically infer the number of shared and individual factors. For posterior inference, we provide a Gibbs sampling scheme using auxiliary variables. The effectiveness of the proposed framework is validated through its application on two real world problems - transfer learning in text and image retrieval.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper presents a poverty profile for Brazil, based on three different sources of household data for 1996. We use PPV consumption data to estimate poverty and indigence lines. “Contagem” data is used to allow for an unprecedented refinement of the country’s poverty map. Poverty measures and shares are also presented for a wide range of population subgroups, based on the PNAD 1996, with new adjustments for imputed rents and spatial differences in cost of living. Robustness of the profile is verified with respect to different poverty lines, spatial price deflators, and equivalence scales. Overall poverty incidence ranges from 23% with respect to an indigence line to 45% with respect to a more generous poverty line. More importantly, however, poverty is found to vary significantly across regions and city sizes, with rural areas, small and medium towns and the metropolitan peripheries of the North and Northeast regions being poorest.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nonnegative matrix factorization based methods provide one of the simplest and most effective approaches to text mining. However, their applicability is mainly limited to analyzing a single data source. In this chapter, we propose a novel joint matrix factorization framework which can jointly analyze multiple data sources by exploiting their shared and individual structures. The proposed framework is flexible to handle any arbitrary sharing configurations encountered in real world data. We derive an efficient algorithm for learning the factorization and show that its convergence is theoretically guaranteed. We demonstrate the utility and effectiveness of the proposed framework in two real-world applications—improving social media retrieval using auxiliary sources and cross-social media retrieval. Representing each social media source using their textual tags, for both applications, we show that retrieval performance exceeds the existing state-of-the-art techniques. The proposed solution provides a generic framework and can be applicable to a wider context in data mining wherever one needs to exploit mutual and individual knowledge present across multiple data sources.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A central tenet in the theory of reliability modelling is the quantification of the probability of asset failure. In general, reliability depends on asset age and the maintenance policy applied. Usually, failure and maintenance times are the primary inputs to reliability models. However, for many organisations, different aspects of these data are often recorded in different databases (e.g. work order notifications, event logs, condition monitoring data, and process control data). These recorded data cannot be interpreted individually, since they typically do not have all the information necessary to ascertain failure and preventive maintenance times. This paper presents a methodology for the extraction of failure and preventive maintenance times using commonly-available, real-world data sources. A text-mining approach is employed to extract keywords indicative of the source of the maintenance event. Using these keywords, a Naïve Bayes classifier is then applied to attribute each machine stoppage to one of two classes: failure or preventive. The accuracy of the algorithm is assessed and the classified failure time data are then presented. The applicability of the methodology is demonstrated on a maintenance data set from an Australian electricity company.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nonnegative matrix factorization based methods provide one of the simplest and most effective approaches to text mining. However, their applicability is mainly limited to analyzing a single data source. In this paper, we propose a novel joint matrix factorization framework which can jointly analyze multiple data sources by exploiting their shared and individual structures. The proposed framework is flexible to handle any arbitrary sharing configurations encountered in real world data. We derive an efficient algorithm for learning the factorization and show that its convergence is theoretically guaranteed. We demonstrate the utility and effectiveness of the proposed framework in two real-world applications–improving social media retrieval using auxiliary sources and cross-social media retrieval. Representing each social media source using their textual tags, for both applications, we show that retrieval performance exceeds the existing state-of-the-art techniques. The proposed solution provides a generic framework and can be applicable to a wider context in data mining wherever one needs to exploit mutual and individual knowledge present across multiple data sources.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Effective management of invasive fishes depends on the availability of updated information about their distribution and spatial dispersion. Forensic analysis was performed using online and published data on the European catfish, Silurus glanis L., a recent invader in the Tagus catchment (Iberian Peninsula). Eighty records were obtained mainly from anglers’ fora and blogs, and more recently from www.youtube.com. Since the first record in 1998, S. glanis expanded its geographic range by 700 km of river network, occurring mainly in reservoirs and in high-order reaches. Human-mediated and natural dispersal events were identified, with the former occurring during the first years of invasion and involving movements of >50 km. Downstream dispersal directionality was predominant. The analysis of online data from anglers was found to provide useful information on the distribution and dispersal patterns of this non-native fish, and is potentially applicable as a preliminary, exploratory assessment tool for other non-native fishes.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background Not all cancer patients receive state-of-the-art care and providing regular feedback to clinicians might reduce this problem. The purpose of this study was to assess the utility of various data sources in providing feedback on the quality of cancer care. Methods Published clinical practice guidelines were used to obtain a list of processes-of-care of interest to clinicians. These were assigned to one of four data categories according to their availability and the marginal cost of using them for feedback. Results Only 8 (3%) of 243 processes-of-care could be measured using population-based registry or administrative inpatient data (lowest cost). A further 119 (49%) could be measured using a core clinical registry, which contains information on important prognostic factors (e.g., clinical stage, physiological reserve, hormone-receptor status). Another 88 (36%) required an expanded clinical registry or medical record review; mainly because they concerned long-term management of disease progression (recurrences and metastases) and 28 (11.5%) required patient interview or audio-taping of consultations because they involved information sharing between clinician and patient. Conclusion The advantages of population-based cancer registries and administrative inpatient data are wide coverage and low cost. The disadvantage is that they currently contain information on only a few processes-of-care. In most jurisdictions, clinical cancer registries, which can be used to report on many more processes-of-care, do not cover smaller hospitals. If we are to provide feedback about all patients, not just those in larger academic hospitals with the most developed data systems, then we need to develop sustainable population-based data systems that capture information on prognostic factors at the time of initial diagnosis and information on management of disease progression.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This report provides an evaluation of the current available evidence-base for identification and surveillance of product-related injuries in children in Queensland. While the focal population was children in Queensland, the identification of information needs and data sources for product safety surveillance has applicability nationally for all age groups. The report firstly summarises the data needs of product safety regulators regarding product-related injury in children, describing the current sources of information informing product safety policy and practice, and documenting the priority product surveillance areas affecting children which have been a focus over recent years in Queensland. Health data sources in Queensland which have the potential to inform product safety surveillance initiatives were evaluated in terms of their ability to address the information needs of product safety regulators. Patterns in product-related injuries in children were analysed using routinely available health data to identify areas for future intervention, and the patterns in product-related injuries in children identified in health data were compared to those identified by product safety regulators. Recommendations were made for information system improvements and improved access to and utilisation of health data for more proactive approaches to product safety surveillance in the future.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Spreadsheet for Creative City Index 2012