970 resultados para DATASETS
Resumo:
Association rule mining has made many advances in the area of knowledge discovery. However, the quality of the discovered association rules is a big concern and has drawn more and more attention recently. One problem with the quality of the discovered association rules is the huge size of the extracted rule set. Often for a dataset, a huge number of rules can be extracted, but many of them can be redundant to other rules and thus useless in practice. Mining non-redundant rules is a promising approach to solve this problem. In this paper, we firstly propose a definition for redundancy; then we propose a concise representation called Reliable basis for representing non-redundant association rules for both exact rules and approximate rules. An important contribution of this paper is that we propose to use the certainty factor as the criteria to measure the strength of the discovered association rules. With the criteria, we can determine the boundary between redundancy and non-redundancy to ensure eliminating as many redundant rules as possible without reducing the inference capacity of and the belief to the remaining extracted non-redundant rules. We prove that the redundancy elimination based on the proposed Reliable basis does not reduce the belief to the extracted rules. We also prove that all association rules can be deduced from the Reliable basis. Therefore the Reliable basis is a lossless representation of association rules. Experimental results show that the proposed Reliable basis can significantly reduce the number of extracted rules.
Resumo:
Scientists need to transfer semantically similar queries across multiple heterogeneous linked datasets. These queries may require data from different locations and the results are not simple to combine due to differences between datasets. A query model was developed to make it simple to distribute queries across different datasets using RDF as the result format. The query model, based on the concept of publicly recognised namespaces for parts of each scientific dataset, was implemented with a configuration that includes a large number of current biological and chemical datasets. The configuration is flexible, providing the ability to transparently use both private and public datasets in any query. A prototype implementation of the model was used to resolve queries for the Bio2RDF website, including both Bio2RDF datasets and other datasets that do not follow the Bio2RDF URI conventions.
Resumo:
In today’s electronic world vast amounts of knowledge is stored within many datasets and databases. Often the default format of this data means that the knowledge within is not immediately accessible, but rather has to be mined and extracted. This requires automated tools and they need to be effective and efficient. Association rule mining is one approach to obtaining knowledge stored with datasets / databases which includes frequent patterns and association rules between the items / attributes of a dataset with varying levels of strength. However, this is also association rule mining’s downside; the number of rules that can be found is usually very big. In order to effectively use the association rules (and the knowledge within) the number of rules needs to be kept manageable, thus it is necessary to have a method to reduce the number of association rules. However, we do not want to lose knowledge through this process. Thus the idea of non-redundant association rule mining was born. A second issue with association rule mining is determining which ones are interesting. The standard approach has been to use support and confidence. But they have their limitations. Approaches which use information about the dataset’s structure to measure association rules are limited, but could yield useful association rules if tapped. Finally, while it is important to be able to get interesting association rules from a dataset in a manageable size, it is equally as important to be able to apply them in a practical way, where the knowledge they contain can be taken advantage of. Association rules show items / attributes that appear together frequently. Recommendation systems also look at patterns and items / attributes that occur together frequently in order to make a recommendation to a person. It should therefore be possible to bring the two together. In this thesis we look at these three issues and propose approaches to help. For discovering non-redundant rules we propose enhanced approaches to rule mining in multi-level datasets that will allow hierarchically redundant association rules to be identified and removed, without information loss. When it comes to discovering interesting association rules based on the dataset’s structure we propose three measures for use in multi-level datasets. Lastly, we propose and demonstrate an approach that allows for association rules to be practically and effectively used in a recommender system, while at the same time improving the recommender system’s performance. This especially becomes evident when looking at the user cold-start problem for a recommender system. In fact our proposal helps to solve this serious problem facing recommender systems.
Resumo:
Recent studies on automatic new topic identification in Web search engine user sessions demonstrated that neural networks are successful in automatic new topic identification. However most of this work applied their new topic identification algorithms on data logs from a single search engine. In this study, we investigate whether the application of neural networks for automatic new topic identification are more successful on some search engines than others. Sample data logs from the Norwegian search engine FAST (currently owned by Overture) and Excite are used in this study. Findings of this study suggest that query logs with more topic shifts tend to provide more successful results on shift-based performance measures, whereas logs with more topic continuations tend to provide better results on continuation-based performance measures.
Resumo:
Humanitarian entrants remain invisible in existing populations datasets, and this has significant implications for health care and health policy. We suggest adding 'year of arrival' to population datasets; enabling the combination of 'country of birth' and 'year of arrival' to be used as a proxy for refugee status.
Resumo:
Within the QUT Business School (QUTBS)– researchers across economics, finance and accounting depend on data driven research. They analyze historic and global financial data across a range of instruments to understand the relationships and effects between them as they respond to news and events in their region. Scholars and Higher Degree Research Students in turn seek out universities which offer these particular datasets to further their research. This involves downloading and manipulating large datasets, often with a focus on depth of detail, frequency and long tail historical data. This is stock exchange data and has potential commercial value therefore the license for access tends to be very expensive. This poster reports the following findings: •The library has a part to play in freeing up researchers from the burden of negotiating subscriptions, fundraising and managing the legal requirements around license and access. •The role of the library is to communicate the nature and potential of these complex resources across the university to disciplines as diverse as Mathematics, Health, Information Systems and Creative Industries. •Has demonstrated clear concrete support for research by QUT Library and built relationships into faculty. It has made data available to all researchers and attracted new HDRs. The aim is to reach the output threshold of research outputs to submit into FOR Code 1502 (Banking, Finance and Investment) for ERA 2015. •It is difficult to identify what subset of dataset will be obtained given somewhat vague price tiers. •The integrity of data is variable as it is limited by the way it is collected, this occasionally raises issues for researchers(Cook, Campbell, & Kelly, 2012) •Improved library understanding of the content of our products and the nature of financial based research is a necessary part of the service.
Resumo:
In this paper, we provide an overview of the Social Event Detection (SED) task that is part of the MediaEval Bench mark for Multimedia Evaluation 2013. This task requires participants to discover social events and organize the re- lated media items in event-specific clusters within a collection of Web multimedia. Social events are events that are planned by people, attended by people and for which the social multimedia are also captured by people. We describe the challenges, datasets, and the evaluation methodology.
Resumo:
This paper presents large, accurately calibrated and time-synchronised datasets, gathered outdoors in controlled environmental conditions, using an unmanned ground vehicle (UGV), equipped with a wide variety of sensors. It discusses how the data collection process was designed, the conditions in which these datasets have been gathered, and some possible outcomes of their exploitation, in particular for the evaluation of performance of sensors and perception algorithms for UGVs.
Resumo:
We present two unconditional secure protocols for private set disjointness tests. In order to provide intuition of our protocols, we give a naive example that applies Sylvester matrices. Unfortunately, this simple construction is insecure as it reveals information about the intersection cardinality. More specifically, it discloses its lower bound. By using the Lagrange interpolation, we provide a protocol for the honest-but-curious case without revealing any additional information. Finally, we describe a protocol that is secure against malicious adversaries. In this protocol, a verification test is applied to detect misbehaving participants. Both protocols require O(1) rounds of communication. Our protocols are more efficient than the previous protocols in terms of communication and computation overhead. Unlike previous protocols whose security relies on computational assumptions, our protocols provide information theoretic security. To our knowledge, our protocols are the first ones that have been designed without a generic secure function evaluation. More important, they are the most efficient protocols for private disjointness tests in the malicious adversary case.
Resumo:
At Eurocrypt’04, Freedman, Nissim and Pinkas introduced a fuzzy private matching problem. The problem is defined as follows. Given two parties, each of them having a set of vectors where each vector has T integer components, the fuzzy private matching is to securely test if each vector of one set matches any vector of another set for at least t components where t < T. In the conclusion of their paper, they asked whether it was possible to design a fuzzy private matching protocol without incurring a communication complexity with the factor (T t ) . We answer their question in the affirmative by presenting a protocol based on homomorphic encryption, combined with the novel notion of a share-hiding error-correcting secret sharing scheme, which we show how to implement with efficient decoding using interleaved Reed-Solomon codes. This scheme may be of independent interest. Our protocol is provably secure against passive adversaries, and has better efficiency than previous protocols for certain parameter values.
Resumo:
Objective Evaluate the effectiveness and robustness of Anonym, a tool for de-identifying free-text health records based on conditional random fields classifiers informed by linguistic and lexical features, as well as features extracted by pattern matching techniques. De-identification of personal health information in electronic health records is essential for the sharing and secondary usage of clinical data. De-identification tools that adapt to different sources of clinical data are attractive as they would require minimal intervention to guarantee high effectiveness. Methods and Materials The effectiveness and robustness of Anonym are evaluated across multiple datasets, including the widely adopted Integrating Biology and the Bedside (i2b2) dataset, used for evaluation in a de-identification challenge. The datasets used here vary in type of health records, source of data, and their quality, with one of the datasets containing optical character recognition errors. Results Anonym identifies and removes up to 96.6% of personal health identifiers (recall) with a precision of up to 98.2% on the i2b2 dataset, outperforming the best system proposed in the i2b2 challenge. The effectiveness of Anonym across datasets is found to depend on the amount of information available for training. Conclusion Findings show that Anonym compares to the best approach from the 2006 i2b2 shared task. It is easy to retrain Anonym with new datasets; if retrained, the system is robust to variations of training size, data type and quality in presence of sufficient training data.
Resumo:
We present efficient protocols for private set disjointness tests. We start from an intuition of our protocols that applies Sylvester matrices. Unfortunately, this simple construction is insecure as it reveals information about the cardinality of the intersection. More specifically, it discloses its lower bound. By using the Lagrange interpolation we provide a protocol for the honest-but-curious case without revealing any additional information. Finally, we describe a protocol that is secure against malicious adversaries. The protocol applies a verification test to detect misbehaving participants. Both protocols require O(1) rounds of communication. Our protocols are more efficient than the previous protocols in terms of communication and computation overhead. Unlike previous protocols whose security relies on computational assumptions, our protocols provide information theoretic security. To our knowledge, our protocols are first ones that have been designed without a generic secure function evaluation. More importantly, they are the most efficient protocols for private disjointness tests for the malicious adversary case.
Resumo:
In this paper we propose a novel scheme for carrying out speaker diarization in an iterative manner. We aim to show that the information obtained through the first pass of speaker diarization can be reused to refine and improve the original diarization results. We call this technique speaker rediarization and demonstrate the practical application of our rediarization algorithm using a large archive of two-speaker telephone conversation recordings. We use the NIST 2008 SRE summed telephone corpora for evaluating our speaker rediarization system. This corpus contains recurring speaker identities across independent recording sessions that need to be linked across the entire corpus. We show that our speaker rediarization scheme can take advantage of inter-session speaker information, linked in the initial diarization pass, to achieve a 30% relative improvement over the original diarization error rate (DER) after only two iterations of rediarization.
Resumo:
Background Australian national biomonitoring for persistent organic pollutants (POPs) relies upon age-specific pooled serum samples to characterize central tendencies of concentrations but does not provide estimates of upper bound concentrations. This analysis compares population variation from biomonitoring datasets from the US, Canada, Germany, Spain, and Belgium to identify and test patterns potentially useful for estimating population upper bound reference values for the Australian population. Methods Arithmetic means and the ratio of the 95th percentile to the arithmetic mean (P95:mean) were assessed by survey for defined age subgroups for three polychlorinated biphenyls (PCBs 138, 153, and 180), hexachlorobenzene (HCB), p,p-dichlorodiphenyldichloroethylene (DDE), 2,2′,4,4′ tetrabrominated diphenylether (PBDE 47), perfluorooctanoic acid (PFOA) and perfluorooctane sulfonate (PFOS). Results Arithmetic mean concentrations of each analyte varied widely across surveys and age groups. However, P95:mean ratios differed to a limited extent, with no systematic variation across ages. The average P95:mean ratios were 2.2 for the three PCBs and HCB; 3.0 for DDE; 2.0 and 2.3 for PFOA and PFOS, respectively. The P95:mean ratio for PBDE 47 was more variable among age groups, ranging from 2.7 to 4.8. The average P95:mean ratios accurately estimated age group-specific P95s in the Flemish Environmental Health Survey II and were used to estimate the P95s for the Australian population by age group from the pooled biomonitoring data. Conclusions Similar population variation patterns for POPs were observed across multiple surveys, even when absolute concentrations differed widely. These patterns can be used to estimate population upper bounds when only pooled sampling data are available.