175 resultados para Data compression (Computer science)


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last few years we have observed a proliferation of approaches for clustering XML docu- ments and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the XML data to be clustered. These applications need data in the form of similar contents, tags, paths, structures and semantics. In this paper, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. This presentation leads to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering compo- nent. Finally, the paper moves into the description of future trends and research issues that still need to be faced.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A rule-based approach for classifying previously identified medical concepts in the clinical free text into an assertion category is presented. There are six different categories of assertions for the task: Present, Absent, Possible, Conditional, Hypothetical and Not associated with the patient. The assertion classification algorithms were largely based on extending the popular NegEx and Context algorithms. In addition, a health based clinical terminology called SNOMED CT and other publicly available dictionaries were used to classify assertions, which did not fit the NegEx/Context model. The data for this task includes discharge summaries from Partners HealthCare and from Beth Israel Deaconess Medical Centre, as well as discharge summaries and progress notes from University of Pittsburgh Medical Centre. The set consists of 349 discharge reports, each with pairs of ground truth concept and assertion files for system development, and 477 reports for evaluation. The system’s performance on the evaluation data set was 0.83, 0.83 and 0.83 for recall, precision and F1-measure, respectively. Although the rule-based system shows promise, further improvements can be made by incorporating machine learning approaches.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Projects funded by the Australian National Data Service(ANDS). The specific projects that were funded included: a) Greenhouse Gas Emissions Project (N2O) with Prof. Peter Grace from QUT’s Institute of Sustainable Resources. b) Q150 Project for the management of multimedia data collected at Festival events with Prof. Phil Graham from QUT’s Institute of Creative Industries. c) Bio-diversity environmental sensing with Prof. Paul Roe from the QUT Microsoft eResearch Centre. For the purposes of these projects the Eclipse Rich Client Platform (Eclipse RCP) was chosen as an appropriate software development framework within which to develop the respective software. This poster will present a brief overview of the requirements of the projects, an overview of the experiences of the project team in using Eclipse RCP, report on the advantages and disadvantages of using Eclipse and it’s perspective on Eclipse as an integrated tool for supporting future data management requirements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is a big challenge to acquire correct user profiles for personalized text classification since users may be unsure in providing their interests. Traditional approaches to user profiling adopt machine learning (ML) to automatically discover classification knowledge from explicit user feedback in describing personal interests. However, the accuracy of ML-based methods cannot be significantly improved in many cases due to the term independence assumption and uncertainties associated with them. This paper presents a novel relevance feedback approach for personalized text classification. It basically applies data mining to discover knowledge from relevant and non-relevant text and constraints specific knowledge by reasoning rules to eliminate some conflicting information. We also developed a Dempster-Shafer (DS) approach as the means to utilise the specific knowledge to build high-quality data models for classification. The experimental results conducted on Reuters Corpus Volume 1 and TREC topics support that the proposed technique achieves encouraging performance in comparing with the state-of-the-art relevance feedback models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Existing secure software development principles tend to focus on coding vulnerabilities, such as buffer or integer overflows, that apply to individual program statements, or issues associated with the run-time environment, such as component isolation. Here we instead consider software security from the perspective of potential information flow through a program’s object-oriented module structure. In particular, we define a set of quantifiable "security metrics" which allow programmers to quickly and easily assess the overall security of a given source code program or object-oriented design. Although measuring quality attributes of object-oriented programs for properties such as maintainability and performance has been well-covered in the literature, metrics which measure the quality of information security have received little attention. Moreover, existing securityrelevant metrics assess a system either at a very high level, i.e., the whole system, or at a fine level of granularity, i.e., with respect to individual statements. These approaches make it hard and expensive to recognise a secure system from an early stage of development. Instead, our security metrics are based on well-established compositional properties of object-oriented programs (i.e., data encapsulation, cohesion, coupling, composition, extensibility, inheritance and design size), combined with data flow analysis principles that trace potential information flow between high- and low-security system variables. We first define a set of metrics to assess the security quality of a given object-oriented system based on its design artifacts, allowing defects to be detected at an early stage of development. We then extend these metrics to produce a second set applicable to object-oriented program source code. The resulting metrics make it easy to compare the relative security of functionallyequivalent system designs or source code programs so that, for instance, the security of two different revisions of the same system can be compared directly. This capability is further used to study the impact of specific refactoring rules on system security more generally, at both the design and code levels. By measuring the relative security of various programs refactored using different rules, we thus provide guidelines for the safe application of refactoring steps to security-critical programs. Finally, to make it easy and efficient to measure a system design or program’s security, we have also developed a stand-alone software tool which automatically analyses and measures the security of UML designs and Java program code. The tool’s capabilities are demonstrated by applying it to a number of security-critical system designs and Java programs. Notably, the validity of the metrics is demonstrated empirically through measurements that confirm our expectation that program security typically improves as bugs are fixed, but worsens as new functionality is added.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The encryption method is a well established technology for protecting sensitive data. However, once encrypted, the data can no longer be easily queried. The performance of the database depends on how to encrypt the sensitive data. In this paper we review the conventional encryption method which can be partially queried and propose the encryption method for numerical data which can be effectively queried. The proposed system includes the design of the service scenario, and metadata.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the medical and healthcare arena, patients‟ data is not just their own personal history but also a valuable large dataset for finding solutions for diseases. While electronic medical records are becoming popular and are used in healthcare work places like hospitals, as well as insurance companies, and by major stakeholders such as physicians and their patients, the accessibility of such information should be dealt with in a way that preserves privacy and security. Thus, finding the best way to keep the data secure has become an important issue in the area of database security. Sensitive medical data should be encrypted in databases. There are many encryption/ decryption techniques and algorithms with regard to preserving privacy and security. Currently their performance is an important factor while the medical data is being managed in databases. Another important factor is that the stakeholders should decide more cost-effective ways to reduce the total cost of ownership. As an alternative, DAS (Data as Service) is a popular outsourcing model to satisfy the cost-effectiveness but it takes a consideration that the encryption/ decryption modules needs to be handled by trustworthy stakeholders. This research project is focusing on the query response times in a DAS model (AES-DAS) and analyses the comparison between the outsourcing model and the in-house model which incorporates Microsoft built-in encryption scheme in a SQL Server. This research project includes building a prototype of medical database schemas. There are 2 types of simulations to carry out the project. The first stage includes 6 databases in order to carry out simulations to measure the performance between plain-text, Microsoft built-in encryption and AES-DAS (Data as Service). Particularly, the AES-DAS incorporates implementations of symmetric key encryption such as AES (Advanced Encryption Standard) and a Bucket indexing processor using Bloom filter. The results are categorised such as character type, numeric type, range queries, range queries using Bucket Index and aggregate queries. The second stage takes the scalability test from 5K to 2560K records. The main result of these simulations is that particularly as an outsourcing model, AES-DAS using the Bucket index shows around 3.32 times faster than a normal AES-DAS under the 70 partitions and 10K record-sized databases. Retrieving Numeric typed data takes shorter time than Character typed data in AES-DAS. The aggregation query response time in AES-DAS is not as consistent as that in MS built-in encryption scheme. The scalability test shows that the DBMS reaches in a certain threshold; the query response time becomes rapidly slower. However, there is more to investigate in order to bring about other outcomes and to construct a secured EMR (Electronic Medical Record) more efficiently from these simulations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Monitoring environmental health is becoming increasingly important as human activity and climate change place greater pressure on global biodiversity. Acoustic sensors provide the ability to collect data passively, objectively and continuously across large areas for extended periods. While these factors make acoustic sensors attractive as autonomous data collectors, there are significant issues associated with large-scale data manipulation and analysis. We present our current research into techniques for analysing large volumes of acoustic data efficiently. We provide an overview of a novel online acoustic environmental workbench and discuss a number of approaches to scaling analysis of acoustic data; online collaboration, manual, automatic and human-in-the loop analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most approaches to business process compliance are restricted to the analysis of the structure of processes. It has been argued that full regulatory compliance requires information on not only the structure of processes but also on what the tasks in a process do. To this end Governatori and Sadiq[2007] proposed to extend business processes with semantic annotations. We propose a methodology to automatically extract one kind of such annotations; in particular the annotations related to the data schema and templates linked to the various tasks in a business process.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

While undertaking the ANDS RDA Gold Standard Record Exemplars project, research data sharing was discussed with many QUT researchers. Our experiences provided rich insight into researcher attitudes towards their data and the sharing of such data. Generally, we found traditional altruistic motivations for research data sharing did not inspire researchers, but an explanation of the more achievement-oriented benefits were more compelling.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Server consolidation using virtualization technology has become an important technology to improve the energy efficiency of data centers. Virtual machine placement is the key in the server consolidation. In the past few years, many approaches to the virtual machine placement have been proposed. However, existing virtual machine placement approaches to the virtual machine placement problem consider the energy consumption by physical machines in a data center only, but do not consider the energy consumption in communication network in the data center. However, the energy consumption in the communication network in a data center is not trivial, and therefore should be considered in the virtual machine placement in order to make the data center more energy-efficient. In this paper, we propose a genetic algorithm for a new virtual machine placement problem that considers the energy consumption in both the servers and the communication network in the data center. Experimental results show that the genetic algorithm performs well when tackling test problems of different kinds, and scales up well when the problem size increases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Key decisions at the collection, pre-processing, transformation, mining and interpretation phase of any knowledge discovery from database (KDD) process depend heavily on assumptions and theorectical perspectives relating to the type of task to be performed and characteristics of data sourced. In this article, we compare and contrast theoretical perspectives and assumptions taken in data mining exercises in the legal domain with those adopted in data mining in TCM and allopathic medicine. The juxtaposition results in insights for the application of KDD for Traditional Chinese Medicine.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Citizen Science projects are initiatives in which members of the general public participate in scientific research projects and perform or manage research-related tasks such as data collection and/or data annotation. Citizen Science is technologically possible and scientifically significant. However, as the gathered information is from the crowd, the data quality is always hard to manage. There are many ways to manage data quality, and reputation management is one of the common approaches. In recent year, many research teams have deployed many audio or image sensors in natural environment in order to monitor the status of animals or plants. The collected data will be analysed by ecologists. However, as the amount of collected data is exceedingly huge and the number of ecologists is very limited, it is impossible for scientists to manually analyse all these data. The functions of existing automated tools to process the data are still very limited and the results are still not very accurate. Therefore, researchers have turned to recruiting general citizens who are interested in helping scientific research to do the pre-processing tasks such as species tagging. Although research teams can save time and money by recruiting general citizens to volunteer their time and skills to help data analysis, the reliability of contributed data varies a lot. Therefore, this research aims to investigate techniques to enhance the reliability of data contributed by general citizens in scientific research projects especially for acoustic sensing projects. In particular, we aim to investigate how to use reputation management to enhance data reliability. Reputation systems have been used to solve the uncertainty and improve data quality in many marketing and E-Commerce domains. The commercial organizations which have chosen to embrace the reputation management and implement the technology have gained many benefits. Data quality issues are significant to the domain of Citizen Science due to the quantity and diversity of people and devices involved. However, research on reputation management in this area is relatively new. We therefore start our investigation by examining existing reputation systems in different domains. Then we design novel reputation management approaches for Citizen Science projects to categorise participants and data. We have investigated some critical elements which may influence data reliability in Citizen Science projects. These elements include personal information such as location and education and performance information such as the ability to recognise certain bird calls. The designed reputation framework is evaluated by a series of experiments involving many participants for collecting and interpreting data, in particular, environmental acoustic data. Our research in exploring the advantages of reputation management in Citizen Science (or crowdsourcing in general) will help increase awareness among organizations that are unacquainted with its potential benefits.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is acknowledged around the world that many university students struggle with learning to program (McCracken et al., 2001; McGettrick et al., 2005). In this paper, we describe how we have developed a research programme to systematically study and incrementally improve our teaching. We have adopted a research programme with three elements: (1) a theory that provides an organising framework for defining the type of phenomena and data of interest, (2) data on how the class as a whole performs on formative assessment tasks that are framed from within the organising framework, and (3) data from one-on-one think aloud sessions, to establish why students struggle with some of those in-class formative assessment tasks. We teach introductory computer programming, but this three-element structure of our research is applicable to many areas of engineering education research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.