980 resultados para data linkage
Resumo:
In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is also proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Experiments based on several real-world data collections demonstrate that WebPut outperforms existing approaches.
Resumo:
This paper presents an input-orientated data envelopment analysis (DEA) framework which allows the measurement and decomposition of economic, environmental and ecological efficiency levels in agricultural production across different countries. Economic, environmental and ecological optimisations search for optimal input combinations that minimise total costs, total amount of nutrients, and total amount of cumulative exergy contained in inputs respectively. The application of the framework to an agricultural dataset of 30 OECD countries revealed that (i) there was significant scope to make their agricultural production systemsmore environmentally and ecologically sustainable; (ii) the improvement in the environmental and ecological sustainability could be achieved by being more technically efficient and, even more significantly, by changing the input combinations; (iii) the rankings of sustainability varied significantly across OECD countries within frontier-based environmental and ecological efficiency measures and between frontier-based measures and indicators.
Resumo:
The Clarence-Moreton Basin (CMB) covers approximately 26000 km2 and is the only sub-basin of the Great Artesian Basin (GAB) in which there is flow to both the south-west and the east, although flow to the south-west is predominant. In many parts of the basin, including catchments of the Bremer, Logan and upper Condamine Rivers in southeast Queensland, the Walloon Coal Measures are under exploration for Coal Seam Gas (CSG). In order to assess spatial variations in groundwater flow and hydrochemistry at a basin-wide scale, a 3D hydrogeological model of the Queensland section of the CMB has been developed using GoCAD modelling software. Prior to any large-scale CSG extraction, it is essential to understand the existing hydrochemical character of the different aquifers and to establish any potential linkage. To effectively use the large amount of water chemistry data existing for assessment of hydrochemical evolution within the different lithostratigraphic units, multivariate statistical techniques were employed.
Resumo:
The ability to forecast machinery health is vital to reducing maintenance costs, operation downtime and safety hazards. Recent advances in condition monitoring technologies have given rise to a number of prognostic models which attempt to forecast machinery health based on condition data such as vibration measurements. This paper demonstrates how the population characteristics and condition monitoring data (both complete and suspended) of historical items can be integrated for training an intelligent agent to predict asset health multiple steps ahead. The model consists of a feed-forward neural network whose training targets are asset survival probabilities estimated using a variation of the Kaplan–Meier estimator and a degradation-based failure probability density function estimator. The trained network is capable of estimating the future survival probabilities when a series of asset condition readings are inputted. The output survival probabilities collectively form an estimated survival curve. Pump data from a pulp and paper mill were used for model validation and comparison. The results indicate that the proposed model can predict more accurately as well as further ahead than similar models which neglect population characteristics and suspended data. This work presents a compelling concept for longer-range fault prognosis utilising available information more fully and accurately.
Resumo:
Our paper approaches Twitter through the lens of “platform politics” (Gillespie, 2010), focusing in particular on controversies around user data access, ownership, and control. We characterise different actors in the Twitter data ecosystem: private and institutional end users of Twitter, commercial data resellers such as Gnip and DataSift, data scientists, and finally Twitter, Inc. itself; and describe their conflicting interests. We furthermore study Twitter’s Terms of Service and application programming interface (API) as material instantiations of regulatory instruments used by the platform provider and argue for a more promotion of data rights and literacy to strengthen the position of end users.
Resumo:
The deployment of new emerging technologies, such as cooperative systems, allows the traffic community to foresee relevant improvements in terms of traffic safety and efficiency. Vehicles are able to communicate on the local traffic state in real time, which could result in an automatic and therefore better reaction to the mechanism of traffic jam formation. An upstream single hop radio broadcast network can improve the perception of each cooperative driver within radio range and hence the traffic stability. The impact of a cooperative law on traffic congestion appearance is investigated, analytically and through simulation. Ngsim field data is used to calibrate the Optimal Velocity with Relative Velocity (OVRV) car following model and the MOBIL lane-changing model is implemented. Assuming that congestion can be triggered either by a perturbation in the instability domain or by a critical lane changing behavior, the calibrated car following behavior is used to assess the impact of a microscopic cooperative law on abnormal lane changing behavior. The cooperative law helps reduce and delay traffic congestion as it increases traffic flow stability.
Resumo:
Background Accumulated biological research outcomes show that biological functions do not depend on individual genes, but on complex gene networks. Microarray data are widely used to cluster genes according to their expression levels across experimental conditions. However, functionally related genes generally do not show coherent expression across all conditions since any given cellular process is active only under a subset of conditions. Biclustering finds gene clusters that have similar expression levels across a subset of conditions. This paper proposes a seed-based algorithm that identifies coherent genes in an exhaustive, but efficient manner. Methods In order to find the biclusters in a gene expression dataset, we exhaustively select combinations of genes and conditions as seeds to create candidate bicluster tables. The tables have two columns: (a) a gene set, and (b) the conditions on which the gene set have dissimilar expression levels to the seed. First, the genes with less than the maximum number of dissimilar conditions are identified and a table of these genes is created. Second, the rows that have the same dissimilar conditions are grouped together. Third, the table is sorted in ascending order based on the number of dissimilar conditions. Finally, beginning with the first row of the table, a test is run repeatedly to determine whether the cardinality of the gene set in the row is greater than the minimum threshold number of genes in a bicluster. If so, a bicluster is outputted and the corresponding row is removed from the table. Repeating this process, all biclusters in the table are systematically identified until the table becomes empty. Conclusions This paper presents a novel biclustering algorithm for the identification of additive biclusters. Since it involves exhaustively testing combinations of genes and conditions, the additive biclusters can be found more readily.
Resumo:
miRDeep and its varieties are widely used to quantify known and novel micro RNA (miRNA) from small RNA sequencing (RNAseq). This article describes miRDeep*, our integrated miRNA identification tool, which is modeled off miRDeep, but the precision of detecting novel miRNAs is improved by introducing new strategies to identify precursor miRNAs. miRDeep* has a user-friendly graphic interface and accepts raw data in FastQ and Sequence Alignment Map (SAM) or the binary equivalent (BAM) format. Known and novel miRNA expression levels, as measured by the number of reads, are displayed in an interface, which shows each RNAseq read relative to the pre-miRNA hairpin. The secondary pre-miRNA structure and read locations for each predicted miRNA are shown and kept in a separate figure file. Moreover, the target genes of known and novel miRNAs are predicted using the TargetScan algorithm, and the targets are ranked according to the confidence score. miRDeep* is an integrated standalone application where sequence alignment, pre-miRNA secondary structure calculation and graphical display are purely Java coded. This application tool can be executed using a normal personal computer with 1.5 GB of memory. Further, we show that miRDeep* outperformed existing miRNA prediction tools using our LNCaP and other small RNAseq datasets. miRDeep* is freely available online at http://www.australianprostatecentre.org/research/software/mirdeep-star
Resumo:
The IEEE Subcommittee on the Application of Probability Methods (APM) published the IEEE Reliability Test System (RTS) [1] in 1979. This system provides a consistent and generally acceptable set of data that can be used both in generation capacity and in composite system reliability evaluation [2,3]. The test system provides a basis for the comparison of results obtained by different people using different methods. Prior to its publication, there was no general agreement on either the system or the data that should be used to demonstrate or test various techniques developed to conduct reliability studies. Development of reliability assessment techniques and programs are very dependent on the intent behind the development as the experience of one power utility with their system may be quite different from that of another utility. The development and the utilization of a reliability program are, therefore, greatly influenced by the experience of a utlity and the intent of the system manager, planner and designer conducting the reliability studies. The IEEE-RTS has proved to be extremely valuable in highlighting and comparing the capabilities (or incapabilities) of programs used in reliability studies, the differences in the perception of various power utilities and the differences in the solution techniques. The IEEE-RTS contains a reasonably large power network which can be difficult to use for initial studies in an educational environment.
Resumo:
The IEEE Reliability Test System (RTS) developed by the Application of Probability Method Subcommittee has been used to compare and test a wide range of generating capacity and composite system evaluation techniques and subsequent digital computer programs. A basic reliability test system is presented which has evolved from the reliability education and research programs conducted by the Power System Research Group at the University of Saskatchewan. The basic system data necessary for adequacy evaluation at the generation and composite generation and transmission system levels are presented together with the fundamental data required to conduct reliability-cost/reliability-worth evaluation
Resumo:
Speaker diarization determines instances of the same speaker within a recording. Extending this task to a collection of recordings for linking together segments spoken by a unique speaker requires speaker linking. In this paper we propose a speaker linking system using linkage clustering and state-of-the-art speaker recognition techniques. We evaluate our approach against two baseline linking systems using agglomerative cluster merging (AC) and agglomerative clustering with model retraining (ACR). We demonstrate that our linking method, using complete-linkage clustering, provides a relative improvement of 20% and 29% in attribution error rate (AER), over the AC and ACR systems, respectively.
Speaker attribution of multiple telephone conversations using a complete-linkage clustering approach
Resumo:
In this paper we propose and evaluate a speaker attribution system using a complete-linkage clustering method. Speaker attribution refers to the annotation of a collection of spoken audio based on speaker identities. This can be achieved using diarization and speaker linking. The main challenge associated with attribution is achieving computational efficiency when dealing with large audio archives. Traditional agglomerative clustering methods with model merging and retraining are not feasible for this purpose. This has motivated the use of linkage clustering methods without retraining. We first propose a diarization system using complete-linkage clustering and show that it outperforms traditional agglomerative and single-linkage clustering based diarization systems with a relative improvement of 40% and 68%, respectively. We then propose a complete-linkage speaker linking system to achieve attribution and demonstrate a 26% relative improvement in attribution error rate (AER) over the single-linkage speaker linking approach.
Resumo:
Buildings are key mediators between human activity and the environment around them, but details of energy usage and activity in buildings is often poorly communicated and understood. ECOS is an Eco-Visualization project that aims to contextualize the energy generation and consumption of a green building in a variety of different climates. The ECOS project is being developed for a large public interactive space installed in the new Science and Engineering Centre of the Queensland University of Technology that is dedicated to delivering interactive science education content to the public. This paper focuses on how design can develop ICT solutions from large data sets to create meaningful engagement with environmental data.