930 resultados para data availability
Resumo:
The use of hedonic models to estimate the effects of various factors on house prices is well established. This paper examines a number of international hedonic house price models that seek to quantify the effect of infrastructure charges on new house prices. This work is an important factor in the housing affordability debate, with many governments in high growth areas having user-pays infrastructure charging policies operating in tandem with housing affordability objectives, with no empirical evidence on the impact of one on the other. This research finds there is little consistency between existing models and the data sets utilised. Specification appears dependent upon data availability rather than sound theoretical grounding. This may lead to a lack of external validity with model specification dependent upon data availability rather than sound theoretical grounding.
Resumo:
King, R. D. and Wise, P. H. and Clare, A. (2004) Confirmation of Data Mining Based Predictions of Protein Function. Bioinformatics 20(7), 1110-1118
Resumo:
Background. The assembly of the tree of life has seen significant progress in recent years but algae and protists have been largely overlooked in this effort. Many groups of algae and protists have ancient roots and it is unclear how much data will be required to resolve their phylogenetic relationships for incorporation in the tree of life. The red algae, a group of primary photosynthetic eukaryotes of more than a billion years old, provide the earliest fossil evidence for eukaryotic multicellularity and sexual reproduction. Despite this evolutionary significance, their phylogenetic relationships are understudied. This study aims to infer a comprehensive red algal tree of life at the family level from a supermatrix containing data mined from GenBank. We aim to locate remaining regions of low support in the topology, evaluate their causes and estimate the amount of data required to resolve them. Results. Phylogenetic analysis of a supermatrix of 14 loci and 98 red algal families yielded the most complete red algal tree of life to date. Visualization of statistical support showed the presence of five poorly supported regions. Causes for low support were identified with statistics about the age of the region, data availability and node density, showing that poor support has different origins in different parts of the tree. Parametric simulation experiments yielded optimistic estimates of how much data will be needed to resolve the poorly supported regions (ca. 103 to ca. 104 nucleotides for the different regions). Nonparametric simulations gave a markedly more pessimistic image, some regions requiring more than 2.8 105 nucleotides or not achieving the desired level of support at all. The discrepancies between parametric and nonparametric simulations are discussed in light of our dataset and known attributes of both approaches. Conclusions. Our study takes the red algae one step closer to meaningful inclusion in the tree of life. In addition to the recovery of stable relationships, the recognition of five regions in need of further study is a significant outcome of this work. Based on our analyses of current availability and future requirements of data, we make clear recommendations for forthcoming research.
Resumo:
Data caching is an important technique in mobile computing environments for improving data availability and access latencies particularly because these computing environments are characterized by narrow bandwidth wireless links and frequent disconnections. Cache replacement policy plays a vital role to improve the performance in a cached mobile environment, since the amount of data stored in a client cache is small. In this paper we reviewed some of the well known cache replacement policies proposed for mobile data caches. We made a comparison between these policies after classifying them based on the criteria used for evicting documents. In addition, this paper suggests some alternative techniques for cache replacement
Resumo:
In a world of almost permanent and rapidly increasing electronic data availability, techniques of filtering, compressing, and interpreting this data to transform it into valuable and easily comprehensible information is of utmost importance. One key topic in this area is the capability to deduce future system behavior from a given data input. This book brings together for the first time the complete theory of data-based neurofuzzy modelling and the linguistic attributes of fuzzy logic in a single cohesive mathematical framework. After introducing the basic theory of data-based modelling, new concepts including extended additive and multiplicative submodels are developed and their extensions to state estimation and data fusion are derived. All these algorithms are illustrated with benchmark and real-life examples to demonstrate their efficiency. Chris Harris and his group have carried out pioneering work which has tied together the fields of neural networks and linguistic rule-based algortihms. This book is aimed at researchers and scientists in time series modeling, empirical data modeling, knowledge discovery, data mining, and data fusion.
Resumo:
Pasture-based ruminant production systems are common in certain areas of the world, but energy evaluation in grazing cattle is performed with equations developed, in their majority, with sheep or cattle fed total mixed rations. The aim of the current study was to develop predictions of metabolisable energy (ME) concentrations in fresh-cut grass offered to non-pregnant non-lactating cows at maintenance energy level, which may be more suitable for grazing cattle. Data were collected from three digestibility trials performed over consecutive grazing seasons. In order to cover a range of commercial conditions and data availability in pasture-based systems, thirty-eight equations for the prediction of energy concentrations and ratios were developed. An internal validation was performed for all equations and also for existing predictions of grass ME. Prediction error for ME using nutrient digestibility was lowest when gross energy (GE) or organic matter digestibilities were used as sole predictors, while the addition of grass nutrient contents reduced the difference between predicted and actual values, and explained more variation. Addition of N, GE and diethyl ether extract (EE) contents improved accuracy when digestible organic matter in DM was the primary predictor. When digestible energy was the primary explanatory variable, prediction error was relatively low, but addition of water-soluble carbohydrates, EE and acid-detergent fibre contents of grass decreased prediction error. Equations developed in the current study showed lower prediction errors when compared with those of existing equations, and may thus allow for an improved prediction of ME in practice, which is critical for the sustainability of pasture-based systems.
Resumo:
Current commercially available Doppler lidars provide an economical and robust solution for measuring vertical and horizontal wind velocities, together with the ability to provide co- and cross-polarised backscatter profiles. The high temporal resolution of these instruments allows turbulent properties to be obtained from studying the variation in radial velocities. However, the instrument specifications mean that certain characteristics, especially the background noise behaviour, become a limiting factor for the instrument sensitivity in regions where the aerosol load is low. Turbulent calculations require an accurate estimate of the contribution from velocity uncertainty estimates, which are directly related to the signal-to-noise ratio. Any bias in the signal-to-noise ratio will propagate through as a bias in turbulent properties. In this paper we present a method to correct for artefacts in the background noise behaviour of commercially available Doppler lidars and reduce the signal-to-noise ratio threshold used to discriminate between noise, and cloud or aerosol signals. We show that, for Doppler lidars operating continuously at a number of locations in Finland, the data availability can be increased by as much as 50 % after performing this background correction and subsequent reduction in the threshold. The reduction in bias also greatly improves subsequent calculations of turbulent properties in weak signal regimes.
Resumo:
Motivation: Array CGH technologies enable the simultaneous measurement of DNA copy number for thousands of sites on a genome. We developed the circular binary segmentation (CBS) algorithm to divide the genome into regions of equal copy number (Olshen {\it et~al}, 2004). The algorithm tests for change-points using a maximal $t$-statistic with a permutation reference distribution to obtain the corresponding $p$-value. The number of computations required for the maximal test statistic is $O(N^2),$ where $N$ is the number of markers. This makes the full permutation approach computationally prohibitive for the newer arrays that contain tens of thousands markers and highlights the need for a faster. algorithm. Results: We present a hybrid approach to obtain the $p$-value of the test statistic in linear time. We also introduce a rule for stopping early when there is strong evidence for the presence of a change. We show through simulations that the hybrid approach provides a substantial gain in speed with only a negligible loss in accuracy and that the stopping rule further increases speed. We also present the analysis of array CGH data from a breast cancer cell line to show the impact of the new approaches on the analysis of real data. Availability: An R (R Development Core Team, 2006) version of the CBS algorithm has been implemented in the ``DNAcopy'' package of the Bioconductor project (Gentleman {\it et~al}, 2004). The proposed hybrid method for the $p$-value is available in version 1.2.1 or higher and the stopping rule for declaring a change early is available in version 1.5.1 or higher.
Resumo:
The Weddell Gyre plays a crucial role in the regulation of climate by transferring heat into the deep ocean through deep and bottom water mass formation. However, our understanding of Weddell Gyre water mass properties is limited to regions of data availability, primarily along the Prime Meridian. The aim is to provide a dataset of the upper water column properties of the entire Weddell Gyre. Objective mapping was applied to Argo float data in order to produce spatially gridded, time composite maps of temperature and salinity for fixed pressure levels ranging from 50 to 2000 dbar, as well as temperature, salinity and pressure at the level of the sub-surface temperature maximum. While the data are currently too limited to incorporate time into the gridded structure, the data are extensive enough to produce maps of the entire region across three time composite periods (2002-2005, 2006-2009 and 2010-2013), which can be used to determine how representative conclusions drawn from data collected along general RV transect lines are on a gyre scale perspective. The time composite data sets are provided as netCDF files; one for each time period. Mapped fields of conservative temperature, absolute salinity and potential density are provided for 41 vertical pressure levels. The above variables as well as pressure are provided at the level of the sub-surface temperature maximum. Corresponding mapping errors are also included in the netCDF files. Further details are provided in the global attributes, such as the unit variables and structure of the corresponding data array (i.e. latitude x longitude x vertical pressure level). In addition, all files ending in "_potTpSal" provide mapped fields of potential temperature and practical salinity.
The Long-Term impact of Business Support? - Exploring the Role of Evaluation Timing using Micro Data
Resumo:
The original contribution of this work is threefold. Firstly, this thesis develops a critical perspective on current evaluation practice of business support, with focus on the timing of evaluation. The general time frame applied for business support policy evaluation is limited to one to two, seldom three years post intervention. This is despite calls for long-term impact studies by various authors, concerned about time lags before effects are fully realised. This desire for long-term evaluation opposes the requirements by policy-makers and funders, seeking quick results. Also, current ‘best practice’ frameworks do not refer to timing or its implications, and data availability affects the ability to undertake long-term evaluation. Secondly, this thesis provides methodological value for follow-up and similar studies by using data linking of scheme-beneficiary data with official performance datasets. Thus data availability problems are avoided through the use of secondary data. Thirdly, this thesis builds the evidence, through the application of a longitudinal impact study of small business support in England, covering seven years of post intervention data. This illustrates the variability of results for different evaluation periods, and the value in using multiple years of data for a robust understanding of support impact. For survival, impact of assistance is found to be immediate, but limited. Concerning growth, significant impact centres on a two to three year period post intervention for the linear selection and quantile regression models – positive for employment and turnover, negative for productivity. Attribution of impact may present a problem for subsequent periods. The results clearly support the argument for the use of longitudinal data and analysis, and a greater appreciation by evaluators of the factor time. This analysis recommends a time frame of four to five years post intervention for soft business support evaluation.
The long-term impact of business support? - Exploring the role of evaluation timing using micro data
Resumo:
The original contribution of this work is threefold. Firstly, this thesis develops a critical perspective on current evaluation practice of business support, with focus on the timing of evaluation. The general time frame applied for business support policy evaluation is limited to one to two, seldom three years post intervention. This is despite calls for long-term impact studies by various authors, concerned about time lags before effects are fully realised. This desire for long-term evaluation opposes the requirements by policy-makers and funders, seeking quick results. Also, current ‘best practice’ frameworks do not refer to timing or its implications, and data availability affects the ability to undertake long-term evaluation. Secondly, this thesis provides methodological value for follow-up and similar studies by using data linking of scheme-beneficiary data with official performance datasets. Thus data availability problems are avoided through the use of secondary data. Thirdly, this thesis builds the evidence, through the application of a longitudinal impact study of small business support in England, covering seven years of post intervention data. This illustrates the variability of results for different evaluation periods, and the value in using multiple years of data for a robust understanding of support impact. For survival, impact of assistance is found to be immediate, but limited. Concerning growth, significant impact centres on a two to three year period post intervention for the linear selection and quantile regression models – positive for employment and turnover, negative for productivity. Attribution of impact may present a problem for subsequent periods. The results clearly support the argument for the use of longitudinal data and analysis, and a greater appreciation by evaluators of the factor time. This analysis recommends a time frame of four to five years post intervention for soft business support evaluation.
Resumo:
This research presents several components encompassing the scope of the objective of Data Partitioning and Replication Management in Distributed GIS Database. Modern Geographic Information Systems (GIS) databases are often large and complicated. Therefore data partitioning and replication management problems need to be addresses in development of an efficient and scalable solution. ^ Part of the research is to study the patterns of geographical raster data processing and to propose the algorithms to improve availability of such data. These algorithms and approaches are targeting granularity of geographic data objects as well as data partitioning in geographic databases to achieve high data availability and Quality of Service(QoS) considering distributed data delivery and processing. To achieve this goal a dynamic, real-time approach for mosaicking digital images of different temporal and spatial characteristics into tiles is proposed. This dynamic approach reuses digital images upon demand and generates mosaicked tiles only for the required region according to user's requirements such as resolution, temporal range, and target bands to reduce redundancy in storage and to utilize available computing and storage resources more efficiently. ^ Another part of the research pursued methods for efficient acquiring of GIS data from external heterogeneous databases and Web services as well as end-user GIS data delivery enhancements, automation and 3D virtual reality presentation. ^ There are vast numbers of computing, network, and storage resources idling or not fully utilized available on the Internet. Proposed "Crawling Distributed Operating System "(CDOS) approach employs such resources and creates benefits for the hosts that lend their CPU, network, and storage resources to be used in GIS database context. ^ The results of this dissertation demonstrate effective ways to develop a highly scalable GIS database. The approach developed in this dissertation has resulted in creation of TerraFly GIS database that is used by US government, researchers, and general public to facilitate Web access to remotely-sensed imagery and GIS vector information. ^
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
Due to the sensitive nature of patient data, the secondary use of electronic health records (EHR) is restricted in scientific research and product development. Such restrictions pursue to preserve the privacy of respective patients by limiting the availability and variety of sensitive patient data. Current limitations do not correspond with the actual needs requested by the potential secondary users. In this thesis, the secondary use of Finnish and Swedish EHR data is explored for the purpose of enhancing the availability of such data for clinical research and product development. Involved EHR-related procedures and technologies are analysed to identify the issues limiting the secondary use of patient data. Successful secondary use of patient data increases the data value. To explore the identified circumstances, a case study of potential secondary users and use intentions regarding EHR data was carried out in Finland and Sweden. The data collection for the conducted case study was performed using semi-structured interviews. In total, 14 Finnish and Swedish experts representing scientific research, health management, and business were interviewed. The motivation for the corresponding interviews was to evaluate the protection of EHR data used for secondary purposes. The efficiency of implemented procedures and technologies was analysed in terms of data availability and privacy preserving. The results of the conducted case study show that the factors affecting EHR availability are divided to three categories: management of patient data, preservation of patients' privacy, and potential secondary users. Identified issues regarding data management included laborious and inconsistent data request procedures and the role and effect of external service providers. Based on the study findings, two secondary use approaches enabling the secondary use of EHR data are identified: data alteration and protected processing environment. Data alteration increases the availability of relevant EHR data, further decreasing the value of such data. Protected processing approach restricts the amount of potential users and use intentions while providing more valuable data content.
Resumo:
The aim of the study is to identify the opportunities and challenges a local government public asset manager is most likely to deal with when adopting the appropriate Public Asset Management Framework especially in developing countries. In order to achieve its aim, this study employs a Case Study in Indonesia for collecting all data i.e. interviews, document analysis and observations at South Sulawesi Province, Indonesia. The study concludes that there are significant opportunities and challenges that local governments in developing countries, especially Indonesia, might be required to manage if apply public asset management framework appropriately. The opportunities are more effective and efficient local government, accountable and auditable local government organization, increase local government portfolio, reflect up to date information for decision makers in local government, and improve the quality of public services. On the other hand, there are also challenges. Those challenges are local governments has no clear legal and institutional framework to support the asset management application, non-profit principle of public assets, cross jurisdictions and applications in public asset management, the complexity of public organization objectives, and data availability required for managing public property. The study only covers the condition of developing countries where Indonesia as an example, which could not represent exactly the whole local governments’ condition in the world. Further study to develop an asset management system applicable for all local governments in developing countries is urgently needed. Findings from this study will provide useful input for the policy maker, scholars and asset management practitioners to develop an asset management framework for more efficient and effective local governments.