951 resultados para Spatial Data Quality
Resumo:
The evaluation of geospatial data quality and trustworthiness presents a major challenge to geospatial data users when making a dataset selection decision. The research presented here therefore focused on defining and developing a GEO label – a decision support mechanism to assist data users in efficient and effective geospatial dataset selection on the basis of quality, trustworthiness and fitness for use. This thesis thus presents six phases of research and development conducted to: (a) identify the informational aspects upon which users rely when assessing geospatial dataset quality and trustworthiness; (2) elicit initial user views on the GEO label role in supporting dataset comparison and selection; (3) evaluate prototype label visualisations; (4) develop a Web service to support GEO label generation; (5) develop a prototype GEO label-based dataset discovery and intercomparison decision support tool; and (6) evaluate the prototype tool in a controlled human-subject study. The results of the studies revealed, and subsequently confirmed, eight geospatial data informational aspects that were considered important by users when evaluating geospatial dataset quality and trustworthiness, namely: producer information, producer comments, lineage information, compliance with standards, quantitative quality information, user feedback, expert reviews, and citations information. Following an iterative user-centred design (UCD) approach, it was established that the GEO label should visually summarise availability and allow interrogation of these key informational aspects. A Web service was developed to support generation of dynamic GEO label representations and integrated into a number of real-world GIS applications. The service was also utilised in the development of the GEO LINC tool – a GEO label-based dataset discovery and intercomparison decision support tool. The results of the final evaluation study indicated that (a) the GEO label effectively communicates the availability of dataset quality and trustworthiness information and (b) GEO LINC successfully facilitates ‘at a glance’ dataset intercomparison and fitness for purpose-based dataset selection.
Resumo:
With the exponential growth of the usage of web-based map services, the web GIS application has become more and more popular. Spatial data index, search, analysis, visualization and the resource management of such services are becoming increasingly important to deliver user-desired Quality of Service. First, spatial indexing is typically time-consuming and is not available to end-users. To address this, we introduce TerraFly sksOpen, an open-sourced an Online Indexing and Querying System for Big Geospatial Data. Integrated with the TerraFly Geospatial database [1-9], sksOpen is an efficient indexing and query engine for processing Top-k Spatial Boolean Queries. Further, we provide ergonomic visualization of query results on interactive maps to facilitate the user’s data analysis. Second, due to the highly complex and dynamic nature of GIS systems, it is quite challenging for the end users to quickly understand and analyze the spatial data, and to efficiently share their own data and analysis results with others. Built on the TerraFly Geo spatial database, TerraFly GeoCloud is an extra layer running upon the TerraFly map and can efficiently support many different visualization functions and spatial data analysis models. Furthermore, users can create unique URLs to visualize and share the analysis results. TerraFly GeoCloud also enables the MapQL technology to customize map visualization using SQL-like statements [10]. Third, map systems often serve dynamic web workloads and involve multiple CPU and I/O intensive tiers, which make it challenging to meet the response time targets of map requests while using the resources efficiently. Virtualization facilitates the deployment of web map services and improves their resource utilization through encapsulation and consolidation. Autonomic resource management allows resources to be automatically provisioned to a map service and its internal tiers on demand. v-TerraFly are techniques to predict the demand of map workloads online and optimize resource allocations, considering both response time and data freshness as the QoS target. The proposed v-TerraFly system is prototyped on TerraFly, a production web map service, and evaluated using real TerraFly workloads. The results show that v-TerraFly can accurately predict the workload demands: 18.91% more accurate; and efficiently allocate resources to meet the QoS target: improves the QoS by 26.19% and saves resource usages by 20.83% compared to traditional peak load-based resource allocation.
Resumo:
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
Resumo:
Collecting ground truth data is an important step to be accomplished before performing a supervised classification. However, its quality depends on human, financial and time ressources. It is then important to apply a validation process to assess the reliability of the acquired data. In this study, agricultural infomation was collected in the Brazilian Amazonian State of Mato Grosso in order to map crop expansion based on MODIS EVI temporal profiles. The field work was carried out through interviews for the years 2005-2006 and 2006-2007. This work presents a methodology to validate the training data quality and determine the optimal sample to be used according to the classifier employed. The technique is based on the detection of outlier pixels for each class and is carried out by computing Mahalanobis distances for each pixel. The higher the distance, the further the pixel is from the class centre. Preliminary observations through variation coefficent validate the efficiency of the technique to detect outliers. Then, various subsamples are defined by applying different thresholds to exclude outlier pixels from the classification process. The classification results prove the robustness of the Maximum Likelihood and Spectral Angle Mapper classifiers. Indeed, those classifiers were insensitive to outlier exclusion. On the contrary, the decision tree classifier showed better results when deleting 7.5% of pixels in the training data. The technique managed to detect outliers for all classes. In this study, few outliers were present in the training data, so that the classification quality was not deeply affected by the outliers.
Resumo:
This paper examines the spatial pattern of ill-defined causes of death across Brazilian regions, and its relationship with the evolution of completeness of the deaths registry and changes in the mortality age profile. We make use of the Brazilian Health Informatics Department mortality database and population censuses from 1980 to 2010. We applied demographic methods to evaluate the quality of mortality data for 137 small areas and correct for under-registration of death counts when necessary. The second part of the analysis uses linear regression models to investigate the relationship between, on the one hand, changes in death counts coverage and age profile of mortality, and on the other, changes in the reporting of ill-defined causes of death. The completeness of death counts coverage increases from about 80% in 1980-1991 to over 95% in 2000-2010 at the same time the percentage of ill-defined causes of deaths reduced about 53% in the country. The analysis suggests that the government's efforts to improve data quality are proving successful, and they will allow for a better understanding of the dynamics of health and the mortality transition.
Resumo:
Much information on flavonoid content of Brazilian foods has already been obtained; however, this information is spread in scientific publications and non-published data. The objectives of this work were to compile and evaluate the quality of national flavonoid data according to the United States Department of Agriculture`s Data Quality Evaluation System (USDA-DQES) with few modifications, for future dissemination in the TBCA-USP (Brazilian Food Composition Database). For the compilation, the most abundant compounds in the flavonoid subclasses were considered (flavonols, flavones, isoflavones, flavanones, flavan-3-ols, and anthocyanidins) and the analysis of the compounds by HPLC was adopted as criteria for data inclusion. The evaluation system considers five categories, and the maximum score assigned to each category is 20. For each data, a confidence code (CC) was attributed (A, B, C and D), indicating the quality and reliability of the information. Flavonoid data (773) present in 197 Brazilian foods were evaluated. The CC ""C"" (as average) was attributed to 99% of the data and ""B"" (above average) to 1%. The main categories assigned low average scores were: number of samples; sampling plan and analytical quality control (average scores 2, 5 and 4, respectively). The analytical method category received an average score of 9. The category assigned the highest score was the sample handling (20 average). These results show that researchers need to be conscious about the importance of the number and plan of evaluated samples and the complete description and documentation of all the processes of methodology execution and analytical quality control. (C) 2010 Elsevier Inc. All rights reserved.
Resumo:
The Brazilian Network of Food Data Systems (BRASILFOODS) has been keeping the Brazilian Food Composition Database-USP (TBCA-USP) (http://www.fcf.usp.br/tabela) since 1998. Besides the constant compilation, analysis and update work in the database, the network tries to innovate through the introduction of food information that may contribute to decrease the risk for non-transmissible chronic diseases, such as the profile of carbohydrates and flavonoids in foods. In 2008, data on carbohydrates, individually analyzed, of 112 foods, and 41 data related to the glycemic response produced by foods widely consumed in the country were included in the TBCA-USP. Data (773) about the different flavonoid subclasses of 197 Brazilian foods were compiled and the quality of each data was evaluated according to the USDAs data quality evaluation system. In 2007, BRASILFOODS/USP and INFOODS/FAO organized the 7th International Food Data Conference ""Food Composition and Biodiversity"". This conference was a unique opportunity for interaction between renowned researchers and participants from several countries and it allowed the discussion of aspects that may improve the food composition area. During the period, the LATINFOODS Regional Technical Compilation Committee and BRASILFOODS disseminated to Latin America the Form and Manual for Data Compilation, version 2009, ministered a Food Composition Data Compilation course and developed many activities related to data production and compilation. (C) 2010 Elsevier Inc. All rights reserved.
Resumo:
Soil erosion is a major environmental issue in Australia. It reduces land productivity and has off-site effects of decreased water quality. Broad-scale spatially distributed soil erosion estimation is essential for prioritising erosion control programs and as a component of broader assessments of natural resource condition. This paper describes spatial modelling methods and results that predict sheetwash and rill erosion over the Australian continent using the revised universal soil loss equation (RUSLE) and spatial data layers for each of the contributing environmental factors. The RUSLE has been used before in this way but here we advance the quality of estimation. We use time series of remote sensing imagery and daily rainfall to incorporate the effects of seasonally varying cover and rainfall intensity, and use new digital maps of soil and terrain properties. The results are compared with a compilation of Australian erosion plot data, revealing an acceptable consistency between predictions and observations. The modelling results show that: (1) the northern part of Australia has greater erosion potential than the south; (2) erosion potential differs significantly between summer and winter; (3) the average erosion rate is 4.1 t/ha. year over the continent and about 2.9 x 10(9) tonnes of soil is moved annually which represents 3.9% of global soil erosion from 5% of world land area; and (4) the erosion rate has increased from 4 to 33 times on average for agricultural lands compared with most natural vegetated lands.
Resumo:
Spatial data has now been used extensively in the Web environment, providing online customized maps and supporting map-based applications. The full potential of Web-based spatial applications, however, has yet to be achieved due to performance issues related to the large sizes and high complexity of spatial data. In this paper, we introduce a multiresolution approach to spatial data management and query processing such that the database server can choose spatial data at the right resolution level for different Web applications. One highly desirable property of the proposed approach is that the server-side processing cost and network traffic can be reduced when the level of resolution required by applications are low. Another advantage is that our approach pushes complex multiresolution structures and algorithms into the spatial database engine. That is, the developer of spatial Web applications needs not to be concerned with such complexity. This paper explains the basic idea, technical feasibility and applications of multiresolution spatial databases.
Resumo:
With the proliferation of relational database programs for PC's and other platforms, many business end-users are creating, maintaining, and querying their own databases. More importantly, business end-users use the output of these queries as the basis for operational, tactical, and strategic decisions. Inaccurate data reduce the expected quality of these decisions. Implementing various input validation controls, including higher levels of normalisation, can reduce the number of data anomalies entering the databases. Even in well-maintained databases, however, data anomalies will still accumulate. To improve the quality of data, databases can be queried periodically to locate and correct anomalies. This paper reports the results of two experiments that investigated the effects of different data structures on business end-users' abilities to detect data anomalies in a relational database. The results demonstrate that both unnormalised and higher levels of normalisation lower the effectiveness and efficiency of queries relative to the first normal form. First normal form databases appear to provide the most effective and efficient data structure for business end-users formulating queries to detect data anomalies.
Resumo:
Most Internet search engines are keyword-based. They are not efficient for the queries where geographical location is important, such as finding hotels within an area or close to a place of interest. A natural interface for spatial searching is a map, which can be used not only to display locations of search results but also to assist forming search conditions. A map-based search engine requires a well-designed visual interface that is intuitive to use yet flexible and expressive enough to support various types of spatial queries as well as aspatial queries. Similar to hyperlinks for text and images in an HTML page, spatial objects in a map should support hyperlinks. Such an interface needs to be scalable with the size of the geographical regions and the number of websites it covers. In spite of handling typically a very large amount of spatial data, a map-based search interface should meet the expectation of fast response time for interactive applications. In this paper we discuss general requirements and the design for a new map-based web search interface, focusing on integration with the WWW and visual spatial query interface. A number of current and future research issues are discussed, and a prototype for the University of Queensland is presented. (C) 2001 Published by Elsevier Science Ltd.
Resumo:
This paper presents the SmartClean tool. The purpose of this tool is to detect and correct the data quality problems (DQPs). Compared with existing tools, SmartClean has the following main advantage: the user does not need to specify the execution sequence of the data cleaning operations. For that, an execution sequence was developed. The problems are manipulated (i.e., detected and corrected) following that sequence. The sequence also supports the incremental execution of the operations. In this paper, the underlying architecture of the tool is presented and its components are described in detail. The tool's validity and, consequently, of the architecture is demonstrated through the presentation of a case study. Although SmartClean has cleaning capabilities in all other levels, in this paper are only described those related with the attribute value level.
Resumo:
The emergence of new business models, namely, the establishment of partnerships between organizations, the chance that companies have of adding existing data on the web, especially in the semantic web, to their information, led to the emphasis on some problems existing in databases, particularly related to data quality. Poor data can result in loss of competitiveness of the organizations holding these data, and may even lead to their disappearance, since many of their decision-making processes are based on these data. For this reason, data cleaning is essential. Current approaches to solve these problems are closely linked to database schemas and specific domains. In order that data cleaning can be used in different repositories, it is necessary for computer systems to understand these data, i.e., an associated semantic is needed. The solution presented in this paper includes the use of ontologies: (i) for the specification of data cleaning operations and, (ii) as a way of solving the semantic heterogeneity problems of data stored in different sources. With data cleaning operations defined at a conceptual level and existing mappings between domain ontologies and an ontology that results from a database, they may be instantiated and proposed to the expert/specialist to be executed over that database, thus enabling their interoperability.
Resumo:
MSC Dissertation in Computer Engineering
Resumo:
A thesis submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Information Systems