48 resultados para Data processing Computer science
Resumo:
Spatial data has now been used extensively in the Web environment, providing online customized maps and supporting map-based applications. The full potential of Web-based spatial applications, however, has yet to be achieved due to performance issues related to the large sizes and high complexity of spatial data. In this paper, we introduce a multiresolution approach to spatial data management and query processing such that the database server can choose spatial data at the right resolution level for different Web applications. One highly desirable property of the proposed approach is that the server-side processing cost and network traffic can be reduced when the level of resolution required by applications are low. Another advantage is that our approach pushes complex multiresolution structures and algorithms into the spatial database engine. That is, the developer of spatial Web applications needs not to be concerned with such complexity. This paper explains the basic idea, technical feasibility and applications of multiresolution spatial databases.
Resumo:
Geospatial clustering must be designed in such a way that it takes into account the special features of geoinformation and the peculiar nature of geographical environments in order to successfully derive geospatially interesting global concentrations and localized excesses. This paper examines families of geospaital clustering recently proposed in the data mining community and identifies several features and issues especially important to geospatial clustering in data-rich environments.
Resumo:
Unauthorized accesses to digital contents are serious threats to international security and informatics. We propose an offline oblivious data distribution framework that preserves the sender's security and the receiver's privacy using tamper-proof smart cards. This framework provides persistent content protections from digital piracy and promises private content consumption.
Resumo:
With the proliferation of relational database programs for PC's and other platforms, many business end-users are creating, maintaining, and querying their own databases. More importantly, business end-users use the output of these queries as the basis for operational, tactical, and strategic decisions. Inaccurate data reduce the expected quality of these decisions. Implementing various input validation controls, including higher levels of normalisation, can reduce the number of data anomalies entering the databases. Even in well-maintained databases, however, data anomalies will still accumulate. To improve the quality of data, databases can be queried periodically to locate and correct anomalies. This paper reports the results of two experiments that investigated the effects of different data structures on business end-users' abilities to detect data anomalies in a relational database. The results demonstrate that both unnormalised and higher levels of normalisation lower the effectiveness and efficiency of queries relative to the first normal form. First normal form databases appear to provide the most effective and efficient data structure for business end-users formulating queries to detect data anomalies.
Resumo:
Map algebra is a data model and simple functional notation to study the distribution and patterns of spatial phenomena. It uses a uniform representation of space as discrete grids, which are organized into layers. This paper discusses extensions to map algebra to handle neighborhood operations with a new data type called a template. Templates provide general windowing operations on grids to enable spatial models for cellular automata, mathematical morphology, and local spatial statistics. A programming language for map algebra that incorporates templates and special processing constructs is described. The programming language is called MapScript. Example program scripts are presented to perform diverse and interesting neighborhood analysis for descriptive, model-based and processed-based analysis.
Resumo:
Most Internet search engines are keyword-based. They are not efficient for the queries where geographical location is important, such as finding hotels within an area or close to a place of interest. A natural interface for spatial searching is a map, which can be used not only to display locations of search results but also to assist forming search conditions. A map-based search engine requires a well-designed visual interface that is intuitive to use yet flexible and expressive enough to support various types of spatial queries as well as aspatial queries. Similar to hyperlinks for text and images in an HTML page, spatial objects in a map should support hyperlinks. Such an interface needs to be scalable with the size of the geographical regions and the number of websites it covers. In spite of handling typically a very large amount of spatial data, a map-based search interface should meet the expectation of fast response time for interactive applications. In this paper we discuss general requirements and the design for a new map-based web search interface, focusing on integration with the WWW and visual spatial query interface. A number of current and future research issues are discussed, and a prototype for the University of Queensland is presented. (C) 2001 Published by Elsevier Science Ltd.
Resumo:
Binning and truncation of data are common in data analysis and machine learning. This paper addresses the problem of fitting mixture densities to multivariate binned and truncated data. The EM approach proposed by McLachlan and Jones (Biometrics, 44: 2, 571-578, 1988) for the univariate case is generalized to multivariate measurements. The multivariate solution requires the evaluation of multidimensional integrals over each bin at each iteration of the EM procedure. Naive implementation of the procedure can lead to computationally inefficient results. To reduce the computational cost a number of straightforward numerical techniques are proposed. Results on simulated data indicate that the proposed methods can achieve significant computational gains with no loss in the accuracy of the final parameter estimates. Furthermore, experimental results suggest that with a sufficient number of bins and data points it is possible to estimate the true underlying density almost as well as if the data were not binned. The paper concludes with a brief description of an application of this approach to diagnosis of iron deficiency anemia, in the context of binned and truncated bivariate measurements of volume and hemoglobin concentration from an individual's red blood cells.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
Observations of an insect's movement lead to theory on the insect's flight behaviour and the role of movement in the species' population dynamics. This theory leads to predictions of the way the population changes in time under different conditions. If a hypothesis on movement predicts a specific change in the population, then the hypothesis can be tested against observations of population change. Routine pest monitoring of agricultural crops provides a convenient source of data for studying movement into a region and among fields within a region. Examples of the use of statistical and computational methods for testing hypotheses with such data are presented. The types of questions that can be addressed with these methods and the limitations of pest monitoring data when used for this purpose are discussed. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
We focus on mixtures of factor analyzers from the perspective of a method for model-based density estimation from high-dimensional data, and hence for the clustering of such data. This approach enables a normal mixture model to be fitted to a sample of n data points of dimension p, where p is large relative to n. The number of free parameters is controlled through the dimension of the latent factor space. By working in this reduced space, it allows a model for each component-covariance matrix with complexity lying between that of the isotropic and full covariance structure models. We shall illustrate the use of mixtures of factor analyzers in a practical example that considers the clustering of cell lines on the basis of gene expressions from microarray experiments. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
Online geographic information systems provide the means to extract a subset of desired spatial information from a larger remote repository. Data retrieved representing real-world geographic phenomena are then manipulated to suit the specific needs of an end-user. Often this extraction requires the derivation of representations of objects specific to a particular resolution or scale from a single original stored version. Currently standard spatial data handling techniques cannot support the multi-resolution representation of such features in a database. In this paper a methodology to store and retrieve versions of spatial objects at, different resolutions with respect to scale using standard database primitives and SQL is presented. The technique involves heavy fragmentation of spatial features that allows dynamic simplification into scale-specific object representations customised to the display resolution of the end-user's device. Experimental results comparing the new approach to traditional R-Tree indexing and external object simplification reveal the former performs notably better for mobile and WWW applications where client-side resources are limited and retrieved data loads are kept relatively small.
Resumo:
The notion of compensation is widely used in advanced transaction models as means of recovery from a failure. Similar concepts are adopted for providing transaction-like behaviour for long business processes supported by workflows technology. In general, it is not trivial to design compensating tasks for tasks in the context of a workflow. Actually, a task in a workflow process does not have to be compensatable in the sense that the forcibility of reverse operations of the task is not always guaranteed by the application semantics. In addition, the isolation requirement on data resources may make a task difficult to compensate. In this paper, we first look into the requirements that a compensating task has to satisfy. Then we introduce a new concept called confirmation. With the help of confirmation, we are able to modify most non-compensatable tasks so that they become compensatable. This can substantially increase the availability of shared resources and greatly improve backward recovery for workflow applications in case of failures. To effectively incorporate confirmation and compensation into a workflow management environment, a three level bottom-up workflow design method is introduced. The implementation issues of this design are also discussed. (C) 2003 Elsevier Science Inc. All rights reserved.
Resumo:
The integration of geo-information from multiple sources and of diverse nature in developing mineral favourability indexes (MFIs) is a well-known problem in mineral exploration and mineral resource assessment. Fuzzy set theory provides a convenient framework to combine and analyse qualitative and quantitative data independently of their source or characteristics. A novel, data-driven formulation for calculating MFIs based on fuzzy analysis is developed in this paper. Different geo-variables are considered fuzzy sets and their appropriate membership functions are defined and modelled. A new weighted average-type aggregation operator is then introduced to generate a new fuzzy set representing mineral favourability. The membership grades of the new fuzzy set are considered as the MFI. The weights for the aggregation operation combine the individual membership functions of the geo-variables, and are derived using information from training areas and L, regression. The technique is demonstrated in a case study of skarn tin deposits and is used to integrate geological, geochemical and magnetic data. The study area covers a total of 22.5 km(2) and is divided into 349 cells, which include nine control cells. Nine geo-variables are considered in this study. Depending on the nature of the various geo-variables, four different types of membership functions are used to model the fuzzy membership of the geo-variables involved. (C) 2002 Elsevier Science Ltd. All rights reserved.
Resumo:
The data structure of an information system can significantly impact the ability of end users to efficiently and effectively retrieve the information they need. This research develops a methodology for evaluating, ex ante, the relative desirability of alternative data structures for end user queries. This research theorizes that the data structure that yields the lowest weighted average complexity for a representative sample of information requests is the most desirable data structure for end user queries. The theory was tested in an experiment that compared queries from two different relational database schemas. As theorized, end users querying the data structure associated with the less complex queries performed better Complexity was measured using three different Halstead metrics. Each of the three metrics provided excellent predictions of end user performance. This research supplies strong evidence that organizations can use complexity metrics to evaluate, ex ante, the desirability of alternate data structures. Organizations can use these evaluations to enhance the efficient and effective retrieval of information by creating data structures that minimize end user query complexity.