47 resultados para computation- and data-intensive applications
Resumo:
Networked information and communication technologies are rapidly advancing the capacities of governments to target and separately manage specific sub-populations, groups and individuals. Targeting uses data profiling to calculate the differential probabilities of outcomes associated with various personal characteristics. This knowledge is used to classify and sort people for differentiated levels of treatment. Targeting is often used to efficiently and effectively target government resources to the most disadvantaged. Although having many benefits, targeting raises several policy and ethical issues. This paper discusses these issues and the policy responses governments may take to maximise the benefits of targeting while ameliorating the negative aspects.
Resumo:
Systems biology is based on computational modelling and simulation of large networks of interacting components. Models may be intended to capture processes, mechanisms, components and interactions at different levels of fidelity. Input data are often large and geographically disperse, and may require the computation to be moved to the data, not vice versa. In addition, complex system-level problems require collaboration across institutions and disciplines. Grid computing can offer robust, scaleable solutions for distributed data, compute and expertise. We illustrate some of the range of computational and data requirements in systems biology with three case studies: one requiring large computation but small data (orthologue mapping in comparative genomics), a second involving complex terabyte data (the Visible Cell project) and a third that is both computationally and data-intensive (simulations at multiple temporal and spatial scales). Authentication, authorisation and audit systems are currently not well scalable and may present bottlenecks for distributed collaboration particularly where outcomes may be commercialised. Challenges remain in providing lightweight standards to facilitate the penetration of robust, scalable grid-type computing into diverse user communities to meet the evolving demands of systems biology.
Resumo:
This paper presents load profiles of electricity customers, using the knowledge discovery in databases (KDD) procedure, a data mining technique, to determine the load profiles for different types of customers. In this paper, the current load profiling methods are compared using data mining techniques, by analysing and evaluating these classification techniques. The objective of this study is to determine the best load profiling methods and data mining techniques to classify, detect and predict non-technical losses in the distribution sector, due to faulty metering and billing errors, as well as to gather knowledge on customer behaviour and preferences so as to gain a competitive advantage in the deregulated market. This paper focuses mainly on the comparative analysis of the classification techniques selected; a forthcoming paper will focus on the detection and prediction methods.
Resumo:
Data mining is the process to identify valid, implicit, previously unknown, potentially useful and understandable information from large databases. It is an important step in the process of knowledge discovery in databases, (Olaru & Wehenkel, 1999). In a data mining process, input data can be structured, seme-structured, or unstructured. Data can be in text, categorical or numerical values. One of the important characteristics of data mining is its ability to deal data with large volume, distributed, time variant, noisy, and high dimensionality. A large number of data mining algorithms have been developed for different applications. For example, association rules mining can be useful for market basket problems, clustering algorithms can be used to discover trends in unsupervised learning problems, classification algorithms can be applied in decision-making problems, and sequential and time series mining algorithms can be used in predicting events, fault detection, and other supervised learning problems (Vapnik, 1999). Classification is among the most important tasks in the data mining, particularly for data mining applications into engineering fields. Together with regression, classification is mainly for predictive modelling. So far, there have been a number of classification algorithms in practice. According to (Sebastiani, 2002), the main classification algorithms can be categorized as: decision tree and rule based approach such as C4.5 (Quinlan, 1996); probability methods such as Bayesian classifier (Lewis, 1998); on-line methods such as Winnow (Littlestone, 1988) and CVFDT (Hulten 2001), neural networks methods (Rumelhart, Hinton & Wiliams, 1986); example-based methods such as k-nearest neighbors (Duda & Hart, 1973), and SVM (Cortes & Vapnik, 1995). Other important techniques for classification tasks include Associative Classification (Liu et al, 1998) and Ensemble Classification (Tumer, 1996).
Resumo:
Binning and truncation of data are common in data analysis and machine learning. This paper addresses the problem of fitting mixture densities to multivariate binned and truncated data. The EM approach proposed by McLachlan and Jones (Biometrics, 44: 2, 571-578, 1988) for the univariate case is generalized to multivariate measurements. The multivariate solution requires the evaluation of multidimensional integrals over each bin at each iteration of the EM procedure. Naive implementation of the procedure can lead to computationally inefficient results. To reduce the computational cost a number of straightforward numerical techniques are proposed. Results on simulated data indicate that the proposed methods can achieve significant computational gains with no loss in the accuracy of the final parameter estimates. Furthermore, experimental results suggest that with a sufficient number of bins and data points it is possible to estimate the true underlying density almost as well as if the data were not binned. The paper concludes with a brief description of an application of this approach to diagnosis of iron deficiency anemia, in the context of binned and truncated bivariate measurements of volume and hemoglobin concentration from an individual's red blood cells.
Resumo:
The majority of the world's population now resides in urban environments and information on the internal composition and dynamics of these environments is essential to enable preservation of certain standards of living. Remotely sensed data, especially the global coverage of moderate spatial resolution satellites such as Landsat, Indian Resource Satellite and Systeme Pour I'Observation de la Terre (SPOT), offer a highly useful data source for mapping the composition of these cities and examining their changes over time. The utility and range of applications for remotely sensed data in urban environments could be improved with a more appropriate conceptual model relating urban environments to the sampling resolutions of imaging sensors and processing routines. Hence, the aim of this work was to take the Vegetation-Impervious surface-Soil (VIS) model of urban composition and match it with the most appropriate image processing methodology to deliver information on VIS composition for urban environments. Several approaches were evaluated for mapping the urban composition of Brisbane city (south-cast Queensland, Australia) using Landsat 5 Thematic Mapper data and 1:5000 aerial photographs. The methods evaluated were: image classification; interpretation of aerial photographs; and constrained linear mixture analysis. Over 900 reference sample points on four transects were extracted from the aerial photographs and used as a basis to check output of the classification and mixture analysis. Distinctive zonations of VIS related to urban composition were found in the per-pixel classification and aggregated air-photo interpretation; however, significant spectral confusion also resulted between classes. In contrast, the VIS fraction images produced from the mixture analysis enabled distinctive densities of commercial, industrial and residential zones within the city to be clearly defined, based on their relative amount of vegetation cover. The soil fraction image served as an index for areas being (re)developed. The logical match of a low (L)-resolution, spectral mixture analysis approach with the moderate spatial resolution image data, ensured the processing model matched the spectrally heterogeneous nature of the urban environments at the scale of Landsat Thematic Mapper data.
Resumo:
Electronic communications devices intended for government or military applications must be rigorously evaluated to ensure that they maintain data confidentiality. High-grade information security evaluations require a detailed analysis of the device's design, to determine how it achieves necessary security functions. In practice, such evaluations are labour-intensive and costly, so there is a strong incentive to find ways to make the process more efficient. In this paper we show how well-known concepts from graph theory can be applied to a device's design to optimise information security evaluations. In particular, we use end-to-end graph traversals to eliminate components that do not need to be evaluated at all, and minimal cutsets to identify the smallest group of components that needs to be evaluated in depth.
Resumo:
Reliable, comparable information about the main causes of disease and injury in populations, and how these are changing, is a critical input for debates about priorities in the health sector. Traditional sources of information about the descriptive epidemiology of diseases, injuries and risk factors are generally incomplete, fragmented and of uncertain reliability and comparability. Lack of a standardized measurement framework to permit comparisons across diseases and injuries, as well as risk factors, and failure to systematically evaluate data quality have impeded comparative analyses of the true public health importance of various conditions and risk factors. As a consequence the impact of major conditions and hazards on population health has been poorly appreciated, often leading to a lack of public health investment. Global disease and risk factor quantification improved dramatically in the early 1990s with the completion of the first Global Burden of Disease Study. For the first time, the comparative importance of over 100 diseases and injuries, and ten major risk factors, for global and regional health status could be assessed using a common metric (Disability-Adjusted Life Years) which simultaneously accounted for both premature mortality and the prevalence, duration and severity of the non-fatal consequences of disease and injury. As a consequence, mental health conditions and injuries, for which non-fatal outcomes are of particular significance, were identified as being among the leading causes of disease/injury burden worldwide, with clear implications for policy, particularly prevention. A major achievement of the Study was the complete global descriptive epidemiology, including incidence, prevalence and mortality, by age, sex and Region, of over 100 diseases and injuries. National applications, further methodological research and an increase in data availability have led to improved national, regional and global estimates for 2000, but substantial uncertainty around the disease burden caused by major conditions, including, HIV, remains. The rapid implementation of cost-effective data collection systems in developing countries is a key priority if global public policy to promote health is to be more effectively informed.