858 resultados para data pre-processing
Resumo:
Non-technical losses (NTL) identification and prediction are important tasks for many utilities. Data from customer information system (CIS) can be used for NTL analysis. However, in order to accurately and efficiently perform NTL analysis, the original data from CIS need to be pre-processed before any detailed NTL analysis can be carried out. In this paper, we propose a feature selection based method for CIS data pre-processing in order to extract the most relevant information for further analysis such as clustering and classifications. By removing irrelevant and redundant features, feature selection is an essential step in data mining process in finding optimal subset of features to improve the quality of result by giving faster time processing, higher accuracy and simpler results with fewer features. Detailed feature selection analysis is presented in the paper. Both time-domain and load shape data are compared based on the accuracy, consistency and statistical dependencies between features.
Resumo:
To construct Biodiversity richness maps from Environmental Niche Models (ENMs) of thousands of species is time consuming. A separate species occurrence data pre-processing phase enables the experimenter to control test AUC score variance due to species dataset size. Besides, removing duplicate occurrences and points with missing environmental data, we discuss the need for coordinate precision, wide dispersion, temporal and synonymity filters. After species data filtering, the final task of a pre-processing phase should be the automatic generation of species occurrence datasets which can then be directly ’plugged-in’ to the ENM. A software application capable of carrying out all these tasks will be a valuable time-saver particularly for large scale biodiversity studies.
Resumo:
Most of the existing algorithms for approximate Bayesian computation (ABC) assume that it is feasible to simulate pseudo-data from the model at each iteration. However, the computational cost of these simulations can be prohibitive for high dimensional data. An important example is the Potts model, which is commonly used in image analysis. Images encountered in real world applications can have millions of pixels, therefore scalability is a major concern. We apply ABC with a synthetic likelihood to the hidden Potts model with additive Gaussian noise. Using a pre-processing step, we fit a binding function to model the relationship between the model parameters and the synthetic likelihood parameters. Our numerical experiments demonstrate that the precomputed binding function dramatically improves the scalability of ABC, reducing the average runtime required for model fitting from 71 hours to only 7 minutes. We also illustrate the method by estimating the smoothing parameter for remotely sensed satellite imagery. Without precomputation, Bayesian inference is impractical for datasets of that scale.
Resumo:
We describe a pre-processing correlation attack on an FPGA implementation of AES, protected with a random clocking countermeasure that exhibits complex variations in both the location and amplitude of the power consumption patterns of the AES rounds. It is demonstrated that the merged round patterns can be pre-processed to identify and extract the individual round amplitudes, enabling a successful power analysis attack. We show that the requirement of the random clocking countermeasure to provide a varying execution time between processing rounds can be exploited to select a sub-set of data where sufficient current decay has occurred, further improving the attack. In comparison with the countermeasure's estimated security of 3 million traces from an integration attack, we show that through application of our proposed techniques that the countermeasure can now be broken with as few as 13k traces.
Resumo:
Road networks are a national critical infrastructure. The road assets need to be monitored and maintained efficiently as their conditions deteriorate over time. The condition of one of such assets, road pavement, plays a major role in the road network maintenance programmes. Pavement conditions depend upon many factors such as pavement types, traffic and environmental conditions. This paper presents a data analytics case study for assessing the factors affecting the pavement deflection values measured by the traffic speed deflectometer (TSD) device. The analytics process includes acquisition and integration of data from multiple sources, data pre-processing, mining useful information from them and utilising data mining outputs for knowledge deployment. Data mining techniques are able to show how TSD outputs vary in different roads, traffic and environmental conditions. The generated data mining models map the TSD outputs to some classes and define correction factors for each class.
Resumo:
Increasingly larger scale applications are generating an unprecedented amount of data. However, the increasing gap between computation and I/O capacity on High End Computing machines makes a severe bottleneck for data analysis. Instead of moving data from its source to the output storage, in-situ analytics processes output data while simulations are running. However, in-situ data analysis incurs much more computing resource contentions with simulations. Such contentions severely damage the performance of simulation on HPE. Since different data processing strategies have different impact on performance and cost, there is a consequent need for flexibility in the location of data analytics. In this paper, we explore and analyze several potential data-analytics placement strategies along the I/O path. To find out the best strategy to reduce data movement in given situation, we propose a flexible data analytics (FlexAnalytics) framework in this paper. Based on this framework, a FlexAnalytics prototype system is developed for analytics placement. FlexAnalytics system enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and visualization, as well as for large-scale data transfer. Two use cases – scientific data compression and remote visualization – have been applied in the study to verify the performance of FlexAnalytics. Experimental results demonstrate that FlexAnalytics framework increases data transition bandwidth and improves the application end-to-end transfer performance.
Resumo:
This paper investigates the effect that text pre-processing approaches have on the estimation of the readability of web pages. Readability has been highlighted as an important aspect of web search result personalisation in previous work. The most widely used text readability measures rely on surface level characteristics of text, such as the length of words and sentences. We demonstrate that different tools for extracting text from web pages lead to very different estimations of readability. This has an important implication for search engines because search result personalisation strategies that consider users reading ability may fail if incorrect text readability estimations are computed.
Resumo:
Silver pomfret (Pampus argenteus) was frozen in the fresh condition as well as after holding in ice for one, two and four days. Evaluation of changes in the quality of these samples during storage at -18°C has shown that shelf-life decreased sharply if the pre-freezing iced storage was more than one day. The shelf-life of one day iced, two day iced and four day iced frozen samples were 32, 20 and 16 weeks respectively. No correlation was observed between the peroxide value and the organoleptic detection of rancid flavour. Levels of free fatty acids were more in the samples frozen after storage in ice for one day than in all the other samples.
Resumo:
Shade plots, simple visual representations of abundance matrices from multivariate species assemblage studies, are shown to be an effective aid in choosing an overall transformation (or other pre-treatment) of quantitative data for long-term use, striking an appropriate balance between dominant and less abundant taxa in ensuing resemblance-based multivariate analyses. Though the exposition is entirely general and applicable to all community studies, detailed illustrations of the comparative power and interpretative possibilities of shade plots are given in the case of two estuarine assemblage studies in south-western Australia: (a) macrobenthos in the upper Swan Estuary over a two-year period covering a highly significant precipitation event for the Perth area; and (b) a wide-scale spatial study of the nearshore fish fauna from five divergent estuaries. The utility of transformations of intermediate severity is again demonstrated and, with greater novelty, the potential importance seen of further mild transformation of all data after differential down-weighting (dispersion weighting) of spatially clumped' or schooled' species. Among the new techniques utilized is a two-way form of the RELATE test, which demonstrates linking of assemblage structure (fish) to continuous environmental variables (water quality), having removed a categorical factor (estuary differences). Re-orderings of sample and species axes in the associated shade plots are seen to provide transparent explanations at the species level for such continuous multivariate patterns.
Resumo:
As cryptographic implementations are increasingly subsumed as functional blocks within larger systems on chip, it becomes more difficult to identify the power consumption signatures of cryptographic operations amongst other unrelated processing activities. In addition, at higher clock frequencies, the current decay between successive processing rounds is only partial, making it more difficult to apply existing pattern matching techniques in side-channel analysis. We show however, through the use of a phase-sensitive detector, that power traces can be pre-processed to generate a filtered output which exhibits an enhanced round pattern, enabling the identification of locations on a device where encryption operations are occurring and also assisting with the re-alignment of power traces for side-channel attacks.
Resumo:
Pre-processing (PP) of received symbol vector and channel matrices is an essential pre-requisite operation for Sphere Decoder (SD)-based detection of Multiple-Input Multiple-Output (MIMO) wireless systems. PP is a highly complex operation, but relative to the total SD workload it represents a relatively small fraction of the overall computational cost of detecting an OFDM MIMO frame in standards such as 802.11n. Despite this, real-time PP architectures are highly inefficient, dominating the resource cost of real-time SD architectures. This paper resolves this issue. By reorganising the ordering and QR decomposition sub operations of PP, we describe a Field Programmable Gate Array (FPGA)-based PP architecture for the Fixed Complexity Sphere Decoder (FSD) applied to 4 × 4 802.11n MIMO which reduces resource cost by 50% as compared to state-of-the-art solutions whilst maintaining real-time performance.
Resumo:
This paper presents an electricity medium voltage (MV) customer characterization framework supportedby knowledge discovery in database (KDD). The main idea is to identify typical load profiles (TLP) of MVconsumers and to develop a rule set for the automatic classification of new consumers. To achieve ourgoal a methodology is proposed consisting of several steps: data pre-processing; application of severalclustering algorithms to segment the daily load profiles; selection of the best partition, corresponding tothe best consumers’ segmentation, based on the assessments of several clustering validity indices; andfinally, a classification model is built based on the resulting clusters. To validate the proposed framework,a case study which includes a real database of MV consumers is performed.
Resumo:
Data mining is one of the hottest research areas nowadays as it has got wide variety of applications in common man’s life to make the world a better place to live. It is all about finding interesting hidden patterns in a huge history data base. As an example, from a sales data base, one can find an interesting pattern like “people who buy magazines tend to buy news papers also” using data mining. Now in the sales point of view the advantage is that one can place these things together in the shop to increase sales. In this research work, data mining is effectively applied to a domain called placement chance prediction, since taking wise career decision is so crucial for anybody for sure. In India technical manpower analysis is carried out by an organization named National Technical Manpower Information System (NTMIS), established in 1983-84 by India's Ministry of Education & Culture. The NTMIS comprises of a lead centre in the IAMR, New Delhi, and 21 nodal centres located at different parts of the country. The Kerala State Nodal Centre is located at Cochin University of Science and Technology. In Nodal Centre, they collect placement information by sending postal questionnaire to passed out students on a regular basis. From this raw data available in the nodal centre, a history data base was prepared. Each record in this data base includes entrance rank ranges, reservation, Sector, Sex, and a particular engineering. From each such combination of attributes from the history data base of student records, corresponding placement chances is computed and stored in the history data base. From this data, various popular data mining models are built and tested. These models can be used to predict the most suitable branch for a particular new student with one of the above combination of criteria. Also a detailed performance comparison of the various data mining models is done.This research work proposes to use a combination of data mining models namely a hybrid stacking ensemble for better predictions. A strategy to predict the overall absorption rate for various branches as well as the time it takes for all the students of a particular branch to get placed etc are also proposed. Finally, this research work puts forward a new data mining algorithm namely C 4.5 * stat for numeric data sets which has been proved to have competent accuracy over standard benchmarking data sets called UCI data sets. It also proposes an optimization strategy called parameter tuning to improve the standard C 4.5 algorithm. As a summary this research work passes through all four dimensions for a typical data mining research work, namely application to a domain, development of classifier models, optimization and ensemble methods.