824 resultados para Pruning techniques


Relevância:

20.00% 20.00%

Publicador:

Resumo:

In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Prism family of algorithms induces modular classification rules which, in contrast to decision tree induction algorithms, do not necessarily fit together into a decision tree structure. Classifiers induced by Prism algorithms achieve a comparable accuracy compared with decision trees and in some cases even outperform decision trees. Both kinds of algorithms tend to overfit on large and noisy datasets and this has led to the development of pruning methods. Pruning methods use various metrics to truncate decision trees or to eliminate whole rules or single rule terms from a Prism rule set. For decision trees many pre-pruning and postpruning methods exist, however for Prism algorithms only one pre-pruning method has been developed, J-pruning. Recent work with Prism algorithms examined J-pruning in the context of very large datasets and found that the current method does not use its full potential. This paper revisits the J-pruning method for the Prism family of algorithms and develops a new pruning method Jmax-pruning, discusses it in theoretical terms and evaluates it empirically.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Prism family of algorithms induces modular classification rules in contrast to the Top Down Induction of Decision Trees (TDIDT) approach which induces classification rules in the intermediate form of a tree structure. Both approaches achieve a comparable classification accuracy. However in some cases Prism outperforms TDIDT. For both approaches pre-pruning facilities have been developed in order to prevent the induced classifiers from overfitting on noisy datasets, by cutting rule terms or whole rules or by truncating decision trees according to certain metrics. There have been many pre-pruning mechanisms developed for the TDIDT approach, but for the Prism family the only existing pre-pruning facility is J-pruning. J-pruning not only works on Prism algorithms but also on TDIDT. Although it has been shown that J-pruning produces good results, this work points out that J-pruning does not use its full potential. The original J-pruning facility is examined and the use of a new pre-pruning facility, called Jmax-pruning, is proposed and evaluated empirically. A possible pre-pruning facility for TDIDT based on Jmax-pruning is also discussed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Advances in hardware and software technology enable us to collect, store and distribute large quantities of data on a very large scale. Automatically discovering and extracting hidden knowledge in the form of patterns from these large data volumes is known as data mining. Data mining technology is not only a part of business intelligence, but is also used in many other application areas such as research, marketing and financial analytics. For example medical scientists can use patterns extracted from historic patient data in order to determine if a new patient is likely to respond positively to a particular treatment or not; marketing analysts can use extracted patterns from customer data for future advertisement campaigns; finance experts have an interest in patterns that forecast the development of certain stock market shares for investment recommendations. However, extracting knowledge in the form of patterns from massive data volumes imposes a number of computational challenges in terms of processing time, memory, bandwidth and power consumption. These challenges have led to the development of parallel and distributed data analysis approaches and the utilisation of Grid and Cloud computing. This chapter gives an overview of parallel and distributed computing approaches and how they can be used to scale up data mining to large datasets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Keyphrases are added to documents to help identify the areas of interest they contain. However, in a significant proportion of papers author selected keyphrases are not appropriate for the document they accompany: for instance, they can be classificatory rather than explanatory, or they are not updated when the focus of the paper changes. As such, automated methods for improving the use of keyphrases are needed, and various methods have been published. However, each method was evaluated using a different corpus, typically one relevant to the field of study of the method’s authors. This not only makes it difficult to incorporate the useful elements of algorithms in future work, but also makes comparing the results of each method inefficient and ineffective. This paper describes the work undertaken to compare five methods across a common baseline of corpora. The methods chosen were Term Frequency, Inverse Document Frequency, the C-Value, the NC-Value, and a Synonym based approach. These methods were analysed to evaluate performance and quality of results, and to provide a future benchmark. It is shown that Term Frequency and Inverse Document Frequency were the best algorithms, with the Synonym approach following them. Following these findings, a study was undertaken into the value of using human evaluators to judge the outputs. The Synonym method was compared to the original author keyphrases of the Reuters’ News Corpus. The findings show that authors of Reuters’ news articles provide good keyphrases but that more often than not they do not provide any keyphrases.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Khartoum like many cities in least developing countries (LDCs) still witnesses huge influx of people. Accommodation of the new comers leads to encroachment on the cultivation land leads to sprawl expansion of Greater Khartoum. The city expanded in diameter from 16.8 km in 1955 to 802.5 km in 1998. Most of this horizontal expansion was residential. In 2008 Khartoum accommodated 29% of the urban population of Sudan. Today Khartoum is considered as one of 43 major cities in Africa that accommodates more than 1 million inhabitants. Most of new comers live in the outskirts of the city e.g. Dar El-Salam and Mayo neighbourhoods. The majority of those new comers built their houses especially the walls from mud, wood, straw and sacks. Selection of building materials usually depends on its price regardless of the environmental impact, quality, thermal performance and life of the material. Most of the time, this results in increasing the cost with variables of impacts over the environment during the life of the building. Therefore, consideration of the environmental impacts, social impacts and economic impacts is crucial in the selection of any building material. Decreasing such impacts could lead to more sustainable housing. Comparing the sustainability of the available wall building materials for low cost housing in Khartoum is carried out through the life cycle assessment (LCA) technique. The purpose of this paper is to compare the most available local building materials for walls for the urban poor of Khartoum from a sustainability point of view by going through the manufacturing of the materials, the use of these materials and then the disposal of the materials after their life comes to an end. Findings reveal that traditional red bricks couldn’t be considered as a sustainable wall building material that will draw the future of the low cost housing in Greater Khartoum. On the other hand, results of the comparison lead to draw attention to the wide range of the soil techniques and to its potentials to be a promising sustainable wall material for urban low cost housing in Khartoum.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

n the past decade, the analysis of data has faced the challenge of dealing with very large and complex datasets and the real-time generation of data. Technologies to store and access these complex and large datasets are in place. However, robust and scalable analysis technologies are needed to extract meaningful information from these datasets. The research field of Information Visualization and Visual Data Analytics addresses this need. Information visualization and data mining are often used complementary to each other. Their common goal is the extraction of meaningful information from complex and possibly large data. However, though data mining focuses on the usage of silicon hardware, visualization techniques also aim to access the powerful image-processing capabilities of the human brain. This article highlights the research on data visualization and visual analytics techniques. Furthermore, we highlight existing visual analytics techniques, systems, and applications including a perspective on the field from the chemical process industry.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Sea surface temperature (SST) can be estimated from day and night observations of the Spinning Enhanced Visible and Infra-Red Imager (SEVIRI) by optimal estimation (OE). We show that exploiting the 8.7 μm channel, in addition to the “traditional” wavelengths of 10.8 and 12.0 μm, improves OE SST retrieval statistics in validation. However, the main benefit is an improvement in the sensitivity of the SST estimate to variability in true SST. In a fair, single-pixel comparison, the 3-channel OE gives better results than the SST estimation technique presently operational within the Ocean and Sea Ice Satellite Application Facility. This operational technique is to use SST retrieval coefficients, followed by a bias-correction step informed by radiative transfer simulation. However, the operational technique has an additional “atmospheric correction smoothing”, which improves its noise performance, and hitherto had no analogue within the OE framework. Here, we propose an analogue to atmospheric correction smoothing, based on the expectation that atmospheric total column water vapour has a longer spatial correlation length scale than SST features. The approach extends the observations input to the OE to include the averaged brightness temperatures (BTs) of nearby clear-sky pixels, in addition to the BTs of the pixel for which SST is being retrieved. The retrieved quantities are then the single-pixel SST and the clear-sky total column water vapour averaged over the vicinity of the pixel. This reduces the noise in the retrieved SST significantly. The robust standard deviation of the new OE SST compared to matched drifting buoys becomes 0.39 K for all data. The smoothed OE gives SST sensitivity of 98% on average. This means that diurnal temperature variability and ocean frontal gradients are more faithfully estimated, and that the influence of the prior SST used is minimal (2%). This benefit is not available using traditional atmospheric correction smoothing.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cities, which are now inhabited by a majority of the world's population, are not only an important source of global environmental and resource depletion problems, but can also act as important centres of technological innovation and social learning in the continuing quest for a low carbon future. Planning and managing large-scale transitions in cities to deal with these pressures require an understanding of urban retrofitting at city scale. In this context performative techniques (such as backcasting and roadmapping) can provide valuable tools for helping cities develop a strategic view of the future. However, it is also important to identify ‘disruptive’ and ‘sustaining’ technologies which may contribute to city-based sustainability transitions. This paper presents research findings from the EPSRC Retrofit 2050 project, and explores the relationship between technology roadmaps and transition theory literature, highlighting the research gaps at urban/city level. The paper develops a research methodology to describe the development of three guiding visions for city-regional retrofit futures, and identifies key sustaining and disruptive technologies at city scale within these visions using foresight (horizon scanning) techniques. The implications of the research for city-based transition studies and related methodologies are discussed.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Prism is a modular classification rule generation method based on the ‘separate and conquer’ approach that is alternative to the rule induction approach using decision trees also known as ‘divide and conquer’. Prism often achieves a similar level of classification accuracy compared with decision trees, but tends to produce a more compact noise tolerant set of classification rules. As with other classification rule generation methods, a principle problem arising with Prism is that of overfitting due to over-specialised rules. In addition, over-specialised rules increase the associated computational complexity. These problems can be solved by pruning methods. For the Prism method, two pruning algorithms have been introduced recently for reducing overfitting of classification rules - J-pruning and Jmax-pruning. Both algorithms are based on the J-measure, an information theoretic means for quantifying the theoretical information content of a rule. Jmax-pruning attempts to exploit the J-measure to its full potential because J-pruning does not actually achieve this and may even lead to underfitting. A series of experiments have proved that Jmax-pruning may outperform J-pruning in reducing overfitting. However, Jmax-pruning is computationally relatively expensive and may also lead to underfitting. This paper reviews the Prism method and the two existing pruning algorithms above. It also proposes a novel pruning algorithm called Jmid-pruning. The latter is based on the J-measure and it reduces overfitting to a similar level as the other two algorithms but is better in avoiding underfitting and unnecessary computational effort. The authors conduct an experimental study on the performance of the Jmid-pruning algorithm in terms of classification accuracy and computational efficiency. The algorithm is also evaluated comparatively with the J-pruning and Jmax-pruning algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We present five new cloud detection algorithms over land based on dynamic threshold or Bayesian techniques, applicable to the Advanced Along Track Scanning Radiometer (AATSR) instrument and compare these with the standard threshold based SADIST cloud detection scheme. We use a manually classified dataset as a reference to assess algorithm performance and quantify the impact of each cloud detection scheme on land surface temperature (LST) retrieval. The use of probabilistic Bayesian cloud detection methods improves algorithm true skill scores by 8-9 % over SADIST (maximum score of 77.93 % compared to 69.27 %). We present an assessment of the impact of imperfect cloud masking, in relation to the reference cloud mask, on the retrieved AATSR LST imposing a 2 K tolerance over a 3x3 pixel domain. We find an increase of 5-7 % in the observations falling within this tolerance when using Bayesian methods (maximum of 92.02 % compared to 85.69 %). We also demonstrate that the use of dynamic thresholds in the tests employed by SADIST can significantly improve performance, applicable to cloud-test data to provided by the Sea and Land Surface Temperature Radiometer (SLSTR) due to be launched on the Sentinel 3 mission (estimated 2014).

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Older adult computer users often lose track of the mouse cursor and so resort to methods such as shaking the mouse or searching the entire screen to find the cursor again. Hence, this paper describes how a standard optical mouse was modified to include a touch sensor, activated by releasing and touching the mouse, which automatically centers the mouse cursor to the screen, potentially making it easier to find a ‘lost’ cursor. Six older adult computer users and six younger computer users were asked to compare the touch sensitive mouse with cursor centering with two alternative techniques for locating the mouse cursor: manually shaking the mouse and using the Windows sonar facility. The time taken to click on a target after a distractor task was recorded, and results show that centering the mouse was the fastest to use, with a 35% improvement over shaking the mouse. Five out of six older participants ranked the touch sensitive mouse with cursor centering as the easiest to use.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfit-ting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances. This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suit-able structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations on Big Data: rule generation, rule simplification and rule representation. The authors also review some existing methods and techniques used for each of the three operations and highlight their limitations. They introduce some novel methods and techniques developed by them recently. These methods and techniques are also discussed in comparison to existing ones with respect to efficient processing of Big Data.