946 resultados para rough set theory


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The communication via email is one of the most popular services of the Internet. Emails have brought us great convenience in our daily work and life. However, unsolicited messages or spam, flood our email boxes, which results in bandwidth, time and money wasting. To this end, this paper presents a rough set based model to classify emails into three categories - spam, no-spam and suspicious, rather than two classes (spam and non-spam) in most currently used approaches. By comparing with popular classification methods like Naive Bayes classification, the error ratio that a non-spam is discriminated to spam can be reduced using our proposed model.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper introduces a new technique in the investigation of object classification and illustrates the potential use of this technique for the analysis of a range of biological data, using avian morphometric data as an example. The nascent variable precision rough sets (VPRS) model is introduced and compared with the decision tree method ID3 (through a ‘leave n out’ approach), using the same dataset of morphometric measures of European barn swallows (Hirundo rustica) and assessing the accuracy of gender classification based on these measures. The results demonstrate that the VPRS model, allied with the use of a modern method of discretization of data, is comparable with the more traditional non-parametric ID3 decision tree method. We show that, particularly in small samples, the VPRS model can improve classification and to a lesser extent prediction aspects over ID3. Furthermore, through the ‘leave n out’ approach, some indication can be produced of the relative importance of the different morphometric measures used in this problem. In this case we suggest that VPRS has advantages over ID3, as it intelligently uses more of the morphometric data available for the data classification, whilst placing less emphasis on variables with low reliability. In biological terms, the results suggest that the gender of swallows can be determined with reasonable accuracy from morphometric data and highlight the most important variables in this process. We suggest that both analysis techniques are potentially useful for the analysis of a range of different types of biological datasets, and that VPRS in particular has potential for application to a range of biological circumstances.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper considers the problem of concept generalization in decision-making systems where such features of real-world databases as large size, incompleteness and inconsistence of the stored information are taken into account. The methods of the rough set theory (like lower and upper approximations, positive regions and reducts) are used for the solving of this problem. The new discretization algorithm of the continuous attributes is proposed. It essentially increases an overall performance of generalization algorithms and can be applied to processing of real value attributes in large data tables. Also the search algorithm of the significant attributes combined with a stage of discretization is developed. It allows avoiding splitting of continuous domains of insignificant attributes into intervals.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Textual document set has become an important and rapidly growing information source in the web. Text classification is one of the crucial technologies for information organisation and management. Text classification has become more and more important and attracted wide attention of researchers from different research fields. In this paper, many feature selection methods, the implement algorithms and applications of text classification are introduced firstly. However, because there are much noise in the knowledge extracted by current data-mining techniques for text classification, it leads to much uncertainty in the process of text classification which is produced from both the knowledge extraction and knowledge usage, therefore, more innovative techniques and methods are needed to improve the performance of text classification. It has been a critical step with great challenge to further improve the process of knowledge extraction and effectively utilization of the extracted knowledge. Rough Set decision making approach is proposed to use Rough Set decision techniques to more precisely classify the textual documents which are difficult to separate by the classic text classification methods. The purpose of this paper is to give an overview of existing text classification technologies, to demonstrate the Rough Set concepts and the decision making approach based on Rough Set theory for building more reliable and effective text classification framework with higher precision, to set up an innovative evaluation metric named CEI which is very effective for the performance assessment of the similar research, and to propose a promising research direction for addressing the challenging problems in text classification, text mining and other relative fields.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Spatial relations, reflecting the complex association between geographical phenomena and environments, are very important in the solution of geographical issues. Different spatial relations can be expressed by indicators which are useful for the analysis of geographical issues. Urbanization, an important geographical issue, is considered in this paper. The spatial relationship indicators concerning urbanization are expressed with a decision table. Thereafter, the spatial relationship indicator rules are extracted based on the application of rough set theory. The extraction process of spatial relationship indicator rules is illustrated with data from the urban and rural areas of Shenzhen and Hong Kong, located in the Pearl River Delta. Land use vector data of 1995 and 2000 are used. The extracted spatial relationship indicator rules of 1995 are used to identify the urban and rural areas in Zhongshan, Zhuhai and Macao. The identification accuracy is approximately 96.3%. Similar procedures are used to extract the spatial relationship indicator rules of 2000 for the urban and rural areas in Zhongshan, Zhuhai and Macao. An identification accuracy of about 83.6% is obtained.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

X. Wang, J. Yang, R. Jensen and X. Liu, 'Rough Set Feature Selection and Rule Induction for Prediction of Malignancy Degree in Brain Glioma,' Computer Methods and Programs in Biomedicine, vol. 83, no. 2, pp. 147-156, 2006.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Feature selection aims to determine a minimal feature subset from a problem domain while retaining a suitably high accuracy in representing the original features. Rough set theory (RST) has been used as such a tool with much success. RST enables the discovery of data dependencies and the reduction of the number of attributes contained in a dataset using the data alone, requiring no additional information. This chapter describes the fundamental ideas behind RST-based approaches and reviews related feature selection methods that build on these ideas. Extensions to the traditional rough set approach are discussed, including recent selection methods based on tolerance rough sets, variable precision rough sets and fuzzy-rough sets. Alternative search mechanisms are also highly important in rough set feature selection. The chapter includes the latest developments in this area, including RST strategies based on hill-climbing, genetic algorithms and ant colony optimization.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

R. Jensen, Q. Shen and A. Tuson, 'Finding Rough Set Reducts with SAT,' Proceedings of the 10th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, LNAI 3641, pp. 194-203, 2005.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Feature selection plays an important role in knowledge discovery and data mining nowadays. In traditional rough set theory, feature selection using reduct - the minimal discerning set of attributes - is an important area. Nevertheless, the original definition of a reduct is restrictive, so in one of the previous research it was proposed to take into account not only the horizontal reduction of information by feature selection, but also a vertical reduction considering suitable subsets of the original set of objects. Following the work mentioned above, a new approach to generate bireducts using a multi--objective genetic algorithm was proposed. Although the genetic algorithms were used to calculate reduct in some previous works, we did not find any work where genetic algorithms were adopted to calculate bireducts. Compared to the works done before in this area, the proposed method has less randomness in generating bireducts. The genetic algorithm system estimated a quality of each bireduct by values of two objective functions as evolution progresses, so consequently a set of bireducts with optimized values of these objectives was obtained. Different fitness evaluation methods and genetic operators, such as crossover and mutation, were applied and the prediction accuracies were compared. Five datasets were used to test the proposed method and two datasets were used to perform a comparison study. Statistical analysis using the one-way ANOVA test was performed to determine the significant difference between the results. The experiment showed that the proposed method was able to reduce the number of bireducts necessary in order to receive a good prediction accuracy. Also, the influence of different genetic operators and fitness evaluation strategies on the prediction accuracy was analyzed. It was shown that the prediction accuracies of the proposed method are comparable with the best results in machine learning literature, and some of them outperformed it.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A few of clustering techniques for categorical data exist to group objects having similar characteristics. Some are able to handle uncertainty in the clustering process while others have stability issues. However, the performance of these techniques is an issue due to low accuracy and high computational complexity. This paper proposes a new technique called maximum dependency attributes (MDA) for selecting clustering attribute. The proposed approach is based on rough set theory by taking into account the dependency of attributes of the database. We analyze and compare the performance of MDA technique with the bi-clustering, total roughness (TR) and min–min roughness (MMR) techniques based on four test cases. The results establish the better performance of the proposed approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An information filtering (IF) system monitors an incoming document stream to find the documents that match the information needs specified by the user profiles. To learn to use the user profiles effectively is one of the most challenging tasks when developing an IF system. With the document selection criteria better defined based on the users’ needs, filtering large streams of information can be more efficient and effective. To learn the user profiles, term-based approaches have been widely used in the IF community because of their simplicity and directness. Term-based approaches are relatively well established. However, these approaches have problems when dealing with polysemy and synonymy, which often lead to an information overload problem. Recently, pattern-based approaches (or Pattern Taxonomy Models (PTM) [160]) have been proposed for IF by the data mining community. These approaches are better at capturing sematic information and have shown encouraging results for improving the effectiveness of the IF system. On the other hand, pattern discovery from large data streams is not computationally efficient. Also, these approaches had to deal with low frequency pattern issues. The measures used by the data mining technique (for example, “support” and “confidences”) to learn the profile have turned out to be not suitable for filtering. They can lead to a mismatch problem. This thesis uses the rough set-based reasoning (term-based) and pattern mining approach as a unified framework for information filtering to overcome the aforementioned problems. This system consists of two stages - topic filtering and pattern mining stages. The topic filtering stage is intended to minimize information overloading by filtering out the most likely irrelevant information based on the user profiles. A novel user-profiles learning method and a theoretical model of the threshold setting have been developed by using rough set decision theory. The second stage (pattern mining) aims at solving the problem of the information mismatch. This stage is precision-oriented. A new document-ranking function has been derived by exploiting the patterns in the pattern taxonomy. The most likely relevant documents were assigned higher scores by the ranking function. Because there is a relatively small amount of documents left after the first stage, the computational cost is markedly reduced; at the same time, pattern discoveries yield more accurate results. The overall performance of the system was improved significantly. The new two-stage information filtering model has been evaluated by extensive experiments. Tests were based on the well-known IR bench-marking processes, using the latest version of the Reuters dataset, namely, the Reuters Corpus Volume 1 (RCV1). The performance of the new two-stage model was compared with both the term-based and data mining-based IF models. The results demonstrate that the proposed information filtering system outperforms significantly the other IF systems, such as the traditional Rocchio IF model, the state-of-the-art term-based models, including the BM25, Support Vector Machines (SVM), and Pattern Taxonomy Model (PTM).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Outliers are objects that show abnormal behavior with respect to their context or that have unexpected values in some of their parameters. In decision-making processes, information quality is of the utmost importance. In specific applications, an outlying data element may represent an important deviation in a production process or a damaged sensor. Therefore, the ability to detect these elements could make the difference between making a correct and an incorrect decision. This task is complicated by the large sizes of typical databases. Due to their importance in search processes in large volumes of data, researchers pay special attention to the development of efficient outlier detection techniques. This article presents a computationally efficient algorithm for the detection of outliers in large volumes of information. This proposal is based on an extension of the mathematical framework upon which the basic theory of detection of outliers, founded on Rough Set Theory, has been constructed. From this starting point, current problems are analyzed; a detection method is proposed, along with a computational algorithm that allows the performance of outlier detection tasks with an almost-linear complexity. To illustrate its viability, the results of the application of the outlier-detection algorithm to the concrete example of a large database are presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Information Overload and Mismatch are two fundamental problems affecting the effectiveness of information filtering systems. Even though both term-based and patternbased approaches have been proposed to address the problems of overload and mismatch, neither of these approaches alone can provide a satisfactory solution to address these problems. This paper presents a novel two-stage information filtering model which combines the merits of term-based and pattern-based approaches to effectively filter sheer volume of information. In particular, the first filtering stage is supported by a novel rough analysis model which efficiently removes a large number of irrelevant documents, thereby addressing the overload problem. The second filtering stage is empowered by a semantically rich pattern taxonomy mining model which effectively fetches incoming documents according to the specific information needs of a user, thereby addressing the mismatch problem. The experimental results based on the RCV1 corpus show that the proposed twostage filtering model significantly outperforms the both termbased and pattern-based information filtering models.