77 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Anti-malware software producers are continually challenged to identify and counter new malware as it is released into the wild. A dramatic increase in malware production in recent years has rendered the conventional method of manually determining a signature for each new malware sample untenable. This paper presents a scalable, automated approach for detecting and classifying malware by using pattern recognition algorithms and statistical methods at various stages of the malware analysis life cycle. Our framework combines the static features of function length and printable string information extracted from malware samples into a single test which gives classification results better than those achieved by using either feature individually. In our testing we input feature information from close to 1400 unpacked malware samples to a number of different classification algorithms. Using k-fold cross validation on the malware, which includes Trojans and viruses, along with 151 clean files, we achieve an overall classification accuracy of over 98%.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Theme development evolution analysis of literature is a significant tool to help the scientific scholars find and study the frontier problems more efficiently. This paper designs and develops a visual mining system for theme development evolution analysis to deal with the large number of literature information. The analysis of related themes based on sub-themes, together with the dynamic threshold strategy are adopted for improving the accuracy of system. Experiments results prove that correlations of themes obtained from the system are accurate and achieve better practical effect in comparison with that of our early work.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Purpose – This paper aims to investigate whether the accounting reform in China has improved the relevance of China's accounting information. It seeks to investigate the association between earnings and book value of equity to share returns before and after the introduction of the Accounting System for Business Enterprises (ASBE) in 2001 for A- and A&B-share firms.

Design/methodology/approach – The paper employs the return regression model. The pre-ASBE period is designated as 1997 through to 2000, and the post-ASBE period is designated as 2002 through to 2004. All firms listed on the Chinese stock market during the investigation period constitute the sample.

Findings – It is found that accounting information better explains share returns for both A-share firms and A&B-share firms in the post-ASBE period. The paper also finds that the book value of equity for A&B-share firms is incrementally value relevant to that of A-share firms in the post-ASBE period.

Research limitations/implications – Further studies will contribute to understanding how governance mechanisms and liquidity influence the association between accounting information and share returns in the Chinese A-share market.

Practical implications
– The findings provide empirical evidence regarding the relevance of accounting information in emerging markets.

Originality/value – The paper contributes to the extant value relevance literature by investigating time periods surrounding the issue of ASBE in 2001 in the Chinese stock market.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The human brain processes information in both unimodal and multimodal fashion where information is progressively captured, accumulated, abstracted and seamlessly fused. Subsequently, the fusion of multimodal inputs allows a holistic understanding of a problem. The proliferation of technology has produced various sources of electronic data and continues to do so exponentially. Finding patterns from such multi-source and multimodal data could be compared to the multimodal and multidimensional information processing in the human brain. Therefore, such brain functionality could be taken as an inspiration to develop a methodology for exploring multimodal and multi-source electronic data and further identifying multi-view patterns. In this paper, we first propose a brain inspired conceptual model that allows exploration and identification of patterns at different levels of granularity, different types of hierarchies and different types of modalities. Secondly, we present a cluster driven approach for the implementation of the proposed brain inspired model. Particularly, the Growing Self Organising Maps (GSOM) based cross-clustering approach is discussed. Furthermore, the acquisition of multi-view patterns with clusters driven implementation is demonstrated with experimental results.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Class imbalance in textual data is one important factor that affects the reliability of text mining. For imbalanced textual data, conventional classifiers tend to have a strong performance bias, which results in high accuracy rate on the majority class but very low rate on the minorities. An extreme strategy for unbalanced learning is to discard the majority instances and apply one-class classification to the minority class. However, this could easily cause another type of bias, which increases the accuracy rate on minorities by sacrificing the majorities. This chapter aims to investigate approaches that reduce these two types of performance bias and improve the reliability of discovered classification rules. Experimental results show that the inexact field learning method and parameter optimized one class classifiers achieve more balanced performance than the standard approaches.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The web is a rich resource for information discovery, as a result web mining is a hot topic. However, a reliable mining result depends on the reliability of the data set. For every single second, the web generate huge amount of data, such as web page requests, file transportation. The data reflect human behavior in the cyber space and therefore valuable for our analysis in various disciplines, e.g. social science, network security. How to deposit the data is a challenge. An usual strategy is to save the abstract of the data, such as using aggregation functions to preserve the features of the original data with much smaller space. A key problem, however is that such information can be distorted by the presence of illegitimate traffic, e.g. botnet recruitment scanning, DDoS attack traffic, etc. An important consideration in web related knowledge discovery then is the robustness of the aggregation method , which in turn may be affected by the reliability of network traffic data. In this chapter, we first present the methods of aggregation functions, and then we employe information distances to filter out anomaly data as a preparation for web data mining.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The central problem of automatic retrieval from unformatted text is that computational devices are not adequately trained to look for associated information. However for complete understanding and information retrieval, a complete artificial intelligence would have to be built. This paper describes a method for achieving significant information retrieval by using a semantic search engine. The underlying semantic information is stored in a network of clarified words, linked by logical connections. We employ simple scoring techniques on collections of paths in this network to establish a degree of relevance between a document and a clarified search criterion. This technique has been applied with success to test examples and can be easily scaled up to search large documents.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Ranking is an important task for handling a large amount of content. Ideally, training data for supervised ranking would include a complete rank of documents (or other objects such as images or videos) for a particular query. However, this is only possible for small sets of documents. In practice, one often resorts to document rating, in that a subset of documents is assigned with a small number indicating the degree of relevance. This poses a general problem of modelling and learning rank data with ties. In this paper, we propose a probabilistic generative model, that models the process as permutations over partitions. This results in super-exponential combinatorial state space with unknown numbers of partitions and unknown ordering among them. We approach the problem from the discrete choice theory, where subsets are chosen in a stagewise manner, reducing the state space per each stage significantly. Further, we show that with suitable parameterisation, we can still learn the models in linear time. We evaluate the proposed models on two application areas: (i) document ranking with the data from the recently held Yahoo! challenge, and (ii) collaborative filtering with movie data. The results demonstrate that the models are competitive against well-known rivals.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Speaker recognition is the process of automatically recognizing the speaker by analyzing individual information contained in the speech waves. In this paper, we discuss the development of an intelligent system for text-dependent speaker recognition. The system comprises two main modules, a wavelet-based signal-processing module for feature extraction of speech waves, and an artificial-neural-network-based classifier module to identify and categorize the speakers. Wavelet is used in de-noising and in compressing the speech signals. The wavelet family that we used is the Daubechies Wavelets. After extracting the necessary features from the speech waves, the features were then fed to a neural-network-based classifier to identify the speakers. We have implemented the Fuzzy ARTMAP (FAM) network in the classifier module to categorize the de-noised and compressed signals. The proposed intelligent learning system has been applied to a case study of text-dependent speaker recognition problem.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This paper explores the idea of responsible systems design. To do so, it examines a case study - the accreditation of Information Systems (IS) courses by the professional body. A professionally accredited educational program, like any other non-trivial design product, represents the balancing of competing influences, ideas and stakeholders. The case is particularly relevant because there have been significant changes in the context of Australian IS education recently that have made more complex the task of designing educational systems in a responsible manner. A general approach to addressing this complexity is articulated here as a design pattern to guide IS educational design. The pattern identifies the influences on design, the processes and products of design and the feedback mechanism required to demonstrate that stakeholder requirements are satisfied. Design tensions and principles arising from the model are discussed and future work identified

Relevância:

40.00% 40.00%

Publicador:

Resumo:

One of the issues associated with pattern classification using data based machine learning systems is the “curse of dimensionality”. In this paper, the circle-segments method is proposed as a feature selection method to identify important input features before the entire data set is provided for learning with machine learning systems. Specifically, four machine learning systems are deployed for classification, viz. Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM), and k-Nearest Neighbour (kNN). The integration between the circle-segments method and the machine learning systems has been applied to two case studies comprising one benchmark and one real data sets. Overall, the results after feature selection using the circle segments method demonstrate improvements in performance even with more than 50% of the input features eliminated from the original data sets.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In the past decade there has been massive growth of data on the internet. Many people rely on XML based RSS feeds to receive updates from websites. In this paper, we propose a method for managing the RSS feeds from various news websites. A web service is developed to deliver filtered news items from RSS feeds to a mobile client. Each news item is indexed, subsequently, the indexes are used for filtering news items. Indexing is done in two steps. First, classical text categorization algorithms are used to assign a category to each news item, second, geoparsing is used to assign geolocation data to each news item. An android application is developed to access filtered news items by consuming the proposed web service. A prototype is implemented using Rapid miner 5.0 as the data mining tool and SVM as the classification algorithm. Geoparsing and geocoding web services, and Android API are used to implement location-based access to news items. Experimental results prove that the proposed approach is effective and saves a significant amount of information overload processing time.