796 resultados para Data-Mining Techniques
Resumo:
Unstructured text data, such as emails, blogs, contracts, academic publications, organizational documents, transcribed interviews, and even tweets, are important sources of data in Information Systems research. Various forms of qualitative analysis of the content of these data exist and have revealed important insights. Yet, to date, these analyses have been hampered by limitations of human coding of large data sets, and by bias due to human interpretation. In this paper, we compare and combine two quantitative analysis techniques to demonstrate the capabilities of computational analysis for content analysis of unstructured text. Specifically, we seek to demonstrate how two quantitative analytic methods, viz., Latent Semantic Analysis and data mining, can aid researchers in revealing core content topic areas in large (or small) data sets, and in visualizing how these concepts evolve, migrate, converge or diverge over time. We exemplify the complementary application of these techniques through an examination of a 25-year sample of abstracts from selected journals in Information Systems, Management, and Accounting disciplines. Through this work, we explore the capabilities of two computational techniques, and show how these techniques can be used to gather insights from a large corpus of unstructured text.
Resumo:
Rule extraction from neural network algorithms have been investigated for two decades and there have been significant applications. Despite this level of success, rule extraction from neural network methods are generally not part of data mining tools, and a significant commercial breakthrough may still be some time away. This paper briefly reviews the state-of-the-art and points to some of the obstacles, namely a lack of evaluation techniques in experiments and larger benchmark data sets. A significant new development is the view that rule extraction from neural networks is an interactive process which actively involves the user. This leads to the application of assessment and evaluation techniques from information retrieval which may lead to a range of new methods.
Resumo:
In the last few years we have observed a proliferation of approaches for clustering XML docu- ments and schemas based on their structure and content. The presence of such a huge amount of approaches is due to the different applications requiring the XML data to be clustered. These applications need data in the form of similar contents, tags, paths, structures and semantics. In this paper, we first outline the application contexts in which clustering is useful, then we survey approaches so far proposed relying on the abstract representation of data (instances or schema), on the identified similarity measure, and on the clustering algorithm. This presentation leads to draw a taxonomy in which the current approaches can be classified and compared. We aim at introducing an integrated view that is useful when comparing XML data clustering approaches, when developing a new clustering algorithm, and when implementing an XML clustering compo- nent. Finally, the paper moves into the description of future trends and research issues that still need to be faced.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
In the era of Web 2.0, huge volumes of consumer reviews are posted to the Internet every day. Manual approaches to detecting and analyzing fake reviews (i.e., spam) are not practical due to the problem of information overload. However, the design and development of automated methods of detecting fake reviews is a challenging research problem. The main reason is that fake reviews are specifically composed to mislead readers, so they may appear the same as legitimate reviews (i.e., ham). As a result, discriminatory features that would enable individual reviews to be classified as spam or ham may not be available. Guided by the design science research methodology, the main contribution of this study is the design and instantiation of novel computational models for detecting fake reviews. In particular, a novel text mining model is developed and integrated into a semantic language model for the detection of untruthful reviews. The models are then evaluated based on a real-world dataset collected from amazon.com. The results of our experiments confirm that the proposed models outperform other well-known baseline models in detecting fake reviews. To the best of our knowledge, the work discussed in this article represents the first successful attempt to apply text mining methods and semantic language models to the detection of fake consumer reviews. A managerial implication of our research is that firms can apply our design artifacts to monitor online consumer reviews to develop effective marketing or product design strategies based on genuine consumer feedback posted to the Internet.
Resumo:
Large margin learning approaches, such as support vector machines (SVM), have been successfully applied to numerous classification tasks, especially for automatic facial expression recognition. The risk of such approaches however, is their sensitivity to large margin losses due to the influence from noisy training examples and outliers which is a common problem in the area of affective computing (i.e., manual coding at the frame level is tedious so coarse labels are normally assigned). In this paper, we leverage the relaxation of the parallel-hyperplanes constraint and propose the use of modified correlation filters (MCF). The MCF is similar in spirit to SVMs and correlation filters, but with the key difference of optimizing only a single hyperplane. We demonstrate the superiority of MCF over current techniques on a battery of experiments.
Resumo:
This paper presents an extended granule mining based methodology, to effectively describe the relationships between granules not only by traditional support and confidence, but by diversity and condition diversity as well. Diversity measures how diverse of a granule associated with the other granules, it provides a kind of novel knowledge in databases. We also provide an algorithm to implement the proposed methodology. The experiments conducted to characterize a real network traffic data collection show that the proposed concepts and algorithm are promising.
Resumo:
Serving as a powerful tool for extracting localized variations in non-stationary signals, applications of wavelet transforms (WTs) in traffic engineering have been introduced; however, lacking in some important theoretical fundamentals. In particular, there is little guidance provided on selecting an appropriate WT across potential transport applications. This research described in this paper contributes uniquely to the literature by first describing a numerical experiment to demonstrate the shortcomings of commonly-used data processing techniques in traffic engineering (i.e., averaging, moving averaging, second-order difference, oblique cumulative curve, and short-time Fourier transform). It then mathematically describes WT’s ability to detect singularities in traffic data. Next, selecting a suitable WT for a particular research topic in traffic engineering is discussed in detail by objectively and quantitatively comparing candidate wavelets’ performances using a numerical experiment. Finally, based on several case studies using both loop detector data and vehicle trajectories, it is shown that selecting a suitable wavelet largely depends on the specific research topic, and that the Mexican hat wavelet generally gives a satisfactory performance in detecting singularities in traffic and vehicular data.
Resumo:
It is a big challenge to clearly identify the boundary between positive and negative streams. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on RCV1, and substantial experiments show that the proposed approach achieves encouraging performance.
Resumo:
Monitoring the natural environment is increasingly important as habit degradation and climate change reduce theworld’s biodiversity.We have developed software tools and applications to assist ecologists with the collection and analysis of acoustic data at large spatial and temporal scales.One of our key objectives is automated animal call recognition, and our approach has three novel attributes. First, we work with raw environmental audio, contaminated by noise and artefacts and containing calls that vary greatly in volume depending on the animal’s proximity to the microphone. Second, initial experimentation suggested that no single recognizer could dealwith the enormous variety of calls. Therefore, we developed a toolbox of generic recognizers to extract invariant features for each call type. Third, many species are cryptic and offer little data with which to train a recognizer. Many popular machine learning methods require large volumes of training and validation data and considerable time and expertise to prepare. Consequently we adopt bootstrap techniques that can be initiated with little data and refined subsequently. In this paper, we describe our recognition tools and present results for real ecological problems.
Resumo:
Decision table and decision rules play an important role in rough set based data analysis, which compress databases into granules and describe the associations between granules. Granule mining was also proposed to interpret decision rules in terms of association rules and multi-tier structure. In this paper, we further extend granule mining to describe the relationships between granules not only by traditional support and confidence, but by diversity and condition diversity as well. Diversity measures how diverse of a granule associated with the other ganules, it provides a kind of novel knowledge in databases. Some experiments are conducted to test the proposed new concepts for describing the characteristics of a real network traffic data collection. The results show that the proposed concepts are promising.
Resumo:
Having a reliable understanding about the behaviours, problems, and performance of existing processes is important in enabling a targeted process improvement initiative. Recently, there has been an increase in the application of innovative process mining techniques to facilitate evidence-based understanding about organizations' business processes. Nevertheless, the application of these techniques in the domain of finance in Australia is, at best, scarce. This paper details a 6-month case study on the application of process mining in one of the largest insurance companies in Australia. In particular, the challenges encountered, the lessons learned, and the results obtained from this case study are detailed. Through this case study, we not only validated existing `lessons learned' from other similar case studies, but also added new insights that can be beneficial to other practitioners in applying process mining in their respective fields.
Resumo:
Background Accumulated biological research outcomes show that biological functions do not depend on individual genes, but on complex gene networks. Microarray data are widely used to cluster genes according to their expression levels across experimental conditions. However, functionally related genes generally do not show coherent expression across all conditions since any given cellular process is active only under a subset of conditions. Biclustering finds gene clusters that have similar expression levels across a subset of conditions. This paper proposes a seed-based algorithm that identifies coherent genes in an exhaustive, but efficient manner. Methods In order to find the biclusters in a gene expression dataset, we exhaustively select combinations of genes and conditions as seeds to create candidate bicluster tables. The tables have two columns: (a) a gene set, and (b) the conditions on which the gene set have dissimilar expression levels to the seed. First, the genes with less than the maximum number of dissimilar conditions are identified and a table of these genes is created. Second, the rows that have the same dissimilar conditions are grouped together. Third, the table is sorted in ascending order based on the number of dissimilar conditions. Finally, beginning with the first row of the table, a test is run repeatedly to determine whether the cardinality of the gene set in the row is greater than the minimum threshold number of genes in a bicluster. If so, a bicluster is outputted and the corresponding row is removed from the table. Repeating this process, all biclusters in the table are systematically identified until the table becomes empty. Conclusions This paper presents a novel biclustering algorithm for the identification of additive biclusters. Since it involves exhaustively testing combinations of genes and conditions, the additive biclusters can be found more readily.