797 resultados para Data Mining


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The design, development, and use of complex systems models raises a unique class of challenges and potential pitfalls, many of which are commonly recurring problems. Over time, researchers gain experience in this form of modeling, choosing algorithms, techniques, and frameworks that improve the quality, confidence level, and speed of development of their models. This increasing collective experience of complex systems modellers is a resource that should be captured. Fields such as software engineering and architecture have benefited from the development of generic solutions to recurring problems, called patterns. Using pattern development techniques from these fields, insights from communities such as learning and information processing, data mining, bioinformatics, and agent-based modeling can be identified and captured. Collections of such 'pattern languages' would allow knowledge gained through experience to be readily accessible to less-experienced practitioners and to other domains. This paper proposes a methodology for capturing the wisdom of computational modelers by introducing example visualization patterns, and a pattern classification system for analyzing the relationship between micro and macro behaviour in complex systems models. We anticipate that a new field of complex systems patterns will provide an invaluable resource for both practicing and future generations of modelers.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Spatial data mining recently emerges from a number of real applications, such as real-estate marketing, urban planning, weather forecasting, medical image analysis, road traffic accident analysis, etc. It demands for efficient solutions for many new, expensive, and complicated problems. In this paper, we investigate the problem of evaluating the top k distinguished “features” for a “cluster” based on weighted proximity relationships between the cluster and features. We measure proximity in an average fashion to address possible nonuniform data distribution in a cluster. Combining a standard multi-step paradigm with new lower and upper proximity bounds, we presented an efficient algorithm to solve the problem. The algorithm is implemented in several different modes. Our experiment results not only give a comparison among them but also illustrate the efficiency of the algorithm.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A major task of traditional temporal event sequence mining is to find all frequent event patterns from a long temporal sequence. In many real applications, however, events are often grouped into different types, and not all types are of equal importance. In this paper, we consider the problem of efficient mining of temporal event sequences which lead to an instance of a specific type of event. Temporal constraints are used to ensure sensibility of the mining results. We will first generalise and formalise the problem of event-oriented temporal sequence data mining. After discussing some unique issues in this new problem, we give a set of criteria, which are adapted from traditional data mining techniques, to measure the quality of patterns to be discovered. Finally we present an algorithm to discover potentially interesting patterns.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Pattern discovery in temporal event sequences is of great importance in many application domains, such as telecommunication network fault analysis. In reality, not every type of event has an accurate timestamp. Some of them, defined as inaccurate events may only have an interval as possible time of occurrence. The existence of inaccurate events may cause uncertainty in event ordering. The traditional support model cannot deal with this uncertainty, which would cause some interesting patterns to be missing. A new concept, precise support, is introduced to evaluate the probability of a pattern contained in a sequence. Based on this new metric, we define the uncertainty model and present an algorithm to discover interesting patterns in the sequence database that has one type of inaccurate event. In our model, the number of types of inaccurate events can be extended to k readily, however, at a cost of increasing computational complexity.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Esta dissertação visa apresentar o mapeamento do uso das teorias de sistemas de informações, usando técnicas de recuperação de informação e metodologias de mineração de dados e textos. As teorias abordadas foram Economia de Custos de Transações (Transactions Costs Economics TCE), Visão Baseada em Recursos da Firma (Resource-Based View-RBV) e Teoria Institucional (Institutional Theory-IT), sendo escolhidas por serem teorias de grande relevância para estudos de alocação de investimentos e implementação em sistemas de informação, tendo como base de dados o conteúdo textual (em inglês) do resumo e da revisão teórica dos artigos dos periódicos Information System Research (ISR), Management Information Systems Quarterly (MISQ) e Journal of Management Information Systems (JMIS) no período de 2000 a 2008. Os resultados advindos da técnica de mineração textual aliada à mineração de dados foram comparadas com a ferramenta de busca avançada EBSCO e demonstraram uma eficiência maior na identificação de conteúdo. Os artigos fundamentados nas três teorias representaram 10% do total de artigos dos três períodicos e o período mais profícuo de publicação foi o de 2001 e 2007.(AU)

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Purpose – Academic writing is often considered to be a weakness in contemporary students, while good reporting and writing skills are highly valued by graduate employers. A number of universities have introduced writing centres aimed at addressing this problem; however, the evaluation of such centres is usually qualitative. The paper seeks to consider the efficacy of a writing centre by looking at the impact of attendance on two “real world” quantitative outcomes – achievement and progression. Design/methodology/approach – Data mining was used to obtain records of 806 first-year students, of whom 45 had attended the writing centre and 761 had not. Findings – A highly significant association between writing centre attendance and achievement was found. Progression to year two was also significantly associated with writing centre attendance. Originality/value – Further, quantitative evaluation of writing centres is advocated using random allocation to a comparison condition to control for potential confounds such as motivation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the last two decades there have been substantial developments in the mathematical theory of inverse optimization problems, and their applications have expanded greatly. In parallel, time series analysis and forecasting have become increasingly important in various fields of research such as data mining, economics, business, engineering, medicine, politics, and many others. Despite the large uses of linear programming in forecasting models there is no a single application of inverse optimization reported in the forecasting literature when the time series data is available. Thus the goal of this paper is to introduce inverse optimization into forecasting field, and to provide a streamlined approach to time series analysis and forecasting using inverse linear programming. An application has been used to demonstrate the use of inverse forecasting developed in this study. © 2007 Elsevier Ltd. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Customer Base Analysis is perhaps the first stage of analysis in customer value, aiming to predict purchase frequency and customer lifecycle. An important part of the customer purchase frequency and its retention has to do with the service upgrade. Many models have tried to predict purchase frequency as well as upgrading. The comparison of these models seems important to provide academics with a picture of the current situation. The purpose of this research is to evaluate how models can predict service upgrade among a customer database of an online DVD rental company and suggest an alternative based on data mining techniques and data on historical transactions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis introduces a flexible visual data exploration framework which combines advanced projection algorithms from the machine learning domain with visual representation techniques developed in the information visualisation domain to help a user to explore and understand effectively large multi-dimensional datasets. The advantage of such a framework to other techniques currently available to the domain experts is that the user is directly involved in the data mining process and advanced machine learning algorithms are employed for better projection. A hierarchical visualisation model guided by a domain expert allows them to obtain an informed segmentation of the input space. Two other components of this thesis exploit properties of these principled probabilistic projection algorithms to develop a guided mixture of local experts algorithm which provides robust prediction and a model to estimate feature saliency simultaneously with the training of a projection algorithm.Local models are useful since a single global model cannot capture the full variability of a heterogeneous data space such as the chemical space. Probabilistic hierarchical visualisation techniques provide an effective soft segmentation of an input space by a visualisation hierarchy whose leaf nodes represent different regions of the input space. We use this soft segmentation to develop a guided mixture of local experts (GME) algorithm which is appropriate for the heterogeneous datasets found in chemoinformatics problems. Moreover, in this approach the domain experts are more involved in the model development process which is suitable for an intuition and domain knowledge driven task such as drug discovery. We also derive a generative topographic mapping (GTM) based data visualisation approach which estimates feature saliency simultaneously with the training of a visualisation model.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis applies a hierarchical latent trait model system to a large quantity of data. The motivation for it was lack of viable approaches to analyse High Throughput Screening datasets which maybe include thousands of data points with high dimensions. High Throughput Screening (HTS) is an important tool in the pharmaceutical industry for discovering leads which can be optimised and further developed into candidate drugs. Since the development of new robotic technologies, the ability to test the activities of compounds has considerably increased in recent years. Traditional methods, looking at tables and graphical plots for analysing relationships between measured activities and the structure of compounds, have not been feasible when facing a large HTS dataset. Instead, data visualisation provides a method for analysing such large datasets, especially with high dimensions. So far, a few visualisation techniques for drug design have been developed, but most of them just cope with several properties of compounds at one time. We believe that a latent variable model (LTM) with a non-linear mapping from the latent space to the data space is a preferred choice for visualising a complex high-dimensional data set. As a type of latent variable model, the latent trait model can deal with either continuous data or discrete data, which makes it particularly useful in this domain. In addition, with the aid of differential geometry, we can imagine the distribution of data from magnification factor and curvature plots. Rather than obtaining the useful information just from a single plot, a hierarchical LTM arranges a set of LTMs and their corresponding plots in a tree structure. We model the whole data set with a LTM at the top level, which is broken down into clusters at deeper levels of t.he hierarchy. In this manner, the refined visualisation plots can be displayed in deeper levels and sub-clusters may be found. Hierarchy of LTMs is trained using expectation-maximisation (EM) algorithm to maximise its likelihood with respect to the data sample. Training proceeds interactively in a recursive fashion (top-down). The user subjectively identifies interesting regions on the visualisation plot that they would like to model in a greater detail. At each stage of hierarchical LTM construction, the EM algorithm alternates between the E- and M-step. Another problem that can occur when visualising a large data set is that there may be significant overlaps of data clusters. It is very difficult for the user to judge where centres of regions of interest should be put. We address this problem by employing the minimum message length technique, which can help the user to decide the optimal structure of the model. In this thesis we also demonstrate the applicability of the hierarchy of latent trait models in the field of document data mining.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper explores the use of the optimization procedures in SAS/OR software with application to the ordered weight averaging (OWA) operators of decision-making units (DMUs). OWA was originally introduced by Yager (IEEE Trans Syst Man Cybern 18(1):183-190, 1988) has gained much interest among researchers, hence many applications such as in the areas of decision making, expert systems, data mining, approximate reasoning, fuzzy system and control have been proposed. On the other hand, the SAS is powerful software and it is capable of running various optimization tools such as linear and non-linear programming with all type of constraints. To facilitate the use of OWA operator by SAS users, a code was implemented. The SAS macro developed in this paper selects the criteria and alternatives from a SAS dataset and calculates a set of OWA weights. An example is given to illustrate the features of SAS/OWA software. © Springer-Verlag 2009.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper proposes a novel framework of incorporating protein-protein interactions (PPI) ontology knowledge into PPI extraction from biomedical literature in order to address the emerging challenges of deep natural language understanding. It is built upon the existing work on relation extraction using the Hidden Vector State (HVS) model. The HVS model belongs to the category of statistical learning methods. It can be trained directly from un-annotated data in a constrained way whilst at the same time being able to capture the underlying named entity relationships. However, it is difficult to incorporate background knowledge or non-local information into the HVS model. This paper proposes to represent the HVS model as a conditionally trained undirected graphical model in which non-local features derived from PPI ontology through inference would be easily incorporated. The seamless fusion of ontology inference with statistical learning produces a new paradigm to information extraction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Guest editorial Ali Emrouznejad is a Senior Lecturer at the Aston Business School in Birmingham, UK. His areas of research interest include performance measurement and management, efficiency and productivity analysis as well as data mining. He has published widely in various international journals. He is an Associate Editor of IMA Journal of Management Mathematics and Guest Editor to several special issues of journals including Journal of Operational Research Society, Annals of Operations Research, Journal of Medical Systems, and International Journal of Energy Management Sector. He is in the editorial board of several international journals and co-founder of Performance Improvement Management Software. William Ho is a Senior Lecturer at the Aston University Business School. Before joining Aston in 2005, he had worked as a Research Associate in the Department of Industrial and Systems Engineering at the Hong Kong Polytechnic University. His research interests include supply chain management, production and operations management, and operations research. He has published extensively in various international journals like Computers & Operations Research, Engineering Applications of Artificial Intelligence, European Journal of Operational Research, Expert Systems with Applications, International Journal of Production Economics, International Journal of Production Research, Supply Chain Management: An International Journal, and so on. His first authored book was published in 2006. He is an Editorial Board member of the International Journal of Advanced Manufacturing Technology and an Associate Editor of the OR Insight Journal. Currently, he is a Scholar of the Advanced Institute of Management Research. Uses of frontier efficiency methodologies and multi-criteria decision making for performance measurement in the energy sector This special issue aims to focus on holistic, applied research on performance measurement in energy sector management and for publication of relevant applied research to bridge the gap between industry and academia. After a rigorous refereeing process, seven papers were included in this special issue. The volume opens with five data envelopment analysis (DEA)-based papers. Wu et al. apply the DEA-based Malmquist index to evaluate the changes in relative efficiency and the total factor productivity of coal-fired electricity generation of 30 Chinese administrative regions from 1999 to 2007. Factors considered in the model include fuel consumption, labor, capital, sulphur dioxide emissions, and electricity generated. The authors reveal that the east provinces were relatively and technically more efficient, whereas the west provinces had the highest growth rate in the period studied. Ioannis E. Tsolas applies the DEA approach to assess the performance of Greek fossil fuel-fired power stations taking undesirable outputs into consideration, such as carbon dioxide and sulphur dioxide emissions. In addition, the bootstrapping approach is deployed to address the uncertainty surrounding DEA point estimates, and provide bias-corrected estimations and confidence intervals for the point estimates. The author revealed from the sample that the non-lignite-fired stations are on an average more efficient than the lignite-fired stations. Maethee Mekaroonreung and Andrew L. Johnson compare the relative performance of three DEA-based measures, which estimate production frontiers and evaluate the relative efficiency of 113 US petroleum refineries while considering undesirable outputs. Three inputs (capital, energy consumption, and crude oil consumption), two desirable outputs (gasoline and distillate generation), and an undesirable output (toxic release) are considered in the DEA models. The authors discover that refineries in the Rocky Mountain region performed the best, and about 60 percent of oil refineries in the sample could improve their efficiencies further. H. Omrani, A. Azadeh, S. F. Ghaderi, and S. Abdollahzadeh presented an integrated approach, combining DEA, corrected ordinary least squares (COLS), and principal component analysis (PCA) methods, to calculate the relative efficiency scores of 26 Iranian electricity distribution units from 2003 to 2006. Specifically, both DEA and COLS are used to check three internal consistency conditions, whereas PCA is used to verify and validate the final ranking results of either DEA (consistency) or DEA-COLS (non-consistency). Three inputs (network length, transformer capacity, and number of employees) and two outputs (number of customers and total electricity sales) are considered in the model. Virendra Ajodhia applied three DEA-based models to evaluate the relative performance of 20 electricity distribution firms from the UK and the Netherlands. The first model is a traditional DEA model for analyzing cost-only efficiency. The second model includes (inverse) quality by modelling total customer minutes lost as an input data. The third model is based on the idea of using total social costs, including the firm’s private costs and the interruption costs incurred by consumers, as an input. Both energy-delivered and number of consumers are treated as the outputs in the models. After five DEA papers, Stelios Grafakos, Alexandros Flamos, Vlasis Oikonomou, and D. Zevgolis presented a multiple criteria analysis weighting approach to evaluate the energy and climate policy. The proposed approach is akin to the analytic hierarchy process, which consists of pairwise comparisons, consistency verification, and criteria prioritization. In the approach, stakeholders and experts in the energy policy field are incorporated in the evaluation process by providing an interactive mean with verbal, numerical, and visual representation of their preferences. A total of 14 evaluation criteria were considered and classified into four objectives, such as climate change mitigation, energy effectiveness, socioeconomic, and competitiveness and technology. Finally, Borge Hess applied the stochastic frontier analysis approach to analyze the impact of various business strategies, including acquisition, holding structures, and joint ventures, on a firm’s efficiency within a sample of 47 natural gas transmission pipelines in the USA from 1996 to 2005. The author finds that there were no significant changes in the firm’s efficiency by an acquisition, and there is a weak evidence for efficiency improvements caused by the new shareholder. Besides, the author discovers that parent companies appear not to influence a subsidiary’s efficiency positively. In addition, the analysis shows a negative impact of a joint venture on technical efficiency of the pipeline company. To conclude, we are grateful to all the authors for their contribution, and all the reviewers for their constructive comments, which made this special issue possible. We hope that this issue would contribute significantly to performance improvement of the energy sector.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

MOTIVATION: G protein-coupled receptors (GPCRs) play an important role in many physiological systems by transducing an extracellular signal into an intracellular response. Over 50% of all marketed drugs are targeted towards a GPCR. There is considerable interest in developing an algorithm that could effectively predict the function of a GPCR from its primary sequence. Such an algorithm is useful not only in identifying novel GPCR sequences but in characterizing the interrelationships between known GPCRs. RESULTS: An alignment-free approach to GPCR classification has been developed using techniques drawn from data mining and proteochemometrics. A dataset of over 8000 sequences was constructed to train the algorithm. This represents one of the largest GPCR datasets currently available. A predictive algorithm was developed based upon the simplest reasonable numerical representation of the protein's physicochemical properties. A selective top-down approach was developed, which used a hierarchical classifier to assign sequences to subdivisions within the GPCR hierarchy. The predictive performance of the algorithm was assessed against several standard data mining classifiers and further validated against Support Vector Machine-based GPCR prediction servers. The selective top-down approach achieves significantly higher accuracy than standard data mining methods in almost all cases.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In order to address problems of information overload in digital imagery task domains we have developed an interactive approach to the capture and reuse of image context information. Our framework models different aspects of the relationship between images and domain tasks they support by monitoring the interactive manipulation and annotation of task-relevant imagery. The approach allows us to gauge a measure of a user's intentions as they complete goal-directed image tasks. As users analyze retrieved imagery their interactions are captured and an expert task context is dynamically constructed. This human expertise, proficiency, and knowledge can then be leveraged to support other users in carrying out similar domain tasks. We have applied our techniques to two multimedia retrieval applications for two different image domains, namely the geo-spatial and medical imagery domains. © Springer-Verlag Berlin Heidelberg 2007.