775 resultados para Data Mining, Rough Sets, Multi-Dimension, Association Rules, Constraint
Resumo:
Retrospective clinical data presents many challenges for data mining and machine learning. The transcription of patient records from paper charts and subsequent manipulation of data often results in high volumes of noise as well as a loss of other important information. In addition, such datasets often fail to represent expert medical knowledge and reasoning in any explicit manner. In this research we describe applying data mining methods to retrospective clinical data to build a prediction model for asthma exacerbation severity for pediatric patients in the emergency department. Difficulties in building such a model forced us to investigate alternative strategies for analyzing and processing retrospective data. This paper describes this process together with an approach to mining retrospective clinical data by incorporating formalized external expert knowledge (secondary knowledge sources) into the classification task. This knowledge is used to partition the data into a number of coherent sets, where each set is explicitly described in terms of the secondary knowledge source. Instances from each set are then classified in a manner appropriate for the characteristics of the particular set. We present our methodology and outline a set of experiential results that demonstrate some advantages and some limitations of our approach. © 2008 Springer-Verlag Berlin Heidelberg.
Resumo:
Guest editorial Ali Emrouznejad is a Senior Lecturer at the Aston Business School in Birmingham, UK. His areas of research interest include performance measurement and management, efficiency and productivity analysis as well as data mining. He has published widely in various international journals. He is an Associate Editor of IMA Journal of Management Mathematics and Guest Editor to several special issues of journals including Journal of Operational Research Society, Annals of Operations Research, Journal of Medical Systems, and International Journal of Energy Management Sector. He is in the editorial board of several international journals and co-founder of Performance Improvement Management Software. William Ho is a Senior Lecturer at the Aston University Business School. Before joining Aston in 2005, he had worked as a Research Associate in the Department of Industrial and Systems Engineering at the Hong Kong Polytechnic University. His research interests include supply chain management, production and operations management, and operations research. He has published extensively in various international journals like Computers & Operations Research, Engineering Applications of Artificial Intelligence, European Journal of Operational Research, Expert Systems with Applications, International Journal of Production Economics, International Journal of Production Research, Supply Chain Management: An International Journal, and so on. His first authored book was published in 2006. He is an Editorial Board member of the International Journal of Advanced Manufacturing Technology and an Associate Editor of the OR Insight Journal. Currently, he is a Scholar of the Advanced Institute of Management Research. Uses of frontier efficiency methodologies and multi-criteria decision making for performance measurement in the energy sector This special issue aims to focus on holistic, applied research on performance measurement in energy sector management and for publication of relevant applied research to bridge the gap between industry and academia. After a rigorous refereeing process, seven papers were included in this special issue. The volume opens with five data envelopment analysis (DEA)-based papers. Wu et al. apply the DEA-based Malmquist index to evaluate the changes in relative efficiency and the total factor productivity of coal-fired electricity generation of 30 Chinese administrative regions from 1999 to 2007. Factors considered in the model include fuel consumption, labor, capital, sulphur dioxide emissions, and electricity generated. The authors reveal that the east provinces were relatively and technically more efficient, whereas the west provinces had the highest growth rate in the period studied. Ioannis E. Tsolas applies the DEA approach to assess the performance of Greek fossil fuel-fired power stations taking undesirable outputs into consideration, such as carbon dioxide and sulphur dioxide emissions. In addition, the bootstrapping approach is deployed to address the uncertainty surrounding DEA point estimates, and provide bias-corrected estimations and confidence intervals for the point estimates. The author revealed from the sample that the non-lignite-fired stations are on an average more efficient than the lignite-fired stations. Maethee Mekaroonreung and Andrew L. Johnson compare the relative performance of three DEA-based measures, which estimate production frontiers and evaluate the relative efficiency of 113 US petroleum refineries while considering undesirable outputs. Three inputs (capital, energy consumption, and crude oil consumption), two desirable outputs (gasoline and distillate generation), and an undesirable output (toxic release) are considered in the DEA models. The authors discover that refineries in the Rocky Mountain region performed the best, and about 60 percent of oil refineries in the sample could improve their efficiencies further. H. Omrani, A. Azadeh, S. F. Ghaderi, and S. Abdollahzadeh presented an integrated approach, combining DEA, corrected ordinary least squares (COLS), and principal component analysis (PCA) methods, to calculate the relative efficiency scores of 26 Iranian electricity distribution units from 2003 to 2006. Specifically, both DEA and COLS are used to check three internal consistency conditions, whereas PCA is used to verify and validate the final ranking results of either DEA (consistency) or DEA-COLS (non-consistency). Three inputs (network length, transformer capacity, and number of employees) and two outputs (number of customers and total electricity sales) are considered in the model. Virendra Ajodhia applied three DEA-based models to evaluate the relative performance of 20 electricity distribution firms from the UK and the Netherlands. The first model is a traditional DEA model for analyzing cost-only efficiency. The second model includes (inverse) quality by modelling total customer minutes lost as an input data. The third model is based on the idea of using total social costs, including the firm’s private costs and the interruption costs incurred by consumers, as an input. Both energy-delivered and number of consumers are treated as the outputs in the models. After five DEA papers, Stelios Grafakos, Alexandros Flamos, Vlasis Oikonomou, and D. Zevgolis presented a multiple criteria analysis weighting approach to evaluate the energy and climate policy. The proposed approach is akin to the analytic hierarchy process, which consists of pairwise comparisons, consistency verification, and criteria prioritization. In the approach, stakeholders and experts in the energy policy field are incorporated in the evaluation process by providing an interactive mean with verbal, numerical, and visual representation of their preferences. A total of 14 evaluation criteria were considered and classified into four objectives, such as climate change mitigation, energy effectiveness, socioeconomic, and competitiveness and technology. Finally, Borge Hess applied the stochastic frontier analysis approach to analyze the impact of various business strategies, including acquisition, holding structures, and joint ventures, on a firm’s efficiency within a sample of 47 natural gas transmission pipelines in the USA from 1996 to 2005. The author finds that there were no significant changes in the firm’s efficiency by an acquisition, and there is a weak evidence for efficiency improvements caused by the new shareholder. Besides, the author discovers that parent companies appear not to influence a subsidiary’s efficiency positively. In addition, the analysis shows a negative impact of a joint venture on technical efficiency of the pipeline company. To conclude, we are grateful to all the authors for their contribution, and all the reviewers for their constructive comments, which made this special issue possible. We hope that this issue would contribute significantly to performance improvement of the energy sector.
Resumo:
In order to address problems of information overload in digital imagery task domains we have developed an interactive approach to the capture and reuse of image context information. Our framework models different aspects of the relationship between images and domain tasks they support by monitoring the interactive manipulation and annotation of task-relevant imagery. The approach allows us to gauge a measure of a user's intentions as they complete goal-directed image tasks. As users analyze retrieved imagery their interactions are captured and an expert task context is dynamically constructed. This human expertise, proficiency, and knowledge can then be leveraged to support other users in carrying out similar domain tasks. We have applied our techniques to two multimedia retrieval applications for two different image domains, namely the geo-spatial and medical imagery domains. © Springer-Verlag Berlin Heidelberg 2007.
Resumo:
Today, due to globalization of the world the size of data set is increasing, it is necessary to discover the knowledge. The discovery of knowledge can be typically in the form of association rules, classification rules, clustering, discovery of frequent episodes and deviation detection. Fast and accurate classifiers for large databases are an important task in data mining. There is growing evidence that integrating classification and association rules mining, classification approaches based on heuristic, greedy search like decision tree induction. Emerging associative classification algorithms have shown good promises on producing accurate classifiers. In this paper we focus on performance of associative classification and present a parallel model for classifier building. For classifier building some parallel-distributed algorithms have been proposed for decision tree induction but so far no such work has been reported for associative classification.
Resumo:
This paper considers the problem of concept generalization in decision-making systems where such features of real-world databases as large size, incompleteness and inconsistence of the stored information are taken into account. The methods of the rough set theory (like lower and upper approximations, positive regions and reducts) are used for the solving of this problem. The new discretization algorithm of the continuous attributes is proposed. It essentially increases an overall performance of generalization algorithms and can be applied to processing of real value attributes in large data tables. Also the search algorithm of the significant attributes combined with a stage of discretization is developed. It allows avoiding splitting of continuous domains of insignificant attributes into intervals.
Resumo:
The paper presents a case study of geo-monitoring a region consisting in the capturing and encoding of human expertise into a knowledge-based system. As soon as the maps have been processed, the data patterns are detected using knowledge-based agents for the harvest prognosis.
Resumo:
Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences.
Resumo:
This paper surveys research in the field of data mining, which is related to discovering the dependencies between attributes in databases. We consider a number of approaches to finding the distribution intervals of association rules, to discovering branching dependencies between a given set of attributes and a given attribute in a database relation, to finding fractional dependencies between a given set of attributes and a given attribute in a database relation, and to collaborative filtering.
Resumo:
A rough set approach for attribute reduction is an important research subject in data mining and machine learning. However, most attribute reduction methods are performed on a complete decision system table. In this paper, we propose methods for attribute reduction in static incomplete decision systems and dynamic incomplete decision systems with dynamically-increasing and decreasing conditional attributes. Our methods use generalized discernibility matrix and function in tolerance-based rough sets.
Resumo:
An Automatic Vehicle Location (AVL) system is a computer-based vehicle tracking system that is capable of determining a vehicle's location in real time. As a major technology of the Advanced Public Transportation System (APTS), AVL systems have been widely deployed by transit agencies for purposes such as real-time operation monitoring, computer-aided dispatching, and arrival time prediction. AVL systems make a large amount of transit performance data available that are valuable for transit performance management and planning purposes. However, the difficulties of extracting useful information from the huge spatial-temporal database have hindered off-line applications of the AVL data. ^ In this study, a data mining process, including data integration, cluster analysis, and multiple regression, is proposed. The AVL-generated data are first integrated into a Geographic Information System (GIS) platform. The model-based cluster method is employed to investigate the spatial and temporal patterns of transit travel speeds, which may be easily translated into travel time. The transit speed variations along the route segments are identified. Transit service periods such as morning peak, mid-day, afternoon peak, and evening periods are determined based on analyses of transit travel speed variations for different times of day. The seasonal patterns of transit performance are investigated by using the analysis of variance (ANOVA). Travel speed models based on the clustered time-of-day intervals are developed using important factors identified as having significant effects on speed for different time-of-day periods. ^ It has been found that transit performance varied from different seasons and different time-of-day periods. The geographic location of a transit route segment also plays a role in the variation of the transit performance. The results of this research indicate that advanced data mining techniques have good potential in providing automated techniques of assisting transit agencies in service planning, scheduling, and operations control. ^
Resumo:
Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as ƒ-test is performed during each node's split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.
Resumo:
Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent. With the popularity of data mining techniques in recent years, these events grad- ually become more and more useful. Despite the recent advances of the data mining techniques, the application to system event mining is still in a rudimentary stage. Most of works are still focusing on episodes mining or frequent pattern discovering. These methods are unable to provide a brief yet comprehensible summary to reveal the valuable information from the high level perspective. Moreover, these methods provide little actionable knowledge to help the system administrators to better man- age the systems. To better make use of the recorded events, more practical techniques are required. From the perspective of data mining, three correlated directions are considered to be helpful for system management: (1) Provide concise yet comprehensive summaries about the running status of the systems; (2) Make the systems more intelligence and autonomous; (3) Effectively detect the abnormal behaviors of the systems. Due to the richness of the event logs, all these directions can be solved in the data-driven manner. And in this way, the robustness of the systems can be enhanced and the goal of autonomous management can be approached. This dissertation mainly focuses on the foregoing directions that leverage tem- poral mining techniques to facilitate system management. More specifically, three concrete topics will be discussed, including event, resource demand prediction, and streaming anomaly detection. Besides the theoretic contributions, the experimental evaluation will also be presented to demonstrate the effectiveness and efficacy of the corresponding solutions.
Resumo:
This article presents the response of the Centre for Copyright and New Business Models in the Creative Economy (CREATe) to the consultation on reform of the EU copyright regime. Reviews the format of the consultation, notes the common problems in reporting data in such a format, and reproduces the consultation questions to which CREATe responded, together with a summary of its conclusions on topics including: (1) terms of copyright protection; (2) libraries and archives; (3) persons with disabilities; (4) remuneration of authors; (5) user-generated content; (6) respect for rights; and (7) data mining.
Resumo:
Internet users consume online targeted advertising based on information collected about them and voluntarily share personal information in social networks. Sensor information and data from smart-phones is collected and used by applications, sometimes in unclear ways. As it happens today with smartphones, in the near future sensors will be shipped in all types of connected devices, enabling ubiquitous information gathering from the physical environment, enabling the vision of Ambient Intelligence. The value of gathered data, if not obvious, can be harnessed through data mining techniques and put to use by enabling personalized and tailored services as well as business intelligence practices, fueling the digital economy. However, the ever-expanding information gathering and use undermines the privacy conceptions of the past. Natural social practices of managing privacy in daily relations are overridden by socially-awkward communication tools, service providers struggle with security issues resulting in harmful data leaks, governments use mass surveillance techniques, the incentives of the digital economy threaten consumer privacy, and the advancement of consumergrade data-gathering technology enables new inter-personal abuses. A wide range of fields attempts to address technology-related privacy problems, however they vary immensely in terms of assumptions, scope and approach. Privacy of future use cases is typically handled vertically, instead of building upon previous work that can be re-contextualized, while current privacy problems are typically addressed per type in a more focused way. Because significant effort was required to make sense of the relations and structure of privacy-related work, this thesis attempts to transmit a structured view of it. It is multi-disciplinary - from cryptography to economics, including distributed systems and information theory - and addresses privacy issues of different natures. As existing work is framed and discussed, the contributions to the state-of-theart done in the scope of this thesis are presented. The contributions add to five distinct areas: 1) identity in distributed systems; 2) future context-aware services; 3) event-based context management; 4) low-latency information flow control; 5) high-dimensional dataset anonymity. Finally, having laid out such landscape of the privacy-preserving work, the current and future privacy challenges are discussed, considering not only technical but also socio-economic perspectives.