899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining


Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With advances in science and technology, computing and business intelligence (BI) systems are steadily becoming more complex with an increasing variety of heterogeneous software and hardware components. They are thus becoming progressively more difficult to monitor, manage and maintain. Traditional approaches to system management have largely relied on domain experts through a knowledge acquisition process that translates domain knowledge into operating rules and policies. It is widely acknowledged as a cumbersome, labor intensive, and error prone process, besides being difficult to keep up with the rapidly changing environments. In addition, many traditional business systems deliver primarily pre-defined historic metrics for a long-term strategic or mid-term tactical analysis, and lack the necessary flexibility to support evolving metrics or data collection for real-time operational analysis. There is thus a pressing need for automatic and efficient approaches to monitor and manage complex computing and BI systems. To realize the goal of autonomic management and enable self-management capabilities, we propose to mine system historical log data generated by computing and BI systems, and automatically extract actionable patterns from this data. This dissertation focuses on the development of different data mining techniques to extract actionable patterns from various types of log data in computing and BI systems. Four key problems—Log data categorization and event summarization, Leading indicator identification , Pattern prioritization by exploring the link structures , and Tensor model for three-way log data are studied. Case studies and comprehensive experiments on real application scenarios and datasets are conducted to show the effectiveness of our proposed approaches.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Graph-structured databases are widely prevalent, and the problem of effective search and retrieval from such graphs has been receiving much attention recently. For example, the Web can be naturally viewed as a graph. Likewise, a relational database can be viewed as a graph where tuples are modeled as vertices connected via foreign-key relationships. Keyword search querying has emerged as one of the most effective paradigms for information discovery, especially over HTML documents in the World Wide Web. One of the key advantages of keyword search querying is its simplicity—users do not have to learn a complex query language, and can issue queries without any prior knowledge about the structure of the underlying data. The purpose of this dissertation was to develop techniques for user-friendly, high quality and efficient searching of graph structured databases. Several ranked search methods on data graphs have been studied in the recent years. Given a top-k keyword search query on a graph and some ranking criteria, a keyword proximity search finds the top-k answers where each answer is a substructure of the graph containing all query keywords, which illustrates the relationship between the keyword present in the graph. We applied keyword proximity search on the web and the page graph of web documents to find top-k answers that satisfy user’s information need and increase user satisfaction. Another effective ranking mechanism applied on data graphs is the authority flow based ranking mechanism. Given a top- k keyword search query on a graph, an authority-flow based search finds the top-k answers where each answer is a node in the graph ranked according to its relevance and importance to the query. We developed techniques that improved the authority flow based search on data graphs by creating a framework to explain and reformulate them taking in to consideration user preferences and feedback. We also applied the proposed graph search techniques for Information Discovery over biological databases. Our algorithms were experimentally evaluated for performance and quality. The quality of our method was compared to current approaches by using user surveys.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The Comprehensive Everglades Restoration Plan (CERP) attempts to restore hydrology in the Northern and Southern Estuaries of Florida. Reefs of the Eastern oyster Crassostrea virginica are a dominant feature of the estuaries along the Southwest Florida coast. Oysters are benthic, sessile, filter-feeding organisms that provide ecosystem services by filtering the water column and providing food, shelter and habitat for associated organisms. As such, the species is an excellent sentinel organism for examining the impacts of restoration on estuarine ecosystems. The implementation of CERP attempts to improve: the hydrology and spatial and structural characteristics of oyster reefs, the recruitment and survivorship of C. virginica, and the reef-associated communities of organisms. This project links biological responses and environmental conditions relative to hydrological changes as a means of assessing positive or negative trends in oyster responses and population trends. Using oyster responses, we have developed a communication tool (i.e., Stoplight Report Card) based on CERP performance measures that can distinguish between responses to restoration and natural patterns. The Stoplight Report Card system is a communication tool that uses Monitoring and Assessment Program (MAP) performance measures to grade an estuary's response to changes brought about by anthropogenic input or restoration activities. The Stoplight Report Card consists of both a suitability index score for each organism metric as well as a trend score (− decreasing trend, +/− no change in trend, and + increasing trend). Based on these two measures, a component score (e.g., living density) is calculated by averaging the suitability index score and the trend score. The final index score is obtained by taking the geometric score of each component, which is then translated into a stoplight color for success (green), caution (yellow), or failure (red). Based on the data available for oyster populations and the responses of oysters in the Caloosahatchee Estuary, the system is currently at stage “caution.” This communication tool instantly conveys the status of the indicator and the suitability, while trend curves provide information on progress towards reaching a target. Furthermore, the tool has the advantage of being able to be applied regionally, by species, and collectively, in concert with other species, system-wide.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Electronic database handling of buisness information has gradually gained its popularity in the hospitality industry. This article provides an overview on the fundamental concepts of a hotel database and investigates the feasibility of incorporating computer-assisted data mining techniques into hospitality database applications. The author also exposes some potential myths associated with data mining in hospitaltiy database applications.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Conceptual database design is an unusually difficult and error-prone task for novice designers. This study examined how two training approaches---rule-based and pattern-based---might improve performance on database design tasks. A rule-based approach prescribes a sequence of rules for modeling conceptual constructs, and the action to be taken at various stages while developing a conceptual model. A pattern-based approach presents data modeling structures that occur frequently in practice, and prescribes guidelines on how to recognize and use these structures. This study describes the conceptual framework, experimental design, and results of a laboratory experiment that employed novice designers to compare the effectiveness of the two training approaches (between-subjects) at three levels of task complexity (within subjects). Results indicate an interaction effect between treatment and task complexity. The rule-based approach was significantly better in the low-complexity and the high-complexity cases; there was no statistical difference in the medium-complexity case. Designer performance fell significantly as complexity increased. Overall, though the rule-based approach was not significantly superior to the pattern-based approach in all instances, it out-performed the pattern-based approach at two out of three complexity levels. The primary contributions of the study are (1) the operationalization of the complexity construct to a degree not addressed in previous studies; (2) the development of a pattern-based instructional approach to database design; and (3) the finding that the effectiveness of a particular training approach may depend on the complexity of the task.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In outsourcing relationships with China, the Electronic Manufacturing (EM) and Information Technology Services (ITS) industry in Taiwan may possess such advantages as the continuing growth of its production value, complete manufacturing supply chain, low production cost and a large-scale Chinese market, and language and culture similarity compared to outsourcing to other countries. Nevertheless, the Council for Economic Planning and Development of Executive Yuan (CEPD) found that Taiwan's IT services outsourcing to China is subject to certain constraints and might not be as successful as the EM outsourcing (Aggarwal, 2003; CEPD, 2004a; CIER, 2003; Einhorn and Kriplani, 2003; Kumar and Zhu, 2006; Li and Gao, 2003; MIC, 2006). Some studies examined this issue, but failed to (1) provide statistical evidence about lower prevalence rates of IT services outsourcing, and (2) clearly explain the lower prevalence rates of IT services outsourcing by identifying similarities and differences between both types of outsourcing contexts. This research seeks to fill that gap and possibly provide potential strategic guidelines to ITS firms in Taiwan. This study adopts Transaction Cost Economics (TCE) as the theoretical basis. The basic premise is that different types of outsourcing activities may incur differing transaction costs and realize varying degrees of outsourcing success due to differential attributes of the transactions in the outsourcing process. Using primary data gathered from questionnaire surveys of ninety two firms, the results from exploratory analysis and binary logistic regression indicated that (1) when outsourcing to China, Taiwanese firms' ITS outsourcing tends to have higher level of asset specificity, uncertainty and technical skills relative to EM outsourcing, and these features indirectly reduce firms' outsourcing prevalence rates via their direct positive impacts on transaction costs; (2) Taiwanese firms' ITS outsourcing tends to have lower level of transaction structurability relative to EM outsourcing, and this feature indirectly increases firms' outsourcing prevalence rates via its direct negative impacts on transaction costs; (3) frequency does influence firms' transaction costs in ITS outsourcing positively, but does not bring impacts into their outsourcing prevalence rates, (4) relatedness does influence firms' transaction costs positively and prevalence rates negatively in ITS outsourcing, but its impacts on the prevalence rates are not caused by the mediation effects of transaction costs, and (5) firm size of outsourcing provider does not affect firms' transaction costs, but does affect their outsourcing prevalence rates in ITS outsourcing directly and positively. Using primary data gathered from face-to-face interviews of executives from seven firms, the results from inductive analysis indicated that (1) IT services outsourcing has lower prevalence rates than EM outsourcing, and (2) this result is mainly attributed to Taiwan's core competence in manufacturing and management and higher overall transaction costs of IT services outsourcing. Specifically, there is not much difference between both types of outsourcing context in the transaction characteristics of reputation and most aspects of overall comparison. Although there are some differences in the feature of firm size of the outsourcing provider, the difference doesn't cause apparent impacts on firms' overall transaction costs. The medium or above medium difference in the transaction characteristics of asset specificity, uncertainty, frequency, technical skills, transaction structurability, and relatedness has caused higher overall transaction costs for IT services outsourcing. This higher cost might cause lower prevalence rates for ITS outsourcing relative to EM outsourcing. Overall, the interview results are consistent with the statistical analyses and provide support to my expectation that in outsourcing to China, Taiwan's electronic manufacturing firms do have lower prevalence rates of IT services outsourcing relative to EM outsourcing due to higher transaction costs caused by certain attributes. To solve this problem, firms' management should aim at identifying alternative strategies and strive to reduce their overall transaction costs of IT services outsourcing by initiating appropriate strategies which fit their environment and needs.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as ƒ-test is performed during each node's split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The purpose of this study was to ascertain the perception of the level of satisfaction that international students have regarding the services and the relevance of the curriculum offered at Miami-Dade Community College. Trends and issues at universities and community colleges in providing services and an international curriculum for foreign students are outlined. Focus is on characteristics, personal and career needs as well as needs of national development for the students' countries. A sample of students from four developing nations was selected to qualitatively and quantitatively determine their level of satisfaction. The nations are the Bahamas, Colombia, Haiti and Pakistan. Students responses were recorded through group interviews, four personal interviews, an open ended questionnaire and a Likert scaled survey questionnaire. Matrix charts, mean calculations and one way analysis of variance were used to analyze data collected. Country of origin and major program of study were the variables used for statistical analysis. Information gathered through qualitative research presented a variety of perspectives and responses, both positive and negative. Students supplied specific examples of experiences and insights to help explain their various perceptions. Statistically, there were no significant differences between the variables of country of origin and major program of study regarding program services and relevance of the curriculum. Implications and recommendations for community college programs were outlined.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the advent of peer to peer networks, and more importantly sensor networks, the desire to extract useful information from continuous and unbounded streams of data has become more prominent. For example, in tele-health applications, sensor based data streaming systems are used to continuously and accurately monitor Alzheimer's patients and their surrounding environment. Typically, the requirements of such applications necessitate the cleaning and filtering of continuous, corrupted and incomplete data streams gathered wirelessly in dynamically varying conditions. Yet, existing data stream cleaning and filtering schemes are incapable of capturing the dynamics of the environment while simultaneously suppressing the losses and corruption introduced by uncertain environmental, hardware, and network conditions. Consequently, existing data cleaning and filtering paradigms are being challenged. This dissertation develops novel schemes for cleaning data streams received from a wireless sensor network operating under non-linear and dynamically varying conditions. The study establishes a paradigm for validating spatio-temporal associations among data sources to enhance data cleaning. To simplify the complexity of the validation process, the developed solution maps the requirements of the application on a geometrical space and identifies the potential sensor nodes of interest. Additionally, this dissertation models a wireless sensor network data reduction system by ascertaining that segregating data adaptation and prediction processes will augment the data reduction rates. The schemes presented in this study are evaluated using simulation and information theory concepts. The results demonstrate that dynamic conditions of the environment are better managed when validation is used for data cleaning. They also show that when a fast convergent adaptation process is deployed, data reduction rates are significantly improved. Targeted applications of the developed methodology include machine health monitoring, tele-health, environment and habitat monitoring, intermodal transportation and homeland security.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Today, databases have become an integral part of information systems. In the past two decades, we have seen different database systems being developed independently and used in different applications domains. Today's interconnected networks and advanced applications, such as data warehousing, data mining & knowledge discovery and intelligent data access to information on the Web, have created a need for integrated access to such heterogeneous, autonomous, distributed database systems. Heterogeneous/multidatabase research has focused on this issue resulting in many different approaches. However, a single, generally accepted methodology in academia or industry has not emerged providing ubiquitous intelligent data access from heterogeneous, autonomous, distributed information sources. This thesis describes a heterogeneous database system being developed at Highperformance Database Research Center (HPDRC). A major impediment to ubiquitous deployment of multidatabase technology is the difficulty in resolving semantic heterogeneity. That is, identifying related information sources for integration and querying purposes. Our approach considers the semantics of the meta-data constructs in resolving this issue. The major contributions of the thesis work include: (i.) providing a scalable, easy-to-implement architecture for developing a heterogeneous multidatabase system, utilizing Semantic Binary Object-oriented Data Model (Sem-ODM) and Semantic SQL query language to capture the semantics of the data sources being integrated and to provide an easy-to-use query facility; (ii.) a methodology for semantic heterogeneity resolution by investigating into the extents of the meta-data constructs of component schemas. This methodology is shown to be correct, complete and unambiguous; (iii.) a semi-automated technique for identifying semantic relations, which is the basis of semantic knowledge for integration and querying, using shared ontologies for context-mediation; (iv.) resolutions for schematic conflicts and a language for defining global views from a set of component Sem-ODM schemas; (v.) design of a knowledge base for storing and manipulating meta-data and knowledge acquired during the integration process. This knowledge base acts as the interface between integration and query processing modules; (vi.) techniques for Semantic SQL query processing and optimization based on semantic knowledge in a heterogeneous database environment; and (vii.) a framework for intelligent computing and communication on the Internet applying the concepts of our work.