19 resultados para text mining clusterizzazione clustering auto-organizzazione conoscenza MoK

em Digital Commons at Florida International University


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users. Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs. In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Modern IT infrastructures are constructed by large scale computing systems and administered by IT service providers. Manually maintaining such large computing systems is costly and inefficient. Service providers often seek automatic or semi-automatic methodologies of detecting and resolving system issues to improve their service quality and efficiency. This dissertation investigates several data-driven approaches for assisting service providers in achieving this goal. The detailed problems studied by these approaches can be categorized into the three aspects in the service workflow: 1) preprocessing raw textual system logs to structural events; 2) refining monitoring configurations for eliminating false positives and false negatives; 3) improving the efficiency of system diagnosis on detected alerts. Solving these problems usually requires a huge amount of domain knowledge about the particular computing systems. The approaches investigated by this dissertation are developed based on event mining algorithms, which are able to automatically derive part of that knowledge from the historical system logs, events and tickets. ^ In particular, two textual clustering algorithms are developed for converting raw textual logs into system events. For refining the monitoring configuration, a rule based alert prediction algorithm is proposed for eliminating false alerts (false positives) without losing any real alert and a textual classification method is applied to identify the missing alerts (false negatives) from manual incident tickets. For system diagnosis, this dissertation presents an efficient algorithm for discovering the temporal dependencies between system events with corresponding time lags, which can help the administrators to determine the redundancies of deployed monitoring situations and dependencies of system components. To improve the efficiency of incident ticket resolving, several KNN-based algorithms that recommend relevant historical tickets with resolutions for incoming tickets are investigated. Finally, this dissertation offers a novel algorithm for searching similar textual event segments over large system logs that assists administrators to locate similar system behaviors in the logs. Extensive empirical evaluation on system logs, events and tickets from real IT infrastructures demonstrates the effectiveness and efficiency of the proposed approaches.^

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

An Automatic Vehicle Location (AVL) system is a computer-based vehicle tracking system that is capable of determining a vehicle's location in real time. As a major technology of the Advanced Public Transportation System (APTS), AVL systems have been widely deployed by transit agencies for purposes such as real-time operation monitoring, computer-aided dispatching, and arrival time prediction. AVL systems make a large amount of transit performance data available that are valuable for transit performance management and planning purposes. However, the difficulties of extracting useful information from the huge spatial-temporal database have hindered off-line applications of the AVL data. ^ In this study, a data mining process, including data integration, cluster analysis, and multiple regression, is proposed. The AVL-generated data are first integrated into a Geographic Information System (GIS) platform. The model-based cluster method is employed to investigate the spatial and temporal patterns of transit travel speeds, which may be easily translated into travel time. The transit speed variations along the route segments are identified. Transit service periods such as morning peak, mid-day, afternoon peak, and evening periods are determined based on analyses of transit travel speed variations for different times of day. The seasonal patterns of transit performance are investigated by using the analysis of variance (ANOVA). Travel speed models based on the clustered time-of-day intervals are developed using important factors identified as having significant effects on speed for different time-of-day periods. ^ It has been found that transit performance varied from different seasons and different time-of-day periods. The geographic location of a transit route segment also plays a role in the variation of the transit performance. The results of this research indicate that advanced data mining techniques have good potential in providing automated techniques of assisting transit agencies in service planning, scheduling, and operations control. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The nation's freeway systems are becoming increasingly congested. A major contribution to traffic congestion on freeways is due to traffic incidents. Traffic incidents are non-recurring events such as accidents or stranded vehicles that cause a temporary roadway capacity reduction, and they can account for as much as 60 percent of all traffic congestion on freeways. One major freeway incident management strategy involves diverting traffic to avoid incident locations by relaying timely information through Intelligent Transportation Systems (ITS) devices such as dynamic message signs or real-time traveler information systems. The decision to divert traffic depends foremost on the expected duration of an incident, which is difficult to predict. In addition, the duration of an incident is affected by many contributing factors. Determining and understanding these factors can help the process of identifying and developing better strategies to reduce incident durations and alleviate traffic congestion. A number of research studies have attempted to develop models to predict incident durations, yet with limited success. ^ This dissertation research attempts to improve on this previous effort by applying data mining techniques to a comprehensive incident database maintained by the District 4 ITS Office of the Florida Department of Transportation (FDOT). Two categories of incident duration prediction models were developed: "offline" models designed for use in the performance evaluation of incident management programs, and "online" models for real-time prediction of incident duration to aid in the decision making of traffic diversion in the event of an ongoing incident. Multiple data mining analysis techniques were applied and evaluated in the research. The multiple linear regression analysis and decision tree based method were applied to develop the offline models, and the rule-based method and a tree algorithm called M5P were used to develop the online models. ^ The results show that the models in general can achieve high prediction accuracy within acceptable time intervals of the actual durations. The research also identifies some new contributing factors that have not been examined in past studies. As part of the research effort, software code was developed to implement the models in the existing software system of District 4 FDOT for actual applications. ^

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With the rapid growth of the Internet, computer attacks are increasing at a fast pace and can easily cause millions of dollar in damage to an organization. Detecting these attacks is an important issue of computer security. There are many types of attacks and they fall into four main categories, Denial of Service (DoS) attacks, Probe, User to Root (U2R) attacks, and Remote to Local (R2L) attacks. Within these categories, DoS and Probe attacks continuously show up with greater frequency in a short period of time when they attack systems. They are different from the normal traffic data and can be easily separated from normal activities. On the contrary, U2R and R2L attacks are embedded in the data portions of the packets and normally involve only a single connection. It becomes difficult to achieve satisfactory detection accuracy for detecting these two attacks. Therefore, we focus on studying the ambiguity problem between normal activities and U2R/R2L attacks. The goal is to build a detection system that can accurately and quickly detect these two attacks. In this dissertation, we design a two-phase intrusion detection approach. In the first phase, a correlation-based feature selection algorithm is proposed to advance the speed of detection. Features with poor prediction ability for the signatures of attacks and features inter-correlated with one or more other features are considered redundant. Such features are removed and only indispensable information about the original feature space remains. In the second phase, we develop an ensemble intrusion detection system to achieve accurate detection performance. The proposed method includes multiple feature selecting intrusion detectors and a data mining intrusion detector. The former ones consist of a set of detectors, and each of them uses a fuzzy clustering technique and belief theory to solve the ambiguity problem. The latter one applies data mining technique to automatically extract computer users’ normal behavior from training network traffic data. The final decision is a combination of the outputs of feature selecting and data mining detectors. The experimental results indicate that our ensemble approach not only significantly reduces the detection time but also effectively detect U2R and R2L attacks that contain degrees of ambiguous information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience. An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre->artist->album->track, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system. The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user-access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

With increasing competition and more demanding members, clubs need a tool to help them belter attract and retain members and predict their behavior. Data mining is such a tool. This article presents an overview of how data warehousing, data marting, and data mining can provide the foundation on which clubs can build strategies to outsmart competitors, build Ioyalty identify new members, and lower costs.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Electronic database handling of buisness information has gradually gained its popularity in the hospitality industry. This article provides an overview on the fundamental concepts of a hotel database and investigates the feasibility of incorporating computer-assisted data mining techniques into hospitality database applications. The author also exposes some potential myths associated with data mining in hospitaltiy database applications.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Driven by the political and economic forces of cross-strait, Taiwan has become one of the major source markets for Hong Kong tourism industry since 1987. The major purposes of this study were to investigate the following factors (1) The influential factors of travel motivation, (2) The clusters of travel motivations, (3) The marketing segmentation of clusters of Taiwanese tourists to visit Hong Kong. Through ten travel agents, self-report surveys were distributed to collect data from 366 Taiwanese travelers. Hence, four push factors and six pull factors were identified as travel motivations through the factor analysis. Combined with the cluster analysis; five new groups were founded. Finally, five clusters which process unique profiles (location difference, visiting frequency, travel satisfaction, and destination loyalty) were addressed. The suggestions of developing effective market strategies to attract Taiwanese tourists to Hong Kong were also provided.