16 resultados para Data mining and knowledge discovery

em Digital Commons at Florida International University


Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the proliferation of multimedia data and ever-growing requests for multimedia applications, there is an increasing need for efficient and effective indexing, storage and retrieval of multimedia data, such as graphics, images, animation, video, audio and text. Due to the special characteristics of the multimedia data, the Multimedia Database management Systems (MMDBMSs) have emerged and attracted great research attention in recent years. Though much research effort has been devoted to this area, it is still far from maturity and there exist many open issues. In this dissertation, with the focus of addressing three of the essential challenges in developing the MMDBMS, namely, semantic gap, perception subjectivity and data organization, a systematic and integrated framework is proposed with video database and image database serving as the testbed. In particular, the framework addresses these challenges separately yet coherently from three main aspects of a MMDBMS: multimedia data representation, indexing and retrieval. In terms of multimedia data representation, the key to address the semantic gap issue is to intelligently and automatically model the mid-level representation and/or semi-semantic descriptors besides the extraction of the low-level media features. The data organization challenge is mainly addressed by the aspect of media indexing where various levels of indexing are required to support the diverse query requirements. In particular, the focus of this study is to facilitate the high-level video indexing by proposing a multimodal event mining framework associated with temporal knowledge discovery approaches. With respect to the perception subjectivity issue, advanced techniques are proposed to support users' interaction and to effectively model users' perception from the feedback at both the image-level and object-level.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The nation's freeway systems are becoming increasingly congested. A major contribution to traffic congestion on freeways is due to traffic incidents. Traffic incidents are non-recurring events such as accidents or stranded vehicles that cause a temporary roadway capacity reduction, and they can account for as much as 60 percent of all traffic congestion on freeways. One major freeway incident management strategy involves diverting traffic to avoid incident locations by relaying timely information through Intelligent Transportation Systems (ITS) devices such as dynamic message signs or real-time traveler information systems. The decision to divert traffic depends foremost on the expected duration of an incident, which is difficult to predict. In addition, the duration of an incident is affected by many contributing factors. Determining and understanding these factors can help the process of identifying and developing better strategies to reduce incident durations and alleviate traffic congestion. A number of research studies have attempted to develop models to predict incident durations, yet with limited success. ^ This dissertation research attempts to improve on this previous effort by applying data mining techniques to a comprehensive incident database maintained by the District 4 ITS Office of the Florida Department of Transportation (FDOT). Two categories of incident duration prediction models were developed: "offline" models designed for use in the performance evaluation of incident management programs, and "online" models for real-time prediction of incident duration to aid in the decision making of traffic diversion in the event of an ongoing incident. Multiple data mining analysis techniques were applied and evaluated in the research. The multiple linear regression analysis and decision tree based method were applied to develop the offline models, and the rule-based method and a tree algorithm called M5P were used to develop the online models. ^ The results show that the models in general can achieve high prediction accuracy within acceptable time intervals of the actual durations. The research also identifies some new contributing factors that have not been examined in past studies. As part of the research effort, software code was developed to implement the models in the existing software system of District 4 FDOT for actual applications. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There is growing popularity in the use of composite indices and rankings for cross-organizational benchmarking. However, little attention has been paid to alternative methods and procedures for the computation of these indices and how the use of such methods may impact the resulting indices and rankings. This dissertation developed an approach for assessing composite indices and rankings based on the integration of a number of methods for aggregation, data transformation and attribute weighting involved in their computation. The integrated model developed is based on the simulation of composite indices using methods and procedures proposed in the area of multi-criteria decision making (MCDM) and knowledge discovery in databases (KDD). The approach developed in this dissertation was automated through an IT artifact that was designed, developed and evaluated based on the framework and guidelines of the design science paradigm of information systems research. This artifact dynamically generates multiple versions of indices and rankings by considering different methodological scenarios according to user specified parameters. The computerized implementation was done in Visual Basic for Excel 2007. Using different performance measures, the artifact produces a number of excel outputs for the comparison and assessment of the indices and rankings. In order to evaluate the efficacy of the artifact and its underlying approach, a full empirical analysis was conducted using the World Bank's Doing Business database for the year 2010, which includes ten sub-indices (each corresponding to different areas of the business environment and regulation) for 183 countries. The output results, which were obtained using 115 methodological scenarios for the assessment of this index and its ten sub-indices, indicated that the variability of the component indicators considered in each case influenced the sensitivity of the rankings to the methodological choices. Overall, the results of our multi-method assessment were consistent with the World Bank rankings except in cases where the indices involved cost indicators measured in per capita income which yielded more sensitive results. Low income level countries exhibited more sensitivity in their rankings and less agreement between the benchmark rankings and our multi-method based rankings than higher income country groups.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Electronic database handling of buisness information has gradually gained its popularity in the hospitality industry. This article provides an overview on the fundamental concepts of a hotel database and investigates the feasibility of incorporating computer-assisted data mining techniques into hospitality database applications. The author also exposes some potential myths associated with data mining in hospitaltiy database applications.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Because some Web users will be able to design a template to visualize information from scratch, while other users need to automatically visualize information by changing some parameters, providing different levels of customization of the information is a desirable goal. Our system allows the automatic generation of visualizations given the semantics of the data, and the static or pre-specified visualization by creating an interface language. We address information visualization taking into consideration the Web, where the presentation of the retrieved information is a challenge. ^ We provide a model to narrow the gap between the user's way of expressing queries and database manipulation languages (SQL) without changing the system itself thus improving the query specification process. We develop a Web interface model that is integrated with the HTML language to create a powerful language that facilitates the construction of Web-based database reports. ^ As opposed to other papers, this model offers a new way of exploring databases focusing on providing Web connectivity to databases with minimal or no result buffering, formatting, or extra programming. We describe how to easily connect the database to the Web. In addition, we offer an enhanced way on viewing and exploring the contents of a database, allowing users to customize their views depending on the contents and the structure of the data. Current database front-ends typically attempt to display the database objects in a flat view making it difficult for users to grasp the contents and the structure of their result. Our model narrows the gap between databases and the Web. ^ The overall objective of this research is to construct a model that accesses different databases easily across the net and generates SQL, forms, and reports across all platforms without requiring the developer to code a complex application. This increases the speed of development. In addition, using only the Web browsers, the end-user can retrieve data from databases remotely to make necessary modifications and manipulations of data using the Web formatted forms and reports, independent of the platform, without having to open different applications, or learn to use anything but their Web browser. We introduce a strategic method to generate and construct SQL queries, enabling inexperienced users that are not well exposed to the SQL world to build syntactically and semantically a valid SQL query and to understand the retrieved data. The generated SQL query will be validated against the database schema to ensure harmless and efficient SQL execution. (Abstract shortened by UMI.)^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This research presents several components encompassing the scope of the objective of Data Partitioning and Replication Management in Distributed GIS Database. Modern Geographic Information Systems (GIS) databases are often large and complicated. Therefore data partitioning and replication management problems need to be addresses in development of an efficient and scalable solution. ^ Part of the research is to study the patterns of geographical raster data processing and to propose the algorithms to improve availability of such data. These algorithms and approaches are targeting granularity of geographic data objects as well as data partitioning in geographic databases to achieve high data availability and Quality of Service(QoS) considering distributed data delivery and processing. To achieve this goal a dynamic, real-time approach for mosaicking digital images of different temporal and spatial characteristics into tiles is proposed. This dynamic approach reuses digital images upon demand and generates mosaicked tiles only for the required region according to user's requirements such as resolution, temporal range, and target bands to reduce redundancy in storage and to utilize available computing and storage resources more efficiently. ^ Another part of the research pursued methods for efficient acquiring of GIS data from external heterogeneous databases and Web services as well as end-user GIS data delivery enhancements, automation and 3D virtual reality presentation. ^ There are vast numbers of computing, network, and storage resources idling or not fully utilized available on the Internet. Proposed "Crawling Distributed Operating System "(CDOS) approach employs such resources and creates benefits for the hosts that lend their CPU, network, and storage resources to be used in GIS database context. ^ The results of this dissertation demonstrate effective ways to develop a highly scalable GIS database. The approach developed in this dissertation has resulted in creation of TerraFly GIS database that is used by US government, researchers, and general public to facilitate Web access to remotely-sensed imagery and GIS vector information. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The study examines the thought of Yanagita Kunio (1875–1962), an influential Japanese nationalist thinker and a founder of an academic discipline named minzokugaku. The purpose of the study is to bring into light an unredeemed potential of his intellectual and political project as a critique of the way in which modern politics and knowledge systematically suppresses global diversity. The study reads his texts against the backdrop of the modern understanding of space and time and its political and moral implications and traces the historical evolution of his thought that culminates in the establishment of minzokugaku. My reading of Yanagita’s texts draws on three interpretive hypotheses. First, his thought can be interpreted as a critical engagement with John Stuart Mill’s philosophy of history, as he turns Mill’s defense of diversity against Mill’s justification of enlightened despotism in non-Western societies. Second, to counter Mill’s individualistic notion of progressive agency, he turns to a Marxian notion of anthropological space, in which a laboring class makes history by continuously transforming nature, and rehabilitates the common people (jomin) as progressive agents. Third, in addition to the common people, Yanagita integrates wandering people as a countervailing force to the innate parochialism and conservatism of agrarian civilization. To excavate the unrecorded history of ordinary farmers and wandering people and promote the formation of national consciousness, his minzokugaku adopts travel as an alternative method for knowledge production and political education. In light of this interpretation, the aim of Yanagita’s intellectual and political project can be understood as defense and critique of the Enlightenment tradition. Intellectually, he attempts to navigate between spurious universalism and reactionary particularism by revaluing diversity as a necessary condition for universal knowledge and human progress. Politically, his minzokugaku aims at nation-building/globalization from below by tracing back the history of a migratory process cutting across the existing boundaries. His project is opposed to nation-building from above that aims to integrate the world population into international society at the expense of global diversity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

long-term research on freshwater ecosystems provides insights that can be difficult to obtain from other approaches. Widespread monitoring of ecologically relevant water-quality parameters spanning decades can facilitate important tests of ecological principles. Unique long-term data sets and analytical tools are increasingly available, allowing for powerful and synthetic analyses across sites. long-term measurements or experiments in aquatic systems can catch rare events, changes in highly variable systems, time-lagged responses, cumulative effects of stressors, and biotic responses that encompass multiple generations. Data are available from formal networks, local to international agencies, private organizations, various institutions, and paleontological and historic records; brief literature surveys suggest much existing data are not synthesized. Ecological sciences will benefit from careful maintenance and analyses of existing long-term programs, and subsequent insights can aid in the design of effective future long-term experimental and observational efforts. long-term research on freshwaters is particularly important because of their value to humanity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The purpose of the study was to investigate the physiological and psychological benefits provided by a self-selected health and wellness course on a racially and ethnically diverse student population. It was designed to determine if students from a 2-year Hispanic serving institution (HIS) from a large metropolitan area would enhance their capacity to perform physical activities, increase their knowledge of health topics and raise their exercise self-efficacy after completing a course that included educational and activity components for a period of 16 weeks. A total of 185 students voluntarily agreed to participate in the study. An experimental group was selected from six sections of a health and wellness course, and a comparison group from students in a student life skills course. All participants were given anthropometric tests of physical fitness, a knowledge test, and an exercise self-efficacy scale was given at the beginning and at the conclusion of the semester. An ANCOVA analyses with the pretest scores being the covariate and the dependent variable being the difference score, indicated a significant improvement of the experimental group in five of the seven anthropometric tests over the comparison group. In addition, the experimental group increased in two of the three sections of the exercise self-efficacy scale indicating greater confidence to participate in physical activities in spite of barriers over the comparison group. The experimental group also increased in knowledge of health related topics over the comparison group at the .05 significance level. Results indicated beneficial outcomes gained by students enrolled in a 16-week health and wellness course. The study has several implications for practitioners, faculty members, educational policy makers and researchers in terms of implementation of strategies to promote healthy behaviors in college students and, to encourage them to engage in regular physical activities throughout their college years.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper details the research methods an introductory qualitative research class used to both study an issue related to race and identity, and to familiarize themselves with data collection strategies. Throughout the paper the authors attempt to capture the challenges, disagreements, and consensus building that marked this unusual research endeavor.