9 resultados para Knowledge Discovery
em Digital Commons at Florida International University
Resumo:
The primary aim of this dissertation is to develop data mining tools for knowledge discovery in biomedical data when multiple (homogeneous or heterogeneous) sources of data are available. The central hypothesis is that, when information from multiple sources of data are used appropriately and effectively, knowledge discovery can be better achieved than what is possible from only a single source. ^ Recent advances in high-throughput technology have enabled biomedical researchers to generate large volumes of diverse types of data on a genome-wide scale. These data include DNA sequences, gene expression measurements, and much more; they provide the motivation for building analysis tools to elucidate the modular organization of the cell. The challenges include efficiently and accurately extracting information from the multiple data sources; representing the information effectively, developing analytical tools, and interpreting the results in the context of the domain. ^ The first part considers the application of feature-level integration to design classifiers that discriminate between soil types. The machine learning tools, SVM and KNN, were used to successfully distinguish between several soil samples. ^ The second part considers clustering using multiple heterogeneous data sources. The resulting Multi-Source Clustering (MSC) algorithm was shown to have a better performance than clustering methods that use only a single data source or a simple feature-level integration of heterogeneous data sources. ^ The third part proposes a new approach to effectively incorporate incomplete data into clustering analysis. Adapted from K-means algorithm, the Generalized Constrained Clustering (GCC) algorithm makes use of incomplete data in the form of constraints to perform exploratory analysis. Novel approaches for extracting constraints were proposed. For sufficiently large constraint sets, the GCC algorithm outperformed the MSC algorithm. ^ The last part considers the problem of providing a theme-specific environment for mining multi-source biomedical data. The database called PlasmoTFBM, focusing on gene regulation of Plasmodium falciparum, contains diverse information and has a simple interface to allow biologists to explore the data. It provided a framework for comparing different analytical tools for predicting regulatory elements and for designing useful data mining tools. ^ The conclusion is that the experiments reported in this dissertation strongly support the central hypothesis.^
Resumo:
With the proliferation of multimedia data and ever-growing requests for multimedia applications, there is an increasing need for efficient and effective indexing, storage and retrieval of multimedia data, such as graphics, images, animation, video, audio and text. Due to the special characteristics of the multimedia data, the Multimedia Database management Systems (MMDBMSs) have emerged and attracted great research attention in recent years. Though much research effort has been devoted to this area, it is still far from maturity and there exist many open issues. In this dissertation, with the focus of addressing three of the essential challenges in developing the MMDBMS, namely, semantic gap, perception subjectivity and data organization, a systematic and integrated framework is proposed with video database and image database serving as the testbed. In particular, the framework addresses these challenges separately yet coherently from three main aspects of a MMDBMS: multimedia data representation, indexing and retrieval. In terms of multimedia data representation, the key to address the semantic gap issue is to intelligently and automatically model the mid-level representation and/or semi-semantic descriptors besides the extraction of the low-level media features. The data organization challenge is mainly addressed by the aspect of media indexing where various levels of indexing are required to support the diverse query requirements. In particular, the focus of this study is to facilitate the high-level video indexing by proposing a multimodal event mining framework associated with temporal knowledge discovery approaches. With respect to the perception subjectivity issue, advanced techniques are proposed to support users' interaction and to effectively model users' perception from the feedback at both the image-level and object-level.
Resumo:
Today, databases have become an integral part of information systems. In the past two decades, we have seen different database systems being developed independently and used in different applications domains. Today's interconnected networks and advanced applications, such as data warehousing, data mining & knowledge discovery and intelligent data access to information on the Web, have created a need for integrated access to such heterogeneous, autonomous, distributed database systems. Heterogeneous/multidatabase research has focused on this issue resulting in many different approaches. However, a single, generally accepted methodology in academia or industry has not emerged providing ubiquitous intelligent data access from heterogeneous, autonomous, distributed information sources. ^ This thesis describes a heterogeneous database system being developed at High-performance Database Research Center (HPDRC). A major impediment to ubiquitous deployment of multidatabase technology is the difficulty in resolving semantic heterogeneity. That is, identifying related information sources for integration and querying purposes. Our approach considers the semantics of the meta-data constructs in resolving this issue. The major contributions of the thesis work include: (i) providing a scalable, easy-to-implement architecture for developing a heterogeneous multidatabase system, utilizing Semantic Binary Object-oriented Data Model (Sem-ODM) and Semantic SQL query language to capture the semantics of the data sources being integrated and to provide an easy-to-use query facility; (ii) a methodology for semantic heterogeneity resolution by investigating into the extents of the meta-data constructs of component schemas. This methodology is shown to be correct, complete and unambiguous; (iii) a semi-automated technique for identifying semantic relations, which is the basis of semantic knowledge for integration and querying, using shared ontologies for context-mediation; (iv) resolutions for schematic conflicts and a language for defining global views from a set of component Sem-ODM schemas; (v) design of a knowledge base for storing and manipulating meta-data and knowledge acquired during the integration process. This knowledge base acts as the interface between integration and query processing modules; (vi) techniques for Semantic SQL query processing and optimization based on semantic knowledge in a heterogeneous database environment; and (vii) a framework for intelligent computing and communication on the Internet applying the concepts of our work. ^
Resumo:
There is growing popularity in the use of composite indices and rankings for cross-organizational benchmarking. However, little attention has been paid to alternative methods and procedures for the computation of these indices and how the use of such methods may impact the resulting indices and rankings. This dissertation developed an approach for assessing composite indices and rankings based on the integration of a number of methods for aggregation, data transformation and attribute weighting involved in their computation. The integrated model developed is based on the simulation of composite indices using methods and procedures proposed in the area of multi-criteria decision making (MCDM) and knowledge discovery in databases (KDD). The approach developed in this dissertation was automated through an IT artifact that was designed, developed and evaluated based on the framework and guidelines of the design science paradigm of information systems research. This artifact dynamically generates multiple versions of indices and rankings by considering different methodological scenarios according to user specified parameters. The computerized implementation was done in Visual Basic for Excel 2007. Using different performance measures, the artifact produces a number of excel outputs for the comparison and assessment of the indices and rankings. In order to evaluate the efficacy of the artifact and its underlying approach, a full empirical analysis was conducted using the World Bank's Doing Business database for the year 2010, which includes ten sub-indices (each corresponding to different areas of the business environment and regulation) for 183 countries. The output results, which were obtained using 115 methodological scenarios for the assessment of this index and its ten sub-indices, indicated that the variability of the component indicators considered in each case influenced the sensitivity of the rankings to the methodological choices. Overall, the results of our multi-method assessment were consistent with the World Bank rankings except in cases where the indices involved cost indicators measured in per capita income which yielded more sensitive results. Low income level countries exhibited more sensitivity in their rankings and less agreement between the benchmark rankings and our multi-method based rankings than higher income country groups.
Resumo:
Today, databases have become an integral part of information systems. In the past two decades, we have seen different database systems being developed independently and used in different applications domains. Today's interconnected networks and advanced applications, such as data warehousing, data mining & knowledge discovery and intelligent data access to information on the Web, have created a need for integrated access to such heterogeneous, autonomous, distributed database systems. Heterogeneous/multidatabase research has focused on this issue resulting in many different approaches. However, a single, generally accepted methodology in academia or industry has not emerged providing ubiquitous intelligent data access from heterogeneous, autonomous, distributed information sources. This thesis describes a heterogeneous database system being developed at Highperformance Database Research Center (HPDRC). A major impediment to ubiquitous deployment of multidatabase technology is the difficulty in resolving semantic heterogeneity. That is, identifying related information sources for integration and querying purposes. Our approach considers the semantics of the meta-data constructs in resolving this issue. The major contributions of the thesis work include: (i.) providing a scalable, easy-to-implement architecture for developing a heterogeneous multidatabase system, utilizing Semantic Binary Object-oriented Data Model (Sem-ODM) and Semantic SQL query language to capture the semantics of the data sources being integrated and to provide an easy-to-use query facility; (ii.) a methodology for semantic heterogeneity resolution by investigating into the extents of the meta-data constructs of component schemas. This methodology is shown to be correct, complete and unambiguous; (iii.) a semi-automated technique for identifying semantic relations, which is the basis of semantic knowledge for integration and querying, using shared ontologies for context-mediation; (iv.) resolutions for schematic conflicts and a language for defining global views from a set of component Sem-ODM schemas; (v.) design of a knowledge base for storing and manipulating meta-data and knowledge acquired during the integration process. This knowledge base acts as the interface between integration and query processing modules; (vi.) techniques for Semantic SQL query processing and optimization based on semantic knowledge in a heterogeneous database environment; and (vii.) a framework for intelligent computing and communication on the Internet applying the concepts of our work.
Resumo:
Entrepreneurial opportunity recognition is an increasingly prevalent phenomenon. Of particular interest is the ability of promising technology based ventures to recognize and exploit opportunities. Recent research drawing on the Austrian economic theory emphasizes the importance of knowledge, particularly market knowledge, behind opportunity recognition. While insightful, this research has tended to overlook those interrelationships that exist between different types of knowledge (technology and market knowledge) as well as between a firm’s knowledge base and its entrepreneurial orientation. Additional shortfalls of prior research include the ambiguous definitions provided for entrepreneurial opportunities, oversight of opportunity exploitation with an extensive focus on opportunity recognition only, and the lack of quantitative, empirical evidence on entrepreneurial opportunity recognition. ^ In this dissertation, these research gaps are addressed by integrating Schumpeterian opportunity development view with a Kirznerian opportunity discovery theory as well as insights from literature on entrepreneurial orientation. A sample of 85 new biotechnology ventures from the United States, Finland, and Sweden was analyzed. While leaders in all 85 companies were interviewed for the research in 2003-2004, 42 firms provided data in 2007. Data was analyzed using regression analysis. ^ The results show the value and importance of early market knowledge and technology knowledge as well as an entrepreneurial company posture for subsequent opportunity recognition. The highest numbers of new opportunities are recognized in firms where high levels of market knowledge are combined with high levels of technology knowledge (measured with a number of patents). A firm’s entrepreneurial orientation also enhances its opportunity recognition. Furthermore, the results show that new ventures with more market knowledge are able to gather more equity investments, license out more technologies, and achieve higher sales than new ventures with lower levels of market knowledge. Overall, the findings of this dissertation help further our understanding of the sources of entrepreneurial opportunities, and should encourage further research in this area. ^
Resumo:
The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. ^ Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. ^ This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model’s parsing mechanism. ^ The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents. ^
Resumo:
Entrepreneurial opportunity recognition is an increasingly prevalent phenomenon. Of particular interest is the ability of promising technology based ventures to recognize and exploit opportunities. Recent research drawing on the Austrian economic theory emphasizes the importance of knowledge, particularly market knowledge, behind opportunity recognition. While insightful, this research has tended to overlook those interrelationships that exist between different types of knowledge (technology and market knowledge) as well as between a firm’s knowledge base and its entrepreneurial orientation. Additional shortfalls of prior research include the ambiguous definitions provided for entrepreneurial opportunities, oversight of opportunity exploitation with an extensive focus on opportunity recognition only, and the lack of quantitative, empirical evidence on entrepreneurial opportunity recognition. In this dissertation, these research gaps are addressed by integrating Schumpeterian opportunity development view with a Kirznerian opportunity discovery theory as well as insights from literature on entrepreneurial orientation. A sample of 85 new biotechnology ventures from the United States, Finland, and Sweden was analyzed. While leaders in all 85 companies were interviewed for the research in 2003-2004, 42 firms provided data in 2007. Data was analyzed using regression analysis. The results show the value and importance of early market knowledge and technology knowledge as well as an entrepreneurial company posture for subsequent opportunity recognition. The highest numbers of new opportunities are recognized in firms where high levels of market knowledge are combined with high levels of technology knowledge (measured with a number of patents). A firm’s entrepreneurial orientation also enhances its opportunity recognition. Furthermore, the results show that new ventures with more market knowledge are able to gather more equity investments, license out more technologies, and achieve higher sales than new ventures with lower levels of market knowledge. Overall, the findings of this dissertation help further our understanding of the sources of entrepreneurial opportunities, and should encourage further research in this area.
Resumo:
The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents.