899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining


Relevância:

50.00% 50.00%

Publicador:

Resumo:

Dimensionality reduction is a very important step in the data mining process. In this paper, we consider feature extraction for classification tasks as a technique to overcome problems occurring because of “the curse of dimensionality”. Three different eigenvector-based feature extraction approaches are discussed and three different kinds of applications with respect to classification tasks are considered. The summary of obtained results concerning the accuracy of classification schemes is presented with the conclusion about the search for the most appropriate feature extraction method. The problem how to discover knowledge needed to integrate the feature extraction and classification processes is stated. A decision support system to aid in the integration of the feature extraction and classification processes is proposed. The goals and requirements set for the decision support system and its basic structure are defined. The means of knowledge acquisition needed to build up the proposed system are considered.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We develop a new autoregressive conditional process to capture both the changes and the persistency of the intraday seasonal (U-shape) pattern of volatility in essay 1. Unlike other procedures, this approach allows for the intraday volatility pattern to change over time without the filtering process injecting a spurious pattern of noise into the filtered series. We show that prior deterministic filtering procedures are special cases of the autoregressive conditional filtering process presented here. Lagrange multiplier tests prove that the stochastic seasonal variance component is statistically significant. Specification tests using the correlogram and cross-spectral analyses prove the reliability of the autoregressive conditional filtering process. In essay 2 we develop a new methodology to decompose return variance in order to examine the informativeness embedded in the return series. The variance is decomposed into the information arrival component and the noise factor component. This decomposition methodology differs from previous studies in that both the informational variance and the noise variance are time-varying. Furthermore, the covariance of the informational component and the noisy component is no longer restricted to be zero. The resultant measure of price informativeness is defined as the informational variance divided by the total variance of the returns. The noisy rational expectations model predicts that uninformed traders react to price changes more than informed traders, since uninformed traders cannot distinguish between price changes caused by information arrivals and price changes caused by noise. This hypothesis is tested in essay 3 using intraday data with the intraday seasonal volatility component removed, as based on the procedure in the first essay. The resultant seasonally adjusted variance series is decomposed into components caused by unexpected information arrivals and by noise in order to examine informativeness.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. ^ Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. ^ This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model’s parsing mechanism. ^ The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents. ^

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience. An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre->artist->album->track, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system. The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user-access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Background As the use of electronic health records (EHRs) becomes more widespread, so does the need to search and provide effective information discovery within them. Querying by keyword has emerged as one of the most effective paradigms for searching. Most work in this area is based on traditional Information Retrieval (IR) techniques, where each document is compared individually against the query. We compare the effectiveness of two fundamentally different techniques for keyword search of EHRs. Methods We built two ranking systems. The traditional BM25 system exploits the EHRs' content without regard to association among entities within. The Clinical ObjectRank (CO) system exploits the entities' associations in EHRs using an authority-flow algorithm to discover the most relevant entities. BM25 and CO were deployed on an EHR dataset of the cardiovascular division of Miami Children's Hospital. Using sequences of keywords as queries, sensitivity and specificity were measured by two physicians for a set of 11 queries related to congenital cardiac disease. Results Our pilot evaluation showed that CO outperforms BM25 in terms of sensitivity (65% vs. 38%) by 71% on average, while maintaining the specificity (64% vs. 61%). The evaluation was done by two physicians. Conclusions Authority-flow techniques can greatly improve the detection of relevant information in EHRs and hence deserve further study.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Protecting confidential information from improper disclosure is a fundamental security goal. While encryption and access control are important tools for ensuring confidentiality, they cannot prevent an authorized system from leaking confidential information to its publicly observable outputs, whether inadvertently or maliciously. Hence, secure information flow aims to provide end-to-end control of information flow. Unfortunately, the traditionally-adopted policy of noninterference, which forbids all improper leakage, is often too restrictive. Theories of quantitative information flow address this issue by quantifying the amount of confidential information leaked by a system, with the goal of showing that it is intuitively "small" enough to be tolerated. Given such a theory, it is crucial to develop automated techniques for calculating the leakage in a system. ^ This dissertation is concerned with program analysis for calculating the maximum leakage, or capacity, of confidential information in the context of deterministic systems and under three proposed entropy measures of information leakage: Shannon entropy leakage, min-entropy leakage, and g-leakage. In this context, it turns out that calculating the maximum leakage of a program reduces to counting the number of possible outputs that it can produce. ^ The new approach introduced in this dissertation is to determine two-bit patterns, the relationships among pairs of bits in the output; for instance we might determine that two bits must be unequal. By counting the number of solutions to the two-bit patterns, we obtain an upper bound on the number of possible outputs. Hence, the maximum leakage can be bounded. We first describe a straightforward computation of the two-bit patterns using an automated prover. We then show a more efficient implementation that uses an implication graph to represent the two- bit patterns. It efficiently constructs the graph through the use of an automated prover, random executions, STP counterexamples, and deductive closure. The effectiveness of our techniques, both in terms of efficiency and accuracy, is shown through a number of case studies found in recent literature. ^

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent. With the popularity of data mining techniques in recent years, these events grad- ually become more and more useful. Despite the recent advances of the data mining techniques, the application to system event mining is still in a rudimentary stage. Most of works are still focusing on episodes mining or frequent pattern discovering. These methods are unable to provide a brief yet comprehensible summary to reveal the valuable information from the high level perspective. Moreover, these methods provide little actionable knowledge to help the system administrators to better man- age the systems. To better make use of the recorded events, more practical techniques are required. From the perspective of data mining, three correlated directions are considered to be helpful for system management: (1) Provide concise yet comprehensive summaries about the running status of the systems; (2) Make the systems more intelligence and autonomous; (3) Effectively detect the abnormal behaviors of the systems. Due to the richness of the event logs, all these directions can be solved in the data-driven manner. And in this way, the robustness of the systems can be enhanced and the goal of autonomous management can be approached. This dissertation mainly focuses on the foregoing directions that leverage tem- poral mining techniques to facilitate system management. More specifically, three concrete topics will be discussed, including event, resource demand prediction, and streaming anomaly detection. Besides the theoretic contributions, the experimental evaluation will also be presented to demonstrate the effectiveness and efficacy of the corresponding solutions.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We develop a new autoregressive conditional process to capture both the changes and the persistency of the intraday seasonal (U-shape) pattern of volatility in essay 1. Unlike other procedures, this approach allows for the intraday volatility pattern to change over time without the filtering process injecting a spurious pattern of noise into the filtered series. We show that prior deterministic filtering procedures are special cases of the autoregressive conditional filtering process presented here. Lagrange multiplier tests prove that the stochastic seasonal variance component is statistically significant. Specification tests using the correlogram and cross-spectral analyses prove the reliability of the autoregressive conditional filtering process. In essay 2 we develop a new methodology to decompose return variance in order to examine the informativeness embedded in the return series. The variance is decomposed into the information arrival component and the noise factor component. This decomposition methodology differs from previous studies in that both the informational variance and the noise variance are time-varying. Furthermore, the covariance of the informational component and the noisy component is no longer restricted to be zero. The resultant measure of price informativeness is defined as the informational variance divided by the total variance of the returns. The noisy rational expectations model predicts that uninformed traders react to price changes more than informed traders, since uninformed traders cannot distinguish between price changes caused by information arrivals and price changes caused by noise. This hypothesis is tested in essay 3 using intraday data with the intraday seasonal volatility component removed, as based on the procedure in the first essay. The resultant seasonally adjusted variance series is decomposed into components caused by unexpected information arrivals and by noise in order to examine informativeness.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Peer reviewed

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Peer reviewed

Relevância:

50.00% 50.00%

Publicador:

Resumo:

Peer reviewed

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The extractive industry is characterized by high levels of risk and uncertainty. These attributes create challenges when applying traditional accounting concepts (such as the revenue recognition and matching concepts) to the preparation of financial statements in the industry. The International Accounting Standards Board (2010) states that the objective of general purpose financial statements is to provide useful financial information to assist the capital allocation decisions of existing and potential providers of capital. The usefulness of information is defined as being relevant and faithfully represented so as to best aid in the investment decisions of capital providers. Value relevance research utilizes adaptations of the Ohlson (1995) to assess the attribute of value relevance which is one part of the attributes resulting in useful information. This study firstly examines the value relevance of the financial information disclosed in the financial reports of extractive firms. The findings reveal that the value relevance of information disclosed in the financial reports depends on the circumstances of the firm including sector, size and profitability. Traditional accounting concepts such as the matching concept can be ineffective when applied to small firms who are primarily engaged in nonproduction activities that involve significant levels of uncertainty such as exploration activities or the development of sites. Standard setting bodies such as the International Accounting Standards Board and the Financial Accounting Standards Board have addressed the financial reporting challenges in the extractive industry by allowing a significant amount of accounting flexibility in industryspecific accounting standards, particularly in relation to the accounting treatment of exploration and evaluation expenditure. Therefore, secondly this study examines whether the choice of exploration accounting policy has an effect on the value relevance of information disclosed in the financial reports. The findings show that, in general, the Successful Efforts method produces value relevant information in the financial reports of profitable extractive firms. However, specifically in the oil & gas sector, the Full Cost method produces value relevant asset disclosures if the firm is lossmaking. This indicates that investors in production and non-production orientated firms have different information needs and these needs cannot be simultaneously fulfilled by a single accounting policy. In the mining sector, a preference by large profitable mining companies towards a more conservative policy than either the Full Cost or Successful Efforts methods does not result in more value relevant information being disclosed in the financial reports. This finding supports the fact that the qualitative characteristic of prudence is a form of bias which has a downward effect on asset values. The third aspect of this study is an examination of the effect of corporate governance on the value relevance of disclosures made in the financial reports of extractive firms. The findings show that the key factor influencing the value relevance of financial information is the ability of the directors to select accounting policies which reflect the economic substance of the particular circumstances facing the firms in an effective way. Corporate governance is found to have an effect on value relevance, particularly in the oil & gas sector. However, there is no significant difference between the exploration accounting policy choices made by directors of firms with good systems of corporate governance and those with weak systems of corporate governance.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

The continuous flow of technological developments in communications and electronic industries has led to the growing expansion of the Internet of Things (IoT). By leveraging the capabilities of smart networked devices and integrating them into existing industrial, leisure and communication applications, the IoT is expected to positively impact both economy and society, reducing the gap between the physical and digital worlds. Therefore, several efforts have been dedicated to the development of networking solutions addressing the diversity of challenges associated with such a vision. In this context, the integration of Information Centric Networking (ICN) concepts into the core of IoT is a research area gaining momentum and involving both research and industry actors. The massive amount of heterogeneous devices, as well as the data they produce, is a significant challenge for a wide-scale adoption of the IoT. In this paper we propose a service discovery mechanism, based on Named Data Networking (NDN), that leverages the use of a semantic matching mechanism for achieving a flexible discovery process. The development of appropriate service discovery mechanisms enriched with semantic capabilities for understanding and processing context information is a key feature for turning raw data into useful knowledge and ensuring the interoperability among different devices and applications. We assessed the performance of our solution through the implementation and deployment of a proof-of-concept prototype. Obtained results illustrate the potential of integrating semantic and ICN mechanisms to enable a flexible service discovery in IoT scenarios.