162 resultados para Natural language processing systems
Resumo:
We present a text watermarking scheme that embeds a bitstream watermark Wi in a text document P preserving the meaning, context, and flow of the document. The document is viewed as a set of paragraphs, each paragraph being a set of sentences. The sequence of paragraphs and sentences used to embed watermark bits is permuted using a secret key. Then, English language sentence transformations are used to modify sentence lengths, thus embedding watermarking bits in the Least Significant Bits (LSB) of the sentences’ cardinalities. The embedding and extracting algorithms are public, while the secrecy and security of the watermark depends on a secret key K. The probability of False Positives is extremely small, hence avoiding incidental occurrences of our watermark in random text documents. Majority voting provides security against text addition, deletion, and swapping attacks, further reducing the probability of False Positives. The scheme is secure against the general attacks on text watermarks such as reproduction (photocopying, FAX), reformatting, synonym substitution, text addition, text deletion, text swapping, paragraph shuffling and collusion attacks.
Resumo:
Niklas Luhmann's theory of social systems has been widely influential in the German-speaking countries in the past few decades. However, despite its significance, particularly for organization studies, it is only very recently that Luhmann's work has attracted attention on the international stage as well. This Special Issue is in response to that. In this introductory paper, we provide a systematic overview of Luhmann's theory. Reading his work as a theory about distinction generating and processing systems, we especially highlight the following aspects: (i) Organizations are processes that come into being by permanently constructing and reconstructing themselves by means of using distinctions, which mark what is part of their realm and what not. (ii) Such an organizational process belongs to a social sphere sui generis possessing its own logic, which cannot be traced back to human actors or subjects. (iii) Organizations are a specific kind of social process characterized by a specific kind of distinction: decision, which makes up what is specifically organizational about organizations as social phenomena. We conclude by introducing the papers in this Special Issue. Copyright © 2006 SAGE.
Resumo:
Last December Natural Language Processing and academic literature searching came into the spotlight in online conversations. Katherine Howard took a look at where and how this rapidly evolving technology fits with core competencies.
Resumo:
Debates on gene patents have necessitated the analysis of patents that disclose and reference human sequences. In this study, we built an automated classifier that assigns sequences to one of nine predefined categories according to their functional roles in patent claims by applying natural language processing and supervised learning techniques. To improve its correctness, we experimented with various feature mappings, resulting in the maximal accuracy of 79%.
Resumo:
Within online learning communities, receiving timely and meaningful insights into the quality of learning activities is an important part of an effective educational experience. Commonly adopted methods – such as the Community of Inquiry framework – rely on manual coding of online discussion transcripts, which is a costly and time consuming process. There are several efforts underway to enable the automated classification of online discussion messages using supervised machine learning, which would enable the real-time analysis of interactions occurring within online learning communities. This paper investigates the importance of incorporating features that utilise the structure of on-line discussions for the classification of "cognitive presence" – the central dimension of the Community of Inquiry framework focusing on the quality of students' critical thinking within online learning communities. We implemented a Conditional Random Field classification solution, which incorporates structural features that may be useful in increasing classification performance over other implementations. Our approach leads to an improvement in classification accuracy of 5.8% over current existing techniques when tested on the same dataset, with a precision and recall of 0.630 and 0.504 respectively.
Identifying relevant information for emergency services from twitter in response to natural disaster
Resumo:
This project proposes a framework that identifies high‐value disaster-based information from social media to facilitate key decision-making processes during natural disasters. At present it is very difficult to differentiate between information that has a high degree of disaster relevance and information that has a low degree of disaster relevance. By digitally harvesting and categorising social media conversation streams automatically, this framework identifies highly disaster-relevant information that can be used by emergency services for intelligence gathering and decision-making.
Resumo:
Recent advances in neural language models have contributed new methods for learning distributed vector representations of words (also called word embeddings). Two such methods are the continuous bag-of-words model and the skipgram model. These methods have been shown to produce embeddings that capture higher order relationships between words that are highly effective in natural language processing tasks involving the use of word similarity and word analogy. Despite these promising results, there has been little analysis of the use of these word embeddings for retrieval. Motivated by these observations, in this paper, we set out to determine how these word embeddings can be used within a retrieval model and what the benefit might be. To this aim, we use neural word embeddings within the well known translation language model for information retrieval. This language model captures implicit semantic relations between the words in queries and those in relevant documents, thus producing more accurate estimations of document relevance. The word embeddings used to estimate neural language models produce translations that differ from previous translation language model approaches; differences that deliver improvements in retrieval effectiveness. The models are robust to choices made in building word embeddings and, even more so, our results show that embeddings do not even need to be produced from the same corpus being used for retrieval.
Resumo:
Objective Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates – an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates. Methods Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model. Results The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable. Conclusion The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries to monitor and report on cancer mortality in a timely and accurate manner. In addition, the methods and findings are generally applicable beyond cancer classification and to other sources of medical text besides death certificates.
Resumo:
In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen's kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.
Resumo:
Multi-document summarization addressing the problem of information overload has been widely utilized in the various real-world applications. Most of existing approaches adopt term-based representation for documents which limit the performance of multi-document summarization systems. In this paper, we proposed a novel pattern-based topic model (PBTMSum) for the task of the multi-document summarization. PBTMSum combining pattern mining techniques with LDA topic modelling could generate discriminative and semantic rich representations for topics and documents so that the most representative and non-redundant sentences can be selected to form a succinct and informative summary. Extensive experiments are conducted on the data of document understanding conference (DUC) 2007. The results prove the effectiveness and efficiency of our proposed approach.