362 resultados para Natural Language


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Product reviews are the foremost source of information for customers and manufacturers to help them make appropriate purchasing and production decisions. Natural language data is typically very sparse; the most common words are those that do not carry a lot of semantic content, and occurrences of any particular content-bearing word are rare, while co-occurrences of these words are rarer. Mining product aspects, along with corresponding opinions, is essential for Aspect-Based Opinion Mining (ABOM) as a result of the e-commerce revolution. Therefore, the need for automatic mining of reviews has reached a peak. In this work, we deal with ABOM as sequence labelling problem and propose a supervised extraction method to identify product aspects and corresponding opinions. We use Conditional Random Fields (CRFs) to solve the extraction problem and propose a feature function to enhance accuracy. The proposed method is evaluated using two different datasets. We also evaluate the effectiveness of feature function and the optimisation through multiple experiments.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Objective: To illustrate a new method for simplifying patient recruitment for advanced prostate cancer clinical trials using natural language processing techniques. Background: The identification of eligible participants for clinical trials is a critical factor to increase patient recruitment rates and an important issue for discovery of new treatment interventions. The current practice of identifying eligible participants is highly constrained due to manual processing of disparate sources of unstructured patient data. Informatics-based approaches can simplify the complex task of evaluating patient’s eligibility for clinical trials. We show that an ontology-based approach can address the challenge of matching patients to suitable clinical trials. Methods: The free-text descriptions of clinical trial criteria as well as patient data were analysed. A set of common inclusion and exclusion criteria was identified through consultations with expert clinical trial coordinators. A research prototype was developed using Unstructured Information Management Architecture (UIMA) that identified SNOMED CT concepts in the patient data and clinical trial description. The SNOMED CT concepts model the standard clinical terminology that can be used to represent and evaluate patient’s inclusion/exclusion criteria for the clinical trial. Results: Our experimental research prototype describes a semi-automated method for filtering patient records using common clinical trial criteria. Our method simplified the patient recruitment process. The discussion with clinical trial coordinators showed that the efficiency in patient recruitment process measured in terms of information processing time could be improved by 25%. Conclusion: An UIMA-based approach can resolve complexities in patient recruitment for advanced prostate cancer clinical trials.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Debates on gene patents have necessitated the analysis of patents that disclose and reference human sequences. In this study, we built an automated classifier that assigns sequences to one of nine predefined categories according to their functional roles in patent claims by applying natural language processing and supervised learning techniques. To improve its correctness, we experimented with various feature mappings, resulting in the maximal accuracy of 79%.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Within online learning communities, receiving timely and meaningful insights into the quality of learning activities is an important part of an effective educational experience. Commonly adopted methods – such as the Community of Inquiry framework – rely on manual coding of online discussion transcripts, which is a costly and time consuming process. There are several efforts underway to enable the automated classification of online discussion messages using supervised machine learning, which would enable the real-time analysis of interactions occurring within online learning communities. This paper investigates the importance of incorporating features that utilise the structure of on-line discussions for the classification of "cognitive presence" – the central dimension of the Community of Inquiry framework focusing on the quality of students' critical thinking within online learning communities. We implemented a Conditional Random Field classification solution, which incorporates structural features that may be useful in increasing classification performance over other implementations. Our approach leads to an improvement in classification accuracy of 5.8% over current existing techniques when tested on the same dataset, with a precision and recall of 0.630 and 0.504 respectively.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recent advances in neural language models have contributed new methods for learning distributed vector representations of words (also called word embeddings). Two such methods are the continuous bag-of-words model and the skipgram model. These methods have been shown to produce embeddings that capture higher order relationships between words that are highly effective in natural language processing tasks involving the use of word similarity and word analogy. Despite these promising results, there has been little analysis of the use of these word embeddings for retrieval. Motivated by these observations, in this paper, we set out to determine how these word embeddings can be used within a retrieval model and what the benefit might be. To this aim, we use neural word embeddings within the well known translation language model for information retrieval. This language model captures implicit semantic relations between the words in queries and those in relevant documents, thus producing more accurate estimations of document relevance. The word embeddings used to estimate neural language models produce translations that differ from previous translation language model approaches; differences that deliver improvements in retrieval effectiveness. The models are robust to choices made in building word embeddings and, even more so, our results show that embeddings do not even need to be produced from the same corpus being used for retrieval.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Objective Death certificates provide an invaluable source for cancer mortality statistics; however, this value can only be realised if accurate, quantitative data can be extracted from certificates – an aim hampered by both the volume and variable nature of certificates written in natural language. This paper proposes an automatic classification system for identifying cancer related causes of death from death certificates. Methods Detailed features, including terms, n-grams and SNOMED CT concepts were extracted from a collection of 447,336 death certificates. These features were used to train Support Vector Machine classifiers (one classifier for each cancer type). The classifiers were deployed in a cascaded architecture: the first level identified the presence of cancer (i.e., binary cancer/nocancer) and the second level identified the type of cancer (according to the ICD-10 classification system). A held-out test set was used to evaluate the effectiveness of the classifiers according to precision, recall and F-measure. In addition, detailed feature analysis was performed to reveal the characteristics of a successful cancer classification model. Results The system was highly effective at identifying cancer as the underlying cause of death (F-measure 0.94). The system was also effective at determining the type of cancer for common cancers (F-measure 0.7). Rare cancers, for which there was little training data, were difficult to classify accurately (F-measure 0.12). Factors influencing performance were the amount of training data and certain ambiguous cancers (e.g., those in the stomach region). The feature analysis revealed a combination of features were important for cancer type classification, with SNOMED CT concept and oncology specific morphology features proving the most valuable. Conclusion The system proposed in this study provides automatic identification and characterisation of cancers from large collections of free-text death certificates. This allows organisations such as Cancer Registries to monitor and report on cancer mortality in a timely and accurate manner. In addition, the methods and findings are generally applicable beyond cancer classification and to other sources of medical text besides death certificates.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This article presents a method for checking the conformance between an event log capturing the actual execution of a business process, and a model capturing its expected or normative execution. Given a business process model and an event log, the method returns a set of statements in natural language describing the behavior allowed by the process model but not observed in the log and vice versa. The method relies on a unified representation of process models and event logs based on a well-known model of concurrency, namely event structures. Specifically, the problem of conformance checking is approached by folding the input event log into an event structure, unfolding the process model into another event structure, and comparing the two event structures via an error-correcting synchronized product. Each behavioral difference detected in the synchronized product is then verbalized as a natural language statement. An empirical evaluation shows that the proposed method scales up to real-life datasets while producing more concise and higher-level difference descriptions than state-of-the-art conformance checking methods.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we present the results of an exploratory study that examined the problem of automating content analysis of student online discussion transcripts. We looked at the problem of coding discussion transcripts for the levels of cognitive presence, one of the three main constructs in the Community of Inquiry (CoI) model of distance education. Using Coh-Metrix and LIWC features, together with a set of custom features developed to capture discussion context, we developed a random forest classification system that achieved 70.3% classification accuracy and 0.63 Cohen's kappa, which is significantly higher than values reported in the previous studies. Besides improvement in classification accuracy, the developed system is also less sensitive to overfitting as it uses only 205 classification features, which is around 100 times less features than in similar systems based on bag-of-words features. We also provide an overview of the classification features most indicative of the different phases of cognitive presence that gives an additional insights into the nature of cognitive presence learning cycle. Overall, our results show great potential of the proposed approach, with an added benefit of providing further characterization of the cognitive presence coding scheme.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents 'vSpeak', the first initiative taken in Pakistan for ICT enabled conversion of dynamic Sign Urdu gestures into natural language sentences. To realize this, vSpeak has adopted a novel approach for feature extraction using edge detection and image compression which gives input to the Artificial Neural Network that recognizes the gesture. This technique caters for the blurred images as well. The training and testing is currently being performed on a dataset of 200 patterns of 20 words from Sign Urdu with target accuracy of 90% and above.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents an initiative taken in Pakistan for the rehabilitation of the deaf community, enabled by the use of technology. iPSL is a system that primarily aims at facilitating communication between the hearing and the deaf community in Pakistan. There is a twofold approach to achieve this. The first dimension is to implement a system that can translate signs made by deaf into natural language sentences. The second dimension is to implement tools that enable hearing people to understand and learn sign language by converting natural language sentences into sign language. This paper presents the progress made in the project so far in terms of design, implementation and evaluation. © ACM 2009.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Multi-document summarization addressing the problem of information overload has been widely utilized in the various real-world applications. Most of existing approaches adopt term-based representation for documents which limit the performance of multi-document summarization systems. In this paper, we proposed a novel pattern-based topic model (PBTMSum) for the task of the multi-document summarization. PBTMSum combining pattern mining techniques with LDA topic modelling could generate discriminative and semantic rich representations for topics and documents so that the most representative and non-redundant sentences can be selected to form a succinct and informative summary. Extensive experiments are conducted on the data of document understanding conference (DUC) 2007. The results prove the effectiveness and efficiency of our proposed approach.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The rapid increase in the number of text documents available on the Internet has created pressure to use effective cleaning techniques. Cleaning techniques are needed for converting these documents to structured documents. Text cleaning techniques are one of the key mechanisms in typical text mining application frameworks. In this paper, we explore the role of text cleaning in the 20 newsgroups dataset, and report on experimental results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper presents an overview of the 6th ALTA shared task that ran in 2015. The task was to identify in English texts all the potential cognates from the perspective of the French language. In other words, identify all the words in the English text that would acceptably translate into a similar word in French. We present the motivations for the task, the description of the data and the results of the 4 participating teams. We discuss the results against a baseline and prior work.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Because aesthetics can have a profound effect upon the human relationship to the non-human environment the importance of aesthetics to ecologically sustainable designed landscapes has been acknowledged. However, in recognition that the physical forms of designed landscapes are an expression of the social values of the time, some design professionals have called for a new aesthetic ― one that reflects these current ecological concerns. To address this, some authors have suggested various theoretical design frameworks upon which such an aesthetic could be based. Within these frameworks there is an underlying theme that the patterns and processes of natural systems have the potential to form a new aesthetic for landscape design —an aesthetic based on fractal rather than Euclidean geometry. Perry, Reeves and Sim (2008) have shown that it is possible to differentiate between different landscape forms by fractal analysis. However, this research also shows that individual scenes from within very different landscape forms can possess the same fractal properties. Early data, revealed by transforming landscape images from the spatial to the frequency domain, using the fast Fourier transform, suggest that fractal patterning can have a significant effect within the landscape. In fact, it may be argued that any landscape design that includes living processes will include some design element whose ultimate form can only be expressed through the mathematics of fractal geometry. This paper will present ongoing research into the potential role of fractal geometry as a basis for a new form language – a language that may articulate an aesthetic for landscape design that echoes our ecological awakening.