9 resultados para 280205 Text Processing

em Aston University Research Archive


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Ontology construction for any domain is a labour intensive and complex process. Any methodology that can reduce the cost and increase efficiency has the potential to make a major impact in the life sciences. This paper describes an experiment in ontology construction from text for the animal behaviour domain. Our objective was to see how much could be done in a simple and relatively rapid manner using a corpus of journal papers. We used a sequence of pre-existing text processing steps, and here describe the different choices made to clean the input, to derive a set of terms and to structure those terms in a number of hierarchies. We describe some of the challenges, especially that of focusing the ontology appropriately given a starting point of a heterogeneous corpus. Results - Using mainly automated techniques, we were able to construct an 18055 term ontology-like structure with 73% recall of animal behaviour terms, but a precision of only 26%. We were able to clean unwanted terms from the nascent ontology using lexico-syntactic patterns that tested the validity of term inclusion within the ontology. We used the same technique to test for subsumption relationships between the remaining terms to add structure to the initially broad and shallow structure we generated. All outputs are available at http://thirlmere.aston.ac.uk/~kiffer/animalbehaviour/ webcite. Conclusion - We present a systematic method for the initial steps of ontology or structured vocabulary construction for scientific domains that requires limited human effort and can make a contribution both to ontology learning and maintenance. The method is useful both for the exploration of a scientific domain and as a stepping stone towards formally rigourous ontologies. The filtering of recognised terms from a heterogeneous corpus to focus upon those that are the topic of the ontology is identified to be one of the main challenges for research in ontology learning.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Ontology construction for any domain is a labour intensive and complex process. Any methodology that can reduce the cost and increase efficiency has the potential to make a major impact in the life sciences. This paper describes an experiment in ontology construction from text for the Animal Behaviour domain. Our objective was to see how much could be done in a simple and rapid manner using a corpus of journal papers. We used a sequence of text processing steps, and describe the different choices made to clean the input, to derive a set of terms and to structure those terms in a hierarchy. We were able in a very short space of time to construct a 17000 term ontology with a high percentage of suitable terms. We describe some of the challenges, especially that of focusing the ontology appropriately given a starting point of a heterogeneous corpus.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The main argument of this paper is that Natural Language Processing (NLP) does, and will continue to, underlie the Semantic Web (SW), including its initial construction from unstructured sources like the World Wide Web (WWW), whether its advocates realise this or not. Chiefly, we argue, such NLP activity is the only way up to a defensible notion of meaning at conceptual levels (in the original SW diagram) based on lower level empirical computations over usage. Our aim is definitely not to claim logic-bad, NLP-good in any simple-minded way, but to argue that the SW will be a fascinating interaction of these two methodologies, again like the WWW (which has been basically a field for statistical NLP research) but with deeper content. Only NLP technologies (and chiefly information extraction) will be able to provide the requisite RDF knowledge stores for the SW from existing unstructured text databases in the WWW, and in the vast quantities needed. There is no alternative at this point, since a wholly or mostly hand-crafted SW is also unthinkable, as is a SW built from scratch and without reference to the WWW. We also assume that, whatever the limitations on current SW representational power we have drawn attention to here, the SW will continue to grow in a distributed manner so as to serve the needs of scientists, even if it is not perfect. The WWW has already shown how an imperfect artefact can become indispensable.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This thesis sets out to investigate the role of cohesion in the organisation and processing of three text types in English and Arabic. In other words, it attempts to shed some light on the descriptive and explanatory power of cohesion in different text typologies. To this effect, three text types, namely, literary fictional narrative, newspaper editorial and science were analysed to ascertain the intra- and inter-sentential trends in textual cohesion characteristic of each text type in each language. In addition, two small scale experiments which aimed at exploring the facilitatory effect of one cohesive device (i.e. lexical repetition) on the comprehension of three English text types by Arab learners were carried out. The first experiment examined this effect in an English science text; the second covered three English text types, i.e. fictional narrative, culturally-oriented and science. Some interesting and significant results have emerged from the textual analysis and the pilot studies. Most importantly, each text type tends to utilize the cohesive trends that are compatible with its readership, reader knowledge, reading style and pedagogical purpose. Whereas fictional narratives largely cohere through pronominal co-reference, editorials and science texts derive much cohesion from lexical repetition. As for cross-language differences English opts for economy in the use of cohesive devices, while Arabic largely coheres through the redundant effect created by the high frequency of most of those devices. Thus, cohesion is proved to be a variable rather than a homogeneous phenomenon which is dictated by text type among other factors. The results of the experiments suggest that lexical repetition does facilitate the comprehension of English texts by Arab learners. Fictional narratives are found to be easier to process and understand than expository texts. Consequently, cohesion can assist in the processing of text as it can in its creation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Text classification is essential for narrowing down the number of documents relevant to a particular topic for further pursual, especially when searching through large biomedical databases. Protein-protein interactions are an example of such a topic with databases being devoted specifically to them. This paper proposed a semi-supervised learning algorithm via local learning with class priors (LL-CP) for biomedical text classification where unlabeled data points are classified in a vector space based on their proximity to labeled nodes. The algorithm has been evaluated on a corpus of biomedical documents to identify abstracts containing information about protein-protein interactions with promising results. Experimental results show that LL-CP outperforms the traditional semisupervised learning algorithms such as SVMand it also performs better than local learning without incorporating class priors.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

It is well established that speech, language and phonological skills are closely associated with literacy, and that children with a family risk of dyslexia (FRD) tend to show deficits in each of these areas in the preschool years. This paper examines what the relationships are between FRD and these skills, and whether deficits in speech, language and phonological processing fully account for the increased risk of dyslexia in children with FRD. One hundred and fifty-three 4-6-year-old children, 44 of whom had FRD, completed a battery of speech, language, phonology and literacy tasks. Word reading and spelling were retested 6 months later, and text reading accuracy and reading comprehension were tested 3 years later. The children with FRD were at increased risk of developing difficulties in reading accuracy, but not reading comprehension. Four groups were compared: good and poor readers with and without FRD. In most cases good readers outperformed poor readers regardless of family history, but there was an effect of family history on naming and nonword repetition regardless of literacy outcome, suggesting a role for speech production skills as an endophenotype of dyslexia. Phonological processing predicted spelling, while language predicted text reading accuracy and comprehension. FRD was a significant additional predictor of reading and spelling after controlling for speech production, language and phonological processing, suggesting that children with FRD show additional difficulties in literacy that cannot be fully explained in terms of their language and phonological skills. It is well established that speech, language and phonological skills are closely associated with literacy, and that children with a family risk of dyslexia (FRD) tend to show deficits in each of these areas in the preschool years. This paper examines what the relationships are between FRD and these skills, and whether deficits in speech, language and phonological processing fully account for the increased risk of dyslexia in children with FRD. One hundred and fifty-three 4-6-year-old children, 44 of whom had FRD, completed a battery of speech, language, phonology and literacy tasks. © 2014 John Wiley & Sons Ltd.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Most research in the area of emotion detection in written text focused on detecting explicit expressions of emotions in text. In this paper, we present a rule-based pipeline approach for detecting implicit emotions in written text without emotion-bearing words based on the OCC Model. We have evaluated our approach on three different datasets with five emotion categories. Our results show that the proposed approach outperforms the lexicon matching method consistently across all the three datasets by a large margin of 17–30% in F-measure and gives competitive performance compared to a supervised classifier. In particular, when dealing with formal text which follows grammatical rules strictly, our approach gives an average F-measure of 82.7% on “Happy”, “Angry-Disgust” and “Sad”, even outperforming the supervised baseline by nearly 17% in F-measure. Our preliminary results show the feasibility of the approach for the task of implicit emotion detection in written text.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

As one of the most popular deep learning models, convolution neural network (CNN) has achieved huge success in image information extraction. Traditionally CNN is trained by supervised learning method with labeled data and used as a classifier by adding a classification layer in the end. Its capability of extracting image features is largely limited due to the difficulty of setting up a large training dataset. In this paper, we propose a new unsupervised learning CNN model, which uses a so-called convolutional sparse auto-encoder (CSAE) algorithm pre-Train the CNN. Instead of using labeled natural images for CNN training, the CSAE algorithm can be used to train the CNN with unlabeled artificial images, which enables easy expansion of training data and unsupervised learning. The CSAE algorithm is especially designed for extracting complex features from specific objects such as Chinese characters. After the features of articficial images are extracted by the CSAE algorithm, the learned parameters are used to initialize the first CNN convolutional layer, and then the CNN model is fine-Trained by scene image patches with a linear classifier. The new CNN model is applied to Chinese scene text detection and is evaluated with a multilingual image dataset, which labels Chinese, English and numerals texts separately. More than 10% detection precision gain is observed over two CNN models.