962 resultados para Text authorship identification


Relevância:

100.00% 100.00%

Publicador:

Resumo:

2010 Mathematics Subject Classification: 68T50,62H30,62J05.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper presents a salience-based technique for the annotation of directly quoted speech from fiction text. In particular, this paper determines to what extent a naïve (without the use of complex machine learning or knowledge-based techniques) scoring technique can be used for the identification of the speaker of speech quotes. The presented technique makes use of a scoring technique, similar to that commonly found in knowledge-poor anaphora resolution research, as well as a set of hand-coded rules for the final identification of the speaker of each quote in the text. Speaker identification is shown to be achieved using three tasks: the identification of a speech-verb associated with a quote with a recall of 94.41%; the identification of the actor associated with a quote with a recall of 88.22%; and the selection of a speaker with an accuracy of 79.40%.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Although more than 100 genes associated with inherited retinal disease have been mapped to chromosomal locations, less than half of these genes have been cloned. This text includes identification and evaluation of candidate genes for three autosomal dominant forms of inherited retinal degeneration: atypical vitelliform macular dystrophy (VMD1), cone-rod dystrophy (CORD), and retinitis pigmentosa (RP). ^ VMD1 is a disorder characterized by complete penetrance but extremely variable expressivity, and includes macular or peripheral retinal lesions and peripappilary abnormalitites. In 1984, linkage was reported between VMD1 and soluble glutamate-pyruvate transaminase GPT); however, placement of GPT to 8q24 on linkage maps had been debated, and VMD1 did not show linkage to microsatellite markers in that region. This study excluded linkage between the loci by cloning GPT, identifying the nucleotide substitution associated with the GPT sozymes, and by assaying VMD1 family samples with an RFLP designed to detect the substitution. In addition, linkage of VMD1 to the known dominant macular degeneration loci was excluded. ^ CORD is characterized by early onset of color-vision deficiency, and decreased visual acuity, However, this retinal degeneration progresses to no light perception, severe macular lesion, and “bone-spicule” accumulations in the peripheral retina. In this study, the disorder in a large Texan family was mapped to the CORD2 locus of 19q13, and a mutation in the retina/pineal-specific cone-rod homeobox gene (CRX) was identified as the disease cause. In addition, mutations in CRX were associated with significantly different retinal disease phenotypes, including retinitis pigmentosa and Leber congenital amaurosis. ^ Many of the mutations leading to inherited retinal disorders have been identified in genes like CRX, which are expressed predominantly in the retina and pineal gland. Therefore, a combination of database analysis and laboratory investigation was used to identify 26 novel retina/pineal-specific expressed sequence tag (EST) clusters as candidate genes for inherited retinal disorders. Eight of these genes were mapped into the candidate regions of inherited retinal degeneration loci. ^ Two of the eight clusters mapped into the retinitis pigmentosa RP13 candidate region of 17p13, and were both determined to represent a single gene that is highly expressed in photoreceptors. This gene, the Ah receptor-interacting like protein-1 (AIPL1), was cloned, characterized, and screened for mutations in RP13 patient DNA samples. ^

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This essay investigates the concept author-illustrator by drawing on two influential essays – ‘Death of the Author’ by Roland Barthes and ‘What is an Author?’ by Michel Foucault. By engaging with the key points of debate that emerge from these positions, this essay argues that the notion of author-illustrator is part of a wider discursive field that is embedded in a complex, commodified, multimedia public sphere where the author is paradoxically reinscribed and erased. This environment is changing the nature of the text, authorship, and reader-text interaction, but until now the concept author-illustrator has been largely absent from these discussions.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

One of the biggest challenges in speech synthesis is the production of contextually-appropriate naturally sounding synthetic voices. This means that a Text-To-Speech system must be able to analyze a text beyond the sentence limits in order to select, or even modulate, the speaking style according to a broader context. Our current architecture is based on a two-step approach: text genre identification and speaking style synthesis according to the detected discourse genre. For the final implementation, a set of four genres and their corresponding speaking styles were considered: broadcast news, live sport commentaries, interviews and political speeches. In the final TTS evaluation, the four speaking styles were transplanted to the neutral voices of other speakers not included in the training database. When the transplanted styles were compared to the neutral voices, transplantation was significantly preferred and the similarity to the target speaker was as high as 78%.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Author identification is the problem of identifying the author of an anonymous text or text whose authorship is in doubt from a given set of authors. The works by different authors are strongly distinguished by quantifiable features of the text. This paper deals with the attempts made on identifying the most likely author of a text in Malayalam from a list of authors. Malayalam is a Dravidian language with agglutinative nature and not much successful tools have been developed to extract syntactic & semantic features of texts in this language. We have done a detailed study on the various stylometric features that can be used to form an authors profile and have found that the frequencies of word collocations can be used to clearly distinguish an author in a highly inflectious language such as Malayalam. In our work we try to extract the word level and character level features present in the text for characterizing the style of an author. Our first step was towards creating a profile for each of the candidate authors whose texts were available with us, first from word n-gram frequencies and then by using variable length character n-gram frequencies. Profiles of the set of authors under consideration thus formed, was then compared with the features extracted from anonymous text, to suggest the most likely author.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

National Highway Traffic Safety Administration, Washington, D.C.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Current debate within forensic authorship analysis has tended to polarise those who argue that analysis methods should reflect a strong cognitive theory of idiolect and others who see less of a need to look behind the stylistic variation of the texts they are examining. This chapter examines theories of idiolect and asks how useful or necessary they are to the practice of forensic authorship analysis. Taking a specific text messaging case the chapter demonstrates that methodologically rigorous, theoretically informed authorship analysis need not appeal to cognitive theories of idiolect in order to be valid. By considering text messaging forensics, lessons will be drawn which can contribute to wider debates on the role of theories of idiolect in forensic casework.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

This chapter introduces Native Language Identification (NLID) and considers the casework applications with regard to authorship analysis of online material. It presents findings from research identifying which linguistic features were the best indicators of native (L1) Persian speakers blogging in English, and analyses how these features cope at distinguishing between native influences from languages that are linguistically and culturally related. The first chapter section outlines the area of Native Language Identification, and demonstrates its potential for application through a discussion of relevant case history. The next section discusses a development of methodology for identifying influence from L1 Persian in an anonymous blog author, and presents findings. The third part discusses the application of these features to casework situations as well as how the features identified can form an easily applicable model and demonstrates the application of this to casework. The research presented in this chapter can be considered a case study for the wider potential application of NLID.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The problem of determining the script and language of a document image has a number of important applications in the field of document analysis, such as indexing and sorting of large collections of such images, or as a precursor to optical character recognition (OCR). In this paper, we investigate the use of texture as a tool for determining the script of a document image, based on the observation that text has a distinct visual texture. An experimental evaluation of a number of commonly used texture features is conducted on a newly created script database, providing a qualitative measure of which features are most appropriate for this task. Strategies for improving classification results in situations with limited training data and multiple font types are also proposed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Objective: To quantify the extent to which alcohol related injuries are adequately identified in hospitalisation data using ICD-10-AM codes indicative of alcohol involvement. Method: A random sample of 4373 injury-related hospital separations from 1 July 2002 to 30 June 2004 were obtained from a stratified random sample of 50 hospitals across 4 states in Australia. From this sample, cases were identified as involving alcohol if they contained an ICD-10-AM diagnosis or external cause code referring to alcohol, or if the text description extracted from the medical records mentioned alcohol involvement. Results: Overall, identification of alcohol involvement using ICD codes detected 38% of the alcohol-related sample, whilst almost 94% of alcohol-related cases were identified through a search of the text extracted from the medical records. The resultant estimate of alcohol involvement in injury-related hospitalisations in this sample was 10%. Emergency department records were the most likely to identify whether the injury was alcohol-related with almost three-quarters of alcohol-related cases mentioning alcohol in the text abstracted from these records. Conclusions and Implications: The current best estimates of the frequency of hospital admissions where alcohol is involved prior to the injury underestimate the burden by around 62%. This is a substantial underestimate that has major implications for public policy, and highlights the need for further work on improving the quality and completeness of routine administrative data sources for identification of alcohol-related injuries.