996 resultados para Text authorship identification
Resumo:
2010 Mathematics Subject Classification: 68T50,62H30,62J05.
Resumo:
In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.
Resumo:
Although more than 100 genes associated with inherited retinal disease have been mapped to chromosomal locations, less than half of these genes have been cloned. This text includes identification and evaluation of candidate genes for three autosomal dominant forms of inherited retinal degeneration: atypical vitelliform macular dystrophy (VMD1), cone-rod dystrophy (CORD), and retinitis pigmentosa (RP). ^ VMD1 is a disorder characterized by complete penetrance but extremely variable expressivity, and includes macular or peripheral retinal lesions and peripappilary abnormalitites. In 1984, linkage was reported between VMD1 and soluble glutamate-pyruvate transaminase GPT); however, placement of GPT to 8q24 on linkage maps had been debated, and VMD1 did not show linkage to microsatellite markers in that region. This study excluded linkage between the loci by cloning GPT, identifying the nucleotide substitution associated with the GPT sozymes, and by assaying VMD1 family samples with an RFLP designed to detect the substitution. In addition, linkage of VMD1 to the known dominant macular degeneration loci was excluded. ^ CORD is characterized by early onset of color-vision deficiency, and decreased visual acuity, However, this retinal degeneration progresses to no light perception, severe macular lesion, and “bone-spicule” accumulations in the peripheral retina. In this study, the disorder in a large Texan family was mapped to the CORD2 locus of 19q13, and a mutation in the retina/pineal-specific cone-rod homeobox gene (CRX) was identified as the disease cause. In addition, mutations in CRX were associated with significantly different retinal disease phenotypes, including retinitis pigmentosa and Leber congenital amaurosis. ^ Many of the mutations leading to inherited retinal disorders have been identified in genes like CRX, which are expressed predominantly in the retina and pineal gland. Therefore, a combination of database analysis and laboratory investigation was used to identify 26 novel retina/pineal-specific expressed sequence tag (EST) clusters as candidate genes for inherited retinal disorders. Eight of these genes were mapped into the candidate regions of inherited retinal degeneration loci. ^ Two of the eight clusters mapped into the retinitis pigmentosa RP13 candidate region of 17p13, and were both determined to represent a single gene that is highly expressed in photoreceptors. This gene, the Ah receptor-interacting like protein-1 (AIPL1), was cloned, characterized, and screened for mutations in RP13 patient DNA samples. ^
Resumo:
One of the biggest challenges in speech synthesis is the production of contextually-appropriate naturally sounding synthetic voices. This means that a Text-To-Speech system must be able to analyze a text beyond the sentence limits in order to select, or even modulate, the speaking style according to a broader context. Our current architecture is based on a two-step approach: text genre identification and speaking style synthesis according to the detected discourse genre. For the final implementation, a set of four genres and their corresponding speaking styles were considered: broadcast news, live sport commentaries, interviews and political speeches. In the final TTS evaluation, the four speaking styles were transplanted to the neutral voices of other speakers not included in the training database. When the transplanted styles were compared to the neutral voices, transplantation was significantly preferred and the similarity to the target speaker was as high as 78%.
Resumo:
Choice of industrial development options and the relevant allocation of the research funds become more and more difficult because of the increasing R&D costs and pressure for shorter development period. Forecast of the research progress is based on the analysis of the publications activity in the field of interest as well as on the dynamics of its change. Moreover, allocation of funds is hindered by exponential growth in the number of publications and patents. Thematic clusters become more and more difficult to identify, and their evolution hard to follow. The existing approaches of research field structuring and identification of its development are very limited. They do not identify the thematic clusters with adequate precision while the identified trends are often ambiguous. Therefore, there is a clear need to develop methods and tools, which are able to identify developing fields of research. The main objective of this Thesis is to develop tools and methods helping in the identification of the promising research topics in the field of separation processes. Two structuring methods as well as three approaches for identification of the development trends have been proposed. The proposed methods have been applied to the analysis of the research on distillation and filtration. The results show that the developed methods are universal and could be used to study of the various fields of research. The identified thematic clusters and the forecasted trends of their development have been confirmed in almost all tested cases. It proves the universality of the proposed methods. The results allow for identification of the fast-growing scientific fields as well as the topics characterized by stagnant or diminishing research activity.
Resumo:
Author identification is the problem of identifying the author of an anonymous text or text whose authorship is in doubt from a given set of authors. The works by different authors are strongly distinguished by quantifiable features of the text. This paper deals with the attempts made on identifying the most likely author of a text in Malayalam from a list of authors. Malayalam is a Dravidian language with agglutinative nature and not much successful tools have been developed to extract syntactic & semantic features of texts in this language. We have done a detailed study on the various stylometric features that can be used to form an authors profile and have found that the frequencies of word collocations can be used to clearly distinguish an author in a highly inflectious language such as Malayalam. In our work we try to extract the word level and character level features present in the text for characterizing the style of an author. Our first step was towards creating a profile for each of the candidate authors whose texts were available with us, first from word n-gram frequencies and then by using variable length character n-gram frequencies. Profiles of the set of authors under consideration thus formed, was then compared with the features extracted from anonymous text, to suggest the most likely author.
Resumo:
In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.
Resumo:
National Highway Traffic Safety Administration, Washington, D.C.
Resumo:
Current debate within forensic authorship analysis has tended to polarise those who argue that analysis methods should reflect a strong cognitive theory of idiolect and others who see less of a need to look behind the stylistic variation of the texts they are examining. This chapter examines theories of idiolect and asks how useful or necessary they are to the practice of forensic authorship analysis. Taking a specific text messaging case the chapter demonstrates that methodologically rigorous, theoretically informed authorship analysis need not appeal to cognitive theories of idiolect in order to be valid. By considering text messaging forensics, lessons will be drawn which can contribute to wider debates on the role of theories of idiolect in forensic casework.
Resumo:
This chapter introduces Native Language Identification (NLID) and considers the casework applications with regard to authorship analysis of online material. It presents findings from research identifying which linguistic features were the best indicators of native (L1) Persian speakers blogging in English, and analyses how these features cope at distinguishing between native influences from languages that are linguistically and culturally related. The first chapter section outlines the area of Native Language Identification, and demonstrates its potential for application through a discussion of relevant case history. The next section discusses a development of methodology for identifying influence from L1 Persian in an anonymous blog author, and presents findings. The third part discusses the application of these features to casework situations as well as how the features identified can form an easily applicable model and demonstrates the application of this to casework. The research presented in this chapter can be considered a case study for the wider potential application of NLID.
Resumo:
Esta pesquisa documental analisa as concepções de alfabetização, leitura e escrita subjacentes à Provinha Brasil no período 2008-2012 e o panorama em que esse programa de avaliação é produzido. Parte do referencial bakhtiniano e do conceito de alfabetização de Gontijo (2008, 2013). Ao tomar a Provinha como gênero do discurso, discute os elos precedentes dentro do contexto de produção dessa avaliação, a autoria do Programa e seus principais destinatários. Constata que a Provinha é criada como resposta às demandas de avaliação da alfabetização provenientes de organismos internacionais como o Banco Mundial e a Organização das Nações Unidas para a Educação (Unesco). A avaliação é elaborada pelo Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (Inep) como órgão que coordena as avaliações no País, em colaboração com pesquisadores de universidades e de organizações da sociedade civil, para demonstrar confiabilidade científica aliada à participação democrática no processo de produção. Seus principais destinatários são gestores de Secretarias de Educação e professores. Aos primeiros, cabe aderir ao programa de avaliação e tomar medidas administrativas para sua operacionalização nas redes. Os docentes têm o papel central de seguir as orientações do material e reorganizar sua prática em função de melhorias nos desempenhos das crianças no teste. Estas, por sua vez, são desconsideradas como sujeitos de dizeres e é legitimado um discurso homogeneizador sobre seu desenvolvimento. A partir dos testes aplicados e das matrizes de referência e seus eixos, a pesquisa analisa como a diferenciação teórica entre alfabetização e letramento se concretiza na organização das provas. A alfabetização, entendida como apropriação do sistema de escrita, é avaliada no primeiro eixo do teste principalmente como identificação de unidades menores da língua, como letras, sílabas e fonemas. As habilidades de leitura, ligadas ao letramento como concebido nos pressupostos do programa, são aferidas ora como decodificação de palavras e frases descontextualizadas, ora como apreensão de significado predeterminado do texto. A escrita somente é avaliada no ano de 2008 e por meio de itens que solicitavam codificação de palavras e frases ditadas pelo aplicador. Desse modo, a Provinha Brasil contribui para a subtração das potencialidades políticas e transformadoras do aprendizado da língua materna no País.
Resumo:
The genus Heliconia is not much studied and the number of existing species in this genus is still uncertain. It is known that this number relies between 150 to 250 species. In Brazil, about 40 species are native and known by many different names. The objective of this paper was to characterize morphometrically and to identify the NOR (active nucleolus organizer regions) by Ag-NOR banding of chromosomes of Heliconia bihai (L) L. Root meristems were submitted to blocking treatment in an amiprofos-methyl (APM) solution, fixed in methanol-acetic acid solution for 24 hours, at least. The meristems were washed in distilled water and submitted to enzymatic digestion with pectinase enzyme. The slides were prepared by dissociation of the root meristem, dried in the air and also on hot plate at 50°C. Subsequently, some slides were submitted to 5% Giemsa stain for karyotype construction and to a solution of silver nitrate (AgNO3) 50% for Ag-NOR banding. The species H. bihai has 2n = 22 chromosomes, 4 pairs of submetacentric chromosomes and 7 pairs of metacentric chromosomes, and graded medium to short (3.96 to 0.67 μM), with the presence of active NOR in pairs 1 and 2 and interphase cells with 2 nucleoli. These are the features of a diploid species.