119 resultados para Corpus inscriptionum graecarum.
Resumo:
In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.
Resumo:
The problem of impostor dataset selection for GMM-based speaker verification is addressed through the recently proposed data-driven background dataset refinement technique. The SVM-based refinement technique selects from a candidate impostor dataset those examples that are most frequently selected as support vectors when training a set of SVMs on a development corpus. This study demonstrates the versatility of dataset refinement in the task of selecting suitable impostor datasets for use in GMM-based speaker verification. The use of refined Z- and T-norm datasets provided performance gains of 15% in EER in the NIST 2006 SRE over the use of heuristically selected datasets. The refined datasets were shown to generalise well to the unseen data of the NIST 2008 SRE.
Resumo:
A data-driven background dataset refinement technique was recently proposed for SVM based speaker verification. This method selects a refined SVM background dataset from a set of candidate impostor examples after individually ranking examples by their relevance. This paper extends this technique to the refinement of the T-norm dataset for SVM-based speaker verification. The independent refinement of the background and T-norm datasets provides a means of investigating the sensitivity of SVM-based speaker verification performance to the selection of each of these datasets. Using refined datasets provided improvements of 13% in min. DCF and 9% in EER over the full set of impostor examples on the 2006 SRE corpus with the majority of these gains due to refinement of the T-norm dataset. Similar trends were observed for the unseen data of the NIST 2008 SRE.
Resumo:
This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability. The goals of the proposed extended JFA model are to improve verification performance with short utterances by compensating for the effects of limited or imbalanced phonetic coverage, and to produce a flexible JFA model that is effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results on the 2006 NIST SRE corpus demonstrate the flexibility of the proposed model by providing competitive results over a wide range of utterance lengths without retraining and also yielding modest improvements in a number of conditions over current state-of-the-art.
Resumo:
This article explores two matrix methods to induce the ``shades of meaning" (SoM) of a word. A matrix representation of a word is computed from a corpus of traces based on the given word. Non-negative Matrix Factorisation (NMF) and Singular Value Decomposition (SVD) compute a set of vectors corresponding to a potential shade of meaning. The two methods were evaluated based on loss of conditional entropy with respect to two sets of manually tagged data. One set reflects concepts generally appearing in text, and the second set comprises words used for investigations into word sense disambiguation. Results show that for NMF consistently outperforms SVD for inducing both SoM of general concepts as well as word senses. The problem of inducing the shades of meaning of a word is more subtle than that of word sense induction and hence relevant to thematic analysis of opinion where nuances of opinion can arise.
Resumo:
In this paper we argue that the term “capitalism” is no longer useful for understanding the current system of political economic relations in which we live. Rather, we argue that the system can be more usefully characterised as neofeudal corporatism. Using examples drawn from a 300,000 word corpus of public utterances by three political leaders from the “coalition of the willing”— George W. Bush, Tony Blair, and John Howard—we show some defining characteristics of this relatively new system and how they are manifest in political language about the invasion of Iraq.
Resumo:
In this study, a nanofiber mesh made by co-electrospinning medical grade poly(epsilon-caprolactone) and collagen (mPCL/Col) was fabricated and studied. Its mechanical properties and characteristics were analyzed and compared to mPCL meshes. mPCL/Col meshes showed a reduction in strength but an increase in ductility when compared to PCL meshes. In vitro assays revealed that mPCL/Col supported the attachment and proliferation of smooth muscle cells on both sides of the mesh. In vivo studies in the corpus cavernosa of rabbits revealed that the mPCL/Col scaffold used in conjunction with autologous smooth muscle cells resulted in better integration with host tissue when compared to cell free scaffolds. On a cellular level preseeded scaffolds showed a minimized foreign body reaction.
Resumo:
So much has been made over the crisis in English literature as field, as corpus, and as canon in recent years, that some of it undoubtedly has spilled over into English education. This has been the case in predominantly English-speaking Anglo-American and Commonwealth nations, as well as in those postcolonial states where English remains the medium of instruction and lingua franca of economic and cultural elites. Yet to attribute the pressures for change in pedagogic practice to academic paradigm shift per se would prop up the shaky axiom that English education is forever caught in some kind of perverse evolutionary time-lag, parasitic of university literary studies. I, too, believe that English education has reached a crucial moment in its history, but that this moment is contingent upon the changing demographics, cultural knowledges, and practices of economic globalization.
Resumo:
Automatic Speech Recognition (ASR) has matured into a technology which is becoming more common in our everyday lives, and is emerging as a necessity to minimise driver distraction when operating in-car systems such as navigation and infotainment. In “noise-free” environments, word recognition performance of these systems has been shown to approach 100%, however this performance degrades rapidly as the level of background noise is increased. Speech enhancement is a popular method for making ASR systems more ro- bust. Single-channel spectral subtraction was originally designed to improve hu- man speech intelligibility and many attempts have been made to optimise this algorithm in terms of signal-based metrics such as maximised Signal-to-Noise Ratio (SNR) or minimised speech distortion. Such metrics are used to assess en- hancement performance for intelligibility not speech recognition, therefore mak- ing them sub-optimal ASR applications. This research investigates two methods for closely coupling subtractive-type enhancement algorithms with ASR: (a) a computationally-efficient Mel-filterbank noise subtraction technique based on likelihood-maximisation (LIMA), and (b) in- troducing phase spectrum information to enable spectral subtraction in the com- plex frequency domain. Likelihood-maximisation uses gradient-descent to optimise parameters of the enhancement algorithm to best fit the acoustic speech model given a word se- quence known a priori. Whilst this technique is shown to improve the ASR word accuracy performance, it is also identified to be particularly sensitive to non-noise mismatches between the training and testing data. Phase information has long been ignored in spectral subtraction as it is deemed to have little effect on human intelligibility. In this work it is shown that phase information is important in obtaining highly accurate estimates of clean speech magnitudes which are typically used in ASR feature extraction. Phase Estimation via Delay Projection is proposed based on the stationarity of sinusoidal signals, and demonstrates the potential to produce improvements in ASR word accuracy in a wide range of SNR. Throughout the dissertation, consideration is given to practical implemen- tation in vehicular environments which resulted in two novel contributions – a LIMA framework which takes advantage of the grounding procedure common to speech dialogue systems, and a resource-saving formulation of frequency-domain spectral subtraction for realisation in field-programmable gate array hardware. The techniques proposed in this dissertation were evaluated using the Aus- tralian English In-Car Speech Corpus which was collected as part of this work. This database is the first of its kind within Australia and captures real in-car speech of 50 native Australian speakers in seven driving conditions common to Australian environments.
Resumo:
Aux confluences historiques et conceptuelles de la modernité, de la technologie, et de l’« humain », les textes de notre corpus négocient et interrogent de façon critique les possibilités matérielles et symboliques de la prothèse, ses aspects phénoménologiques et spéculatifs : du côté subjectiviste et conceptualiste avec une philosophie de la conscience, avec Merleau-Ponty ; et de l’autre avec les épistémologues du corps et historiens de la connaissance Canguilhem et Foucault. Le trope prometteur de la prothèse impacte sur les formations discursives et non-discursives concernant la reconstruction des corps, là où la technologie devient le corrélat de l’identité. La technologie s’humanise au contact de l’homme, et, en révélant une hybridité supérieure, elle phagocyte l’humain du même coup. Ce travail de sociologie des sciences (Latour, 1989), ou encore d’anthropologie des sciences (Hakken, 2001) ou d’anthropologie bioculturelle (Andrieu, 1993; Andrieu, 2006; Andrieu, 2007a) se propose en tant qu’exemple de la contribution potentielle que l’anthropologie biologique et culturelle peut rendre à la médecine reconstructrice et que la médecine reconstructrice peut rendre à la plastique de l’homme ; l’anthropologie biologique nous concerne dans la transformation biologique du corps humain, par l’outil de la technologie, tant dans son histoire de la reconstruction mécanique et plastique, que dans son projet d’augmentation bionique. Nous établirons une continuité archéologique, d’une terminologie foucaldienne, entre les deux pratiques. Nous questionnons les postulats au sujet des relations nature/culture, biologie/contexte social, et nous présentons une approche définitionnelle de la technologie, pierre angulaire de notre travail théorique. Le trope de la technologie, en tant qu’outil adaptatif de la culture au service de la nature, opère un glissement sémantique en se plaçant au service d’une biologie à améliorer. Une des clés de notre recherche sur l’augmentation des fonctions et de l’esthétique du corps humain réside dans la redéfinition même de ces relations ; et dans l’impact de l’interpénétration entre réalité et imaginaire dans la construction de l’objet scientifique, dans la transformation du corps humain. Afin de cerner les enjeux du discours au sujet de l’« autoévolution » des corps, les théories évolutionnistes sont abordées, bien que ne représentant pas notre spécialité. Dans le cadre de l’autoévolution, et de l’augmentation bionique de l’homme, la somation culturelle du corps s’exerce par l’usage des biotechnologies, en rupture épistémologique de la pensée darwinienne, bien que l’acte d’hybridation évolutionnaire soit toujours inscrit dans un dessein de maximisation bionique/génétique du corps humain. Nous explorons les courants de la pensée cybernétique dans leurs actions de transformation biologique du corps humain, de la performativité des mutilations. Ainsi technologie et techniques apparaissent-elles indissociables de la science, et de son constructionnisme social.
Resumo:
In this paper I present an analysis of the language used by the National Endowment for Democracy (NED) on its website (NED, 2008). The specific focus of the analysis is on the NED's high usage of the word “should” revealed in computer assisted corpus analysis using Leximancer. Typically we use the word “should” as a term to propose specific courses of action for ourselves and others. It is a marker of obligation and “oughtness”. In other words, its systematic institutional use can be read as a statement of ethics, of how the NED thinks the world ought to behave. As an ostensibly democracy-promoting institution, and one with a clear agenda of implementing American foreign policy, the ethics of NED are worth understanding. Analysis reveals a pattern of grammatical metaphor in which “should” is often deployed counter intuitively, and sometimes ambiguously, as a truth-making tool rather than one for proposing action. The effect is to present NED's imperatives for action as matters of fact rather than ethical or obligatory claims.
Resumo:
Forensic imaging has been facing scalability challenges for some time. As disk capacity growth continues to outpace storage IO bandwidth, the demands placed on storage and time are ever increasing. Data reduction and de-duplication technologies are now commonplace in the Enterprise space, and are potentially applicable to forensic acquisition. Using the new AFF4 forensic file format we employ a hash based compression scheme to leverage an existing corpus of images, reducing both acquisition time and storage requirements. This paper additionally describes some of the recent evolution in the AFF4 file format making the efficient implementation of hash based imaging a reality.
Resumo:
This thesis introduces the problem of conceptual ambiguity, or Shades of Meaning (SoM) that can exist around a term or entity. As an example consider President Ronald Reagan the ex-president of the USA, there are many aspects to him that are captured in text; the Russian missile deal, the Iran-contra deal and others. Simply finding documents with the word “Reagan” in them is going to return results that cover many different shades of meaning related to "Reagan". Instead it may be desirable to retrieve results around a specific shade of meaning of "Reagan", e.g., all documents relating to the Iran-contra scandal. This thesis investigates computational methods for identifying shades of meaning around a word, or concept. This problem is related to word sense ambiguity, but is more subtle and based less on the particular syntactic structures associated with or around an instance of the term and more with the semantic contexts around it. A particularly noteworthy difference from typical word sense disambiguation is that shades of a concept are not known in advance. It is up to the algorithm itself to ascertain these subtleties. It is the key hypothesis of this thesis that reducing the number of dimensions in the representation of concepts is a key part of reducing sparseness and thus also crucial in discovering their SoMwithin a given corpus.
Resumo:
Digital collections are growing exponentially in size as the information age takes a firm grip on all aspects of society. As a result Information Retrieval (IR) has become an increasingly important area of research. It promises to provide new and more effective ways for users to find information relevant to their search intentions. Document clustering is one of the many tools in the IR toolbox and is far from being perfected. It groups documents that share common features. This grouping allows a user to quickly identify relevant information. If these groups are misleading then valuable information can accidentally be ignored. There- fore, the study and analysis of the quality of document clustering is important. With more and more digital information available, the performance of these algorithms is also of interest. An algorithm with a time complexity of O(n2) can quickly become impractical when clustering a corpus containing millions of documents. Therefore, the investigation of algorithms and data structures to perform clustering in an efficient manner is vital to its success as an IR tool. Document classification is another tool frequently used in the IR field. It predicts categories of new documents based on an existing database of (doc- ument, category) pairs. Support Vector Machines (SVM) have been found to be effective when classifying text documents. As the algorithms for classifica- tion are both efficient and of high quality, the largest gains can be made from improvements to representation. Document representations are vital for both clustering and classification. Representations exploit the content and structure of documents. Dimensionality reduction can improve the effectiveness of existing representations in terms of quality and run-time performance. Research into these areas is another way to improve the efficiency and quality of clustering and classification results. Evaluating document clustering is a difficult task. Intrinsic measures of quality such as distortion only indicate how well an algorithm minimised a sim- ilarity function in a particular vector space. Intrinsic comparisons are inherently limited by the given representation and are not comparable between different representations. Extrinsic measures of quality compare a clustering solution to a “ground truth” solution. This allows comparison between different approaches. As the “ground truth” is created by humans it can suffer from the fact that not every human interprets a topic in the same manner. Whether a document belongs to a particular topic or not can be subjective.
Resumo:
Most information retrieval (IR) models treat the presence of a term within a document as an indication that the document is somehow "about" that term, they do not take into account when a term might be explicitly negated. Medical data, by its nature, contains a high frequency of negated terms - e.g. "review of systems showed no chest pain or shortness of breath". This papers presents a study of the effects of negation on information retrieval. We present a number of experiments to determine whether negation has a significant negative affect on IR performance and whether language models that take negation into account might improve performance. We use a collection of real medical records as our test corpus. Our findings are that negation has some affect on system performance, but this will likely be confined to domains such as medical data where negation is prevalent.