906 resultados para Text feature extraction
Resumo:
Machine learning techniques for prediction and rule extraction from artificial neural network methods are used. The hypothesis that market sentiment and IPO specific attributes are equally responsible for first-day IPO returns in the US stock market is tested. Machine learning methods used are Bayesian classifications, support vector machines, decision tree techniques, rule learners and artificial neural networks. The outcomes of the research are predictions and rules associated With first-day returns of technology IPOs. The hypothesis that first-day returns of technology IPOs are equally determined by IPO specific and market sentiment is rejected. Instead lower yielding IPOs are determined by IPO specific and market sentiment attributes, while higher yielding IPOs are largely dependent on IPO specific attributes.
Resumo:
This work examines prosody modelling for the Standard Yorùbá (SY) language in the context of computer text-to-speech synthesis applications. The thesis of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combines acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. Our prosody model is conceptualised around a modular holistic framework. The framework is implemented using the Relational Tree (R-Tree) techniques (Ehrich and Foith, 1976). R-Tree is a sophisticated data structure that provides a multi-dimensional description of a waveform. A Skeletal Tree (S-Tree) is first generated using algorithms based on the tone phonological rules of SY. Subsequent steps update the S-Tree by computing the numerical values of the prosody dimensions. To implement the intonation dimension, fuzzy control rules where developed based on data from native speakers of Yorùbá. The Classification And Regression Tree (CART) and the Fuzzy Decision Tree (FDT) techniques were tested in modelling the duration dimension. The FDT was selected based on its better performance. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration and intonation, using different techniques and their subsequent integration. Our approach provides us with a flexible and extendible model that can also be used to implement, study and explain the theory behind aspects of the phenomena observed in speech prosody.
Resumo:
The present thesis investigates mode related aspects in biology lecture discourse and attempts to identify the position of this variety along the spontaneous spoken versus planned written language continuum. Nine lectures (of 43,000 words) consisting of three sets of three lectures each, given by the three lecturers at Aston University, make up the corpus. The indeterminacy of the results obtained from the investigation of grammatical complexity as measured in subordination motivates the need to take the analysis beyond sentence level to the study of mode related aspects in the use of sentence-initial connectives, sub-topic shifting and paraphrase. It is found that biology lecture discourse combines features typical of speech and writing at sentence as well as discourse level: thus, subordination is more used than co-ordination, but one degree complexity sentence is favoured; some sentence initial connectives are only found in uses typical of spoken language but sub-topic shift signalling (generally introduced by a connective) typical of planned written language is a major feature of the lectures; syntactic and lexical revision and repetition, interrupted structures are found in the sub-topic shift signalling utterance and paraphrase, but the text is also amenable to analysis into sentence like units. On the other hand, it is also found that: (1) while there are some differences in the use of a given feature, inter-speaker variation is on the whole not significant; (2) mode related aspects are often motivated by the didactic function of the variety; and (3) the structuring of the text follows a sequencing whose boundaries are marked by sub-topic shifting and the summary paraphrase. This study enables us to draw four theoretical conclusions: (1) mode related aspects cannot be approached as a simple dichotomy since a combination of aspects of both speech and writing are found in a given feature. It is necessary to go to the level of textual features to identify mode related aspects; (2) homogeneity is dominant in this sample of lectures which suggests that there is a high level of standardization in this variety; (3) the didactic function of the variety is manifested in some mode related aspects; (4) the features studied play a role in the structuring of the text.
Resumo:
Working within the framework of the branch of Linguistics known as discourse analysis, and more specifically within the current approach of genre analysis, this thesis presents an analysis of the English of economic forecasting. The language of economic forecasting is highly specialised and follows certain conventions of structure and style. This research project identifies these characteristics and explains them in terms of their communicative function. The work is based on a corpus of texts published in economic reports and surveys by major corporate bodies. These documents are targeted at an international expert readership familiar with this genre. The data is analysed at two broad levels: firstly, the macro-level of text structure which is described in terms of schema-theory, a currently influential model of analysis, and, secondly, the micro-level of authors' strategies for modulating the predictions which form the key move in the forecasting schema. The thesis aims to contribute to the newly developing field of genre analysis in a number of ways: firstly, by a coverage of a hitherto neglected but intrinsically interesting and important genre (Economic Forecasting); secondly, by testing the applicability of existing models of analysis at the level of schematic structure and proposing a genre-specific model; thirdly by offering insights into the nature of modulation of propositions which is often broadly classified as `hedging' or `modality', and which has been recently described as lq`an area for prolonged fieldwork'. This phenomenon is shown to be a key feature of this particular genre. It is suggested that this thesis, in addition to its contribution to the theory of genre analysis, provides a useful basis for work by teachers of English for Economics, an important area of English for Specific Purposes.
Resumo:
Web APIs have gained increasing popularity in recent Web service technology development owing to its simplicity of technology stack and the proliferation of mashups. However, efficiently discovering Web APIs and the relevant documentations on the Web is still a challenging task even with the best resources available on the Web. In this paper we cast the problem of detecting the Web API documentations as a text classification problem of classifying a given Web page as Web API associated or not. We propose a supervised generative topic model called feature latent Dirichlet allocation (feaLDA) which offers a generic probabilistic framework for automatic detection of Web APIs. feaLDA not only captures the correspondence between data and the associated class labels, but also provides a mechanism for incorporating side information such as labelled features automatically learned from data that can effectively help improving classification performance. Extensive experiments on our Web APIs documentation dataset shows that the feaLDA model outperforms three strong supervised baselines including naive Bayes, support vector machines, and the maximum entropy model, by over 3% in classification accuracy. In addition, feaLDA also gives superior performance when compared against other existing supervised topic models.
Resumo:
In large organizations the resources needed to solve challenging problems are typically dispersed over systems within and beyond the organization, and also in different media. However, there is still the need, in knowledge environments, for extraction methods able to combine evidence for a fact from across different media. In many cases the whole is more than the sum of its parts: only when considering the different media simultaneously can enough evidence be obtained to derive facts otherwise inaccessible to the knowledge worker via traditional methods that work on each single medium separately. In this paper, we present a cross-media knowledge extraction framework specifically designed to handle large volumes of documents composed of three types of media text, images and raw data and to exploit the evidence across the media. Our goal is to improve the quality and depth of automatically extracted knowledge.
Resumo:
In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yorùbá (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model. We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications. © 2006 Elsevier Ltd. All rights reserved.
Resumo:
The principal feature of ontology, which is developed for a text processing, is wider knowledge representation of an external world due to introduction of three-level hierarchy. It allows to improve semantic interpretation of natural language texts.
Resumo:
This paper presents an algorithmic solution for management of related text objects, in which are integrated algorithms for their extraction from paper or electronic format, for their storage and processing in a relational database. The developed algorithms for data extraction and data analysis enable one to find specific features and relations between the text objects from the database. The algorithmic solution is applied to data from the field of phytopharmacy in Bulgaria. It can be used as a tool and methodology for other subject areas where there are complex relationships between text objects.
Resumo:
Internal quantum efficiency (IQE) of a high-brightness blue LED has been evaluated from the external quantum efficiency measured as a function of current at room temperature. Processing the data with a novel evaluation procedure based on the ABC-model, we have determined separately IQE of the LED structure and light extraction efficiency (LEE) of UX:3 chip. Full text Nowadays, understanding of LED efficiency behavior at high currents is quite critical to find ways for further improvement of III-nitride LED performance [1]. External quantum efficiency ηe (EQE) provides integral information on the recombination and photon emission processes in LEDs. Meanwhile EQE is the product of IQE ηi and LEE ηext at negligible carrier leakage from the active region. Separate determination of IQE and LEE would be much more helpful, providing correlation between these parameters and specific epi-structure and chip design. In this paper, we extend the approach of [2,3] to the whole range of the current/optical power variation, providing an express tool for separate evaluation of IQE and LEE. We studied an InGaN-based LED fabricated by Osram OS. LED structure grown by MOCVD on sapphire substrate was processed as UX:3 chip and mounted into the Golden Dragon package without molding. EQE was measured with Labsphere CDS-600 spectrometer. Plotting EQE versus output power P and finding the power Pm corresponding to EQE maximum ηm enables comparing the measurements with the analytical relationships ηi = Q/(Q+p1/2+p-1/2) ,p = P/Pm , and Q = B/(AC) 1/2 where A, Band C are recombination constants [4]. As a result, maximum IQE value equal to QI(Q+2) can be found from the ratio ηm/ηe plotted as a function of p1/2 +p1-1/2 (see Fig.la) and then LEE calculated as ηext = ηm (Q+2)/Q . Experimental EQE as a function of normalized optical power p is shown in Fig. 1 b along with the analytical approximation based on the ABCmodel. The approximation fits perfectly the measurements in the range of the optical power (or operating current) variation by eight orders of magnitude. In conclusion, new express method for separate evaluation of IQE and LEE of III-nitride LEDs is suggested and applied to characterization of a high-brightness blue LED. With this method, we obtained LEE from the free chip surface to the air as 69.8% and IQE as 85.7% at the maximum and 65.2% at the operation current 350 rnA. [I] G. Verzellesi, D. Saguatti, M. Meneghini, F. Bertazzi, M. Goano, G. Meneghesso, and E. Zanoni, "Efficiency droop in InGaN/GaN blue light-emitting diodes: Physical mechanisms and remedies," 1. AppL Phys., vol. 114, no. 7, pp. 071101, Aug., 2013. [2] C. van Opdorp and G. W. 't Hooft, "Method for determining effective non radiative lifetime and leakage losses in double-heterostructure lasers," 1. AppL Phys., vol. 52, no. 6, pp. 3827-3839, Feb., 1981. [3] M. Meneghini, N. Trivellin, G. Meneghesso, E. Zanoni, U. Zehnder, and B. Hahn, "A combined electro-optical method for the determination of the recombination parameters in InGaN-based light-emitting diodes," 1. AppL Phys., vol. 106, no. II, pp. 114508, Dec., 2009. [4] Qi Dai, Qifeng Shan, ling Wang, S. Chhajed, laehee Cho, E. F. Schubert, M. H. Crawford, D. D. Koleske, Min-Ho Kim, and Yongjo Park, "Carrier recombination mechanisms and efficiency droop in GalnN/GaN light-emitting diodes," App/. Phys. Leu., vol. 97, no. 13, pp. 133507, Sept., 2010. © 2014 IEEE.
Resumo:
2000 Mathematics Subject Classification: 62H30
Resumo:
As one of the most popular deep learning models, convolution neural network (CNN) has achieved huge success in image information extraction. Traditionally CNN is trained by supervised learning method with labeled data and used as a classifier by adding a classification layer in the end. Its capability of extracting image features is largely limited due to the difficulty of setting up a large training dataset. In this paper, we propose a new unsupervised learning CNN model, which uses a so-called convolutional sparse auto-encoder (CSAE) algorithm pre-Train the CNN. Instead of using labeled natural images for CNN training, the CSAE algorithm can be used to train the CNN with unlabeled artificial images, which enables easy expansion of training data and unsupervised learning. The CSAE algorithm is especially designed for extracting complex features from specific objects such as Chinese characters. After the features of articficial images are extracted by the CSAE algorithm, the learned parameters are used to initialize the first CNN convolutional layer, and then the CNN model is fine-Trained by scene image patches with a linear classifier. The new CNN model is applied to Chinese scene text detection and is evaluated with a multilingual image dataset, which labels Chinese, English and numerals texts separately. More than 10% detection precision gain is observed over two CNN models.
Resumo:
We present in this article an automated framework that extracts product adopter information from online reviews and incorporates the extracted information into feature-based matrix factorization formore effective product recommendation. In specific, we propose a bootstrapping approach for the extraction of product adopters from review text and categorize them into a number of different demographic categories. The aggregated demographic information of many product adopters can be used to characterize both products and users in the form of distributions over different demographic categories. We further propose a graphbased method to iteratively update user- and product-related distributions more reliably in a heterogeneous user-product graph and incorporate them as features into the matrix factorization approach for product recommendation. Our experimental results on a large dataset crawled from JINGDONG, the largest B2C e-commerce website in China, show that our proposed framework outperforms a number of competitive baselines for product recommendation.
Resumo:
Latin America, a region rich in both energy resources and native heritage, faces a rising politico-social confrontation that has been growing for over two decades. While resources like oil and gas are exploited to enhance the state’s economic growth, indigenous groups feel threatened because the operations related to this exploitation are infringing on their homelands. Furthermore, they believe that the potential resource wealth found in these environmentally-sensitive regions is provoking an “intrusion” in their ancestral territory of either government agencies or corporations allowed by governmental decree. Indigenous groups, which have achieved greater political voice over the past decade, are protesting against government violations. These protests have reached the media and received international attention, leading the discourse on topics such as civil and human rights violations. When this happens, the State finds itself “between a rock and a hard place”: In a debate between indigenous groups’ rights and economic sustainability.