30 resultados para Statistical Error
Resumo:
This paper compares statistical technique of paraphrase identification to semantic technique of paraphrase identification. The statistical techniques used for comparison are word set and word-order based methods where as the semantic technique used is the WordNet similarity matrix method described by Stevenson and Fernando in [3].
Resumo:
In this paper we describe the methodology and the structural design of a system that translates English into Malayalam using statistical models. A monolingual Malayalam corpus and a bilingual English/Malayalam corpus are the main resource in building this Statistical Machine Translator. Training strategy adopted has been enhanced by PoS tagging which helps to get rid of the insignificant alignments. Moreover, incorporating units like suffix separator and the stop word eliminator has proven to be effective in bringing about better training results. In the decoder, order conversion rules are applied to reduce the structural difference between the language pair. The quality of statistical outcome of the decoder is further improved by applying mending rules. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
This paper underlines a methodology for translating text from English into the Dravidian language, Malayalam using statistical models. By using a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase, the machine automatically generates Malayalam translations of English sentences. This paper also discusses a technique to improve the alignment model by incorporating the parts of speech information into the bilingual corpus. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in training. Various handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. The structural difference between the English Malayalam pair is resolved in the decoder by applying the order conversion rules. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
A methodology for translating text from English into the Dravidian language, Malayalam using statistical models is discussed in this paper. The translator utilizes a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase and generates automatically the Malayalam translation of an unseen English sentence. Various techniques to improve the alignment model by incorporating the morphological inputs into the bilingual corpus are discussed. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in producing better alignments. Difficulties in translation process that arise due to the structural difference between the English Malayalam pair is resolved in the decoding phase by applying the order conversion rules. The handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
Statistical Machine Translation (SMT) is one of the potential applications in the field of Natural Language Processing. The translation process in SMT is carried out by acquiring translation rules automatically from the parallel corpora. However, for many language pairs (e.g. Malayalam- English), they are available only in very limited quantities. Therefore, for these language pairs a huge portion of phrases encountered at run-time will be unknown. This paper focuses on methods for handling such out-of-vocabulary (OOV) words in Malayalam that cannot be translated to English using conventional phrase-based statistical machine translation systems. The OOV words in the source sentence are pre-processed to obtain the root word and its suffix. Different inflected forms of the OOV root are generated and a match is looked up for the word variants in the phrase translation table of the translation model. A Vocabulary filter is used to choose the best among the translations of these word variants by finding the unigram count. A match for the OOV suffix is also looked up in the phrase entries and the target translations are filtered out. Structuring of the filtered phrases is done and SMT translation model is extended by adding OOV with its new phrase translations. By the results of the manual evaluation done it is observed that amount of OOV words in the input has been reduced considerably
Resumo:
In Statistical Machine Translation from English to Malayalam, an unseen English sentence is translated into its equivalent Malayalam sentence using statistical models. A parallel corpus of English-Malayalam is used in the training phase. Word to word alignments has to be set among the sentence pairs of the source and target language before subjecting them for training. This paper deals with certain techniques which can be adopted for improving the alignment model of SMT. Methods to incorporate the parts of speech information into the bilingual corpus has resulted in eliminating many of the insignificant alignments. Also identifying the name entities and cognates present in the sentence pairs has proved to be advantageous while setting up the alignments. Presence of Malayalam words with predictable translations has also contributed in reducing the insignificant alignments. Moreover, reduction of the unwanted alignments has brought in better training results. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics.
Resumo:
Coded OFDM is a transmission technique that is used in many practical communication systems. In a coded OFDM system, source data are coded, interleaved and multiplexed for transmission over many frequency sub-channels. In a conventional coded OFDM system, the transmission power of each subcarrier is the same regardless of the channel condition. However, some subcarrier can suffer deep fading with multi-paths and the power allocated to the faded subcarrier is likely to be wasted. In this paper, we compute the FER and BER bounds of a coded OFDM system given as convex functions for a given channel coder, inter-leaver and channel response. The power optimization is shown to be a convex optimization problem that can be solved numerically with great efficiency. With the proposed power optimization scheme, near-optimum power allocation for a given coded OFDM system and channel response to minimize FER or BER under a constant transmission power constraint is obtained
Resumo:
In Statistical Machine Translation from English to Malayalam, an unseen English sentence is translated into its equivalent Malayalam translation using statistical models like translation model, language model and a decoder. A parallel corpus of English-Malayalam is used in the training phase. Word to word alignments has to be set up among the sentence pairs of the source and target language before subjecting them for training. This paper is deals with the techniques which can be adopted for improving the alignment model of SMT. Incorporating the parts of speech information into the bilingual corpus has eliminated many of the insignificant alignments. Also identifying the name entities and cognates present in the sentence pairs has proved to be advantageous while setting up the alignments. Moreover, reduction of the unwanted alignments has brought in better training results. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
A potential fungal strain producing extracellular β-glucosidase enzyme was isolated from sea water and identified as ^ëéÉêJ Öáääìë=ëóÇçïáá BTMFS 55 by a molecular approach based on 28S rDNA sequence homology which showed 93% identity with already reported sequences of ^ëéÉêÖáääìë=ëóÇçïáá in the GenBank. A sequential optimization strategy was used to enhance the production of β-glucosidase under solid state fermentation (SSF) with wheat bran (WB) as the growth medium. The two-level Plackett-Burman (PB) design was implemented to screen medium components that influence β-glucosidase production and among the 11 variables, moisture content, inoculums, and peptone were identified as the most significant factors for β-glucosidase production. The enzyme was purified by (NH4)2SO4 precipitation followed by ion exchange chromatography on DEAE sepharose. The enzyme was a monomeric protein with a molecular weight of ~95 kDa as determined by SDS-PAGE. It was optimally active at pH 5.0 and 50°C. It showed high affinity towards éNPG and enzyme has a hã and sã~ñ of 0.67 mM and 83.3 U/mL, respectively. The enzyme was tolerant to glucose inhibition with a há of 17 mM. Low concentration of alcohols (10%), especially ethanol, could activate the enzyme. A considerable level of ethanol could produce from wheat bran and rice straw after 48 and 24 h, respectively, with the help of p~ÅÅÜ~êçãóÅÉë=ÅÉêÉîáëá~É in presence of cellulase and the purified β-glucosidase of ^ëéÉêÖáääìë=ëóÇçïáá BTMFS 55.
Resumo:
Adaptive filter is a primary method to filter Electrocardiogram (ECG), because it does not need the signal statistical characteristics. In this paper, an adaptive filtering technique for denoising the ECG based on Genetic Algorithm (GA) tuned Sign-Data Least Mean Square (SD-LMS) algorithm is proposed. This technique minimizes the mean-squared error between the primary input, which is a noisy ECG, and a reference input which can be either noise that is correlated in some way with the noise in the primary input or a signal that is correlated only with ECG in the primary input. Noise is used as the reference signal in this work. The algorithm was applied to the records from the MIT -BIH Arrhythmia database for removing the baseline wander and 60Hz power line interference. The proposed algorithm gave an average signal to noise ratio improvement of 10.75 dB for baseline wander and 24.26 dB for power line interference which is better than the previous reported works
Resumo:
Low grade and High grade Gliomas are tumors that originate in the glial cells. The main challenge in brain tumor diagnosis is whether a tumor is benign or malignant, primary or metastatic and low or high grade. Based on the patient's MRI, a radiologist could not differentiate whether it is a low grade Glioma or a high grade Glioma. Because both of these are almost visually similar, autopsy confirms the diagnosis of low grade with high-grade and infiltrative features. In this paper, textural description of Grade I and grade III Glioma are extracted using First order statistics and Gray Level Co-occurance Matrix Method (GLCM). Textural features are extracted from 16X16 sub image of the segmented Region of Interest(ROI) .In the proposed method, first order statistical features such as contrast, Intensity , Entropy, Kurtosis and spectral energy and GLCM features extracted were showed promising results. The ranges of these first order statistics and GLCM based features extracted are highly discriminant between grade I and Grade III. In this study which gives statistical textural information of grade I and grade III Glioma which is very useful for further classification and analysis and thus assisting Radiologist in greater extent.
Resumo:
The characterization and grading of glioma tumors, via image derived features, for diagnosis, prognosis, and treatment response has been an active research area in medical image computing. This paper presents a novel method for automatic detection and classification of glioma from conventional T2 weighted MR images. Automatic detection of the tumor was established using newly developed method called Adaptive Gray level Algebraic set Segmentation Algorithm (AGASA).Statistical Features were extracted from the detected tumor texture using first order statistics and gray level co-occurrence matrix (GLCM) based second order statistical methods. Statistical significance of the features was determined by t-test and its corresponding p-value. A decision system was developed for the grade detection of glioma using these selected features and its p-value. The detection performance of the decision system was validated using the receiver operating characteristic (ROC) curve. The diagnosis and grading of glioma using this non-invasive method can contribute promising results in medical image computing
Resumo:
Econometrics is a young science. It developed during the twentieth century in the mid-1930’s, primarily after the World War II. Econometrics is the unification of statistical analysis, economic theory and mathematics. The history of econometrics can be traced to the use of statistical and mathematics analysis in economics. The most prominent contributions during the initial period can be seen in the works of Tinbergen and Frisch, and also that of Haavelmo in the 1940's through the mid 1950's. Right from the rudimentary application of statistics to economic data, like the use of laws of error through the development of least squares by Legendre, Laplace, and Gauss, the discipline of econometrics has later on witnessed the applied works done by Edge worth and Mitchell. A very significant mile stone in its evolution has been the work of Tinbergen, Frisch, and Haavelmo in their development of multiple regression and correlation analysis. They used these techniques to test different economic theories using time series data. In spite of the fact that some predictions based on econometric methodology might have gone wrong, the sound scientific nature of the discipline cannot be ignored by anyone. This is reflected in the economic rationale underlying any econometric model, statistical and mathematical reasoning for the various inferences drawn etc. The relevance of econometrics as an academic discipline assumes high significance in the above context. Because of the inter-disciplinary nature of econometrics (which is a unification of Economics, Statistics and Mathematics), the subject can be taught at all these broad areas, not-withstanding the fact that most often Economics students alone are offered this subject as those of other disciplines might not have adequate Economics background to understand the subject. In fact, even for technical courses (like Engineering), business management courses (like MBA), professional accountancy courses etc. econometrics is quite relevant. More relevant is the case of research students of various social sciences, commerce and management. In the ongoing scenario of globalization and economic deregulation, there is the need to give added thrust to the academic discipline of econometrics in higher education, across various social science streams, commerce, management, professional accountancy etc. Accordingly, the analytical ability of the students can be sharpened and their ability to look into the socio-economic problems with a mathematical approach can be improved, and enabling them to derive scientific inferences and solutions to such problems. The utmost significance of hands-own practical training on the use of computer-based econometric packages, especially at the post-graduate and research levels need to be pointed out here. Mere learning of the econometric methodology or the underlying theories alone would not have much practical utility for the students in their future career, whether in academics, industry, or in practice This paper seeks to trace the historical development of econometrics and study the current status of econometrics as an academic discipline in higher education. Besides, the paper looks into the problems faced by the teachers in teaching econometrics, and those of students in learning the subject including effective application of the methodology in real life situations. Accordingly, the paper offers some meaningful suggestions for effective teaching of econometrics in higher education
Resumo:
The problem of using information available from one variable X to make inferenceabout another Y is classical in many physical and social sciences. In statistics this isoften done via regression analysis where mean response is used to model the data. Onestipulates the model Y = µ(X) +ɛ. Here µ(X) is the mean response at the predictor variable value X = x, and ɛ = Y - µ(X) is the error. In classical regression analysis, both (X; Y ) are observable and one then proceeds to make inference about the mean response function µ(X). In practice there are numerous examples where X is not available, but a variable Z is observed which provides an estimate of X. As an example, consider the herbicidestudy of Rudemo, et al. [3] in which a nominal measured amount Z of herbicide was applied to a plant but the actual amount absorbed by the plant X is unobservable. As another example, from Wang [5], an epidemiologist studies the severity of a lung disease, Y , among the residents in a city in relation to the amount of certain air pollutants. The amount of the air pollutants Z can be measured at certain observation stations in the city, but the actual exposure of the residents to the pollutants, X, is unobservable and may vary randomly from the Z-values. In both cases X = Z+error: This is the so called Berkson measurement error model.In more classical measurement error model one observes an unbiased estimator W of X and stipulates the relation W = X + error: An example of this model occurs when assessing effect of nutrition X on a disease. Measuring nutrition intake precisely within 24 hours is almost impossible. There are many similar examples in agricultural or medical studies, see e.g., Carroll, Ruppert and Stefanski [1] and Fuller [2], , among others. In this talk we shall address the question of fitting a parametric model to the re-gression function µ(X) in the Berkson measurement error model: Y = µ(X) + ɛ; X = Z + η; where η and ɛ are random errors with E(ɛ) = 0, X and η are d-dimensional, and Z is the observable d-dimensional r.v.
Resumo:
Study on variable stars is an important topic of modern astrophysics. After the invention of powerful telescopes and high resolving powered CCD’s, the variable star data is accumulating in the order of peta-bytes. The huge amount of data need lot of automated methods as well as human experts. This thesis is devoted to the data analysis on variable star’s astronomical time series data and hence belong to the inter-disciplinary topic, Astrostatistics. For an observer on earth, stars that have a change in apparent brightness over time are called variable stars. The variation in brightness may be regular (periodic), quasi periodic (semi-periodic) or irregular manner (aperiodic) and are caused by various reasons. In some cases, the variation is due to some internal thermo-nuclear processes, which are generally known as intrinsic vari- ables and in some other cases, it is due to some external processes, like eclipse or rotation, which are known as extrinsic variables. Intrinsic variables can be further grouped into pulsating variables, eruptive variables and flare stars. Extrinsic variables are grouped into eclipsing binary stars and chromospheri- cal stars. Pulsating variables can again classified into Cepheid, RR Lyrae, RV Tauri, Delta Scuti, Mira etc. The eruptive or cataclysmic variables are novae, supernovae, etc., which rarely occurs and are not periodic phenomena. Most of the other variations are periodic in nature. Variable stars can be observed through many ways such as photometry, spectrophotometry and spectroscopy. The sequence of photometric observa- xiv tions on variable stars produces time series data, which contains time, magni- tude and error. The plot between variable star’s apparent magnitude and time are known as light curve. If the time series data is folded on a period, the plot between apparent magnitude and phase is known as phased light curve. The unique shape of phased light curve is a characteristic of each type of variable star. One way to identify the type of variable star and to classify them is by visually looking at the phased light curve by an expert. For last several years, automated algorithms are used to classify a group of variable stars, with the help of computers. Research on variable stars can be divided into different stages like observa- tion, data reduction, data analysis, modeling and classification. The modeling on variable stars helps to determine the short-term and long-term behaviour and to construct theoretical models (for eg:- Wilson-Devinney model for eclips- ing binaries) and to derive stellar properties like mass, radius, luminosity, tem- perature, internal and external structure, chemical composition and evolution. The classification requires the determination of the basic parameters like pe- riod, amplitude and phase and also some other derived parameters. Out of these, period is the most important parameter since the wrong periods can lead to sparse light curves and misleading information. Time series analysis is a method of applying mathematical and statistical tests to data, to quantify the variation, understand the nature of time-varying phenomena, to gain physical understanding of the system and to predict future behavior of the system. Astronomical time series usually suffer from unevenly spaced time instants, varying error conditions and possibility of big gaps. This is due to daily varying daylight and the weather conditions for ground based observations and observations from space may suffer from the impact of cosmic ray particles. Many large scale astronomical surveys such as MACHO, OGLE, EROS, xv ROTSE, PLANET, Hipparcos, MISAO, NSVS, ASAS, Pan-STARRS, Ke- pler,ESA, Gaia, LSST, CRTS provide variable star’s time series data, even though their primary intention is not variable star observation. Center for Astrostatistics, Pennsylvania State University is established to help the astro- nomical community with the aid of statistical tools for harvesting and analysing archival data. Most of these surveys releases the data to the public for further analysis. There exist many period search algorithms through astronomical time se- ries analysis, which can be classified into parametric (assume some underlying distribution for data) and non-parametric (do not assume any statistical model like Gaussian etc.,) methods. Many of the parametric methods are based on variations of discrete Fourier transforms like Generalised Lomb-Scargle peri- odogram (GLSP) by Zechmeister(2009), Significant Spectrum (SigSpec) by Reegen(2007) etc. Non-parametric methods include Phase Dispersion Minimi- sation (PDM) by Stellingwerf(1978) and Cubic spline method by Akerlof(1994) etc. Even though most of the methods can be brought under automation, any of the method stated above could not fully recover the true periods. The wrong detection of period can be due to several reasons such as power leakage to other frequencies which is due to finite total interval, finite sampling interval and finite amount of data. Another problem is aliasing, which is due to the influence of regular sampling. Also spurious periods appear due to long gaps and power flow to harmonic frequencies is an inherent problem of Fourier methods. Hence obtaining the exact period of variable star from it’s time series data is still a difficult problem, in case of huge databases, when subjected to automation. As Matthew Templeton, AAVSO, states “Variable star data analysis is not always straightforward; large-scale, automated analysis design is non-trivial”. Derekas et al. 2007, Deb et.al. 2010 states “The processing of xvi huge amount of data in these databases is quite challenging, even when looking at seemingly small issues such as period determination and classification”. It will be beneficial for the variable star astronomical community, if basic parameters, such as period, amplitude and phase are obtained more accurately, when huge time series databases are subjected to automation. In the present thesis work, the theories of four popular period search methods are studied, the strength and weakness of these methods are evaluated by applying it on two survey databases and finally a modified form of cubic spline method is intro- duced to confirm the exact period of variable star. For the classification of new variable stars discovered and entering them in the “General Catalogue of Vari- able Stars” or other databases like “Variable Star Index“, the characteristics of the variability has to be quantified in term of variable star parameters.