957 resultados para Statistical approach
Resumo:
A methodology for translating text from English into the Dravidian language, Malayalam using statistical models is discussed in this paper. The translator utilizes a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase and generates automatically the Malayalam translation of an unseen English sentence. Various techniques to improve the alignment model by incorporating the morphological inputs into the bilingual corpus are discussed. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in producing better alignments. Difficulties in translation process that arise due to the structural difference between the English Malayalam pair is resolved in the decoding phase by applying the order conversion rules. The handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
A potential fungal strain producing extracellular β-glucosidase enzyme was isolated from sea water and identified as ^ëéÉêJ Öáääìë=ëóÇçïáá BTMFS 55 by a molecular approach based on 28S rDNA sequence homology which showed 93% identity with already reported sequences of ^ëéÉêÖáääìë=ëóÇçïáá in the GenBank. A sequential optimization strategy was used to enhance the production of β-glucosidase under solid state fermentation (SSF) with wheat bran (WB) as the growth medium. The two-level Plackett-Burman (PB) design was implemented to screen medium components that influence β-glucosidase production and among the 11 variables, moisture content, inoculums, and peptone were identified as the most significant factors for β-glucosidase production. The enzyme was purified by (NH4)2SO4 precipitation followed by ion exchange chromatography on DEAE sepharose. The enzyme was a monomeric protein with a molecular weight of ~95 kDa as determined by SDS-PAGE. It was optimally active at pH 5.0 and 50°C. It showed high affinity towards éNPG and enzyme has a hã and sã~ñ of 0.67 mM and 83.3 U/mL, respectively. The enzyme was tolerant to glucose inhibition with a há of 17 mM. Low concentration of alcohols (10%), especially ethanol, could activate the enzyme. A considerable level of ethanol could produce from wheat bran and rice straw after 48 and 24 h, respectively, with the help of p~ÅÅÜ~êçãóÅÉë=ÅÉêÉîáëá~É in presence of cellulase and the purified β-glucosidase of ^ëéÉêÖáääìë=ëóÇçïáá BTMFS 55.
Resumo:
Post-transcriptional gene silencing by RNA interference is mediated by small interfering RNA called siRNA. This gene silencing mechanism can be exploited therapeutically to a wide variety of disease-associated targets, especially in AIDS, neurodegenerative diseases, cholesterol and cancer on mice with the hope of extending these approaches to treat humans. Over the recent past, a significant amount of work has been undertaken to understand the gene silencing mediated by exogenous siRNA. The design of efficient exogenous siRNA sequences is challenging because of many issues related to siRNA. While designing efficient siRNA, target mRNAs must be selected such that their corresponding siRNAs are likely to be efficient against that target and unlikely to accidentally silence other transcripts due to sequence similarity. So before doing gene silencing by siRNAs, it is essential to analyze their off-target effects in addition to their inhibition efficiency against a particular target. Hence designing exogenous siRNA with good knock-down efficiency and target specificity is an area of concern to be addressed. Some methods have been developed already by considering both inhibition efficiency and off-target possibility of siRNA against agene. Out of these methods, only a few have achieved good inhibition efficiency, specificity and sensitivity. The main focus of this thesis is to develop computational methods to optimize the efficiency of siRNA in terms of “inhibition capacity and off-target possibility” against target mRNAs with improved efficacy, which may be useful in the area of gene silencing and drug design for tumor development. This study aims to investigate the currently available siRNA prediction approaches and to devise a better computational approach to tackle the problem of siRNA efficacy by inhibition capacity and off-target possibility. The strength and limitations of the available approaches are investigated and taken into consideration for making improved solution. Thus the approaches proposed in this study extend some of the good scoring previous state of the art techniques by incorporating machine learning and statistical approaches and thermodynamic features like whole stacking energy to improve the prediction accuracy, inhibition efficiency, sensitivity and specificity. Here, we propose one Support Vector Machine (SVM) model, and two Artificial Neural Network (ANN) models for siRNA efficiency prediction. In SVM model, the classification property is used to classify whether the siRNA is efficient or inefficient in silencing a target gene. The first ANNmodel, named siRNA Designer, is used for optimizing the inhibition efficiency of siRNA against target genes. The second ANN model, named Optimized siRNA Designer, OpsiD, produces efficient siRNAs with high inhibition efficiency to degrade target genes with improved sensitivity-specificity, and identifies the off-target knockdown possibility of siRNA against non-target genes. The models are trained and tested against a large data set of siRNA sequences. The validations are conducted using Pearson Correlation Coefficient, Mathews Correlation Coefficient, Receiver Operating Characteristic analysis, Accuracy of prediction, Sensitivity and Specificity. It is found that the approach, OpsiD, is capable of predicting the inhibition capacity of siRNA against a target mRNA with improved results over the state of the art techniques. Also we are able to understand the influence of whole stacking energy on efficiency of siRNA. The model is further improved by including the ability to identify the “off-target possibility” of predicted siRNA on non-target genes. Thus the proposed model, OpsiD, can predict optimized siRNA by considering both “inhibition efficiency on target genes and off-target possibility on non-target genes”, with improved inhibition efficiency, specificity and sensitivity. Since we have taken efforts to optimize the siRNA efficacy in terms of “inhibition efficiency and offtarget possibility”, we hope that the risk of “off-target effect” while doing gene silencing in various bioinformatics fields can be overcome to a great extent. These findings may provide new insights into cancer diagnosis, prognosis and therapy by gene silencing. The approach may be found useful for designing exogenous siRNA for therapeutic applications and gene silencing techniques in different areas of bioinformatics.
Resumo:
The consumers are becoming more concerned about food quality, especially regarding how, when and where the foods are produced (Haglund et al., 1999; Kahl et al., 2004; Alföldi, et al., 2006). Therefore, during recent years there has been a growing interest in the methods for food quality assessment, especially in the picture-development methods as a complement to traditional chemical analysis of single compounds (Kahl et al., 2006). The biocrystallization as one of the picture-developing method is based on the crystallographic phenomenon that when crystallizing aqueous solutions of dihydrate CuCl2 with adding of organic solutions, originating, e.g., from crop samples, biocrystallograms are generated with reproducible crystal patterns (Kleber & Steinike-Hartung, 1959). Its output is a crystal pattern on glass plates from which different variables (numbers) can be calculated by using image analysis. However, there is a lack of a standardized evaluation method to quantify the morphological features of the biocrystallogram image. Therefore, the main sakes of this research are (1) to optimize an existing statistical model in order to describe all the effects that contribute to the experiment, (2) to investigate the effect of image parameters on the texture analysis of the biocrystallogram images, i.e., region of interest (ROI), color transformation and histogram matching on samples from the project 020E170/F financed by the Federal Ministry of Food, Agriculture and Consumer Protection(BMELV).The samples are wheat and carrots from controlled field and farm trials, (3) to consider the strongest effect of texture parameter with the visual evaluation criteria that have been developed by a group of researcher (University of Kassel, Germany; Louis Bolk Institute (LBI), Netherlands and Biodynamic Research Association Denmark (BRAD), Denmark) in order to clarify how the relation of the texture parameter and visual characteristics on an image is. The refined statistical model was accomplished by using a lme model with repeated measurements via crossed effects, programmed in R (version 2.1.0). The validity of the F and P values is checked against the SAS program. While getting from the ANOVA the same F values, the P values are bigger in R because of the more conservative approach. The refined model is calculating more significant P values. The optimization of the image analysis is dealing with the following parameters: ROI(Region of Interest which is the area around the geometrical center), color transformation (calculation of the 1 dimensional gray level value out of the three dimensional color information of the scanned picture, which is necessary for the texture analysis), histogram matching (normalization of the histogram of the picture to enhance the contrast and to minimize the errors from lighting conditions). The samples were wheat from DOC trial with 4 field replicates for the years 2003 and 2005, “market samples”(organic and conventional neighbors with the same variety) for 2004 and 2005, carrot where the samples were obtained from the University of Kassel (2 varieties, 2 nitrogen treatments) for the years 2004, 2005, 2006 and “market samples” of carrot for the years 2004 and 2005. The criterion for the optimization was repeatability of the differentiation of the samples over the different harvest(years). For different samples different ROIs were found, which reflect the different pictures. The best color transformation that shows efficiently differentiation is relied on gray scale, i.e., equal color transformation. The second dimension of the color transformation only appeared in some years for the effect of color wavelength(hue) for carrot treated with different nitrate fertilizer levels. The best histogram matching is the Gaussian distribution. The approach was to find a connection between the variables from textural image analysis with the different visual criteria. The relation between the texture parameters and visual evaluation criteria was limited to the carrot samples, especially, as it could be well differentiated by the texture analysis. It was possible to connect groups of variables of the texture analysis with groups of criteria from the visual evaluation. These selected variables were able to differentiate the samples but not able to classify the samples according to the treatment. Contrarily, in case of visual criteria which describe the picture as a whole there is a classification in 80% of the sample cases possible. Herewith, it clearly can find the limits of the single variable approach of the image analysis (texture analysis).
Resumo:
Auf dem Gebiet der Strukturdynamik sind computergestützte Modellvalidierungstechniken inzwischen weit verbreitet. Dabei werden experimentelle Modaldaten, um ein numerisches Modell für weitere Analysen zu korrigieren. Gleichwohl repräsentiert das validierte Modell nur das dynamische Verhalten der getesteten Struktur. In der Realität gibt es wiederum viele Faktoren, die zwangsläufig zu variierenden Ergebnissen von Modaltests führen werden: Sich verändernde Umgebungsbedingungen während eines Tests, leicht unterschiedliche Testaufbauten, ein Test an einer nominell gleichen aber anderen Struktur (z.B. aus der Serienfertigung), etc. Damit eine stochastische Simulation durchgeführt werden kann, muss eine Reihe von Annahmen für die verwendeten Zufallsvariablengetroffen werden. Folglich bedarf es einer inversen Methode, die es ermöglicht ein stochastisches Modell aus experimentellen Modaldaten zu identifizieren. Die Arbeit beschreibt die Entwicklung eines parameter-basierten Ansatzes, um stochastische Simulationsmodelle auf dem Gebiet der Strukturdynamik zu identifizieren. Die entwickelte Methode beruht auf Sensitivitäten erster Ordnung, mit denen Parametermittelwerte und Kovarianzen des numerischen Modells aus stochastischen experimentellen Modaldaten bestimmt werden können.
Resumo:
This paper describes a new statistical, model-based approach to building a contact state observer. The observer uses measurements of the contact force and position, and prior information about the task encoded in a graph, to determine the current location of the robot in the task configuration space. Each node represents what the measurements will look like in a small region of configuration space by storing a predictive, statistical, measurement model. This approach assumes that the measurements are statistically block independent conditioned on knowledge of the model, which is a fairly good model of the actual process. Arcs in the graph represent possible transitions between models. Beam Viterbi search is used to match measurement history against possible paths through the model graph in order to estimate the most likely path for the robot. The resulting approach provides a new decision process that can be use as an observer for event driven manipulation programming. The decision procedure is significantly more robust than simple threshold decisions because the measurement history is used to make decisions. The approach can be used to enhance the capabilities of autonomous assembly machines and in quality control applications.
Resumo:
This paper is a first draft of the principle of statistical modelling on coordinates. Several causes —which would be long to detail—have led to this situation close to the deadline for submitting papers to CODAWORK’03. The main of them is the fast development of the approach along the last months, which let appear previous drafts as obsolete. The present paper contains the essential parts of the state of the art of this approach from my point of view. I would like to acknowledge many clarifying discussions with the group of people working in this field in Girona, Barcelona, Carrick Castle, Firenze, Berlin, G¨ottingen, and Freiberg. They have given a lot of suggestions and ideas. Nevertheless, there might be still errors or unclear aspects which are exclusively my fault. I hope this contribution serves as a basis for further discussions and new developments
Resumo:
The Aitchison vector space structure for the simplex is generalized to a Hilbert space structure A2(P) for distributions and likelihoods on arbitrary spaces. Central notations of statistics, such as Information or Likelihood, can be identified in the algebraical structure of A2(P) and their corresponding notions in compositional data analysis, such as Aitchison distance or centered log ratio transform. In this way very elaborated aspects of mathematical statistics can be understood easily in the light of a simple vector space structure and of compositional data analysis. E.g. combination of statistical information such as Bayesian updating, combination of likelihood and robust M-estimation functions are simple additions/ perturbations in A2(Pprior). Weighting observations corresponds to a weighted addition of the corresponding evidence. Likelihood based statistics for general exponential families turns out to have a particularly easy interpretation in terms of A2(P). Regular exponential families form finite dimensional linear subspaces of A2(P) and they correspond to finite dimensional subspaces formed by their posterior in the dual information space A2(Pprior). The Aitchison norm can identified with mean Fisher information. The closing constant itself is identified with a generalization of the cummulant function and shown to be Kullback Leiblers directed information. Fisher information is the local geometry of the manifold induced by the A2(P) derivative of the Kullback Leibler information and the space A2(P) can therefore be seen as the tangential geometry of statistical inference at the distribution P. The discussion of A2(P) valued random variables, such as estimation functions or likelihoods, give a further interpretation of Fisher information as the expected squared norm of evidence and a scale free understanding of unbiased reasoning
Resumo:
The preceding two editions of CoDaWork included talks on the possible consideration of densities as infinite compositions: Egozcue and D´ıaz-Barrero (2003) extended the Euclidean structure of the simplex to a Hilbert space structure of the set of densities within a bounded interval, and van den Boogaart (2005) generalized this to the set of densities bounded by an arbitrary reference density. From the many variations of the Hilbert structures available, we work with three cases. For bounded variables, a basis derived from Legendre polynomials is used. For variables with a lower bound, we standardize them with respect to an exponential distribution and express their densities as coordinates in a basis derived from Laguerre polynomials. Finally, for unbounded variables, a normal distribution is used as reference, and coordinates are obtained with respect to a Hermite-polynomials-based basis. To get the coordinates, several approaches can be considered. A numerical accuracy problem occurs if one estimates the coordinates directly by using discretized scalar products. Thus we propose to use a weighted linear regression approach, where all k- order polynomials are used as predictand variables and weights are proportional to the reference density. Finally, for the case of 2-order Hermite polinomials (normal reference) and 1-order Laguerre polinomials (exponential), one can also derive the coordinates from their relationships to the classical mean and variance. Apart of these theoretical issues, this contribution focuses on the application of this theory to two main problems in sedimentary geology: the comparison of several grain size distributions, and the comparison among different rocks of the empirical distribution of a property measured on a batch of individual grains from the same rock or sediment, like their composition
Resumo:
We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos
Resumo:
This paper focuses on one of the methods for bandwidth allocation in an ATM network: the convolution approach. The convolution approach permits an accurate study of the system load in statistical terms by accumulated calculations, since probabilistic results of the bandwidth allocation can be obtained. Nevertheless, the convolution approach has a high cost in terms of calculation and storage requirements. This aspect makes real-time calculations difficult, so many authors do not consider this approach. With the aim of reducing the cost we propose to use the multinomial distribution function: the enhanced convolution approach (ECA). This permits direct computation of the associated probabilities of the instantaneous bandwidth requirements and makes a simple deconvolution process possible. The ECA is used in connection acceptance control, and some results are presented
Resumo:
Compositional data, also called multiplicative ipsative data, are common in survey research instruments in areas such as time use, budget expenditure and social networks. Compositional data are usually expressed as proportions of a total, whose sum can only be 1. Owing to their constrained nature, statistical analysis in general, and estimation of measurement quality with a confirmatory factor analysis model for multitrait-multimethod (MTMM) designs in particular are challenging tasks. Compositional data are highly non-normal, as they range within the 0-1 interval. One component can only increase if some other(s) decrease, which results in spurious negative correlations among components which cannot be accounted for by the MTMM model parameters. In this article we show how researchers can use the correlated uniqueness model for MTMM designs in order to evaluate measurement quality of compositional indicators. We suggest using the additive log ratio transformation of the data, discuss several approaches to deal with zero components and explain how the interpretation of MTMM designs di ers from the application to standard unconstrained data. We show an illustration of the method on data of social network composition expressed in percentages of partner, family, friends and other members in which we conclude that the faceto-face collection mode is generally superior to the telephone mode, although primacy e ects are higher in the face-to-face mode. Compositions of strong ties (such as partner) are measured with higher quality than those of weaker ties (such as other network members)
Resumo:
La idea básica de detección de defectos basada en vibraciones en Monitorización de la Salud Estructural (SHM), es que el defecto altera las propiedades de rigidez, masa o disipación de energía de un sistema, el cual, altera la respuesta dinámica del mismo. Dentro del contexto de reconocimiento de patrones, esta tesis presenta una metodología híbrida de razonamiento para evaluar los defectos en las estructuras, combinando el uso de un modelo de la estructura y/o experimentos previos con el esquema de razonamiento basado en el conocimiento para evaluar si el defecto está presente, su gravedad y su localización. La metodología involucra algunos elementos relacionados con análisis de vibraciones, matemáticas (wavelets, control de procesos estadístico), análisis y procesamiento de señales y/o patrones (razonamiento basado en casos, redes auto-organizativas), estructuras inteligentes y detección de defectos. Las técnicas son validadas numérica y experimentalmente considerando corrosión, pérdida de masa, acumulación de masa e impactos. Las estructuras usadas durante este trabajo son: una estructura tipo cercha voladiza, una viga de aluminio, dos secciones de tubería y una parte del ala de un avión comercial.
Resumo:
[EU]Lan honetan semantika distribuzionalaren eta ikasketa automatikoaren erabilera aztertzen dugu itzulpen automatiko estatistikoa hobetzeko. Bide horretan, erregresio logistikoan oinarritutako ikasketa automatikoko eredu bat proposatzen dugu hitz-segiden itzulpen- probabilitatea modu dinamikoan modelatzeko. Proposatutako eredua itzulpen automatiko estatistikoko ohiko itzulpen-probabilitateen orokortze bat dela frogatzen dugu, eta testuinguruko nahiz semantika distribuzionaleko informazioa barneratzeko baliatu ezaugarri lexiko, hitz-cluster eta hitzen errepresentazio bektorialen bidez. Horretaz gain, semantika distribuzionaleko ezagutza itzulpen automatiko estatistikoan txertatzeko beste hurbilpen bat lantzen dugu: hitzen errepresentazio bektorial elebidunak erabiltzea hitz-segiden itzulpenen antzekotasuna modelatzeko. Gure esperimentuek proposatutako ereduen baliagarritasuna erakusten dute, emaitza itxaropentsuak eskuratuz oinarrizko sistema sendo baten gainean. Era berean, gure lanak ekarpen garrantzitsuak egiten ditu errepresentazio bektorialen mapaketa elebidunei eta hitzen errepresentazio bektorialetan oinarritutako hitz-segiden antzekotasun neurriei dagokienean, itzulpen automatikoaz haratago balio propio bat dutenak semantika distribuzionalaren arloan.
Resumo:
A new technique is described for the analysis of cloud-resolving model simulations, which allows one to investigate the statistics of the lifecycles of cumulus clouds. Clouds are tracked from timestep-to-timestep within the model run. This allows for a very simple method of tracking, but one which is both comprehensive and robust. An approach for handling cloud splits and mergers is described which allows clouds with simple and complicated time histories to be compared within a single framework. This is found to be important for the analysis of an idealized simulation of radiative-convective equilibrium, in which the moist, buoyant, updrafts (i.e., the convective cores) were tracked. Around half of all such cores were subject to splits and mergers during their lifecycles. For cores without any such events, the average lifetime is 30min, but events can lengthen the typical lifetime considerably.