948 resultados para statistical methods
Resumo:
Longitudinal surveys are increasingly used to collect event history data on person-specific processes such as transitions between labour market states. Surveybased event history data pose a number of challenges for statistical analysis. These challenges include survey errors due to sampling, non-response, attrition and measurement. This study deals with non-response, attrition and measurement errors in event history data and the bias caused by them in event history analysis. The study also discusses some choices faced by a researcher using longitudinal survey data for event history analysis and demonstrates their effects. These choices include, whether a design-based or a model-based approach is taken, which subset of data to use and, if a design-based approach is taken, which weights to use. The study takes advantage of the possibility to use combined longitudinal survey register data. The Finnish subset of European Community Household Panel (FI ECHP) survey for waves 1–5 were linked at person-level with longitudinal register data. Unemployment spells were used as study variables of interest. Lastly, a simulation study was conducted in order to assess the statistical properties of the Inverse Probability of Censoring Weighting (IPCW) method in a survey data context. The study shows how combined longitudinal survey register data can be used to analyse and compare the non-response and attrition processes, test the missingness mechanism type and estimate the size of bias due to non-response and attrition. In our empirical analysis, initial non-response turned out to be a more important source of bias than attrition. Reported unemployment spells were subject to seam effects, omissions, and, to a lesser extent, overreporting. The use of proxy interviews tended to cause spell omissions. An often-ignored phenomenon classification error in reported spell outcomes, was also found in the data. Neither the Missing At Random (MAR) assumption about non-response and attrition mechanisms, nor the classical assumptions about measurement errors, turned out to be valid. Both measurement errors in spell durations and spell outcomes were found to cause bias in estimates from event history models. Low measurement accuracy affected the estimates of baseline hazard most. The design-based estimates based on data from respondents to all waves of interest and weighted by the last wave weights displayed the largest bias. Using all the available data, including the spells by attriters until the time of attrition, helped to reduce attrition bias. Lastly, the simulation study showed that the IPCW correction to design weights reduces bias due to dependent censoring in design-based Kaplan-Meier and Cox proportional hazard model estimators. The study discusses implications of the results for survey organisations collecting event history data, researchers using surveys for event history analysis, and researchers who develop methods to correct for non-sampling biases in event history data.
Resumo:
In today's logistics environment, there is a tremendous need for accurate cost information and cost allocation. Companies searching for the proper solution often come across with activity-based costing (ABC) or one of its variations which utilizes cost drivers to allocate the costs of activities to cost objects. In order to allocate the costs accurately and reliably, the selection of appropriate cost drivers is essential in order to get the benefits of the costing system. The purpose of this study is to validate the transportation cost drivers of a Finnish wholesaler company and ultimately select the best possible driver alternatives for the company. The use of cost driver combinations as an alternative is also studied. The study is conducted as a part of case company's applied ABC-project using the statistical research as the main research method supported by a theoretical, literature based method. The main research tools featured in the study include simple and multiple regression analyses, which together with the literature and observations based practicality analysis forms the basis for the advanced methods. The results suggest that the most appropriate cost driver alternatives are the delivery drops and internal delivery weight. The possibility of using cost driver combinations is not suggested as their use doesn't provide substantially better results while increasing the measurement costs, complexity and load of use at the same time. The use of internal freight cost drivers is also questionable as the results indicate weakening trend in the cost allocation capabilities towards the end of the period. Therefore more research towards internal freight cost drivers should be conducted before taking them in use.
Resumo:
Virtual environments and real-time simulators (VERS) are becoming more and more important tools in research and development (R&D) process of non-road mobile machinery (NRMM). The virtual prototyping techniques enable faster and more cost-efficient development of machines compared to use of real life prototypes. High energy efficiency has become an important topic in the world of NRMM because of environmental and economic demands. The objective of this thesis is to develop VERS based methods for research and development of NRMM. A process using VERS for assessing effects of human operators on the life-cycle efficiency of NRMM was developed. Human in the loop simulations are ran using an underground mining loader to study the developed process. The simulations were ran in the virtual environment of the Laboratory of Intelligent Machines of Lappeenranta University of Technology. A physically adequate real-time simulation model of NRMM was shown to be reliable and cost effective in testing of hardware components by the means of hardware-in-the-loop (HIL) simulations. A control interface connecting integrated electro-hydraulic energy converter (IEHEC) with virtual simulation model of log crane was developed. IEHEC consists of a hydraulic pump-motor and an integrated electrical permanent magnet synchronous motorgenerator. The results show that state of the art real-time NRMM simulators are capable to solve factors related to energy consumption and productivity of the NRMM. A significant variation between the test drivers is found. The results show that VERS can be used for assessing human effects on the life-cycle efficiency of NRMM. HIL simulation responses compared to that achieved with conventional simulation method demonstrate the advances and drawbacks of various possible interfaces between the simulator and hardware part of the system under study. Novel ideas for arranging the interface are successfully tested and compared with the more traditional one. The proposed process for assessing the effects of operators on the life-cycle efficiency will be applied for wider group of operators in the future. Driving styles of the operators can be analysed statistically from sufficient large result data. The statistical analysis can find the most life-cycle efficient driving style for the specific environment and machinery. The proposed control interface for HIL simulation need to be further studied. The robustness and the adaptation of the interface in different situations must be verified. The future work will also include studying the suitability of the IEHEC for different working machines using the proposed HIL simulation method.
Resumo:
The DNA extraction is a critical step in Genetically Modified Organisms analysis based on real-time PCR. In this study, the CTAB and DNeasy methods provided good quality and quantity of DNA from the texturized soy protein, infant formula, and soy milk samples. Concerning the Certified Reference Material consisting of 5% Roundup Ready® soybean, neither method yielded DNA of good quality. However, the dilution test applied in the CTAB extracts showed no interference of inhibitory substances. The PCR efficiencies of lectin target amplification were not statistically different, and the coefficients of correlation (R²) demonstrated high degree of correlation between the copy numbers and the threshold cycle (Ct) values. ANOVA showed suitable adjustment of the regression and absence of significant linear deviations. The efficiencies of the p35S amplification were not statistically different, and all R² values using DNeasy extracts were above 0.98 with no significant linear deviations. Two out of three R² values using CTAB extracts were lower than 0.98, corresponding to lower degree of correlation, and the lack-of-fit test showed significant linear deviation in one run. The comparative analysis of the Ct values for the p35S and lectin targets demonstrated no statistical significant differences between the analytical curves of each target.
Resumo:
The recent rapid development of biotechnological approaches has enabled the production of large whole genome level biological data sets. In order to handle thesedata sets, reliable and efficient automated tools and methods for data processingand result interpretation are required. Bioinformatics, as the field of studying andprocessing biological data, tries to answer this need by combining methods and approaches across computer science, statistics, mathematics and engineering to studyand process biological data. The need is also increasing for tools that can be used by the biological researchers themselves who may not have a strong statistical or computational background, which requires creating tools and pipelines with intuitive user interfaces, robust analysis workflows and strong emphasis on result reportingand visualization. Within this thesis, several data analysis tools and methods have been developed for analyzing high-throughput biological data sets. These approaches, coveringseveral aspects of high-throughput data analysis, are specifically aimed for gene expression and genotyping data although in principle they are suitable for analyzing other data types as well. Coherent handling of the data across the various data analysis steps is highly important in order to ensure robust and reliable results. Thus,robust data analysis workflows are also described, putting the developed tools andmethods into a wider context. The choice of the correct analysis method may also depend on the properties of the specific data setandthereforeguidelinesforchoosing an optimal method are given. The data analysis tools, methods and workflows developed within this thesis have been applied to several research studies, of which two representative examplesare included in the thesis. The first study focuses on spermatogenesis in murinetestis and the second one examines cell lineage specification in mouse embryonicstem cells.
Resumo:
The impermeability of seed coat to water is common mechanism in Fabaceae seeds. Treatments to overcome hardseededness include scarification with sulphuric acid, scarification on abrasive surface and soaking in water among others. The objective of this study was to identify an effective method to overcome dormancy in Dinizia excelsa seeds. A pre-test (untreated seed) and three experiments were carried out: immersion of seeds in acid sulphuric for 10, 20, 30, 40, 50 and 60min (experiment 1); scarification on abrasive surface at the positions distal end, near of the mycrophyle and on the lateral tissue and tegument clipping at 1mm of the distal end, near of the mycrophyle and on the lateral tissue (experiment 2); scarification on abrasive surface and immersion in water for 0, 12, 24 and 48h (experiment 3). The experimental design was completely with four replications of 50 seeds for each treatment. The statistical analysis was carried out by ANOVA and regression analysis. Seedlings emergence on untreated seeds started on the 8th day after sowing and reached 52.5% on the 1,709th day. In general, the treatments to overcome dormancy increase emergence. Emergence was higher for seeds treated with sulphuric acid for 20 and 30min with emergence of 93.6% and 86.6%, respectively. For seeds scarified on abrasive surface higher emergences were recorded for scarification on distal end, near of the mycrophyle and on the lateral, 82.7%, 74.3% and 75.7%, respectively. Seeds scarified manually showed higher emergence when not immersed in water (75%), or when immersed for 12 and 24h (75%, 73.6% and 65.6%, respectively). Immersion seeds in sulphuric acid for 20 and 30min and scarification on abrasive surface of distal end are effective to overcome dormancy in D. excelsa.
Resumo:
Pairs trading is an algorithmic trading strategy that is based on the historical co-movement of two separate assets and trades are executed on the basis of degree of relative mispricing. The purpose of this study is to explore one new and alternative copula-based method for pairs trading. The objective is to find out whether the copula method generates more trading opportunities and higher profits than the more traditional distance and cointegration methods applied extensively in previous empirical studies. Methods are compared by selecting top five pairs from stocks of the large and medium-sized companies in the Finnish stock market. The research period includes years 2006-2015. All the methods are proven to be profitable and the Finnish stock market suitable for pairs trading. However, copula method doesn’t generate more trading opportunities or higher profits than the other methods. It seems that the limitations of the more traditional methods are not too restrictive for this particular sample data.
Resumo:
We study the problem of measuring the uncertainty of CGE (or RBC)-type model simulations associated with parameter uncertainty. We describe two approaches for building confidence sets on model endogenous variables. The first one uses a standard Wald-type statistic. The second approach assumes that a confidence set (sampling or Bayesian) is available for the free parameters, from which confidence sets are derived by a projection technique. The latter has two advantages: first, confidence set validity is not affected by model nonlinearities; second, we can easily build simultaneous confidence intervals for an unlimited number of variables. We study conditions under which these confidence sets take the form of intervals and show they can be implemented using standard methods for solving CGE models. We present an application to a CGE model of the Moroccan economy to study the effects of policy-induced increases of transfers from Moroccan expatriates.
Resumo:
It is well known that standard asymptotic theory is not valid or is extremely unreliable in models with identification problems or weak instruments [Dufour (1997, Econometrica), Staiger and Stock (1997, Econometrica), Wang and Zivot (1998, Econometrica), Stock and Wright (2000, Econometrica), Dufour and Jasiak (2001, International Economic Review)]. One possible way out consists here in using a variant of the Anderson-Rubin (1949, Ann. Math. Stat.) procedure. The latter, however, allows one to build exact tests and confidence sets only for the full vector of the coefficients of the endogenous explanatory variables in a structural equation, which in general does not allow for individual coefficients. This problem may in principle be overcome by using projection techniques [Dufour (1997, Econometrica), Dufour and Jasiak (2001, International Economic Review)]. AR-types are emphasized because they are robust to both weak instruments and instrument exclusion. However, these techniques can be implemented only by using costly numerical techniques. In this paper, we provide a complete analytic solution to the problem of building projection-based confidence sets from Anderson-Rubin-type confidence sets. The latter involves the geometric properties of “quadrics” and can be viewed as an extension of usual confidence intervals and ellipsoids. Only least squares techniques are required for building the confidence intervals. We also study by simulation how “conservative” projection-based confidence sets are. Finally, we illustrate the methods proposed by applying them to three different examples: the relationship between trade and growth in a cross-section of countries, returns to education, and a study of production functions in the U.S. economy.
Resumo:
Les séquences protéiques naturelles sont le résultat net de l’interaction entre les mécanismes de mutation, de sélection naturelle et de dérive stochastique au cours des temps évolutifs. Les modèles probabilistes d’évolution moléculaire qui tiennent compte de ces différents facteurs ont été substantiellement améliorés au cours des dernières années. En particulier, ont été proposés des modèles incorporant explicitement la structure des protéines et les interdépendances entre sites, ainsi que les outils statistiques pour évaluer la performance de ces modèles. Toutefois, en dépit des avancées significatives dans cette direction, seules des représentations très simplifiées de la structure protéique ont été utilisées jusqu’à présent. Dans ce contexte, le sujet général de cette thèse est la modélisation de la structure tridimensionnelle des protéines, en tenant compte des limitations pratiques imposées par l’utilisation de méthodes phylogénétiques très gourmandes en temps de calcul. Dans un premier temps, une méthode statistique générale est présentée, visant à optimiser les paramètres d’un potentiel statistique (qui est une pseudo-énergie mesurant la compatibilité séquence-structure). La forme fonctionnelle du potentiel est par la suite raffinée, en augmentant le niveau de détails dans la description structurale sans alourdir les coûts computationnels. Plusieurs éléments structuraux sont explorés : interactions entre pairs de résidus, accessibilité au solvant, conformation de la chaîne principale et flexibilité. Les potentiels sont ensuite inclus dans un modèle d’évolution et leur performance est évaluée en termes d’ajustement statistique à des données réelles, et contrastée avec des modèles d’évolution standards. Finalement, le nouveau modèle structurellement contraint ainsi obtenu est utilisé pour mieux comprendre les relations entre niveau d’expression des gènes et sélection et conservation de leur séquence protéique.
Resumo:
Contexte: L'obésité chez les jeunes représente aujourd’hui un problème de santé publique à l’échelle mondiale. Afin d’identifier des cibles potentielles pour des stratégies populationnelles de prévention, les liens entre les caractéristiques du voisinage, l’obésité chez les jeunes et les habitudes de vie font de plus en plus l’objet d’études. Cependant, la recherche à ce jour comporte plusieurs incohérences. But: L’objectif général de cette thèse est d’étudier la contribution de différentes caractéristiques du voisinage relativement à l’obésité chez les jeunes et les habitudes de vie qui y sont associées. Les objectifs spécifiques consistent à: 1) Examiner les associations entre la présence de différents commerces d’alimentation dans les voisinages résidentiels et scolaires des enfants et leurs habitudes alimentaires; 2) Examiner comment l’exposition à certaines caractéristiques du voisinage résidentiel détermine l’obésité au niveau familial (chez le jeune, la mère et le père), ainsi que l’obésité individuelle pour chaque membre de la famille; 3) Identifier des combinaisons de facteurs de risque individuels, familiaux et du voisinage résidentiel qui prédisent le mieux l’obésité chez les jeunes, et déterminer si ces profils de facteurs de risque prédisent aussi un changement dans l’obésité après un suivi de deux ans. Méthodes: Les données proviennent de l’étude QUALITY, une cohorte québécoise de 630 jeunes, âgés de 8-10 ans au temps 1, avec une histoire d’obésité parentale. Les voisinages de 512 participants habitant la Région métropolitaine de Montréal ont été caractérisés à l’aide de : 1) données spatiales provenant du recensement et de bases de données administratives, calculées pour des zones tampons à partir du réseau routier et centrées sur le lieu de la résidence et de l’école; et 2) des observations menées par des évaluateurs dans le voisinage résidentiel. Les mesures du voisinage étudiées se rapportent aux caractéristiques de l’environnement bâti, social et alimentaire. L’obésité a été estimée aux temps 1 et 2 à l’aide de l’indice de masse corporelle (IMC) calculé à partir du poids et de la taille mesurés. Les habitudes alimentaires ont été mesurées au temps 1 à l'aide de trois rappels alimentaires. Les analyses effectuées comprennent, entres autres, des équations d'estimation généralisées, des régressions multiniveaux et des analyses prédictives basées sur des arbres de décision. Résultats: Les résultats démontrent la présence d’associations avec l’obésité chez les jeunes et les habitudes alimentaires pour certaines caractéristiques du voisinage. En particulier, la présence de dépanneurs et de restaurants-minutes dans le voisinage résidentiel et scolaire est associée avec de moins bonnes habitudes alimentaires. La présence accrue de trafic routier, ainsi qu’un faible niveau de prestige et d’urbanisation dans le voisinage résidentiel sont associés à l’obésité familiale. Enfin, les résultats montrent qu’habiter un voisinage obésogène, caractérisé par une défavorisation socioéconomique, la présence de moins de parcs et de plus de dépanneurs, prédit l'obésité chez les jeunes lorsque combiné à la présence de facteurs de risque individuels et familiaux. Conclusion: Cette thèse contribue aux écrits sur les voisinages et l’obésité chez les jeunes en considérant à la fois l'influence potentielle du voisinage résidentiel et scolaire ainsi que l’influence de l’environnement familial, en utilisant des méthodes objectives pour caractériser le voisinage et en utilisant des méthodes statistiques novatrices. Les résultats appuient en outre la notion que les efforts de prévention de l'obésité doivent cibler les multiples facteurs de risque de l'obésité chez les jeunes dans les environnements bâtis, sociaux et familiaux de ces jeunes.
Resumo:
One of the major concerns of scoliosis patients undergoing surgical treatment is the aesthetic aspect of the surgery outcome. It would be useful to predict the postoperative appearance of the patient trunk in the course of a surgery planning process in order to take into account the expectations of the patient. In this paper, we propose to use least squares support vector regression for the prediction of the postoperative trunk 3D shape after spine surgery for adolescent idiopathic scoliosis. Five dimensionality reduction techniques used in conjunction with the support vector machine are compared. The methods are evaluated in terms of their accuracy, based on the leave-one-out cross-validation performed on a database of 141 cases. The results indicate that the 3D shape predictions using a dimensionality reduction obtained by simultaneous decomposition of the predictors and response variables have the best accuracy.
Resumo:
Learning Disability (LD) is a general term that describes specific kinds of learning problems. It is a neurological condition that affects a child's brain and impairs his ability to carry out one or many specific tasks. The learning disabled children are neither slow nor mentally retarded. This disorder can make it problematic for a child to learn as quickly or in the same way as some child who isn't affected by a learning disability. An affected child can have normal or above average intelligence. They may have difficulty paying attention, with reading or letter recognition, or with mathematics. It does not mean that children who have learning disabilities are less intelligent. In fact, many children who have learning disabilities are more intelligent than an average child. Learning disabilities vary from child to child. One child with LD may not have the same kind of learning problems as another child with LD. There is no cure for learning disabilities and they are life-long. However, children with LD can be high achievers and can be taught ways to get around the learning disability. In this research work, data mining using machine learning techniques are used to analyze the symptoms of LD, establish interrelationships between them and evaluate the relative importance of these symptoms. To increase the diagnostic accuracy of learning disability prediction, a knowledge based tool based on statistical machine learning or data mining techniques, with high accuracy,according to the knowledge obtained from the clinical information, is proposed. The basic idea of the developed knowledge based tool is to increase the accuracy of the learning disability assessment and reduce the time used for the same. Different statistical machine learning techniques in data mining are used in the study. Identifying the important parameters of LD prediction using the data mining techniques, identifying the hidden relationship between the symptoms of LD and estimating the relative significance of each symptoms of LD are also the parts of the objectives of this research work. The developed tool has many advantages compared to the traditional methods of using check lists in determination of learning disabilities. For improving the performance of various classifiers, we developed some preprocessing methods for the LD prediction system. A new system based on fuzzy and rough set models are also developed for LD prediction. Here also the importance of pre-processing is studied. A Graphical User Interface (GUI) is designed for developing an integrated knowledge based tool for prediction of LD as well as its degree. The designed tool stores the details of the children in the student database and retrieves their LD report as and when required. The present study undoubtedly proves the effectiveness of the tool developed based on various machine learning techniques. It also identifies the important parameters of LD and accurately predicts the learning disability in school age children. This thesis makes several major contributions in technical, general and social areas. The results are found very beneficial to the parents, teachers and the institutions. They are able to diagnose the child’s problem at an early stage and can go for the proper treatments/counseling at the correct time so as to avoid the academic and social losses.
Resumo:
Treating e-mail filtering as a binary text classification problem, researchers have applied several statistical learning algorithms to email corpora with promising results. This paper examines the performance of a Naive Bayes classifier using different approaches to feature selection and tokenization on different email corpora
Resumo:
This paper compares statistical technique of paraphrase identification to semantic technique of paraphrase identification. The statistical techniques used for comparison are word set and word-order based methods where as the semantic technique used is the WordNet similarity matrix method described by Stevenson and Fernando in [3].