979 resultados para Methods : Statistical
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
A certain type of bacterial inclusion, known as a bacterial microcompartment, was recently identified and imaged through cryo-electron tomography. A reconstructed 3D object from single-axis limited angle tilt-series cryo-electron tomography contains missing regions and this problem is known as the missing wedge problem. Due to missing regions on the reconstructed images, analyzing their 3D structures is a challenging problem. The existing methods overcome this problem by aligning and averaging several similar shaped objects. These schemes work well if the objects are symmetric and several objects with almost similar shapes and sizes are available. Since the bacterial inclusions studied here are not symmetric, are deformed, and show a wide range of shapes and sizes, the existing approaches are not appropriate. This research develops new statistical methods for analyzing geometric properties, such as volume, symmetry, aspect ratio, polyhedral structures etc., of these bacterial inclusions in presence of missing data. These methods work with deformed and non-symmetric varied shaped objects and do not necessitate multiple objects for handling the missing wedge problem. The developed methods and contributions include: (a) an improved method for manual image segmentation, (b) a new approach to 'complete' the segmented and reconstructed incomplete 3D images, (c) a polyhedral structural distance model to predict the polyhedral shapes of these microstructures, (d) a new shape descriptor for polyhedral shapes, named as polyhedron profile statistic, and (e) the Bayes classifier, linear discriminant analysis and support vector machine based classifiers for supervised incomplete polyhedral shape classification. Finally, the predicted 3D shapes for these bacterial microstructures belong to the Johnson solids family, and these shapes along with their other geometric properties are important for better understanding of their chemical and biological characteristics.
Resumo:
Thesis (Master's)--University of Washington, 2016-08
Resumo:
This dissertation proposes statistical methods to formulate, estimate and apply complex transportation models. Two main problems are part of the analyses conducted and presented in this dissertation. The first method solves an econometric problem and is concerned with the joint estimation of models that contain both discrete and continuous decision variables. The use of ordered models along with a regression is proposed and their effectiveness is evaluated with respect to unordered models. Procedure to calculate and optimize the log-likelihood functions of both discrete-continuous approaches are derived, and difficulties associated with the estimation of unordered models explained. Numerical approximation methods based on the Genz algortithm are implemented in order to solve the multidimensional integral associated with the unordered modeling structure. The problems deriving from the lack of smoothness of the probit model around the maximum of the log-likelihood function, which makes the optimization and the calculation of standard deviations very difficult, are carefully analyzed. A methodology to perform out-of-sample validation in the context of a joint model is proposed. Comprehensive numerical experiments have been conducted on both simulated and real data. In particular, the discrete-continuous models are estimated and applied to vehicle ownership and use models on data extracted from the 2009 National Household Travel Survey. The second part of this work offers a comprehensive statistical analysis of free-flow speed distribution; the method is applied to data collected on a sample of roads in Italy. A linear mixed model that includes speed quantiles in its predictors is estimated. Results show that there is no road effect in the analysis of free-flow speeds, which is particularly important for model transferability. A very general framework to predict random effects with few observations and incomplete access to model covariates is formulated and applied to predict the distribution of free-flow speed quantiles. The speed distribution of most road sections is successfully predicted; jack-knife estimates are calculated and used to explain why some sections are poorly predicted. Eventually, this work contributes to the literature in transportation modeling by proposing econometric model formulations for discrete-continuous variables, more efficient methods for the calculation of multivariate normal probabilities, and random effects models for free-flow speed estimation that takes into account the survey design. All methods are rigorously validated on both real and simulated data.
Resumo:
Purpose: To develop an effective method for evaluating the quality of Cortex berberidis from different geographical origins. Methods: A simple, precise and accurate high performance liquid chromatography (HPLC) method was first developed for simultaneous quantification of four active alkaloids (magnoflorine, jatrorrhizine, palmatine, and berberine) in Cortex berberidis obtained from Qinghai, Tibet and Sichuan Provinces of China. Method validation was performed in terms of precision, repeatability, stability, accuracy, and linearity. Besides, partial least squares discriminant analysis (PLS-DA) and one-way analysis of variance (ANOVA) were applied to study the quality variations of Cortex berberidis from various geographical origins. Results: The proposed HPLC method showed good linearity, precision, repeatability, and accuracy. The four alkaloids were detected in all samples of Cortex berberidis. Among them, magnoflorine (36.46 - 87.30 mg/g) consistently showed the highest amounts in all the samples, followed by berberine (16.00 - 37.50 mg/g). The content varied in the range of 0.66 - 4.57 mg/g for palmatine and 1.53 - 16.26 mg/g for jatrorrhizine, respectively. The total content of the four alkaloids ranged from 67.62 to 114.79 mg/g. Moreover, the results obtained by the PLS-DA and ANOVA showed that magnoflorine level and the total content of these four alkaloids in Qinghai and Tibet samples were significantly higher (p < 0.01) than those in Sichuan samples. Conclusion: Quantification of multi-ingredients by HPLC combined with statistical methods provide an effective approach for achieving origin discrimination and quality evaluation of Cortex berberidis. The quality of Cortex berberidis closely correlates to the geographical origin of the samples, with Cortex berberidis samples from Qinghai and Tibet exhibiting superior qualities to those from Sichuan.
Resumo:
The microabrasion technique of enamel consists of selectively abrading the discolored areas or causing superficial structural changes in a selective way. In microabrasion technique, abrasive products associated with acids are used, and the evaluation of enamel roughness after this treatment, as well as surface polishing, is necessary. This in-vitro study evaluated the enamel roughness after microabrasion, followed by different polishing techniques. Roughness analyses were performed before microabrasion (L1), after microabrasion (L2), and after polishing (L3).Thus, 60 bovine incisive teeth divided into two groups were selected (n=30): G1- 37% phosphoric acid (37%) (Dentsply) and pumice; G2- hydrochloric acid (6.6%) associated with silicon carbide (Opalustre - Ultradent). Thereafter, the groups were divided into three sub-groups (n=10), according to the system of polishing: A - Fine and superfine granulation aluminum oxide discs (SofLex 3M); B - Diamond Paste (FGM) associated with felt discs (FGM); C - Silicone tips (Enhance - Dentsply). A PROC MIXED procedure was applied after data exploratory analysis, as well as the Tukey-Kramer test (5%). No statistical differences were found between G1 and G2 groups. L2 differed statistically from L1 and showed superior amounts of roughness. Differences in the amounts of post-polishing roughness for specific groups (1A, 2B, and 1C) arose, which demonstrated less roughness in L3 and differed statistically from L2 in the polishing system. All products increased enamel roughness, and the effectiveness of the polishing systems was dependent upon the abrasive used.
Resumo:
Background: Head and neck squamous cell carcinoma (HNSCC) is one of the most common malignancies in humans. The average 5-year survival rate is one of the lowest among aggressive cancers, showing no significant improvement in recent years. When detected early, HNSCC has a good prognosis, but most patients present metastatic disease at the time of diagnosis, which significantly reduces survival rate. Despite extensive research, no molecular markers are currently available for diagnostic or prognostic purposes. Methods: Aiming to identify differentially-expressed genes involved in laryngeal squamous cell carcinoma (LSCC) development and progression, we generated individual Serial Analysis of Gene Expression (SAGE) libraries from a metastatic and non-metastatic larynx carcinoma, as well as from a normal larynx mucosa sample. Approximately 54,000 unique tags were sequenced in three libraries. Results: Statistical data analysis identified a subset of 1,216 differentially expressed tags between tumor and normal libraries, and 894 differentially expressed tags between metastatic and non-metastatic carcinomas. Three genes displaying differential regulation, one down-regulated (KRT31) and two up-regulated (BST2, MFAP2), as well as one with a non-significant differential expression pattern (GNA15) in our SAGE data were selected for real-time polymerase chain reaction (PCR) in a set of HNSCC samples. Consistent with our statistical analysis, quantitative PCR confirmed the upregulation of BST2 and MFAP2 and the downregulation of KRT31 when samples of HNSCC were compared to tumor-free surgical margins. As expected, GNA15 presented a non-significant differential expression pattern when tumor samples were compared to normal tissues. Conclusion: To the best of our knowledge, this is the first study reporting SAGE data in head and neck squamous cell tumors. Statistical analysis was effective in identifying differentially expressed genes reportedly involved in cancer development. The differential expression of a subset of genes was confirmed in additional larynx carcinoma samples and in carcinomas from a distinct head and neck subsite. This result suggests the existence of potential common biomarkers for prognosis and targeted-therapy development in this heterogeneous type of tumor.
Resumo:
This article considers alternative methods to calculate the fair premium rate of crop insurance contracts based on county yields. The premium rate was calculated using parametric and nonparametric approaches to estimate the conditional agricultural yield density. These methods were applied to a data set of county yield provided by the Statistical and Geography Brazilian Institute (IBGE), for the period of 1990 through 2002, for soybean, corn and wheat, in the State of Paran. In this article, we propose methodological alternatives to pricing crop insurance contracts resulting in more accurate premium rates in a situation of limited data.
Resumo:
A stability-indicating high-performance liquid chromatographic (HPLC) and a second-order derivative spectrophotometric (UVDS) analytical methods were validated and compared for determination of simvastatin in tablets. The HPLC method was performed with isocratic elution using a C18 column and a mobile phase composed of methanol:acetonitrile:water (60:20:20, v/v/v) at a flow rate of 1.0 ml/min. The detection was made at 239 nm. In UVDS method, methanol and water were used in first dilution and distilled water was used in consecutive dilutions and as background. The second-order derivative signal measurement was taken at 255 nm. Analytical curves showed correlation coefficients > 0.999 for both methods. The quantitation limits (QL) were 2.41 mu g/ml for HPLC and 0.45 mu g/ml for UVDS, respectively. Intra and inter-day relative standard deviations were < 2.0 %. Statistical analysis with t- and F-tests are not exceeding their critical values demonstrating that there is no significant difference between the two methods at 95 % confidence level.
Resumo:
Hydrodynamic studies were conducted in a semi-cylindrical spouted bed column of diameter 150 mm, height 1000 mm, conical base included angle of 60 degrees and inlet orifice diameter 25 mm. Pressure transducers at several axial positions were used to obtain pressure fluctuation time series with 1.2 and 2.4 mm glass beads at U/U-ms from 0.3 to 1.6, and static bed depths from 150 to 600 mm. The conditions covered several flow regimes (fixed bed, incipient spouting, stable spouting, pulsating spouting, slugging, bubble spouting and fluidization). Images of the system dynamics were also acquired through the transparent walls with a digital camera. The data were analyzed via statistical, mutual information theory, spectral and Hurst`s Rescaled Range methods to assess the potential of these methods to characterize the spouting quality. The results indicate that these methods have potential for monitoring spouted bed operation.
Resumo:
The supervised pattern recognition methods K-Nearest Neighbors (KNN), stepwise discriminant analysis (SDA), and soft independent modelling of class analogy (SIMCA) were employed in this work with the aim to investigate the relationship between the molecular structure of 27 cannabinoid compounds and their analgesic activity. Previous analyses using two unsupervised pattern recognition methods (PCA-principal component analysis and HCA-hierarchical cluster analysis) were performed and five descriptors were selected as the most relevants for the analgesic activity of the compounds studied: R (3) (charge density on substituent at position C(3)), Q (1) (charge on atom C(1)), A (surface area), log P (logarithm of the partition coefficient) and MR (molecular refractivity). The supervised pattern recognition methods (SDA, KNN, and SIMCA) were employed in order to construct a reliable model that can be able to predict the analgesic activity of new cannabinoid compounds and to validate our previous study. The results obtained using the SDA, KNN, and SIMCA methods agree perfectly with our previous model. Comparing the SDA, KNN, and SIMCA results with the PCA and HCA ones we could notice that all multivariate statistical methods classified the cannabinoid compounds studied in three groups exactly in the same way: active, moderately active, and inactive.
Resumo:
OBJECTIVE: To describe variation in all cause and selected cause-specific mortality rates across Australia. METHODS: Mortality and population data for 1997 were obtained from the Australian Bureau of Statistics. All cause and selected cause-specific mortality rates were calculated and directly standardised to the 1997 Australian population in 5-year age groups. Selected major causes of death included cancer, coronary artery disease, cerebrovascular disease, diabetes, accidents and suicide. Rates are reported by statistical division, and State and Territory. RESULTS: All cause age-standardised mortality was 6.98 per 1000 in 1997 and this varied 2-fold from a low in the statistical division of Pilbara, Western Australia (5.78, 95% confidence interval 5.06-6.56), to a high in Northern Territory-excluding Darwin (11.30, 10.67-11.98). Similar mortality variation (all p<0.0001) exists for cancer (1.01-2.23 per 1000) and coronary artery disease (0.99-2.23 per 1000), the two biggest killers. Larger variation (all p<0.0001) exists for cerebrovascular disease (0.7-11.8 per 10,000), diabetes (0.7-6.9 per 10,000), accidents (1.7-7.2 per 10,000) and suicide (0.6-3.8 per 10,000). Less marked variation was observed when analysed by State and Territory. but Northern Territory consistently has the highest age-standardised mortality rates. CONCLUSIONS: Analysed by statistical division, substantial mortality gradients exist across Australia, suggesting an inequitable distribution of the determinants of health. Further research is required to better understand this heterogeneity.
Resumo:
Objective: The aim of this article is to propose an integrated framework for extracting and describing patterns of disorders from medical images using a combination of linear discriminant analysis and active contour models. Methods: A multivariate statistical methodology was first used to identify the most discriminating hyperplane separating two groups of images (from healthy controls and patients with schizophrenia) contained in the input data. After this, the present work makes explicit the differences found by the multivariate statistical method by subtracting the discriminant models of controls and patients, weighted by the pooled variance between the two groups. A variational level-set technique was used to segment clusters of these differences. We obtain a label of each anatomical change using the Talairach atlas. Results: In this work all the data was analysed simultaneously rather than assuming a priori regions of interest. As a consequence of this, by using active contour models, we were able to obtain regions of interest that were emergent from the data. The results were evaluated using, as gold standard, well-known facts about the neuroanatomical changes related to schizophrenia. Most of the items in the gold standard was covered in our result set. Conclusions: We argue that such investigation provides a suitable framework for characterising the high complexity of magnetic resonance images in schizophrenia as the results obtained indicate a high sensitivity rate with respect to the gold standard. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The monitoring of infection control indicators including hospital-acquired infections is an established part of quality maintenance programmes in many health-care facilities. However, surveillance data use can be frustrated by the infrequent nature of many infections. Traditional methods of analysis often provide delayed identification of increasing infection occurrence, placing patients at preventable risk. The application of Shewhart, Cumulative Sum (CUSUM) and Exponentially Weighted Moving Average (EWMA) statistical process control charts to the monitoring of indicator infections allows continuous real-time assessment. The Shewhart chart will detect large changes, while CUSUM and EWMA methods are more suited to recognition of small to moderate sustained change. When used together, Shewhart and EWMA methods are ideal for monitoring bacteraemia and multiresistant organism rates. Shewhart and CUSUM charts are suitable for surgical infection surveillance.