931 resultados para large data sets


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Models of codon evolution have attracted particular interest because of their unique capabilities to detect selection forces and their high fit when applied to sequence evolution. We described here a novel approach for modeling codon evolution, which is based on Kronecker product of matrices. The 61 × 61 codon substitution rate matrix is created using Kronecker product of three 4 × 4 nucleotide substitution matrices, the equilibrium frequency of codons, and the selection rate parameter. The entities of the nucleotide substitution matrices and selection rate are considered as parameters of the model, which are optimized by maximum likelihood. Our fully mechanistic model allows the instantaneous substitution matrix between codons to be fully estimated with only 19 parameters instead of 3,721, by using the biological interdependence existing between positions within codons. We illustrate the properties of our models using computer simulations and assessed its relevance by comparing the AICc measures of our model and other models of codon evolution on simulations and a large range of empirical data sets. We show that our model fits most biological data better compared with the current codon models. Furthermore, the parameters in our model can be interpreted in a similar way as the exchangeability rates found in empirical codon models.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The purpose of the Iowa TOPSpro Data Dictionary is to provide a statewide-standardized set of instructions and definitions for coding Tracking Of Programs And Students (TOPSpro) forms and effectively utilizing the TOPSpro software. This document is designed to serve as a companion document to the TOPS Technical Manual produced by the Comprehensive Adult Student Assessment System (CASAS). The data dictionary integrates information from various data systems to provide uniform data sets and definitions that meet local, state and federal reporting mandates. The sources for the data dictionary are: (1) the National Reporting System (NRS) Guidelines, (2) standard practices utilized in Iowa’s adult literacy program, (3) selected definitions from the Workforce Investment Act of 1998, (4) input from the state level Management Information System (MIS) personnel, and (5) selected definitions from other Iowa state agencies.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Models incorporating more realistic models of customer behavior, as customers choosing froman offer set, have recently become popular in assortment optimization and revenue management.The dynamic program for these models is intractable and approximated by a deterministiclinear program called the CDLP which has an exponential number of columns. However, whenthe segment consideration sets overlap, the CDLP is difficult to solve. Column generationhas been proposed but finding an entering column has been shown to be NP-hard. In thispaper we propose a new approach called SDCP to solving CDLP based on segments and theirconsideration sets. SDCP is a relaxation of CDLP and hence forms a looser upper bound onthe dynamic program but coincides with CDLP for the case of non-overlapping segments. Ifthe number of elements in a consideration set for a segment is not very large (SDCP) can beapplied to any discrete-choice model of consumer behavior. We tighten the SDCP bound by(i) simulations, called the randomized concave programming (RCP) method, and (ii) by addingcuts to a recent compact formulation of the problem for a latent multinomial-choice model ofdemand (SBLP+). This latter approach turns out to be very effective, essentially obtainingCDLP value, and excellent revenue performance in simulations, even for overlapping segments.By formulating the problem as a separation problem, we give insight into why CDLP is easyfor the MNL with non-overlapping considerations sets and why generalizations of MNL posedifficulties. We perform numerical simulations to determine the revenue performance of all themethods on reference data sets in the literature.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

BACKGROUND CONTEXT: Studies involving factor analysis (FA) of the items in the North American Spine Society (NASS) outcome assessment instrument have revealed inconsistent factor structures for the individual items. PURPOSE: This study examined whether the factor structure of the NASS varied in relation to the severity of the back/neck problem and differed from that originally recommended by the developers of the questionnaire, by analyzing data before and after surgery in a large series of patients undergoing lumbar or cervical disc arthroplasty. STUDY DESIGN/SETTING: Prospective multicenter observational case series. PATIENT SAMPLE: Three hundred ninety-one patients with low back pain and 553 patients with neck pain completed questionnaires preoperatively and again at 3 to 6 and 12 months follow-ups (FUs), in connection with the SWISSspine disc arthroplasty registry. OUTCOME MEASURES: North American Spine Society outcome assessment instrument. METHODS: First, an exploratory FA without a priori assumptions and subsequently a confirmatory FA were performed on the 17 items of the NASS-lumbar and 19 items of the NASS-cervical collected at each assessment time point. The item-loading invariance was tested in the German version of the questionnaire for baseline and FU. RESULTS: Both NASS-lumbar and NASS-cervical factor structures differed between baseline and postoperative data sets. The confirmatory analysis and item-loading invariance showed better fit for a three-factor (3F) structure for NASS-lumbar, containing items on "disability," "back pain," and "radiating pain, numbness, and weakness (leg/foot)" and for a 5F structure for NASS-cervical including disability, "neck pain," "radiating pain and numbness (arm/hand)," "weakness (arm/hand)," and "motor deficit (legs)." CONCLUSIONS: The best-fitting factor structure at both baseline and FU was selected for both the lumbar- and cervical-NASS questionnaires. It differed from that proposed by the originators of the NASS instruments. Although the NASS questionnaire represents a valid outcome measure for degenerative spine diseases, it is able to distinguish among all major symptom domains (factors) in patients undergoing lumbar and cervical disc arthroplasty; overall, the item structure could be improved. Any potential revision of the NASS should consider its factorial structure; factorial invariance over time should be aimed for, to allow for more precise interpretations of treatment success.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In our analysis we try and recover the wage loss from unemploymentin Spain and see how it is affected by previous unemploymentexperience, unemployment duration, eligibility for unemploymentbenefits, and previous wages. We also study its variations acrossgroups. Our main conclusion is that while there is some evidencethat labour market rigidities tend to lower it, the wage loss ofdisplaced workers is remarkably high: more than 30%, that is,twice the equivalent figure for the US and France. Wages in Spainsuffer from a serious mismeasurement problems that we do our best tocontrol, so that our results are less robust than the ones thatwould be obtained with better data sets. However, they indicate a large level of wage flexibility in Spain.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

1. Aim - Concerns over how global change will influence species distributions, in conjunction with increased emphasis on understanding niche dynamics in evolutionary and community contexts, highlight the growing need for robust methods to quantify niche differences between or within taxa. We propose a statistical framework to describe and compare environmental niches from occurrence and spatial environmental data.¦2. Location - Europe, North America, South America¦3. Methods - The framework applies kernel smoothers to densities of species occurrence in gridded environmental space to calculate metrics of niche overlap and test hypotheses regarding niche conservatism. We use this framework and simulated species with predefined distributions and amounts of niche overlap to evaluate several ordination and species distribution modeling techniques for quantifying niche overlap. We illustrate the approach with data on two well-studied invasive species.¦4. Results - We show that niche overlap can be accurately detected with the framework when variables driving the distributions are known. The method is robust to known and previously undocumented biases related to the dependence of species occurrences on the frequency of environmental conditions that occur across geographic space. The use of a kernel smoother makes the process of moving from geographical space to multivariate environmental space independent of both sampling effort and arbitrary choice of resolution in environmental space. However, the use of ordination and species distribution model techniques for selecting, combining and weighting variables on which niche overlap is calculated provide contrasting results.¦5. Main conclusions - The framework meets the increasing need for robust methods to quantify niche differences. It is appropriate to study niche differences between species, subspecies or intraspecific lineages that differ in their geographical distributions. Alternatively, it can be used to measure the degree to which the environmental niche of a species or intraspecific lineage has changed over time.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A biplot, which is the multivariate generalization of the two-variable scatterplot, can be used to visualize the results of many multivariate techniques, especially those that are based on the singular value decomposition. We consider data sets consisting of continuous-scale measurements, their fuzzy coding and the biplots that visualize them, using a fuzzy version of multiple correspondence analysis. Of special interest is the way quality of fit of the biplot is measured, since it is well-known that regular (i.e., crisp) multiple correspondence analysis seriously under-estimates this measure. We show how the results of fuzzy multiple correspondence analysis can be defuzzified to obtain estimated values of the original data, and prove that this implies an orthogonal decomposition of variance. This permits a measure of fit to be calculated in the familiar form of a percentage of explained variance, which is directly comparable to the corresponding fit measure used in principal component analysis of the original data. The approach is motivated initially by its application to a simulated data set, showing how the fuzzy approach can lead to diagnosing nonlinear relationships, and finally it is applied to a real set of meteorological data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Self-reported home values are widely used as a measure of housing wealth by researchers employing a variety of data sets and studying a number of different individual and household level decisions. The accuracy of this measure is an open empirical question, and requires some type of market assessment of the values reported. In this research, we study the predictive power of self-reported housing wealth when estimating sales prices utilizing the Health and Retirement Study. We find that homeowners, on average, overestimate the value of their properties by between 5% and 10%. More importantly, we are the first to document a strong correlation between accuracy and the economic conditions at the time of the purchase of the property (measured by the prevalent interest rate, the growth of household income, and the growth of median housing prices). While most individuals overestimate the value of their properties, those who bought during more difficult economic times tend to be more accurate, and in some cases even underestimate the value of their house. These results establish a surprisingly strong, likely permanent, and in many cases long-lived, effect of the initial conditions surrounding the purchases of properties, on how individuals value them. This cyclicality of the overestimation of house prices can provide some explanations for the difficulties currently faced by many homeowners, who were expecting large appreciations in home value to rescue them in case of increases in interest rates which could jeopardize their ability to live up to their financial commitments.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The influence of the basis set size and the correlation energy in the static electrical properties of the CO molecule is assessed. In particular, we have studied both the nuclear relaxation and the vibrational contributions to the static molecular electrical properties, the vibrational Stark effect (VSE) and the vibrational intensity effect (VIE). From a mathematical point of view, when a static and uniform electric field is applied to a molecule, the energy of this system can be expressed in terms of a double power series with respect to the bond length and to the field strength. From the power series expansion of the potential energy, field-dependent expressions for the equilibrium geometry, for the potential energy and for the force constant are obtained. The nuclear relaxation and vibrational contributions to the molecular electrical properties are analyzed in terms of the derivatives of the electronic molecular properties. In general, the results presented show that accurate inclusion of the correlation energy and large basis sets are needed to calculate the molecular electrical properties and their derivatives with respect to either nuclear displacements or/and field strength. With respect to experimental data, the calculated power series coefficients are overestimated by the SCF, CISD, and QCISD methods. On the contrary, perturbation methods (MP2 and MP4) tend to underestimate them. In average and using the 6-311 + G(3df) basis set and for the CO molecule, the nuclear relaxation and the vibrational contributions to the molecular electrical properties amount to 11.7%, 3.3%, and 69.7% of the purely electronic μ, α, and β values, respectively

Relevância:

90.00% 90.00%

Publicador:

Resumo:

OBJECTIVE: As part of the WHO ICD-11 development initiative, the Topic Advisory Group on Quality and Safety explores meta-features of morbidity data sets, such as the optimal number of secondary diagnosis fields. DESIGN: The Health Care Quality Indicators Project of the Organization for Economic Co-Operation and Development collected Patient Safety Indicator (PSI) information from administrative hospital data of 19-20 countries in 2009 and 2011. We investigated whether three countries that expanded their data systems to include more secondary diagnosis fields showed increased PSI rates compared with six countries that did not. Furthermore, administrative hospital data from six of these countries and two American states, California (2011) and Florida (2010), were analysed for distributions of coded patient safety events across diagnosis fields. RESULTS: Among the participating countries, increasing the number of diagnosis fields was not associated with any overall increase in PSI rates. However, high proportions of PSI-related diagnoses appeared beyond the sixth secondary diagnosis field. The distribution of three PSI-related ICD codes was similar in California and Florida: 89-90% of central venous catheter infections and 97-99% of retained foreign bodies and accidental punctures or lacerations were captured within 15 secondary diagnosis fields. CONCLUSIONS: Six to nine secondary diagnosis fields are inadequate for comparing complication rates using hospital administrative data; at least 15 (and perhaps more with ICD-11) are recommended to fully characterize clinical outcomes. Increasing the number of fields should improve the international and intra-national comparability of data for epidemiologic and health services research, utilization analyses and quality of care assessment.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

ABSTRACT: BACKGROUND: Many studies have been published outlining the global effects of 17 beta-estradiol (E2) on gene expression in human epithelial breast cancer derived MCF-7 cells. These studies show large variation in results, reporting between ~100 and ~1500 genes regulated by E2, with poor overlap. RESULTS: We performed a meta-analysis of these expression studies, using the Rank product method to obtain a more accurate and stable list of the differentially expressed genes, and of pathways regulated by E2. We analyzed 9 time-series data sets, concentrating on response at 3-4 hrs (early) and at 24 hrs (late). We found >1000 statistically significant probe sets after correction for multiple testing at 3-4 hrs, and >2000 significant probe sets at 24 hrs. Differentially expressed genes were examined by pathway analysis. This revealed 15 early response pathways, mostly related to cell signaling and proliferation, and 20 late response pathways, mostly related to breast cancer, cell division, DNA repair and recombination. CONCLUSIONS: Our results show that meta-analysis identified more differentially expressed genes than the individual studies, and that these genes act together in networks. These results provide new insight into E2 regulated mechanisms, especially in the context of breast cancer.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper reports the results from a second characterisation of the 91500 zircon, including data from electron probe microanalysis, laser ablation inductively coupled plasma-mass spectrometry (LA-ICP-MS), secondary ion mass spectrometry (SIMS) and laser fluorination analyses. The focus of this initiative was to establish the suitability of this large single zircon crystal for calibrating in situ analyses of the rare earth elements and oxygen isotopes, as well as to provide working values for key geochemical systems. In addition to extensive testing of the chemical and structural homogeneity of this sample, the occurrence of banding in 91500 in both backscattered electron and cathodoluminescence images is described in detail. Blind intercomparison data reported by both LA-ICP-MS and SIMS laboratories indicate that only small systematic differences exist between the data sets provided by these two techniques. Furthermore, the use of NIST SRM 610 glass as the calibrant for SIMS analyses was found to introduce little or no systematic error into the results for zircon. Based on both laser fluorination and SIMS data, zircon 91500 seems to be very well suited for calibrating in situ oxygen isotopic analyses.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

SUMMARY: ExpressionView is an R package that provides an interactive graphical environment to explore transcription modules identified in gene expression data. A sophisticated ordering algorithm is used to present the modules with the expression in a visually appealing layout that provides an intuitive summary of the results. From this overview, the user can select individual modules and access biologically relevant metadata associated with them. AVAILABILITY: http://www.unil.ch/cbg/ExpressionView. Screenshots, tutorials and sample data sets can be found on the ExpressionView web site.