30 resultados para Multivariate data
Resumo:
The monitoring of multivariate systems that exhibit non-Gaussian behavior is addressed. Existing work advocates the use of independent component analysis (ICA) to extract the underlying non-Gaussian data structure. Since some of the source signals may be Gaussian, the use of principal component analysis (PCA) is proposed to capture the Gaussian and non-Gaussian source signals. A subsequent application of ICA then allows the extraction of non-Gaussian components from the retained principal components (PCs). A further contribution is the utilization of a support vector data description to determine a confidence limit for the non-Gaussian components. Finally, a statistical test is developed for determining how many non-Gaussian components are encapsulated within the retained PCs, and associated monitoring statistics are defined. The utility of the proposed scheme is demonstrated by a simulation example, and the analysis of recorded data from an industrial melter.
Resumo:
Flutter prediction as currently practiced is usually deterministic, with a single structural model used to represent an aircraft. By using interval analysis to take into account structural variability, recent work has demonstrated that small changes in the structure can lead to very large changes in the altitude at which
utter occurs (Marques, Badcock, et al., J. Aircraft, 2010). In this follow-up work we examine the same phenomenon using probabilistic collocation (PC), an uncertainty quantification technique which can eficiently propagate multivariate stochastic input through a simulation code,
in this case an eigenvalue-based fluid-structure stability code. The resulting analysis predicts the consequences of an uncertain structure on incidence of
utter in probabilistic terms { information that could be useful in planning
flight-tests and assessing the risk of structural failure. The uncertainty in
utter altitude is confirmed to be substantial. Assuming that the structural uncertainty represents a epistemic uncertainty regarding the
structure, it may be reduced with the availability of additional information { for example aeroelastic response data from a flight-test. Such data is used to update the structural uncertainty using Bayes' theorem. The consequent
utter uncertainty is significantly reduced across the entire Mach number range.
Resumo:
OBJECTIVES: The aim of this study was to examine the co-occurrence of obesity and sleep problems among employees and workplaces. METHODS: We obtained data from 39 873 men and women working in 3040 workplaces in 2000-2002 (the Finnish Public Sector Study). Individual- and workplace-level characteristics were considered as correlates of obesity and sleep problems, which were modelled simultaneously using a multivariate, multilevel approach. RESULTS: Of the participants, 11% were obese and 23% reported sleep problems. We found a correlation between obesity and sleep problems at both the individual [correlation coefficient 0.048, covariance 0.047, standard error (SE) 0.005) and workplace (correlation coefficient 0.619, covariance 0.068, SE 0.011) level. The latter, but not the former, correlation remained after adjustment for individual- and workplace-level confounders, such as age, sex, socioeconomic status, shift work, alcohol consumption, job strain, and proportion of temporary employees and manual workers at the workplace. CONCLUSIONS: Obese employees and those with sleep problems tend to cluster in the same workplaces, suggesting that, in addition to targeting individuals at risk, interventions to reduce obesity and sleep problems might benefit from identifying "risky" workplaces.
Resumo:
Animal communities are sensitive to environmental disturbance, and several multivariate methods have recently been developed to detect changes in community structure. The complex taxonomy of soil invertebrates constrains the use of the community level in monitoring environmental changes, since species identification requires expertise and time. However, recent literature data on marine communities indicate that little multivariate information is lost in the taxonomic aggregation of species data to high rank taxa. In the present paper, this hypothesis was tested on two oribatid mite (oribatida, Acari) assemblages under two different kinds of disturbance: metal pollution and fires. Results indicate that data sets built at the genus and family systematic rank can detect the effects of disturbance with little loss of information. This is an encouraging result in view of the use of the community level as a preliminary tool for describing patterns of human-disturbed soil ecosystems. (c) 2006 Elsevier SAS. All rights reserved.
Resumo:
The techniques of principal component analysis (PCA) and partial least squares (PLS) are introduced from the point of view of providing a multivariate statistical method for modelling process plants. The advantages and limitations of PCA and PLS are discussed from the perspective of the type of data and problems that might be encountered in this application area. These concepts are exemplified by two case studies dealing first with data from a continuous stirred tank reactor (CSTR) simulation and second a literature source describing a low-density polyethylene (LDPE) reactor simulation.
Resumo:
Model selection between competing models is a key consideration in the discovery of prognostic multigene signatures. The use of appropriate statistical performance measures as well as verification of biological significance of the signatures is imperative to maximise the chance of external validation of the generated signatures. Current approaches in time-to-event studies often use only a single measure of performance in model selection, such as logrank test p-values, or dichotomise the follow-up times at some phase of the study to facilitate signature discovery. In this study we improve the prognostic signature discovery process through the application of the multivariate partial Cox model combined with the concordance index, hazard ratio of predictions, independence from available clinical covariates and biological enrichment as measures of signature performance. The proposed framework was applied to discover prognostic multigene signatures from early breast cancer data. The partial Cox model combined with the multiple performance measures were used in both guiding the selection of the optimal panel of prognostic genes and prediction of risk within cross validation without dichotomising the follow-up times at any stage. The signatures were successfully externally cross validated in independent breast cancer datasets, yielding a hazard ratio of 2.55 [1.44, 4.51] for the top ranking signature.
Resumo:
Objective: To examine the differences in the interval between diagnosis and initiation of treatment among women with breast cancer in Northern Ireland.
Design: A cross-sectional observational study.
Setting: All breast cancer care patients in the Northern Ireland Cancer Registry in 2006.
Participants: All women diagnosed and treated for breast cancer in Northern Ireland in 2006.
Main outcome measure: The number of days between diagnosis and initiation of treatment for breast cancer.
Results: The mean (median) interval between diagnosis and initiation of treatment among public patients was 19 (15) compared with 14 (12) among those whose care involved private providers. The differences between individual public providers were as marked as those between the public and private sector - the mean (median) ranging between 14 (12) and 25 (22) days. Multivariate models revealed that the differences were evident when a range of patient characteristics were controlled for including cancer stage.
Conclusions: A relatively small number of women received care privately in Northern Ireland but experienced shorter intervals between diagnosis and initiation of treatment than those who received care wholly in the public system. The variation among public providers was as great as that between the public and private providers. The impact of such differences on survival and in light of waiting time targets introduced in Northern Ireland warrants investigation.
Resumo:
Motivation: To date, Gene Set Analysis (GSA) approaches primarily focus on identifying differentially expressed gene sets (pathways). Methods for identifying differentially coexpressed pathways also exist but are mostly based on aggregated pairwise correlations, or other pairwise measures of coexpression. Instead, we propose Gene Sets Net Correlations Analysis (GSNCA), a multivariate differential coexpression test that accounts for the complete correlation structure between genes.
Results: In GSNCA, weight factors are assigned to genes in proportion to the genes' cross-correlations (intergene correlations). The problem of finding the weight vectors is formulated as an eigenvector problem with a unique solution. GSNCA tests the null hypothesis that for a gene set there is no difference in the weight vectors of the genes between two conditions. In simulation studies and the analyses of experimental data, we demonstrate that GSNCA, indeed, captures changes in the structure of genes' cross-correlations rather than differences in the averaged pairwise correlations. Thus, GSNCA infers differences in coexpression networks, however, bypassing method-dependent steps of network inference. As an additional result from GSNCA, we define hub genes as genes with the largest weights and show that these genes correspond frequently to major and specific pathway regulators, as well as to genes that are most affected by the biological difference between two conditions. In summary, GSNCA is a new approach for the analysis of differentially coexpressed pathways that also evaluates the importance of the genes in the pathways, thus providing unique information that may result in the generation of novel biological hypotheses.
Resumo:
We examined variability in hierarchical beta diversity across ecosystems, geographical gradients, and organism groups using multivariate spatial mixed modeling analysis of two independent data sets. The larger data set comprised reported ratios of regional species richness (RSR) to local species richness (LSR) and the second data set consisted of RSR: LSR ratios derived from nested species-area relationships. There was a negative, albeit relatively weak, relationship between beta diversity and latitude. We found only relatively subtle differences in beta diversity among the realms, yet beta diversity was lower in marine systems than in terrestrial or freshwater realms. Beta diversity varied significantly among organisms' major characteristics such as body mass, trophic position, and dispersal type in the larger data set. Organisms that disperse via seeds had highest beta diversity, and passively dispersed organisms showed the lowest beta diversity. Furthermore, autotrophs had lower beta diversity than organisms higher up the food web; omnivores and carnivores had consistently higher beta diversity. This is evidence that beta diversity is simultaneously controlled by extrinsic factors related to geography and environment, and by intrinsic factors related to organism characteristics.
Resumo:
Slow release drugs must be manufactured to meet target specifications with respect to dissolution curve profiles. In this paper we consider the problem of identifying the drivers of dissolution curve variability of a drug from historical manufacturing data. Several data sources are considered: raw material parameters, coating data, loss on drying and pellet size statistics. The methodology employed is to develop predictive models using LASSO, a powerful machine learning algorithm for regression with high-dimensional datasets. LASSO provides sparse solutions facilitating the identification of the most important causes of variability in the drug fabrication process. The proposed methodology is illustrated using manufacturing data for a slow release drug.
Resumo:
Biodegradable polymers, such as PLA (Polylactide), come from renewable resources like corn starch and if disposed of correctly, degrade and become harmless to the ecosystem making them attractive alternatives to petroleum based polymers. PLA in particular is used in a variety of applications including medical devices, food packaging and waste disposal packaging. However, the industry faces challenges in melt processing of PLA due to its poor thermal stability which is influenced by processing temperatures and shearing.
Identification and control of suitable processing conditions is extremely challenging, usually relying on trial and error, and often sensitive to batch to batch variations. Off-line assessment in a lab environment can result in high scrap rates, long lead times and lengthy and expensive process development. Scrap rates are typically in the region of 25-30% for medical grade PLA costing between €2000-€5000/kg.
Additives are used to enhance material properties such as mechanical properties and may also have a therapeutic role in the case of bioresorbable medical devices, for example the release of calcium from orthopaedic implants such as fixation screws promotes healing. Additives can also reduce the costs involved as less of the polymer resin is required.
This study investigates the scope for monitoring, modelling and optimising processing conditions for twin screw extrusion of PLA and PLA w/calcium carbonate to achieve desired material properties. A DAQ system has been constructed to gather data from a bespoke measurement die comprising melt temperature; pressure drop along the length of the die; and UV-Vis spectral data which is shown to correlate to filler dispersion. Trials were carried out under a range of processing conditions using a Design of Experiments approach and samples were tested for mechanical properties, degradation rate and the release rate of calcium. Relationships between recorded process data and material characterisation results are explored.
Resumo:
Statistics are regularly used to make some form of comparison between trace evidence or deploy the exclusionary principle (Morgan and Bull, 2007) in forensic investigations. Trace evidence are routinely the results of particle size, chemical or modal analyses and as such constitute compositional data. The issue is that compositional data including percentages, parts per million etc. only carry relative information. This may be problematic where a comparison of percentages and other constraint/closed data is deemed a statistically valid and appropriate way to present trace evidence in a court of law. Notwithstanding an awareness of the existence of the constant sum problem since the seminal works of Pearson (1896) and Chayes (1960) and the introduction of the application of log-ratio techniques (Aitchison, 1986; Pawlowsky-Glahn and Egozcue, 2001; Pawlowsky-Glahn and Buccianti, 2011; Tolosana-Delgado and van den Boogaart, 2013) the problem that a constant sum destroys the potential independence of variances and covariances required for correlation regression analysis and empirical multivariate methods (principal component analysis, cluster analysis, discriminant analysis, canonical correlation) is all too often not acknowledged in the statistical treatment of trace evidence. Yet the need for a robust treatment of forensic trace evidence analyses is obvious. This research examines the issues and potential pitfalls for forensic investigators if the constant sum constraint is ignored in the analysis and presentation of forensic trace evidence. Forensic case studies involving particle size and mineral analyses as trace evidence are used to demonstrate the use of a compositional data approach using a centred log-ratio (clr) transformation and multivariate statistical analyses.
Resumo:
Importance: The natural history of patients with newly diagnosed high-risk nonmetastatic (M0) prostate cancer receiving hormone therapy (HT) either alone or with standard-of-care radiotherapy (RT) is not well documented. Furthermore, no clinical trial has assessed the role of RT in patients with node-positive (N+) M0 disease. The STAMPEDE Trial includes such individuals, allowing an exploratory multivariate analysis of the impact of radical RT.
Objective: To describe survival and the impact on failure-free survival of RT by nodal involvement in these patients.
Design, Setting, and Participants: Cohort study using data collected for patients allocated to the control arm (standard-of-care only) of the STAMPEDE Trial between October 5, 2005, and May 1, 2014. Outcomes are presented as hazard ratios (HRs) with 95% CIs derived from adjusted Cox models; survival estimates are reported at 2 and 5 years. Participants were high-risk, hormone-naive patients with newly diagnosed M0 prostate cancer starting long-term HT for the first time. Radiotherapy is encouraged in this group, but mandated for patients with node-negative (N0) M0 disease only since November 2011.
Exposures: Long-term HT either alone or with RT, as per local standard. Planned RT use was recorded at entry.
Main Outcomes and Measures: Failure-free survival (FFS) and overall survival.
Results: A total of 721 men with newly diagnosed M0 disease were included: median age at entry, 66 (interquartile range [IQR], 61-72) years, median (IQR) prostate-specific antigen level of 43 (18-88) ng/mL. There were 40 deaths (31 owing to prostate cancer) with 17 months' median follow-up. Two-year survival was 96% (95% CI, 93%-97%) and 2-year FFS, 77% (95% CI, 73%-81%). Median (IQR) FFS was 63 (26 to not reached) months. Time to FFS was worse in patients with N+ disease (HR, 2.02 [95% CI, 1.46-2.81]) than in those with N0 disease. Failure-free survival outcomes favored planned use of RT for patients with both N0M0 (HR, 0.33 [95% CI, 0.18-0.61]) and N+M0 disease (HR, 0.48 [95% CI, 0.29-0.79]).
Conclusions and Relevance: Survival for men entering the cohort with high-risk M0 disease was higher than anticipated at study inception. These nonrandomized data were consistent with previous trials that support routine use of RT with HT in patients with N0M0 disease. Additionally, the data suggest that the benefits of RT extend to men with N+M0 disease.
Trial Registration: clinicaltrials.gov Identifier: NCT00268476; ISRCTN78818544.
Resumo:
Administrative systems such as health care registration are of increasing importance in providing information for statistical, research, and policy purposes. There is thus a pressing need to understand better the detailed relationship between population characteristics as recorded in such systems and conventional censuses. This paper explores these issues using the unique Northern Ireland Longitudinal Study (NILS). It takes the 2001 Census enumeration as a benchmark and analyses the social, demographic and spatial patterns of mismatch with the health register at individual level. Descriptive comparison is followed by multivariate and multilevel analyses which show that approximately 25% of individuals are reported to be in different addresses and that age, rurality, education, and housing type are all important factors. This level of mismatch appears to be maintained over time, as earlier migrants who update their address details are replaced by others who have not yet done so. In some cases, apparent mismatches seem likely to reflect complex multi-address living arrangements rather than data error.
Resumo:
This paper is part of a special issue of Applied Geochemistry focusing on reliable applications of compositional multivariate statistical methods. This study outlines the application of compositional data analysis (CoDa) to calibration of geochemical data and multivariate statistical modelling of geochemistry and grain-size data from a set of Holocene sedimentary cores from the Ganges-Brahmaputra (G-B) delta. Over the last two decades, understanding near-continuous records of sedimentary sequences has required the use of core-scanning X-ray fluorescence (XRF) spectrometry, for both terrestrial and marine sedimentary sequences. Initial XRF data are generally unusable in ‘raw-format’, requiring data processing in order to remove instrument bias, as well as informed sequence interpretation. The applicability of these conventional calibration equations to core-scanning XRF data are further limited by the constraints posed by unknown measurement geometry and specimen homogeneity, as well as matrix effects. Log-ratio based calibration schemes have been developed and applied to clastic sedimentary sequences focusing mainly on energy dispersive-XRF (ED-XRF) core-scanning. This study has applied high resolution core-scanning XRF to Holocene sedimentary sequences from the tidal-dominated Indian Sundarbans, (Ganges-Brahmaputra delta plain). The Log-Ratio Calibration Equation (LRCE) was applied to a sub-set of core-scan and conventional ED-XRF data to quantify elemental composition. This provides a robust calibration scheme using reduced major axis regression of log-ratio transformed geochemical data. Through partial least squares (PLS) modelling of geochemical and grain-size data, it is possible to derive robust proxy information for the Sundarbans depositional environment. The application of these techniques to Holocene sedimentary data offers an improved methodological framework for unravelling Holocene sedimentation patterns.