991 resultados para Transformed data
Resumo:
Factor analysis as frequent technique for multivariate data inspection is widely used also for compositional data analysis. The usual way is to use a centered logratio (clr) transformation to obtain the random vector y of dimension D. The factor model is then y = Λf + e (1) with the factors f of dimension k < D, the error term e, and the loadings matrix Λ. Using the usual model assumptions (see, e.g., Basilevsky, 1994), the factor analysis model (1) can be written as Cov(y) = ΛΛT + ψ (2) where ψ = Cov(e) has a diagonal form. The diagonal elements of ψ as well as the loadings matrix Λ are estimated from an estimation of Cov(y). Given observed clr transformed data Y as realizations of the random vector y. Outliers or deviations from the idealized model assumptions of factor analysis can severely effect the parameter estimation. As a way out, robust estimation of the covariance matrix of Y will lead to robust estimates of Λ and ψ in (2), see Pison et al. (2003). Well known robust covariance estimators with good statistical properties, like the MCD or the S-estimators (see, e.g. Maronna et al., 2006), rely on a full-rank data matrix Y which is not the case for clr transformed data (see, e.g., Aitchison, 1986). The isometric logratio (ilr) transformation (Egozcue et al., 2003) solves this singularity problem. The data matrix Y is transformed to a matrix Z by using an orthonormal basis of lower dimension. Using the ilr transformed data, a robust covariance matrix C(Z) can be estimated. The result can be back-transformed to the clr space by C(Y ) = V C(Z)V T where the matrix V with orthonormal columns comes from the relation between the clr and the ilr transformation. Now the parameters in the model (2) can be estimated (Basilevsky, 1994) and the results have a direct interpretation since the links to the original variables are still preserved. The above procedure will be applied to data from geochemistry. Our special interest is on comparing the results with those of Reimann et al. (2002) for the Kola project data
Resumo:
Matheron's usual variogram estimator can result in unreliable variograms when data are strongly asymmetric or skewed. Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. This paper examines the effects of underlying asymmetry on the variogram and on the accuracy of prediction, and the second one examines the effects arising from outliers. Standard geostatistical texts suggest ways of dealing with underlying asymmetry; however, this is based on informed intuition rather than detailed investigation. To determine whether the methods generally used to deal with underlying asymmetry are appropriate, the effects of different coefficients of skewness on the shape of the experimental variogram and on the model parameters were investigated. Simulated annealing was used to create normally distributed random fields of different size from variograms with different nugget:sill ratios. These data were then modified to give different degrees of asymmetry and the experimental variogram was computed in each case. The effects of standard data transformations on the form of the variogram were also investigated. Cross-validation was used to assess quantitatively the performance of the different variogram models for kriging. The results showed that the shape of the variogram was affected by the degree of asymmetry, and that the effect increased as the size of data set decreased. Transformations of the data were more effective in reducing the skewness coefficient in the larger sets of data. Cross-validation confirmed that variogram models from transformed data were more suitable for kriging than were those from the raw asymmetric data. The results of this study have implications for the 'standard best practice' in dealing with asymmetry in data for geostatistical analyses. (C) 2007 Elsevier Ltd. All rights reserved.
New methods for quantification and analysis of quantitative real-time polymerase chain reaction data
Resumo:
Quantitative real-time polymerase chain reaction (qPCR) is a sensitive gene quantitation method that has been widely used in the biological and biomedical fields. The currently used methods for PCR data analysis, including the threshold cycle (CT) method, linear and non-linear model fitting methods, all require subtracting background fluorescence. However, the removal of background fluorescence is usually inaccurate, and therefore can distort results. Here, we propose a new method, the taking-difference linear regression method, to overcome this limitation. Briefly, for each two consecutive PCR cycles, we subtracted the fluorescence in the former cycle from that in the later cycle, transforming the n cycle raw data into n-1 cycle data. Then linear regression was applied to the natural logarithm of the transformed data. Finally, amplification efficiencies and the initial DNA molecular numbers were calculated for each PCR run. To evaluate this new method, we compared it in terms of accuracy and precision with the original linear regression method with three background corrections, being the mean of cycles 1-3, the mean of cycles 3-7, and the minimum. Three criteria, including threshold identification, max R2, and max slope, were employed to search for target data points. Considering that PCR data are time series data, we also applied linear mixed models. Collectively, when the threshold identification criterion was applied and when the linear mixed model was adopted, the taking-difference linear regression method was superior as it gave an accurate estimation of initial DNA amount and a reasonable estimation of PCR amplification efficiencies. When the criteria of max R2 and max slope were used, the original linear regression method gave an accurate estimation of initial DNA amount. Overall, the taking-difference linear regression method avoids the error in subtracting an unknown background and thus it is theoretically more accurate and reliable. This method is easy to perform and the taking-difference strategy can be extended to all current methods for qPCR data analysis.^
Resumo:
I developed a new model for estimating annual production-to-biomass ratio P/B and production P of macrobenthic populations in marine and freshwater habitats. Self-learning artificial neural networks (ANN) were used to model the relationships between P/B and twenty easy-to-measure abiotic and biotic parameters in 1252 data sets of population production. Based on log-transformed data, the final predictive model estimates log(P/B) with reasonable accuracy and precision (r2 = 0.801; residual mean square RMS = 0.083). Body mass and water temperature contributed most to the explanatory power of the model. However, as with all least squares models using nonlinearly transformed data, back-transformation to natural scale introduces a bias in the model predictions, i.e., an underestimation of P/B (and P). When estimating production of assemblages of populations by adding up population estimates, accuracy decreases but precision increases with the number of populations in the assemblage.
Resumo:
This dissertation develops a new mathematical approach that overcomes the effect of a data processing phenomenon known as “histogram binning” inherent to flow cytometry data. A real-time procedure is introduced to prove the effectiveness and fast implementation of such an approach on real-world data. The histogram binning effect is a dilemma posed by two seemingly antagonistic developments: (1) flow cytometry data in its histogram form is extended in its dynamic range to improve its analysis and interpretation, and (2) the inevitable dynamic range extension introduces an unwelcome side effect, the binning effect, which skews the statistics of the data, undermining as a consequence the accuracy of the analysis and the eventual interpretation of the data. ^ Researchers in the field contended with such a dilemma for many years, resorting either to hardware approaches that are rather costly with inherent calibration and noise effects; or have developed software techniques based on filtering the binning effect but without successfully preserving the statistical content of the original data. ^ The mathematical approach introduced in this dissertation is so appealing that a patent application has been filed. The contribution of this dissertation is an incremental scientific innovation based on a mathematical framework that will allow researchers in the field of flow cytometry to improve the interpretation of data knowing that its statistical meaning has been faithfully preserved for its optimized analysis. Furthermore, with the same mathematical foundation, proof of the origin of such an inherent artifact is provided. ^ These results are unique in that new mathematical derivations are established to define and solve the critical problem of the binning effect faced at the experimental assessment level, providing a data platform that preserves its statistical content. ^ In addition, a novel method for accumulating the log-transformed data was developed. This new method uses the properties of the transformation of statistical distributions to accumulate the output histogram in a non-integer and multi-channel fashion. Although the mathematics of this new mapping technique seem intricate, the concise nature of the derivations allow for an implementation procedure that lends itself to a real-time implementation using lookup tables, a task that is also introduced in this dissertation. ^
Resumo:
This dissertation develops a new mathematical approach that overcomes the effect of a data processing phenomenon known as "histogram binning" inherent to flow cytometry data. A real-time procedure is introduced to prove the effectiveness and fast implementation of such an approach on real-world data. The histogram binning effect is a dilemma posed by two seemingly antagonistic developments: (1) flow cytometry data in its histogram form is extended in its dynamic range to improve its analysis and interpretation, and (2) the inevitable dynamic range extension introduces an unwelcome side effect, the binning effect, which skews the statistics of the data, undermining as a consequence the accuracy of the analysis and the eventual interpretation of the data. Researchers in the field contended with such a dilemma for many years, resorting either to hardware approaches that are rather costly with inherent calibration and noise effects; or have developed software techniques based on filtering the binning effect but without successfully preserving the statistical content of the original data. The mathematical approach introduced in this dissertation is so appealing that a patent application has been filed. The contribution of this dissertation is an incremental scientific innovation based on a mathematical framework that will allow researchers in the field of flow cytometry to improve the interpretation of data knowing that its statistical meaning has been faithfully preserved for its optimized analysis. Furthermore, with the same mathematical foundation, proof of the origin of such an inherent artifact is provided. These results are unique in that new mathematical derivations are established to define and solve the critical problem of the binning effect faced at the experimental assessment level, providing a data platform that preserves its statistical content. In addition, a novel method for accumulating the log-transformed data was developed. This new method uses the properties of the transformation of statistical distributions to accumulate the output histogram in a non-integer and multi-channel fashion. Although the mathematics of this new mapping technique seem intricate, the concise nature of the derivations allow for an implementation procedure that lends itself to a real-time implementation using lookup tables, a task that is also introduced in this dissertation.
Resumo:
A significant proportion of the cost of software development is due to software testing and maintenance. This is in part the result of the inevitable imperfections due to human error, lack of quality during the design and coding of software, and the increasing need to reduce faults to improve customer satisfaction in a competitive marketplace. Given the cost and importance of removing errors improvements in fault detection and removal can be of significant benefit. The earlier in the development process faults can be found, the less it costs to correct them and the less likely other faults are to develop. This research aims to make the testing process more efficient and effective by identifying those software modules most likely to contain faults, allowing testing efforts to be carefully targeted. This is done with the use of machine learning algorithms which use examples of fault prone and not fault prone modules to develop predictive models of quality. In order to learn the numerical mapping between module and classification, a module is represented in terms of software metrics. A difficulty in this sort of problem is sourcing software engineering data of adequate quality. In this work, data is obtained from two sources, the NASA Metrics Data Program, and the open source Eclipse project. Feature selection before learning is applied, and in this area a number of different feature selection methods are applied to find which work best. Two machine learning algorithms are applied to the data - Naive Bayes and the Support Vector Machine - and predictive results are compared to those of previous efforts and found to be superior on selected data sets and comparable on others. In addition, a new classification method is proposed, Rank Sum, in which a ranking abstraction is laid over bin densities for each class, and a classification is determined based on the sum of ranks over features. A novel extension of this method is also described based on an observed polarising of points by class when rank sum is applied to training data to convert it into 2D rank sum space. SVM is applied to this transformed data to produce models the parameters of which can be set according to trade-off curves to obtain a particular performance trade-off.
Resumo:
Purpose Serum levels of the inflammatory markers YKL-40 and IL-6 are increased in many conditions, including cancers. We examined serum YKL-40 and IL-6 levels in patients with Hodgkin lymphoma (HL), a tumor with strong immunologic reaction to relatively few tumor cells, especially in nodular sclerosis HL. Experimental Design We analyzed Danish and Swedish patients with incident HL (N=470) and population controls from Denmark (N= 245 for YKL-40; N= 348 for IL-6). Serum YKL-40 and IL-6 levels were determined by ELISA, and log-transformed data were analysed by linear regression, adjusting for age and sex. Results Serum levels of YKL-40 and IL-6 were increased in HL patients compared to controls (YKL-40: 3.6-fold, IL-6: 8.3-fold; both p<0.0001). In samples from pre-treatment HL patients (N=176), levels were correlated with more advanced stages (ptrend 0.0001 for YKL-40 and 0.013 for IL-6) and in those with B symptoms, but levels were similar in nodular sclerosis and mixed cellularity subtypes, by EBV status, and in younger (<45 years old) and older patients. Patients tested soon after treatment onset had significantly lower levels than pre-treatment patients, but even >6 months after treatment onset, serum YKL-40 and IL-6 levels remained significantly increased, compared to controls. In patients who died (N=12), pre-treatment levels for both YKL-40 and IL-6 were higher than in survivors, although not statistically significantly. Conclusions Serum YKL-40 and IL-6 levels were increased in untreated HL patients and those with more advanced stages but did not differ significantly by HL histology. Following treatment, serum levels were significantly lower.
Resumo:
The data for this study were gathered between 1993 and 1996 on board commercial trawlers from Somalia, China and Yemen and also from the research vessel Ibn Magid belonging to the Marine Science and Resources Research Centre, Aden, Republic of Yemen. Fish were identified using the FAO species identification literature. All fish were measured to the nearest mm (total length) and weighed to the nearest g. Sex was determined by dissection after the length and weight had been measured. The length-weight relationships were calculated using least-squares regression on log-transformed data and the parameters of the relationship of the form of W=aL super(b) are summarized. Maximum and minimum size of fish sampled are also given. Common names and recent changes in nomenclature were taken from ICLARM's FishBase.
Resumo:
Correlation between total length (TL), fork length (FL) and standard length (SL) of Raslrineobola argentea (pellegrin 1904) in the Winam Gulf of Lake Victoria indicate that FL = 0.92 TL - 0.74 and SL = 0.90 TL - 1.74. Length-weight relationship of log-transformed data shows that the slopes of the regression lines were 3.06 to 3.22 for juveniles, 2.70 to 3.05 for males and 3.24 to 3.71 for females. The slopes were significantly different between groups at at a =0.05. The Fulton's condition factor (K) was highest in December (1.019-1.073) and March/April (1.015-1.030) but lowest in June (1:00-1.025) for all stations. Significant differences between groups demands for the use of different growth models for juveniles, males and females especially for the von Bertalanffy growth equation which uses length-weight relationship. Observed cyclic viations in condition factor suggests two peak breeding seasons for this species in the Winam Gulf. The practical lmplications of these results in stock assessment using length-based fish stock assessment methods is briefly discussed.
Resumo:
In this study, aspects of the structural mechanics of the upper and lower limbs of the three Chinese species of Rhinopithecus were examined. Linear regression and reduced major axis (RMA) analyses of natural log-transformed data were used to examine the dimensions of limb bones and other relationships to body size and locomotion. The results of this study suggest that: (1) the allometry exponents of the lengths of long limbs deviate from isometry, being moderately negative, while the shaft diameters (both sagittal and transverse) show significantly positive allometry; (2) the sagittal diameters of the tibia and ulna show extremely significantly positive allometry - the relative enlargement of the sagittal, as opposed to transverse, diameters of these bones suggests that the distal segments of the fore- and hindlimbs of Rhinopithecus experience high bending stresses during locomotion; (3) observations of Rhinopithecus species in the field indicate that all species engage in energetic leaping during arboreal locomotion. The limbs experience rapid and dramatic decelerations upon completion of a leap. We suggest that these occasional decelerations produce high bending stresses in the distal limb segments and so account for the hypertrophy of the sagittal diameters of the ulna and tibia.
Resumo:
King, R. D. and Ouali, M. (2004) Poly-transformation. In proceedings of 5th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2004). Springer LNCS 3177 p99-107
Resumo:
Political drivers such as the Kyoto protocol, the EU Energy Performance of Buildings Directive and the Energy end use and Services Directive have been implemented in response to an identified need for a reduction in human related CO2 emissions. Buildings account for a significant portion of global CO2 emissions, approximately 25-30%, and it is widely acknowledged by industry and research organisations that they operate inefficiently. In parallel, unsatisfactory indoor environmental conditions have proven to negatively impact occupant productivity. Legislative drivers and client education are seen as the key motivating factors for an improvement in the holistic environmental and energy performance of a building. A symbiotic relationship exists between building indoor environmental conditions and building energy consumption. However traditional Building Management Systems and Energy Management Systems treat these separately. Conventional performance analysis compares building energy consumption with a previously recorded value or with the consumption of a similar building and does not recognise the fact that all buildings are unique. Therefore what is required is a new framework which incorporates performance comparison against a theoretical building specific ideal benchmark. Traditionally Energy Managers, who work at the operational level of organisations with respect to building performance, do not have access to ideal performance benchmark information and as a result cannot optimally operate buildings. This thesis systematically defines Holistic Environmental and Energy Management and specifies the Scenario Modelling Technique which in turn uses an ideal performance benchmark. The holistic technique uses quantified expressions of building performance and by doing so enables the profiled Energy Manager to visualise his actions and the downstream consequences of his actions in the context of overall building operation. The Ideal Building Framework facilitates the use of this technique by acting as a Building Life Cycle (BLC) data repository through which ideal building performance benchmarks are systematically structured and stored in parallel with actual performance data. The Ideal Building Framework utilises transformed data in the form of the Ideal Set of Performance Objectives and Metrics which are capable of defining the performance of any building at any stage of the BLC. It is proposed that the union of Scenario Models for an individual building would result in a building specific Combination of Performance Metrics which would in turn be stored in the BLC data repository. The Ideal Data Set underpins the Ideal Set of Performance Objectives and Metrics and is the set of measurements required to monitor the performance of the Ideal Building. A Model View describes the unique building specific data relevant to a particular project stakeholder. The energy management data and information exchange requirements that underlie a Model View implementation are detailed and incorporate traditional and proposed energy management. This thesis also specifies the Model View Methodology which complements the Ideal Building Framework. The developed Model View and Rule Set methodology process utilises stakeholder specific rule sets to define stakeholder pertinent environmental and energy performance data. This generic process further enables each stakeholder to define the resolution of data desired. For example, basic, intermediate or detailed. The Model View methodology is applicable for all project stakeholders, each requiring its own customised rule set. Two rule sets are defined in detail, the Energy Manager rule set and the LEED Accreditor rule set. This particular measurement generation process accompanied by defined View would filter and expedite data access for all stakeholders involved in building performance. Information presentation is critical for effective use of the data provided by the Ideal Building Framework and the Energy Management View definition. The specifications for a customised Information Delivery Tool account for the established profile of Energy Managers and best practice user interface design. Components of the developed tool could also be used by Facility Managers working at the tactical and strategic levels of organisations. Informed decision making is made possible through specified decision assistance processes which incorporate the Scenario Modelling and Benchmarking techniques, the Ideal Building Framework, the Energy Manager Model View, the Information Delivery Tool and the established profile of Energy Managers. The Model View and Rule Set Methodology is effectively demonstrated on an appropriate mixed use existing ‘green’ building, the Environmental Research Institute at University College Cork, using the Energy Management and LEED rule sets. Informed Decision Making is also demonstrated using a prototype scenario for the demonstration building.
Resumo:
p.117-124