906 resultados para Exploratory statistical data analysis
Resumo:
Limited information is available regarding the methodology required to characterize hashish seizures for assessing the presence or the absence of a chemical link between two seizures. This casework report presents the methodology applied for assessing that two different police seizures were coming from the same block before this latter one was split. The chemical signature was extracted using GC-MS analysis and the implemented methodology consists in a study of intra- and inter-variability distributions based on the measurement of the chemical profiles similarity using a number of hashish seizures and the calculation of the Pearson correlation coefficient. Different statistical scenarios (i.e., a combination of data pretreatment techniques and selection of target compounds) were tested to find the most discriminating one. Seven compounds showing high discrimination capabilities were selected on which a specific statistical data pretreatment was applied. Based on the results, the statistical model built for comparing the hashish seizures leads to low error rates. Therefore, the implemented methodology is suitable for the chemical profiling of hashish seizures.
Resumo:
Factor analysis as frequent technique for multivariate data inspection is widely used also for compositional data analysis. The usual way is to use a centered logratio (clr)transformation to obtain the random vector y of dimension D. The factor model istheny = Λf + e (1)with the factors f of dimension k & D, the error term e, and the loadings matrix Λ.Using the usual model assumptions (see, e.g., Basilevsky, 1994), the factor analysismodel (1) can be written asCov(y) = ΛΛT + ψ (2)where ψ = Cov(e) has a diagonal form. The diagonal elements of ψ as well as theloadings matrix Λ are estimated from an estimation of Cov(y).Given observed clr transformed data Y as realizations of the random vectory. Outliers or deviations from the idealized model assumptions of factor analysiscan severely effect the parameter estimation. As a way out, robust estimation ofthe covariance matrix of Y will lead to robust estimates of Λ and ψ in (2), seePison et al. (2003). Well known robust covariance estimators with good statisticalproperties, like the MCD or the S-estimators (see, e.g. Maronna et al., 2006), relyon a full-rank data matrix Y which is not the case for clr transformed data (see,e.g., Aitchison, 1986).The isometric logratio (ilr) transformation (Egozcue et al., 2003) solves thissingularity problem. The data matrix Y is transformed to a matrix Z by usingan orthonormal basis of lower dimension. Using the ilr transformed data, a robustcovariance matrix C(Z) can be estimated. The result can be back-transformed tothe clr space byC(Y ) = V C(Z)V Twhere the matrix V with orthonormal columns comes from the relation betweenthe clr and the ilr transformation. Now the parameters in the model (2) can beestimated (Basilevsky, 1994) and the results have a direct interpretation since thelinks to the original variables are still preserved.The above procedure will be applied to data from geochemistry. Our specialinterest is on comparing the results with those of Reimann et al. (2002) for the Kolaproject data
Resumo:
Teaching and research are organised differently between subject domains: attempts to construct typologies of higher education institutions, however, often do not include quantitative indicators concerning subject mix which would allow systematic comparisons of large numbers of higher education institutions among different countries, as the availability of data for such indicators is limited. In this paper, we present an exploratory approach for the construction of such indicators. The database constructed in the AQUAMETH project, which includes also data disaggregated at the disciplinary level, is explored with the aim of understanding patterns of subject mix. For six European countries, an exploratory and descriptive analysis of staff composition divided in four large domains (medical sciences, engineering and technology, natural sciences and social sciences and humanities) is performed, which leads to a classification distinguishing between specialist and generalist institutions. Among the latter, a further distinction is made based on the presence or absence of a medical department. Preliminary exploration of this classification and its comparison with other indicators show the influence of long term dynamics on the subject mix of individual higher education institutions, but also underline disciplinary differences, for example regarding student to staff ratios, as well as national patterns, for example regarding the number of PhD degrees per 100 undergraduate students. Despite its many limitations, this exploratory approach allows defining a classification of higher education institutions that accounts for a large share of differences between the analysed higher education institutions.
Resumo:
The statistical analysis of compositional data should be treated using logratios of parts,which are difficult to use correctly in standard statistical packages. For this reason afreeware package, named CoDaPack was created. This software implements most of thebasic statistical methods suitable for compositional data.In this paper we describe the new version of the package that now is calledCoDaPack3D. It is developed in Visual Basic for applications (associated with Excel©),Visual Basic and Open GL, and it is oriented towards users with a minimum knowledgeof computers with the aim at being simple and easy to use.This new version includes new graphical output in 2D and 3D. These outputs could bezoomed and, in 3D, rotated. Also a customization menu is included and outputs couldbe saved in jpeg format. Also this new version includes an interactive help and alldialog windows have been improved in order to facilitate its use.To use CoDaPack one has to access Excel© and introduce the data in a standardspreadsheet. These should be organized as a matrix where Excel© rows correspond tothe observations and columns to the parts. The user executes macros that returnnumerical or graphical results. There are two kinds of numerical results: new variablesand descriptive statistics, and both appear on the same sheet. Graphical output appearsin independent windows. In the present version there are 8 menus, with a total of 38submenus which, after some dialogue, directly call the corresponding macro. Thedialogues ask the user to input variables and further parameters needed, as well aswhere to put these results. The web site http://ima.udg.es/CoDaPack contains thisfreeware package and only Microsoft Excel© under Microsoft Windows© is required torun the software.Kew words: Compositional data Analysis, Software
Resumo:
Hydrogeological research usually includes some statistical studies devised to elucidate mean background state, characterise relationships among different hydrochemical parameters, and show the influence of human activities. These goals are achieved either by means of a statistical approach or by mixing modelsbetween end-members. Compositional data analysis has proved to be effective with the first approach, but there is no commonly accepted solution to the end-member problem in a compositional framework.We present here a possible solution based on factor analysis of compositions illustrated with a case study.We find two factors on the compositional bi-plot fitting two non-centered orthogonal axes to the most representative variables. Each one of these axes defines a subcomposition, grouping those variables thatlay nearest to it. With each subcomposition a log-contrast is computed and rewritten as an equilibrium equation. These two factors can be interpreted as the isometric log-ratio coordinates (ilr) of three hiddencomponents, that can be plotted in a ternary diagram. These hidden components might be interpreted as end-members.We have analysed 14 molarities in 31 sampling stations all along the Llobregat River and its tributaries, with a monthly measure during two years. We have obtained a bi-plot with a 57% of explained totalvariance, from which we have extracted two factors: factor G, reflecting geological background enhanced by potash mining; and factor A, essentially controlled by urban and/or farming wastewater. Graphicalrepresentation of these two factors allows us to identify three extreme samples, corresponding to pristine waters, potash mining influence and urban sewage influence. To confirm this, we have available analysisof diffused and widespread point sources identified in the area: springs, potash mining lixiviates, sewage, and fertilisers. Each one of these sources shows a clear link with one of the extreme samples, exceptfertilisers due to the heterogeneity of their composition.This approach is a useful tool to distinguish end-members, and characterise them, an issue generally difficult to solve. It is worth note that the end-member composition cannot be fully estimated but only characterised through log-ratio relationships among components. Moreover, the influence of each endmember in a given sample must be evaluated in relative terms of the other samples. These limitations areintrinsic to the relative nature of compositional data
Resumo:
This article provides an in-depth study of long-term female unemployment in Catalonia.Long-term unemployment statistics reveal which social groups are most likely to experience difficulty re-entering the labour market. In this case, we found that women are mainly affected by this type of labour exclusion, in particular poorly qualified, working-class women who are aged over 45 and with family responsibilities.The article aims to explore how the overlapping of factors such as gender, age, social class, origin and the division of work based on gender are related to long-term female unemployment. Moreover, we were able to detect which conceptual tools provide us with the production/reproduction paradigm so as to be able to analyse the future of female unemployment. The methodology we used combines quantitative and qualitative approaches. On the one hand, the analysis of secondary statistical data focusing on Catalonia is useful in understanding the situation from a macro-social perspective. On the other hand, an exploratory discussion group enables us to investigate social imaginary practises among unemployed working class women aged over 45. This discussion group was held in Igualada -capital of the Anoia region - an area of Catalonia deeply affected by unemployment in the current economic crisis.
Resumo:
Modern methods of compositional data analysis are not well known in biomedical research.Moreover, there appear to be few mathematical and statistical researchersworking on compositional biomedical problems. Like the earth and environmental sciences,biomedicine has many problems in which the relevant scienti c information isencoded in the relative abundance of key species or categories. I introduce three problemsin cancer research in which analysis of compositions plays an important role. Theproblems involve 1) the classi cation of serum proteomic pro les for early detection oflung cancer, 2) inference of the relative amounts of di erent tissue types in a diagnostictumor biopsy, and 3) the subcellular localization of the BRCA1 protein, and it'srole in breast cancer patient prognosis. For each of these problems I outline a partialsolution. However, none of these problems is \solved". I attempt to identify areas inwhich additional statistical development is needed with the hope of encouraging morecompositional data analysts to become involved in biomedical research
Resumo:
Recently, kernel-based Machine Learning methods have gained great popularity in many data analysis and data mining fields: pattern recognition, biocomputing, speech and vision, engineering, remote sensing etc. The paper describes the use of kernel methods to approach the processing of large datasets from environmental monitoring networks. Several typical problems of the environmental sciences and their solutions provided by kernel-based methods are considered: classification of categorical data (soil type classification), mapping of environmental and pollution continuous information (pollution of soil by radionuclides), mapping with auxiliary information (climatic data from Aral Sea region). The promising developments, such as automatic emergency hot spot detection and monitoring network optimization are discussed as well.
Resumo:
We consider two fundamental properties in the analysis of two-way tables of positive data: the principle of distributional equivalence, one of the cornerstones of correspondence analysis of contingency tables, and the principle of subcompositional coherence, which forms the basis of compositional data analysis. For an analysis to be subcompositionally coherent, it suffices to analyse the ratios of the data values. The usual approach to dimension reduction in compositional data analysis is to perform principal component analysis on the logarithms of ratios, but this method does not obey the principle of distributional equivalence. We show that by introducing weights for the rows and columns, the method achieves this desirable property. This weighted log-ratio analysis is theoretically equivalent to spectral mapping , a multivariate method developed almost 30 years ago for displaying ratio-scale data from biological activity spectra. The close relationship between spectral mapping and correspondence analysis is also explained, as well as their connection with association modelling. The weighted log-ratio methodology is applied here to frequency data in linguistics and to chemical compositional data in archaeology.
Resumo:
Whether for investigative or intelligence aims, crime analysts often face up the necessity to analyse the spatiotemporal distribution of crimes or traces left by suspects. This article presents a visualisation methodology supporting recurrent practical analytical tasks such as the detection of crime series or the analysis of traces left by digital devices like mobile phone or GPS devices. The proposed approach has led to the development of a dedicated tool that has proven its effectiveness in real inquiries and intelligence practices. It supports a more fluent visual analysis of the collected data and may provide critical clues to support police operations as exemplified by the presented case studies.
Resumo:
Quantitative information from magnetic resonance imaging (MRI) may substantiate clinical findings and provide additional insight into the mechanism of clinical interventions in therapeutic stroke trials. The PERFORM study is exploring the efficacy of terutroban versus aspirin for secondary prevention in patients with a history of ischemic stroke. We report on the design of an exploratory longitudinal MRI follow-up study that was performed in a subgroup of the PERFORM trial. An international multi-centre longitudinal follow-up MRI study was designed for different MR systems employing safety and efficacy readouts: new T2 lesions, new DWI lesions, whole brain volume change, hippocampal volume change, changes in tissue microstructure as depicted by mean diffusivity and fractional anisotropy, vessel patency on MR angiography, and the presence of and development of new microbleeds. A total of 1,056 patients (men and women ≥ 55 years) were included. The data analysis included 3D reformation, image registration of different contrasts, tissue segmentation, and automated lesion detection. This large international multi-centre study demonstrates how new MRI readouts can be used to provide key information on the evolution of cerebral tissue lesions and within the macrovasculature after atherothrombotic stroke in a large sample of patients.
Resumo:
The Office of Special Investigations at Iowa Department of Transportation (DOT) collects FWD data on regular basis to evaluate pavement structural conditions. The primary objective of this study was to develop a fully-automated software system for rapid processing of the FWD data along with a user manual. The software system automatically reads the FWD raw data collected by the JILS-20 type FWD machine that Iowa DOT owns, processes and analyzes the collected data with the rapid prediction algorithms developed during the phase I study. This system smoothly integrates the FWD data analysis algorithms and the computer program being used to collect the pavement deflection data. This system can be used to assess pavement condition, estimate remaining pavement life, and eventually help assess pavement rehabilitation strategies by the Iowa DOT pavement management team. This report describes the developed software in detail and can also be used as a user-manual for conducting simulation studies and detailed analyses. *********************** Large File ***********************
Resumo:
BACKGROUND: PCR has the potential to detect and precisely quantify specific DNA sequences, but it is not yet often used as a fully quantitative method. A number of data collection and processing strategies have been described for the implementation of quantitative PCR. However, they can be experimentally cumbersome, their relative performances have not been evaluated systematically, and they often remain poorly validated statistically and/or experimentally. In this study, we evaluated the performance of known methods, and compared them with newly developed data processing strategies in terms of resolution, precision and robustness. RESULTS: Our results indicate that simple methods that do not rely on the estimation of the efficiency of the PCR amplification may provide reproducible and sensitive data, but that they do not quantify DNA with precision. Other evaluated methods based on sigmoidal or exponential curve fitting were generally of both poor resolution and precision. A statistical analysis of the parameters that influence efficiency indicated that it depends mostly on the selected amplicon and to a lesser extent on the particular biological sample analyzed. Thus, we devised various strategies based on individual or averaged efficiency values, which were used to assess the regulated expression of several genes in response to a growth factor. CONCLUSION: Overall, qPCR data analysis methods differ significantly in their performance, and this analysis identifies methods that provide DNA quantification estimates of high precision, robustness and reliability. These methods allow reliable estimations of relative expression ratio of two-fold or higher, and our analysis provides an estimation of the number of biological samples that have to be analyzed to achieve a given precision.
Resumo:
The present study deals with the analysis and mapping of Swiss franc interest rates. Interest rates depend on time and maturity, defining term structure of the interest rate curves (IRC). In the present study IRC are considered in a two-dimensional feature space - time and maturity. Exploratory data analysis includes a variety of tools widely used in econophysics and geostatistics. Geostatistical models and machine learning algorithms (multilayer perceptron and Support Vector Machines) were applied to produce interest rate maps. IR maps can be used for the visualisation and pattern perception purposes, to develop and to explore economical hypotheses, to produce dynamic asset-liability simulations and for financial risk assessments. The feasibility of an application of interest rates mapping approach for the IRC forecasting is considered as well. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
This article provides an in-depth study of long-term female unemployment in Catalonia.Long-term unemployment statistics reveal which social groups are most likely to experience difficulty re-entering the labour market. In this case, we found that women are mainly affected by this type of labour exclusion, in particular poorly qualified, working-class women who are aged over 45 and with family responsibilities.The article aims to explore how the overlapping of factors such as gender, age, social class, origin and the division of work based on gender are related to long-term female unemployment. Moreover, we were able to detect which conceptual tools provide us with the production/reproduction paradigm so as to be able to analyse the future of female unemployment. The methodology we used combines quantitative and qualitative approaches. On the one hand, the analysis of secondary statistical data focusing on Catalonia is useful in understanding the situation from a macro-social perspective. On the other hand, an exploratory discussion group enables us to investigate social imaginary practises among unemployed working class women aged over 45. This discussion group was held in Igualada -capital of the Anoia region - an area of Catalonia deeply affected by unemployment in the current economic crisis.