885 resultados para n-way data analysis
Resumo:
As a sequel to a paper that dealt with the analysis of two-way quantitative data in large germplasm collections, this paper presents analytical methods appropriate for two-way data matrices consisting of mixed data types, namely, ordered multicategory and quantitative data types. While various pattern analysis techniques have been identified as suitable for analysis of the mixed data types which occur in germplasm collections, the clustering and ordination methods used often can not deal explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions) with incomplete information. However, it is shown that the ordination technique of principal component analysis and the mixture maximum likelihood method of clustering can be employed to achieve such analyses. Germplasm evaluation data for 11436 accessions of groundnut (Arachis hypogaea L.) from the International Research Institute of the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the post-rainy season and five ordered multicategory descriptors were used. Pattern analysis results generally indicated that the accessions could be distinguished into four regions along the continuum of growth habit (or plant erectness). Interpretation of accession membership in these regions was found to be consistent with taxonomic information, such as subspecies. Each growth habit region contained accessions from three of the most common groundnut botanical varieties. This implies that within each of the habit types there is the full range of expression for the other descriptors used in the analysis. Using these types of insights, the patterns of variability in germplasm collections can provide scientists with valuable information for their plant improvement programs.
Resumo:
Citizen Science projects are initiatives in which members of the general public participate in scientific research projects and perform or manage research-related tasks such as data collection and/or data annotation. Citizen Science is technologically possible and scientifically significant. However, although research teams can save time and money by recruiting general citizens to volunteer their time and skills to help data analysis, the reliability of contributed data varies a lot. Data reliability issues are significant to the domain of Citizen Science due to the quantity and diversity of people and devices involved. Participants may submit low quality, misleading, inaccurate, or even malicious data. Therefore, finding a way to improve the data reliability has become an urgent demand. This study aims to investigate techniques to enhance the reliability of data contributed by general citizens in scientific research projects especially for acoustic sensing projects. In particular, we propose to design a reputation framework to enhance data reliability and also investigate some critical elements that should be aware of during developing and designing new reputation systems.
Resumo:
In Australia, as in some other western nations, governments impose accountability measures on educational institutions (Earl, 2005). One such accountability measure is the National Assessment Program - Literacy and Numeracy (NAPLAN) from which high-stakes assessment data is generated. In this article, a practical method of data analysis known as the Over Time Assessment Data Analysis (OTADA) is offered as an analytical process by which schools can monitor their current and over time performances. This analysis developed by the author, is currently used extensively in schools throughout Queensland. By Analysing in this way, teachers, and in particular principals, can obtain a quick and insightful performance overview. For those seeking to track the achievements and progress of year level cohorts, the OTADA should be considered.
Resumo:
ENGLISH: A two-stage sampling design is used to estimate the variances of the numbers of yellowfin in different age groups caught in the eastern Pacific Ocean. For purse seiners, the primary sampling unit (n) is a brine well containing fish from a month-area stratum; the number of fish lengths (m) measured from each well are the secondary units. The fish cannot be selected at random from the wells because of practical limitations. The effects of different sampling methods and other factors on the reliability and precision of statistics derived from the length-frequency data were therefore examined. Modifications are recommended where necessary. Lengths of fish measured during the unloading of six test wells revealed two forms of inherent size stratification: 1) short-term disruptions of existing pattern of sizes, and 2) transition zones between long-term trends in sizes. To some degree, all wells exhibited cyclic changes in mean size and variance during unloading. In half of the wells, it was observed that size selection by the unloaders induced a change in mean size. As a result of stratification, the sequence of sizes removed from all wells was non-random, regardless of whether a well contained fish from a single set or from more than one set. The number of modal sizes in a well was not related to the number of sets. In an additional well composed of fish from several sets, an experiment on vertical mixing indicated that a representative sample of the contents may be restricted to the bottom half of the well. The contents of the test wells were used to generate 25 simulated wells and to compare the results of three sampling methods applied to them. The methods were: (1) random sampling (also used as a standard), (2) protracted sampling, in which the selection process was extended over a large portion of a well, and (3) measuring fish consecutively during removal from the well. Repeated sampling by each method and different combinations indicated that, because the principal source of size variation occurred among primary units, increasing n was the most effective way to reduce the variance estimates of both the age-group sizes and the total number of fish in the landings. Protracted sampling largely circumvented the effects of size stratification, and its performance was essentially comparable to that of random sampling. Sampling by this method is recommended. Consecutive-fish sampling produced more biased estimates with greater variances. Analysis of the 1988 length-frequency samples indicated that, for age groups that appear most frequently in the catch, a minimum sampling frequency of one primary unit in six for each month-area stratum would reduce the coefficients of variation (CV) of their size estimates to approximately 10 percent or less. Additional stratification of samples by set type, rather than month-area alone, further reduced the CV's of scarce age groups, such as the recruits, and potentially improved their accuracy. The CV's of recruitment estimates for completely-fished cohorts during the 198184 period were in the vicinity of 3 to 8 percent. Recruitment estimates and their variances were also relatively insensitive to changes in the individual quarterly catches and variances, respectively, of which they were composed. SPANISH: Se usa un diseño de muestreo de dos etapas para estimar las varianzas de los números de aletas amari11as en distintos grupos de edad capturados en el Océano Pacifico oriental. Para barcos cerqueros, la unidad primaria de muestreo (n) es una bodega de salmuera que contenía peces de un estrato de mes-área; el numero de ta11as de peces (m) medidas de cada bodega es la unidad secundaria. Limitaciones de carácter practico impiden la selección aleatoria de peces de las bodegas. Por 10 tanto, fueron examinados los efectos de distintos métodos de muestreo y otros factores sobre la confiabilidad y precisión de las estadísticas derivadas de los datos de frecuencia de ta11a. Se recomiendan modificaciones donde sean necesarias. Las ta11as de peces medidas durante la descarga de seis bodegas de prueba revelaron dos formas de estratificación inherente por ta11a: 1) perturbaciones a corto plazo en la pauta de ta11as existente, y 2) zonas de transición entre las tendencias a largo plazo en las ta11as. En cierto grado, todas las bodegas mostraron cambios cíclicos en ta11a media y varianza durante la descarga. En la mitad de las bodegas, se observo que selección por ta11a por los descargadores indujo un cambio en la ta11a media. Como resultado de la estratificación, la secuencia de ta11as sacadas de todas las bodegas no fue aleatoria, sin considerar si una bodega contenía peces de un solo lance 0 de mas de uno. El numero de ta11as modales en una bodega no estaba relacionado al numero de lances. En una bodega adicional compuesta de peces de varios lances, un experimento de mezcla vertical indico que una muestra representativa del contenido podría estar limitada a la mitad inferior de la bodega. Se uso el contenido de las bodegas de prueba para generar 25 bodegas simuladas y comparar los resultados de tres métodos de muestreo aplicados a estas. Los métodos fueron: (1) muestreo aleatorio (usado también como norma), (2) muestreo extendido, en el cual el proceso de selección fue extendido sobre una porción grande de una bodega, y (3) medición consecutiva de peces durante la descarga de la bodega. EI muestreo repetido con cada método y distintas combinaciones de n y m indico que, puesto que la fuente principal de variación de ta11a ocurría entre las unidades primarias, aumentar n fue la manera mas eficaz de reducir las estimaciones de la varianza de las ta11as de los grupos de edad y el numero total de peces en los desembarcos. El muestreo extendido evito mayormente los efectos de la estratificación por ta11a, y su desempeño fue esencialmente comparable a aquel del muestreo aleatorio. Se recomienda muestrear con este método. El muestreo de peces consecutivos produjo estimaciones mas sesgadas con mayores varianzas. Un análisis de las muestras de frecuencia de ta11a de 1988 indico que, para los grupos de edad que aparecen con mayor frecuencia en la captura, una frecuencia de muestreo minima de una unidad primaria de cada seis para cada estrato de mes-área reduciría los coeficientes de variación (CV) de las estimaciones de ta11a correspondientes a aproximadamente 10% 0 menos. Una estratificación adicional de las muestras por tipo de lance, y no solamente mes-área, redujo aun mas los CV de los grupos de edad escasos, tales como los reclutas, y mejoró potencialmente su precisión. Los CV de las estimaciones del reclutamiento para las cohortes completamente pescadas durante 1981-1984 fueron alrededor de 3-8%. Las estimaciones del reclutamiento y sus varianzas fueron también relativamente insensibles a cambios en las capturas de trimestres individuales y las varianzas, respectivamente, de las cuales fueron derivadas. (PDF contains 70 pages)
Resumo:
In this article, we offer a new way of exploring relationships between three different dimensions of a business operation, namely the stage of business development, the methods of creativity and the major cultural values. Although separately, each of these has gained enormous attention from the management research community, evidenced by a large volume of research studies, there have been not many studies that attempt to describe the logic that connect these three important aspects of a business; let alone empirical evidences that support any significant relationships among these variables. The paper also provides a data set and an empirical investigation on that data set, using a categorical data analysis, to conclude that examinations of these possible relationships are meaningful and possible for seemingly unquantifiable information. The results also show that the most significant category among all creativity methods employed in Vietnamese enterprises is the “creative disciplines” rule in the “entrepreneurial phase,” while in general creative disciplines have played a critical role in explaining the structure of our data sample, for both stages of development in our consideration.
Resumo:
BACKGROUND: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. METHODS/PRINCIPAL FINDINGS: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of "what if" situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. CONCLUSION/SIGNIFICANCE: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
Statistics are regularly used to make some form of comparison between trace evidence or deploy the exclusionary principle (Morgan and Bull, 2007) in forensic investigations. Trace evidence are routinely the results of particle size, chemical or modal analyses and as such constitute compositional data. The issue is that compositional data including percentages, parts per million etc. only carry relative information. This may be problematic where a comparison of percentages and other constraint/closed data is deemed a statistically valid and appropriate way to present trace evidence in a court of law. Notwithstanding an awareness of the existence of the constant sum problem since the seminal works of Pearson (1896) and Chayes (1960) and the introduction of the application of log-ratio techniques (Aitchison, 1986; Pawlowsky-Glahn and Egozcue, 2001; Pawlowsky-Glahn and Buccianti, 2011; Tolosana-Delgado and van den Boogaart, 2013) the problem that a constant sum destroys the potential independence of variances and covariances required for correlation regression analysis and empirical multivariate methods (principal component analysis, cluster analysis, discriminant analysis, canonical correlation) is all too often not acknowledged in the statistical treatment of trace evidence. Yet the need for a robust treatment of forensic trace evidence analyses is obvious. This research examines the issues and potential pitfalls for forensic investigators if the constant sum constraint is ignored in the analysis and presentation of forensic trace evidence. Forensic case studies involving particle size and mineral analyses as trace evidence are used to demonstrate the use of a compositional data approach using a centred log-ratio (clr) transformation and multivariate statistical analyses.
Resumo:
”compositions” is a new R-package for the analysis of compositional and positive data. It contains four classes corresponding to the four different types of compositional and positive geometry (including the Aitchison geometry). It provides means for computation, plotting and high-level multivariate statistical analysis in all four geometries. These geometries are treated in an fully analogous way, based on the principle of working in coordinates, and the object-oriented programming paradigm of R. In this way, called functions automatically select the most appropriate type of analysis as a function of the geometry. The graphical capabilities include ternary diagrams and tetrahedrons, various compositional plots (boxplots, barplots, piecharts) and extensive graphical tools for principal components. Afterwards, ortion and proportion lines, straight lines and ellipses in all geometries can be added to plots. The package is accompanied by a hands-on-introduction, documentation for every function, demos of the graphical capabilities and plenty of usage examples. It allows direct and parallel computation in all four vector spaces and provides the beginner with a copy-and-paste style of data analysis, while letting advanced users keep the functionality and customizability they demand of R, as well as all necessary tools to add own analysis routines. A complete example is included in the appendix
Resumo:
Many multivariate methods that are apparently distinct can be linked by introducing one or more parameters in their definition. Methods that can be linked in this way are correspondence analysis, unweighted or weighted logratio analysis (the latter also known as "spectral mapping"), nonsymmetric correspondence analysis, principal component analysis (with and without logarithmic transformation of the data) and multidimensional scaling. In this presentation I will show how several of these methods, which are frequently used in compositional data analysis, may be linked through parametrizations such as power transformations, linear transformations and convex linear combinations. Since the methods of interest here all lead to visual maps of data, a "movie" can be made where where the linking parameter is allowed to vary in small steps: the results are recalculated "frame by frame" and one can see the smooth change from one method to another. Several of these "movies" will be shown, giving a deeper insight into the similarities and differences between these methods
Resumo:
This work presents a novel approach in order to increase the recognition power of Multiscale Fractal Dimension (MFD) techniques, when applied to image classification. The proposal uses Functional Data Analysis (FDA) with the aim of enhancing the MFD technique precision achieving a more representative descriptors vector, capable of recognizing and characterizing more precisely objects in an image. FDA is applied to signatures extracted by using the Bouligand-Minkowsky MFD technique in the generation of a descriptors vector from them. For the evaluation of the obtained improvement, an experiment using two datasets of objects was carried out. A dataset was used of characters shapes (26 characters of the Latin alphabet) carrying different levels of controlled noise and a dataset of fish images contours. A comparison with the use of the well-known methods of Fourier and wavelets descriptors was performed with the aim of verifying the performance of FDA method. The descriptor vectors were submitted to Linear Discriminant Analysis (LDA) classification method and we compared the correctness rate in the classification process among the descriptors methods. The results demonstrate that FDA overcomes the literature methods (Fourier and wavelets) in the processing of information extracted from the MFD signature. In this way, the proposed method can be considered as an interesting choice for pattern recognition and image classification using fractal analysis.
Resumo:
Each plasma physics laboratory has a proprietary scheme to control and data acquisition system. Usually, it is different from one laboratory to another. It means that each laboratory has its own way to control the experiment and retrieving data from the database. Fusion research relies to a great extent on international collaboration and this private system makes it difficult to follow the work remotely. The TCABR data analysis and acquisition system has been upgraded to support a joint research programme using remote participation technologies. The choice of MDSplus (Model Driven System plus) is proved by the fact that it is widely utilized, and the scientists from different institutions may use the same system in different experiments in different tokamaks without the need to know how each system treats its acquisition system and data analysis. Another important point is the fact that the MDSplus has a library system that allows communication between different types of language (JAVA, Fortran, C, C++, Python) and programs such as MATLAB, IDL, OCTAVE. In the case of tokamak TCABR interfaces (object of this paper) between the system already in use and MDSplus were developed, instead of using the MDSplus at all stages, from the control, and data acquisition to the data analysis. This was done in the way to preserve a complex system already in operation and otherwise it would take a long time to migrate. This implementation also allows add new components using the MDSplus fully at all stages. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
Data visualization techniques are powerful in the handling and analysis of multivariate systems. One such technique known as parallel coordinates was used to support the diagnosis of an event, detected by a neural network-based monitoring system, in a boiler at a Brazilian Kraft pulp mill. Its attractiveness is the possibility of the visualization of several variables simultaneously. The diagnostic procedure was carried out step-by-step going through exploratory, explanatory, confirmatory, and communicative goals. This tool allowed the visualization of the boiler dynamics in an easier way, compared to commonly used univariate trend plots. In addition it facilitated analysis of other aspects, namely relationships among process variables, distinct modes of operation and discrepant data. The whole analysis revealed firstly that the period involving the detected event was associated with a transition between two distinct normal modes of operation, and secondly the presence of unusual changes in process variables at this time.
Resumo:
Supernovae are among the most energetic events occurring in the universe and are so far the only verified extrasolar source of neutrinos. As the explosion mechanism is still not well understood, recording a burst of neutrinos from such a stellar explosion would be an important benchmark for particle physics as well as for the core collapse models. The neutrino telescope IceCube is located at the Geographic South Pole and monitors the antarctic glacier for Cherenkov photons. Even though it was conceived for the detection of high energy neutrinos, it is capable of identifying a burst of low energy neutrinos ejected from a supernova in the Milky Way by exploiting the low photomultiplier noise in the antarctic ice and extracting a collective rate increase. A signal Monte Carlo specifically developed for water Cherenkov telescopes is presented. With its help, we will investigate how well IceCube can distinguish between core collapse models and oscillation scenarios. In the second part, nine years of data taken with the IceCube precursor AMANDA will be analyzed. Intensive data cleaning methods will be presented along with a background simulation. From the result, an upper limit on the expected occurrence of supernovae within the Milky Way will be determined.
Resumo:
Introduction. Food frequency questionnaires (FFQ) are used study the association between dietary intake and disease. An instructional video may potentially offer a low cost, practical method of dietary assessment training for participants thereby reducing recall bias in FFQs. There is little evidence in the literature of the effect of using instructional videos on FFQ-based intake. Objective. This analysis compared the reported energy and macronutrient intake of two groups that were randomized either to watch an instructional video before completing an FFQ or to view the same instructional video after completing the same FFQ. Methods. In the parent study, a diverse group of students, faculty and staff from Houston Community College were randomized to two groups, stratified by ethnicity, and completed an FFQ. The "video before" group watched an instructional video about completing the FFQ prior to answering the FFQ. The "video after" group watched the instructional video after completing the FFQ. The two groups were compared on mean daily energy (Kcal/day), fat (g/day), protein (g/day), carbohydrate (g/day) and fiber (g/day) intakes using descriptive statistics and one-way ANOVA. Demographic, height, and weight information was collected. Dietary intakes were adjusted for total energy intake before the comparative analysis. BMI and age were ruled out as potential confounders. Results. There were no significant differences between the two groups in mean daily dietary intakes of energy, total fat, protein, carbohydrates and fiber. However, a pattern of higher energy intake and lower fiber intake was reported in the group that viewed the instructional video before completing the FFQ compared to those who viewed the video after. Discussion. Analysis of the difference between reported intake of energy and macronutrients showed an overall pattern, albeit not statistically significant, of higher intake in the video before versus the video after group. Application of instructional videos for dietary assessment may require further research to address the validity of reported dietary intakes in those who are randomized to watch an instructional video before reporting diet compared to a control groups that does not view a video.^