952 resultados para Dynamic data set visualization
Resumo:
When continuous data are coded to categorical variables, two types of coding are possible: crisp coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where each observation is transformed to a set of "degrees of membership" between 0 and 1, using co-called membership functions. It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the solution in a low-dimensional space. Since the crisp data only code the categories to which each individual case belongs, an alternative measure of fit is simply to count how well these categories are predicted by the solution. Another approach is to consider multiple correspondence analysis equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal tables of the Burt matrix - the measure of fit is then computed as the quality of explaining these tables only. The correspondence analysis of fuzzy coded data, called "fuzzy multiple correspondence analysis", suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions are made of the categories which have highest degree of membership. But here one can also defuzzify the results of the analysis to obtain estimated values of the original data, and then calculate a measure of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance. Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the crisp case can be applied to analyse the off-diagonal part of this matrix. In this paper these alternative measures of fit are defined and applied to a data set of continuous meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit is further discussed when the data set consists of a mixture of discrete and continuous variables.
Resumo:
Despite the advancement of phylogenetic methods to estimate speciation and extinction rates, their power can be limited under variable rates, in particular for clades with high extinction rates and small number of extant species. Fossil data can provide a powerful alternative source of information to investigate diversification processes. Here, we present PyRate, a computer program to estimate speciation and extinction rates and their temporal dynamics from fossil occurrence data. The rates are inferred in a Bayesian framework and are comparable to those estimated from phylogenetic trees. We describe how PyRate can be used to explore different models of diversification. In addition to the diversification rates, it provides estimates of the parameters of the preservation process (fossilization and sampling) and the times of speciation and extinction of each species in the data set. Moreover, we develop a new birth-death model to correlate the variation of speciation/extinction rates with changes of a continuous trait. Finally, we demonstrate the use of Bayes factors for model selection and show how the posterior estimates of a PyRate analysis can be used to generate calibration densities for Bayesian molecular clock analysis. PyRate is an open-source command-line Python program available at http://sourceforge.net/projects/pyrate/.
Resumo:
A biplot, which is the multivariate generalization of the two-variable scatterplot, can be used to visualize the results of many multivariate techniques, especially those that are based on the singular value decomposition. We consider data sets consisting of continuous-scale measurements, their fuzzy coding and the biplots that visualize them, using a fuzzy version of multiple correspondence analysis. Of special interest is the way quality of fit of the biplot is measured, since it is well-known that regular (i.e., crisp) multiple correspondence analysis seriously under-estimates this measure. We show how the results of fuzzy multiple correspondence analysis can be defuzzified to obtain estimated values of the original data, and prove that this implies an orthogonal decomposition of variance. This permits a measure of fit to be calculated in the familiar form of a percentage of explained variance, which is directly comparable to the corresponding fit measure used in principal component analysis of the original data. The approach is motivated initially by its application to a simulated data set, showing how the fuzzy approach can lead to diagnosing nonlinear relationships, and finally it is applied to a real set of meteorological data.
Resumo:
The singular value decomposition and its interpretation as alinear biplot has proved to be a powerful tool for analysing many formsof multivariate data. Here we adapt biplot methodology to the specifficcase of compositional data consisting of positive vectors each of whichis constrained to have unit sum. These relative variation biplots haveproperties relating to special features of compositional data: the studyof ratios, subcompositions and models of compositional relationships. Themethodology is demonstrated on a data set consisting of six-part colourcompositions in 22 abstract paintings, showing how the singular valuedecomposition can achieve an accurate biplot of the colour ratios and howpossible models interrelating the colours can be diagnosed.
Resumo:
Structural equation models (SEM) are commonly used to analyze the relationship between variables some of which may be latent, such as individual ``attitude'' to and ``behavior'' concerning specific issues. A number of difficulties arise when we want to compare a large number of groups, each with large sample size, and the manifest variables are distinctly non-normally distributed. Using an specific data set, we evaluate the appropriateness of the following alternative SEM approaches: multiple group versus MIMIC models, continuous versus ordinal variables estimation methods, and normal theory versus non-normal estimation methods. The approaches are applied to the ISSP-1993 Environmental data set, with the purpose of exploring variation in the mean level of variables of ``attitude'' to and ``behavior''concerning environmental issues and their mutual relationship across countries. Issues of both theoretical and practical relevance arise in the course of this application.
Resumo:
It is shown how correspondence analysis may be applied to a subset of response categories from a questionnaire survey, for example the subset of undecided responses or the subset of responses for a particular category. The idea is to maintain the original relative frequencies of the categories and not re-express them relative to totals within the subset, as would normally be done in a regular correspondence analysis of the subset. Furthermore, the masses and chi-square metric assigned to the data subset are the same as those in the correspondence analysis of the whole data set. This variant of the method, called Subset Correspondence Analysis, is illustrated on data from the ISSP survey on Family and Changing Gender Roles.
Resumo:
Persons with Down syndrome (DS) uniquely have an increased frequency of leukemias but a decreased total frequency of solid tumors. The distribution and frequency of specific types of brain tumors have never been studied in DS. We evaluated the frequency of primary neural cell embryonal tumors and gliomas in a large international data set. The observed number of children with DS having a medulloblastoma, central nervous system primitive neuroectodermal tumor (CNS-PNET) or glial tumor was compared to the expected number. Data were collected from cancer registries or brain tumor registries in 13 countries of Europe, America, Asia and Oceania. The number of DS children with each category of tumor was treated as a Poisson variable with mean equal to 0.000884 times the total number of registrations in that category. Among 8,043 neural cell embryonal tumors (6,882 medulloblastomas and 1,161 CNS-PNETs), only one patient with medulloblastoma had DS, while 7.11 children in total and 6.08 with medulloblastoma were expected to have DS. (p 0.016 and 0.0066 respectively). Among 13,797 children with glioma, 10 had DS, whereas 12.2 were expected. Children with DS appear to be specifically protected against primary neural cell embryonal tumors of the CNS, whereas gliomas occur at the same frequency as in the general population. A similar protection against neuroblastoma, the principal extracranial neural cell embryonal tumor, has been observed in children with DS. Additional genetic material on the supernumerary chromosome 21 may protect against embryonal neural cell tumor development.
Resumo:
In recent years there has been an explosive growth in the development of adaptive and data driven methods. One of the efficient and data-driven approaches is based on statistical learning theory (Vapnik 1998). The theory is based on Structural Risk Minimisation (SRM) principle and has a solid statistical background. When applying SRM we are trying not only to reduce training error ? to fit the available data with a model, but also to reduce the complexity of the model and to reduce generalisation error. Many nonlinear learning procedures recently developed in neural networks and statistics can be understood and interpreted in terms of the structural risk minimisation inductive principle. A recent methodology based on SRM is called Support Vector Machines (SVM). At present SLT is still under intensive development and SVM find new areas of application (www.kernel-machines.org). SVM develop robust and non linear data models with excellent generalisation abilities that is very important both for monitoring and forecasting. SVM are extremely good when input space is high dimensional and training data set i not big enough to develop corresponding nonlinear model. Moreover, SVM use only support vectors to derive decision boundaries. It opens a way to sampling optimization, estimation of noise in data, quantification of data redundancy etc. Presentation of SVM for spatially distributed data is given in (Kanevski and Maignan 2004).
Resumo:
The quality of semi-detailed (scale 1:100.000) soil maps and the utility of a taxonomically based legend were assessed by studying 33 apparently homogeneous fields with strongly weathered soils in two regions in São Paulo State: Araras and Assis. An independent data set of 395 auger sites was used to determine purity of soil mapping units and analysis of variance within and between mapping units and soil classification units. Twenty three soil profiles were studied in detail. The studied soil maps have a high purity for some legend criteria, such as B horizon type (> 90%) and soil texture class (> 80%). The purity for the "trophic character" (eutrophic, dystrophic, allic) was only 55% in Assis. It was 88% in Araras, where many soil units had been mapped as associations. In both regions, the base status of clay-textured soils was generally better than suggested by the maps. Analysis of variance showed that mapping was successful for "durable" soil characteristics such as clay content (> 80% of variance explained) and cation exchange capacity (≥ 50% of variance explained) of 0-20 and 60-80 cm layers. For soil characteristics that are easily modified by management, such as base saturation of the 0-20 cm layer, the maps had explained very little (< 15%) of the total variance in the study areas. Intermediate results were obtained for base saturation of the 60-80 cm layer (56% in Assis; 42% in Araras). Variance explained by taxonomic groupings that formed the basis for the legend of the soil maps was similar to, often even smaller than, variance explained by mapping units. The conclusion is that map boundaries have been very carefully located, but descriptions of mapping units could be improved. In future mappings, this could possibly be done at low cost by (a) bulk sampling to remove short range variation and enhance visualization of spatial patterns at distances > 100 m; (b) taking advantage of correlations between easily measured soil characteristics and chemical soil properties and, (c) unbending the link between legend criteria and a taxonomic system. The maps are well suited to obtain an impression of land suitability for high-input farming. Additional field work and data on former land use/management are necessary for the evaluation of chemical properties of surface horizons.
Resumo:
The paper deals with the development and application of the generic methodology for automatic processing (mapping and classification) of environmental data. General Regression Neural Network (GRNN) is considered in detail and is proposed as an efficient tool to solve the problem of spatial data mapping (regression). The Probabilistic Neural Network (PNN) is considered as an automatic tool for spatial classifications. The automatic tuning of isotropic and anisotropic GRNN/PNN models using cross-validation procedure is presented. Results are compared with the k-Nearest-Neighbours (k-NN) interpolation algorithm using independent validation data set. Real case studies are based on decision-oriented mapping and classification of radioactively contaminated territories.
Resumo:
For the last 2 decades, supertree reconstruction has been an active field of research and has seen the development of a large number of major algorithms. Because of the growing popularity of the supertree methods, it has become necessary to evaluate the performance of these algorithms to determine which are the best options (especially with regard to the supermatrix approach that is widely used). In this study, seven of the most commonly used supertree methods are investigated by using a large empirical data set (in terms of number of taxa and molecular markers) from the worldwide flowering plant family Sapindaceae. Supertree methods were evaluated using several criteria: similarity of the supertrees with the input trees, similarity between the supertrees and the total evidence tree, level of resolution of the supertree and computational time required by the algorithm. Additional analyses were also conducted on a reduced data set to test if the performance levels were affected by the heuristic searches rather than the algorithms themselves. Based on our results, two main groups of supertree methods were identified: on one hand, the matrix representation with parsimony (MRP), MinFlip, and MinCut methods performed well according to our criteria, whereas the average consensus, split fit, and most similar supertree methods showed a poorer performance or at least did not behave the same way as the total evidence tree. Results for the super distance matrix, that is, the most recent approach tested here, were promising with at least one derived method performing as well as MRP, MinFlip, and MinCut. The output of each method was only slightly improved when applied to the reduced data set, suggesting a correct behavior of the heuristic searches and a relatively low sensitivity of the algorithms to data set sizes and missing data. Results also showed that the MRP analyses could reach a high level of quality even when using a simple heuristic search strategy, with the exception of MRP with Purvis coding scheme and reversible parsimony. The future of supertrees lies in the implementation of a standardized heuristic search for all methods and the increase in computing power to handle large data sets. The latter would prove to be particularly useful for promising approaches such as the maximum quartet fit method that yet requires substantial computing power.
Resumo:
This paper presents a process of mining research & development abstract databases to profile current status and to project potential developments for target technologies, The process is called "technology opportunities analysis." This article steps through the process using a sample data set of abstracts from the INSPEC database on the topic o "knowledge discovery and data mining." The paper offers a set of specific indicators suitable for mining such databases to understand innovation prospects. In illustrating the uses of such indicators, it offers some insights into the status of knowledge discovery research*.
Resumo:
Predictive groundwater modeling requires accurate information about aquifer characteristics. Geophysical imaging is a powerful tool for delineating aquifer properties at an appropriate scale and resolution, but it suffers from problems of ambiguity. One way to overcome such limitations is to adopt a simultaneous multitechnique inversion strategy. We have developed a methodology for aquifer characterization based on structural joint inversion of multiple geophysical data sets followed by clustering to form zones and subsequent inversion for zonal parameters. Joint inversions based on cross-gradient structural constraints require less restrictive assumptions than, say, applying predefined petro-physical relationships and generally yield superior results. This approach has, for the first time, been applied to three geophysical data types in three dimensions. A classification scheme using maximum likelihood estimation is used to determine the parameters of a Gaussian mixture model that defines zonal geometries from joint-inversion tomograms. The resulting zones are used to estimate representative geophysical parameters of each zone, which are then used for field-scale petrophysical analysis. A synthetic study demonstrated how joint inversion of seismic and radar traveltimes and electrical resistance tomography (ERT) data greatly reduces misclassification of zones (down from 21.3% to 3.7%) and improves the accuracy of retrieved zonal parameters (from 1.8% to 0.3%) compared to individual inversions. We applied our scheme to a data set collected in northeastern Switzerland to delineate lithologic subunits within a gravel aquifer. The inversion models resolve three principal subhorizontal units along with some important 3D heterogeneity. Petro-physical analysis of the zonal parameters indicated approximately 30% variation in porosity within the gravel aquifer and an increasing fraction of finer sediments with depth.
Resumo:
PURPOSE: To compare 3 different flow targeted magnetization preparation strategies for coronary MR angiography (cMRA), which allow selective visualization of the vessel lumen. MATERIAL AND METHODS: The right coronary artery of 10 healthy subjects was investigated on a 1.5 Tesla MR system (Gyroscan ACS-NT, Philips Healthcare, Best, NL). A navigator-gated and ECG-triggered 3D radial steady-state free-precession (SSFP) cMRA sequence with 3 different magnetization preparation schemes was performed referred to as projection SSFP (selective labeling of the aorta, subtraction of 2 data sets), LoReIn SSFP (double-inversion preparation, selective labeling of the aorta, 1 data set), and inflow SSFP (inversion preparation, selective labeling of the coronary artery, 1 data set). Signal-to-noise ratio (SNR) of the coronary artery and aorta, contrast-to-noise ratio (CNR) between the coronary artery and epicardial fat, vessel length and vessel sharpness were analyzed. RESULTS: All cMRA sequences were successfully obtained in all subjects. Both projection SSFP and LoReIn SSFP allowed for selective visualization of the coronary arteries with excellent background suppression. Scan time was doubled in projection SSFP because of the need for subtraction of 2 data sets. In inflow SSFP, background suppression was limited to the tissue included in the inversion volume. Projection SSFP (SNR(coro): 25.6 +/- 12.1; SNR(ao): 26.1 +/- 16.8; CNR(coro-fat): 22.0 +/- 11.7) and inflow SSFP (SNR(coro): 27.9 +/- 5.4; SNR(ao): 37.4 +/- 9.2; CNR(coro-fat): 24.9 +/- 4.8) yielded significantly increased SNR and CNR compared with LoReIn SSFP (SNR(coro): 12.3 +/- 5.4; SNR(ao): 11.8 +/- 5.8; CNR(coro-fat): 9.8 +/- 5.5; P < 0.05 for both). Longest visible vessel length was found with projection SSFP (79.5 mm +/- 18.9; P < 0.05 vs. LoReIn) whereas vessel sharpness was best in inflow SSFP (68.2% +/- 4.5%; P < 0.05 vs. LoReIn). Consistently good image quality was achieved using inflow SSFP likely because of the simple planning procedure and short scanning time. CONCLUSION: Three flow targeted cMRA approaches are presented, which provide selective visualization of the coronary vessel lumen and in addition blood flow information without the need of contrast agent administration. Inflow SSFP yielded highest SNR, CNR and vessel sharpness and may prove useful as a fast and efficient approach for assessing proximal and mid vessel coronary blood flow, whereas requiring less planning skills than projection SSFP or LoReIn SSFP.
Resumo:
The main goal of CleanEx is to provide access to public gene expression data via unique gene names. A second objective is to represent heterogeneous expression data produced by different technologies in a way that facilitates joint analysis and cross-data set comparisons. A consistent and up-to-date gene nomenclature is achieved by associating each single experiment with a permanent target identifier consisting of a physical description of the targeted RNA population or the hybridization reagent used. These targets are then mapped at regular intervals to the growing and evolving catalogues of human genes and genes from model organisms. The completely automatic mapping procedure relies partly on external genome information resources such as UniGene and RefSeq. The central part of CleanEx is a weekly built gene index containing cross-references to all public expression data already incorporated into the system. In addition, the expression target database of CleanEx provides gene mapping and quality control information for various types of experimental resource, such as cDNA clones or Affymetrix probe sets. The web-based query interfaces offer access to individual entries via text string searches or quantitative expression criteria. CleanEx is accessible at: http://www.cleanex.isb-sib.ch/.