952 resultados para Data sets storage
Resumo:
The biplot has proved to be a powerful descriptive and analytical tool in many areas of applications of statistics. For compositional data the necessary theoretical adaptation has been provided, with illustrative applications, by Aitchison (1990) and Aitchison and Greenacre (2002). These papers were restricted to the interpretation of simple compositional data sets. In many situations the problem has to be described in some form of conditional modelling. For example, in a clinical trial where interest is in how patients’ steroid metabolite compositions may change as a result of different treatment regimes, interest is in relating the compositions after treatment to the compositions before treatment and the nature of the treatments applied. To study this through a biplot technique requires the development of some form of conditional compositional biplot. This is the purpose of this paper. We choose as a motivating application an analysis of the 1992 US President ial Election, where interest may be in how the three-part composition, the percentage division among the three candidates - Bush, Clinton and Perot - of the presidential vote in each state, depends on the ethnic composition and on the urban-rural composition of the state. The methodology of conditional compositional biplots is first developed and a detailed interpretation of the 1992 US Presidential Election provided. We use a second application involving the conditional variability of tektite mineral compositions with respect to major oxide compositions to demonstrate some hazards of simplistic interpretation of biplots. Finally we conjecture on further possible applications of conditional compositional biplots
Resumo:
All of the imputation techniques usually applied for replacing values below the detection limit in compositional data sets have adverse effects on the variability. In this work we propose a modification of the EM algorithm that is applied using the additive log-ratio transformation. This new strategy is applied to a compositional data set and the results are compared with the usual imputation techniques
Resumo:
The R-package “compositions”is a tool for advanced compositional analysis. Its basic functionality has seen some conceptual improvement, containing now some facilities to work with and represent ilr bases built from balances, and an elaborated subsys- tem for dealing with several kinds of irregular data: (rounded or structural) zeroes, incomplete observations and outliers. The general approach to these irregularities is based on subcompositions: for an irregular datum, one can distinguish a “regular” sub- composition (where all parts are actually observed and the datum behaves typically) and a “problematic” subcomposition (with those unobserved, zero or rounded parts, or else where the datum shows an erratic or atypical behaviour). Systematic classification schemes are proposed for both outliers and missing values (including zeros) focusing on the nature of irregularities in the datum subcomposition(s). To compute statistics with values missing at random and structural zeros, a projection approach is implemented: a given datum contributes to the estimation of the desired parameters only on the subcompositon where it was observed. For data sets with values below the detection limit, two different approaches are provided: the well-known imputation technique, and also the projection approach. To compute statistics in the presence of outliers, robust statistics are adapted to the characteristics of compositional data, based on the minimum covariance determinant approach. The outlier classification is based on four different models of outlier occur- rence and Monte-Carlo-based tests for their characterization. Furthermore the package provides special plots helping to understand the nature of outliers in the dataset. Keywords: coda-dendrogram, lost values, MAR, missing data, MCD estimator, robustness, rounded zeros
Resumo:
We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos
Resumo:
It has been shown that the accuracy of mammographic abnormality detection methods is strongly dependent on the breast tissue characteristics, where a dense breast drastically reduces detection sensitivity. In addition, breast tissue density is widely accepted to be an important risk indicator for the development of breast cancer. Here, we describe the development of an automatic breast tissue classification methodology, which can be summarized in a number of distinct steps: 1) the segmentation of the breast area into fatty versus dense mammographic tissue; 2) the extraction of morphological and texture features from the segmented breast areas; and 3) the use of a Bayesian combination of a number of classifiers. The evaluation, based on a large number of cases from two different mammographic data sets, shows a strong correlation ( and 0.67 for the two data sets) between automatic and expert-based Breast Imaging Reporting and Data System mammographic density assessment
Resumo:
Introducción: El cáncer de seno es la primera causa de cáncer entre las mujeres, además es la primera causa de muerte por cáncer entre las hispanas y la segunda entre otras razas, sin contar con el gran impacto social y económico que conlleva esta patología. Esto motiva la realización de estudios propios, que permitan ampliar nuestro conocimiento y aportar a la literatura colombiana, una publicación que refleje los factores asociados a la recaída en el cáncer de mama. Métodos: Estudio observacional analítico retrospectivo de casos y controles en el que se tomaron 267 historias clínicas de pacientes con diagnóstico de cáncer de seno, clasificadas según estadio clínico y expresión molecular del tumor, se analizaron los factores más fuertemente asociados a la recaída. Resultados: La población total consistió en 267 mujeres de las cuales 58 presentaron recaída, con un relación caso – control, 1:3. Al evaluar los grupos se evidencia homogeneidad en cuanto a edad, tipo de neoplasia, paridad e histología con lo que concluimos que estos grupos son comparables. Se presentó una tasa de mortalidad de 13,8 % en las pacientes que presentaron recaída tumoral vs un 0% de mortalidad en aquellas pacientes sin recaída. Adicionalmente se evidencia una relación entre la presencia del receptor HER 2 y recaída tumoral, que aunque no es estadísticamente significativa (p = 0.112) es importante tener en cuenta por su significancia clínica. Por su parte la presencia de receptor de estrógenos y progestágenos no es un predictor de recaída. La realización de cirugía se muestra como un factor de protección (OAR: 0.046 p = 0.008). Finalmente se encontró una asociación estadísticamente significativa como variables de asociación a recaída tumoral: la edad (p=0.009), el estadio clínico en el momento del diagnóstico (p= <0.001) y la clasificación molecular del tumor (p= 0.016). Conclusiones: Se identificaron como factores asociados a recaída tumoral en pacientes con cáncer de mama de una institución de Bogotá, Colombia a: la edad, el estadio clínico en el momento del diagnóstico y la clasificación molecular del tumor, confirmando la agresividad de los tumores triple negativos. Todos los hallazgos son compatibles a lo descrito en la literatura mundial. Esto permite definir la necesidad de generar en nuestro país estrategias de salud pública, que permitan la educación a todos los grupos etarios para el tamizaje en población joven que está siendo afectada, la detección en estadios tempranos del cáncer de mama, asociados a priorización del manejo y mejoras en la ruta de atención de las pacientes que permitan impactar positivamente en el desenlace y calidad de vida de las mujeres con esta patología. Adicionalmente estos resultados impulsan a la continua investigación de nuevas tecnologías y medicamentos que permitan combatir los tumores más agresivos molecularmente hablando.
Resumo:
La mayoría de los modelos que exploran la relación entre la desigualdad en la distribución del ingreso y el crecimiento económico, postulan la existencia de una correlación negativa entre las dos que es generada a través de diferentes mecanismos. Paralelamente a los modelos teóricos, un número importante de estudios empíricos han tratado de evaluar esta relación. De este esfuerzo ha surgido un consenso amplio que valida la existencia de dicha relación negativa. No obstante, estudios recientes basados en el uso de datos de panel han producido el resultado contrario, documentando la presencia de una relación positiva entre desigualdad y crecimiento. El examen del debate generado a partir de estos resultados, así como el trabajo empírico adelantado en este estudio, indican que las estimaciones obtenidas en diversos trabajos pueden no ser tan robustas como se creía En consecuencia, se sugiere que la realización de estudios de caso por país puede ser una mejor vía para explorar este tema.
Resumo:
L'increment de bases de dades que cada vegada contenen imatges més difícils i amb un nombre més elevat de categories, està forçant el desenvolupament de tècniques de representació d'imatges que siguin discriminatives quan es vol treballar amb múltiples classes i d'algorismes que siguin eficients en l'aprenentatge i classificació. Aquesta tesi explora el problema de classificar les imatges segons l'objecte que contenen quan es disposa d'un gran nombre de categories. Primerament s'investiga com un sistema híbrid format per un model generatiu i un model discriminatiu pot beneficiar la tasca de classificació d'imatges on el nivell d'anotació humà sigui mínim. Per aquesta tasca introduïm un nou vocabulari utilitzant una representació densa de descriptors color-SIFT, i desprès s'investiga com els diferents paràmetres afecten la classificació final. Tot seguit es proposa un mètode par tal d'incorporar informació espacial amb el sistema híbrid, mostrant que la informació de context es de gran ajuda per la classificació d'imatges. Desprès introduïm un nou descriptor de forma que representa la imatge segons la seva forma local i la seva forma espacial, tot junt amb un kernel que incorpora aquesta informació espacial en forma piramidal. La forma es representada per un vector compacte obtenint un descriptor molt adequat per ésser utilitzat amb algorismes d'aprenentatge amb kernels. Els experiments realitzats postren que aquesta informació de forma te uns resultats semblants (i a vegades millors) als descriptors basats en aparença. També s'investiga com diferents característiques es poden combinar per ésser utilitzades en la classificació d'imatges i es mostra com el descriptor de forma proposat juntament amb un descriptor d'aparença millora substancialment la classificació. Finalment es descriu un algoritme que detecta les regions d'interès automàticament durant l'entrenament i la classificació. Això proporciona un mètode per inhibir el fons de la imatge i afegeix invariança a la posició dels objectes dins les imatges. S'ensenya que la forma i l'aparença sobre aquesta regió d'interès i utilitzant els classificadors random forests millora la classificació i el temps computacional. Es comparen els postres resultats amb resultats de la literatura utilitzant les mateixes bases de dades que els autors Aixa com els mateixos protocols d'aprenentatge i classificació. Es veu com totes les innovacions introduïdes incrementen la classificació final de les imatges.
Resumo:
The methodology is focused on the use of digital air photos to monitor changes in land covers and to study its dynamics and its patterns in the last 50 years. The dissertation also take into account the relationship between open habitats patterns/dynamics versus biodiversity persistence, increase risk of fire, land ownership and management. Therefore Geographic Information System (GIS) is a very interesting mapping tool that enables geographic or spatial data capture, storage, retrieval, manipulation, analysis and modeling. Finally this research develop a heuristic model to create sites using suitability maps and a reserve design model to select the most optimum sites in order to increase landscape heterogeneity at the less cost.
Resumo:
Conservation planning requires identifying pertinent habitat factors and locating geographic locations where land management may improve habitat conditions for high priority species. I derived habitat models and mapped predicted abundance for the Golden-winged Warbler (Vermivora chrysoptera), a species of high conservation concern, using bird counts, environmental variables, and hierarchical models applied at multiple spatial scales. My aim was to understand habitat associations at multiple spatial scales and create a predictive abundance map for purposes of conservation planning for the Golden-winged Warbler. My models indicated a substantial influence of landscape conditions, including strong positive associations with total forest composition within the landscape. However, many of the associations I observed were counter to reported associations at finer spatial extents; for instance, I found Golden-winged Warblers negatively associated with several measures of edge habitat. No single spatial scale dominated, indicating that this species is responding to factors at multiple spatial scales. I found Golden-winged Warbler abundance was negatively related with Blue-winged Warbler (Vermivora cyanoptera) abundance. I also observed a north-south spatial trend suggestive of a regional climate effect that was not previously noted for this species. The map of predicted abundance indicated a large area of concentrated abundance in west-central Wisconsin, with smaller areas of high abundance along the northern periphery of the Prairie Hardwood Transition. This map of predicted abundance compared favorably with independent evaluation data sets and can thus be used to inform regional planning efforts devoted to conserving this species.
Resumo:
Population models are essential components of large-scale conservation and management plans for the federally endangered Golden-cheeked Warbler (Setophaga chrysoparia; hereafter GCWA). However, existing models are based on vital rate estimates calculated using relatively small data sets that are now more than a decade old. We estimated more current, precise adult and juvenile apparent survival (Φ) probabilities and their associated variances for male GCWAs. In addition to providing estimates for use in population modeling, we tested hypotheses about spatial and temporal variation in Φ. We assessed whether a linear trend in Φ or a change in the overall mean Φ corresponded to an observed increase in GCWA abundance during 1992-2000 and if Φ varied among study plots. To accomplish these objectives, we analyzed long-term GCWA capture-resight data from 1992 through 2011, collected across seven study plots on the Fort Hood Military Reservation using a Cormack-Jolly-Seber model structure within program MARK. We also estimated Φ process and sampling variances using a variance-components approach. Our results did not provide evidence of site-specific variation in adult Φ on the installation. Because of a lack of data, we could not assess whether juvenile Φ varied spatially. We did not detect a strong temporal association between GCWA abundance and Φ. Mean estimates of Φ for adult and juvenile male GCWAs for all years analyzed were 0.47 with a process variance of 0.0120 and a sampling variance of 0.0113 and 0.28 with a process variance of 0.0076 and a sampling variance of 0.0149, respectively. Although juvenile Φ did not differ greatly from previous estimates, our adult Φ estimate suggests previous GCWA population models were overly optimistic with respect to adult survival. These updated Φ probabilities and their associated variances will be incorporated into new population models to assist with GCWA conservation decision making.
Resumo:
The longwave radiative cooling of the clear-sky atmosphere (Q(LWc)) is a crucial component of the global hydrological cycle and is composed of the clear-sky outgoing longwave radiation to space (OLRc) and the net downward minus upward clear-sky longwave radiation to the surface (SNLc). Estimates of QLWc from reanalyses and observations are presented for the period 1979-2004. Compared to other reanalyses data sets, the European Centre for Medium-range Weather Forecasts 40-year reanalysis (ERA40) produces the largest Q(LWc) over the tropical oceans (217 W m(-2)), explained by the least negative SNLc. On the basis of comparisons with data derived from satellite measurements, ERA40 provides the most realistic QLWc climatology over the tropical oceans but exhibits a spurious interannual variability for column integrated water vapor (CWV) and SNLc. Interannual monthly anomalies of QLWc are broadly consistent between data sets with large increases during the warm El Nino events. Since relative humidity ( RH) errors applying throughout the troposphere result in compensating effects on the cooling to space and to the surface, they exert only a marginal effect on QLWc. An observed increase in CWV with surface temperature of 3 kg m(-2) K-1 over the tropical oceans is important in explaining a positive relationship between QLWc and surface temperature, in particular over ascending regimes; over tropical ocean descending regions this relationship ranges from 3.6 to 4.6 +/- 0.4 W m(-2) K-1 for the data sets considered, consistent with idealized sensitivity tests in which tropospheric warming is applied and RH is held constant and implying an increase in precipitation with warming.
Resumo:
We compare European Centre for Medium-Range Weather Forecasts 15-year reanalysis (ERA-15) moisture over the tropical oceans with satellite observations and the U.S. National Centers for Environmental Prediction (NCEP) National Center for Atmospheric Research 40-year reanalysis. When systematic differences in moisture between the observational and reanalysis data sets are removed, the NCEP data show excellent agreement with the observations while the ERA-15 variability exhibits remarkable differences. By forcing agreement between ERA-15 column water vapor and the observations, where available, by scaling the entire moisture column accordingly, the height-dependent moisture variability remains unchanged for all but the 550–850 hPa layer, where the moisture variability reduces significantly. Thus the excess variation of column moisture in ERA-15 appears to originate in this layer. The moisture variability provided by ERA-15 is not deemed of sufficient quality for use in the validation of climate models.
Resumo:
The distribution and variability of water vapor and its links with radiative cooling and latent heating via precipitation are crucial to understanding feedbacks and processes operating within the climate system. Column-integrated water vapor (CWV) and additional variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) 40-year reanalysis (ERA40) are utilized to quantify the spatial and temporal variability in tropical water vapor over the period 1979–2001. The moisture variability is partitioned between dynamical and thermodynamic influences and compared with variations in precipitation provided by the Climate Prediction Center Merged Analysis of Precipitation (CMAP) and the Global Precipitation Climatology Project (GPCP). The spatial distribution of CWV is strongly determined by thermodynamic constraints. Spatial variability in CWV is dominated by changes in the large-scale dynamics, in particular associated with the El Niño–Southern Oscillation (ENSO). Trends in CWV are also dominated by dynamics rather than thermodynamics over the period considered. However, increases in CWV associated with changes in temperature are significant over the equatorial east Pacific when analyzing interannual variability and over the north and northwest Pacific when analyzing trends. Significant positive trends in CWV tend to predominate over the oceans while negative trends in CWV are found over equatorial Africa and Brazil. Links between changes in CWV and vertical motion fields are identified over these regions and also the equatorial Atlantic. However, trends in precipitation are generally incoherent and show little association with the CWV trends. This may in part reflect the inadequacies of the precipitation data sets and reanalysis products when analyzing decadal variability. Though the dynamic component of CWV is a major factor in determining precipitation variability in the tropics, in some regions/seasons the thermodynamic component cancels its effect on precipitation variability.
Resumo:
For many networks in nature, science and technology, it is possible to order the nodes so that most links are short-range, connecting near-neighbours, and relatively few long-range links, or shortcuts, are present. Given a network as a set of observed links (interactions), the task of finding an ordering of the nodes that reveals such a range-dependent structure is closely related to some sparse matrix reordering problems arising in scientific computation. The spectral, or Fiedler vector, approach for sparse matrix reordering has successfully been applied to biological data sets, revealing useful structures and subpatterns. In this work we argue that a periodic analogue of the standard reordering task is also highly relevant. Here, rather than encouraging nonzeros only to lie close to the diagonal of a suitably ordered adjacency matrix, we also allow them to inhabit the off-diagonal corners. Indeed, for the classic small-world model of Watts & Strogatz (1998, Collective dynamics of ‘small-world’ networks. Nature, 393, 440–442) this type of periodic structure is inherent. We therefore devise and test a new spectral algorithm for periodic reordering. By generalizing the range-dependent random graph class of Grindrod (2002, Range-dependent random graphs and their application to modeling large small-world proteome datasets. Phys. Rev. E, 66, 066702-1–066702-7) to the periodic case, we can also construct a computable likelihood ratio that suggests whether a given network is inherently linear or periodic. Tests on synthetic data show that the new algorithm can detect periodic structure, even in the presence of noise. Further experiments on real biological data sets then show that some networks are better regarded as periodic than linear. Hence, we find both qualitative (reordered networks plots) and quantitative (likelihood ratios) evidence of periodicity in biological networks.