59 resultados para VLE data sets
Resumo:
We survey a number of papers that have focused on the construction of cross-country data sets on average years of schooling. We discuss the construction of the different series, compare their profiles and construct indicators of their information content. The discussion focuses on a sample of OECD countries but we also provide some results for a large non-OECD sample.
Resumo:
We examine the timing of firms' operations in a formal model of labor demand. Merging a variety of data sets from Portugal from 1995-2004, we describe temporal patterns of firms' demand for labor and estimate production-functions and relative labor-demand equations. The results demonstrate the existence of substitution of employment across times of the day/week and show that legislated penalties for work at irregular hours induce firms to alter their operating schedules. The results suggest a role for such penalties in an unregulated labor market, such as the United States, in which unusually large fractions of work are performed at night and on weekends.
Resumo:
Aquesta memoria resumeix el treball de final de carrera d’Enginyeria Superior d’Informàtica. Explicarà les principals raons que han motivat el projecte així com exemples que il·lustren l’aplicació resultant. En aquest cas el software intentarà resoldre la actual necessitat que hi ha de tenir dades de Ground Truth per als algoritmes de segmentació de text per imatges de color complexes. Tots els procesos seran explicats en els diferents capítols partint de la definició del problema, la planificació, els requeriments i el disseny fins a completar la il·lustració dels resultats del programa i les dades de Ground Truth resultants.
Resumo:
When dealing with sustainability we are concerned with the biophysical as well as the monetary aspects of economic and ecological interactions. This multidimensional approach requires that special attention is given to dimensional issues in relation to curve fitting practice in economics. Unfortunately, many empirical and theoretical studies in economics, as well as in ecological economics, apply dimensional numbers in exponential or logarithmic functions. We show that it is an analytical error to put a dimensional unit x into exponential functions ( a x ) and logarithmic functions ( x a log ). Secondly, we investigate the conditions of data sets under which a particular logarithmic specification is superior to the usual regression specification. This analysis shows that logarithmic specification superiority in terms of least square norm is heavily dependent on the available data set. The last section deals with economists’ “curve fitting fetishism”. We propose that a distinction be made between curve fitting over past observations and the development of a theoretical or empirical law capable of maintaining its fitting power for any future observations. Finally we conclude this paper with several epistemological issues in relation to dimensions and curve fitting practice in economics
Resumo:
In this paper we describe an open learning object repository on Statistics based on DSpace which contains true learning objects, that is, exercises, equations, data sets, etc. This repository is part of a large project intended to promote the use of learning object repositories as part of the learning process in virtual learning environments. This involves the creation of a new user interface that provides users with additional services such as resource rating, commenting and so. Both aspects make traditional metadata schemes such as Dublin Core to be inadequate, as there are resources with no title or author, for instance, as those fields are not used by learners to browse and search for learning resources in the repository. Therefore, exporting OAI-PMH compliant records using OAI-DC is not possible, thus limiting the visibility of the learning objects in the repository outside the institution. We propose an architecture based on ontologies and the use of extended metadata records for both storing and refactoring such descriptions.
Resumo:
The availability of rich firm-level data sets has recently led researchers to uncover new evidence on the effects of trade liberalization. First, trade openness forces the least productive firms to exit the market. Secondly, it induces surviving firms to increase their innovation efforts and thirdly, it increases the degree of product market competition. In this paper we propose a model aimed at providing a coherent interpretation of these findings. We introducing firm heterogeneity into an innovation-driven growth model, where incumbent firms operating in oligopolistic industries perform cost-reducing innovations. In this framework, trade liberalization leads to higher product market competition, lower markups and higher quantity produced. These changes in markups and quantities, in turn, promote innovation and productivity growth through a direct competition effect, based on the increase in the size of the market, and a selection effect, produced by the reallocation of resources towards more productive firms. Calibrated to match US aggregate and firm-level statistics, the model predicts that a 10 percent reduction in variable trade costs reduces markups by 1:15 percent, firm surviving probabilities by 1 percent, and induces an increase in productivity growth of about 13 percent. More than 90 percent of the trade-induced growth increase can be attributed to the selection effect.
Resumo:
The biplot has proved to be a powerful descriptive and analytical tool in many areasof applications of statistics. For compositional data the necessary theoreticaladaptation has been provided, with illustrative applications, by Aitchison (1990) andAitchison and Greenacre (2002). These papers were restricted to the interpretation ofsimple compositional data sets. In many situations the problem has to be described insome form of conditional modelling. For example, in a clinical trial where interest isin how patients’ steroid metabolite compositions may change as a result of differenttreatment regimes, interest is in relating the compositions after treatment to thecompositions before treatment and the nature of the treatments applied. To study thisthrough a biplot technique requires the development of some form of conditionalcompositional biplot. This is the purpose of this paper. We choose as a motivatingapplication an analysis of the 1992 US President ial Election, where interest may be inhow the three-part composition, the percentage division among the three candidates -Bush, Clinton and Perot - of the presidential vote in each state, depends on the ethniccomposition and on the urban-rural composition of the state. The methodology ofconditional compositional biplots is first developed and a detailed interpretation of the1992 US Presidential Election provided. We use a second application involving theconditional variability of tektite mineral compositions with respect to major oxidecompositions to demonstrate some hazards of simplistic interpretation of biplots.Finally we conjecture on further possible applications of conditional compositionalbiplots
Resumo:
We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos
Resumo:
The R-package “compositions”is a tool for advanced compositional analysis. Its basicfunctionality has seen some conceptual improvement, containing now some facilitiesto work with and represent ilr bases built from balances, and an elaborated subsys-tem for dealing with several kinds of irregular data: (rounded or structural) zeroes,incomplete observations and outliers. The general approach to these irregularities isbased on subcompositions: for an irregular datum, one can distinguish a “regular” sub-composition (where all parts are actually observed and the datum behaves typically)and a “problematic” subcomposition (with those unobserved, zero or rounded parts, orelse where the datum shows an erratic or atypical behaviour). Systematic classificationschemes are proposed for both outliers and missing values (including zeros) focusing onthe nature of irregularities in the datum subcomposition(s).To compute statistics with values missing at random and structural zeros, a projectionapproach is implemented: a given datum contributes to the estimation of the desiredparameters only on the subcompositon where it was observed. For data sets withvalues below the detection limit, two different approaches are provided: the well-knownimputation technique, and also the projection approach.To compute statistics in the presence of outliers, robust statistics are adapted to thecharacteristics of compositional data, based on the minimum covariance determinantapproach. The outlier classification is based on four different models of outlier occur-rence and Monte-Carlo-based tests for their characterization. Furthermore the packageprovides special plots helping to understand the nature of outliers in the dataset.Keywords: coda-dendrogram, lost values, MAR, missing data, MCD estimator,robustness, rounded zeros
Resumo:
A method to estimate an extreme quantile that requires no distributional assumptions is presented. The approach is based on transformed kernel estimation of the cumulative distribution function (cdf). The proposed method consists of a double transformation kernel estimation. We derive optimal bandwidth selection methods that have a direct expression for the smoothing parameter. The bandwidth can accommodate to the given quantile level. The procedure is useful for large data sets and improves quantile estimation compared to other methods in heavy tailed distributions. Implementation is straightforward and R programs are available.
Resumo:
All of the imputation techniques usually applied for replacing values below thedetection limit in compositional data sets have adverse effects on the variability. In thiswork we propose a modification of the EM algorithm that is applied using the additivelog-ratio transformation. This new strategy is applied to a compositional data set and theresults are compared with the usual imputation techniques
Resumo:
Muchas investigaciones arqueobotánicas, desde un enfoque cualitativo-descriptivo, limitan el propio campo de estudio a los análisis de presencia/ausencia y/o de frecuencia de taxones a partir de su recuento en los conjuntos vegetales. De esa manera, los datos proporcionados resultan ser inconcluyentes y no fiables para la reconstrucción del paleoambiente, la determinación de la dieta alimenticia y de la práctica económica realizada (recolección VS agricultura), y totalmente insuficientes para determinar los cambios históricos ocurridos en los procesos productivos. Por lo que concierne el Perú, desde los primeros estudios con referencia a restos vegetales recuperados en yacimientos arqueológicos, principalmente de la costa, se documenta el importante papel que han desarrollado las especies vegetales en la vida de las comunidades pre-hispánicas. No obstante la excepcional abundancia y óptima preservación de este tipo de material (botánico) en muchos de los yacimientos arqueológicos de esta región, gracias a las extremas condiciones climáticas y ambientales sobre todo de sus áridas zonas costeras, los estudios arqueobotánicos desarrollados hasta el momento son muy escasos y las limitaciones análiticas que presentan en su mayoría reflejan la poca importancia dada a las investigaciones arqueobotánicas. En el presente trabajo desarrollamos y aplicamos una metodología analítica de tipo cuantitativa para el estudio de los macrorestos vegetales procedentes de un yacimiento de la Costa sur del Perú. Con ello pretendemos obtener datos representativos y objetivos de los conjuntos analizados, cuyo procesado lleve a una exhaustiva y correcta interpretación de la información.
Resumo:
It has been shown that the accuracy of mammographic abnormality detection methods is strongly dependent on the breast tissue characteristics, where a dense breast drastically reduces detection sensitivity. In addition, breast tissue density is widely accepted to be an important risk indicator for the development of breast cancer. Here, we describe the development of an automatic breast tissue classification methodology, which can be summarized in a number of distinct steps: 1) the segmentation of the breast area into fatty versus dense mammographic tissue; 2) the extraction of morphological and texture features from the segmented breast areas; and 3) the use of a Bayesian combination of a number of classifiers. The evaluation, based on a large number of cases from two different mammographic data sets, shows a strong correlation ( and 0.67 for the two data sets) between automatic and expert-based Breast Imaging Reporting and Data System mammographic density assessment
Resumo:
One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.
Resumo:
The completion of the sequencing of the mouse genome promises to help predict human genes with greater accuracy. While current ab initio gene prediction programs are remarkably sensitive (i.e., they predict at least a fragment of most genes), their specificity is often low, predicting a large number of false-positive genes in the human genome. Sequence conservation at the protein level with the mouse genome can help eliminate some of those false positives. Here we describe SGP2, a gene prediction program that combines ab initio gene prediction with TBLASTX searches between two genome sequences to provide both sensitive and specific gene predictions. The accuracy of SGP2 when used to predict genes by comparing the human and mouse genomes is assessed on a number of data sets, including single-gene data sets, the highly curated human chromosome 22 predictions, and entire genome predictions from ENSEMBL. Results indicate that SGP2 outperforms purely ab initio gene prediction methods. Results also indicate that SGP2 works about as well with 3x shotgun data as it does with fully assembled genomes. SGP2 provides a high enough specificity that its predictions can be experimentally verified at a reasonable cost. SGP2 was used to generate a complete set of gene predictions on both the human and mouse by comparing the genomes of these two species. Our results suggest that another few thousand human and mouse genes currently not in ENSEMBL are worth verifying experimentally.