979 resultados para Data portal
Resumo:
Retrospective clinical datasets are often characterized by a relatively small sample size and many missing data. In this case, a common way for handling the missingness consists in discarding from the analysis patients with missing covariates, further reducing the sample size. Alternatively, if the mechanism that generated the missing allows, incomplete data can be imputed on the basis of the observed data, avoiding the reduction of the sample size and allowing methods to deal with complete data later on. Moreover, methodologies for data imputation might depend on the particular purpose and might achieve better results by considering specific characteristics of the domain. The problem of missing data treatment is studied in the context of survival tree analysis for the estimation of a prognostic patient stratification. Survival tree methods usually address this problem by using surrogate splits, that is, splitting rules that use other variables yielding similar results to the original ones. Instead, our methodology consists in modeling the dependencies among the clinical variables with a Bayesian network, which is then used to perform data imputation, thus allowing the survival tree to be applied on the completed dataset. The Bayesian network is directly learned from the incomplete data using a structural expectation–maximization (EM) procedure in which the maximization step is performed with an exact anytime method, so that the only source of approximation is due to the EM formulation itself. On both simulated and real data, our proposed methodology usually outperformed several existing methods for data imputation and the imputation so obtained improved the stratification estimated by the survival tree (especially with respect to using surrogate splits).
Resumo:
This paper addresses the estimation of parameters of a Bayesian network from incomplete data. The task is usually tackled by running the Expectation-Maximization (EM) algorithm several times in order to obtain a high log-likelihood estimate. We argue that choosing the maximum log-likelihood estimate (as well as the maximum penalized log-likelihood and the maximum a posteriori estimate) has severe drawbacks, being affected both by overfitting and model uncertainty. Two ideas are discussed to overcome these issues: a maximum entropy approach and a Bayesian model averaging approach. Both ideas can be easily applied on top of EM, while the entropy idea can be also implemented in a more sophisticated way, through a dedicated non-linear solver. A vast set of experiments shows that these ideas produce significantly better estimates and inferences than the traditional and widely used maximum (penalized) log-likelihood and maximum a posteriori estimates. In particular, if EM is adopted as optimization engine, the model averaging approach is the best performing one; its performance is matched by the entropy approach when implemented using the non-linear solver. The results suggest that the applicability of these ideas is immediate (they are easy to implement and to integrate in currently available inference engines) and that they constitute a better way to learn Bayesian network parameters.
Resumo:
This paper considers inference from multinomial data and addresses the problem of choosing the strength of the Dirichlet prior under a mean-squared error criterion. We compare the Maxi-mum Likelihood Estimator (MLE) and the most commonly used Bayesian estimators obtained by assuming a prior Dirichlet distribution with non-informative prior parameters, that is, the parameters of the Dirichlet are equal and altogether sum up to the so called strength of the prior. Under this criterion, MLE becomes more preferable than the Bayesian estimators at the increase of the number of categories k of the multinomial, because non-informative Bayesian estimators induce a region where they are dominant that quickly shrinks with the increase of k. This can be avoided if the strength of the prior is not kept constant but decreased with the number of categories. We argue that the strength should decrease at least k times faster than usual estimators do.
Resumo:
Automatically determining and assigning shared and meaningful text labels to data extracted from an e-Commerce web page is a challenging problem. An e-Commerce web page can display a list of data records, each of which can contain a combination of data items (e.g. product name and price) and explicit labels, which describe some of these data items. Recent advances in extraction techniques have made it much easier to precisely extract individual data items and labels from a web page, however, there are two open problems: 1. assigning an explicit label to a data item, and 2. determining labels for the remaining data items. Furthermore, improvements in the availability and coverage of vocabularies, especially in the context of e-Commerce web sites, means that we now have access to a bank of relevant, meaningful and shared labels which can be assigned to extracted data items. However, there is a need for a technique which will take as input a set of extracted data items and assign automatically to them the most relevant and meaningful labels from a shared vocabulary. We observe that the Information Extraction (IE) community has developed a great number of techniques which solve problems similar to our own. In this work-in-progress paper we propose our intention to theoretically and experimentally evaluate different IE techniques to ascertain which is most suitable to solve this problem.
Resumo:
In this paper, we propose a new learning approach to Web data annotation, where a support vector machine-based multiclass classifier is trained to assign labels to data items. For data record extraction, a data section re-segmentation algorithm based on visual and content features is introduced to improve the performance of Web data record extraction. We have implemented the proposed approach and tested it with a large set of Web query result pages in different domains. Our experimental results show that our proposed approach is highly effective and efficient.
Outperformance in exchange-traded fund pricing deviations: Generalized control of data snooping bias
Resumo:
An investigation into exchange-traded fund (ETF) outperforrnance during the period 2008-2012 is undertaken utilizing a data set of 288 U.S. traded securities. ETFs are tested for net asset value (NAV) premium, underlying index and market benchmark outperformance, with Sharpe, Treynor, and Sortino ratios employed as risk-adjusted performance measures. A key contribution is the application of an innovative generalized stepdown procedure in controlling for data snooping bias. We find that a large proportion of optimized replication and debt asset class ETFs display risk-adjusted premiums with energy and precious metals focused funds outperforming the S&P 500 market benchmark.
Resumo:
The predominant fear in capital markets is that of a price spike. Commodity markets differ in that there is a fear of both upward and down jumps, this results in implied volatility curves displaying distinct shapes when compared to equity markets. The use of a novel functional data analysis (FDA) approach, provides a framework to produce and interpret functional objects that characterise the underlying dynamics of oil future options. We use the FDA framework to examine implied volatility, jump risk, and pricing dynamics within crude oil markets. Examining a WTI crude oil sample for the 2007–2013 period, which includes the global financial crisis and the Arab Spring, strong evidence is found of converse jump dynamics during periods of demand and supply side weakness. This is used as a basis for an FDA-derived Merton (1976) jump diffusion optimised delta hedging strategy, which exhibits superior portfolio management results over traditional methods.
Resumo:
1. Little consensus has been reached as to general features of spatial variation in beta diversity, a fundamental component of species diversity. This could reflect a genuine lack of simple gradients in beta diversity, or a lack of agreement as to just what constitutes beta diversity. Unfortunately, a large number of approaches have been applied to the investigation of variation in beta diversity, which potentially makes comparisons of the findings difficult.
2. We review 24 measures of beta diversity for presence/absence data (the most frequent form of data to which such measures are applied) that have been employed in the literature, express many of them for the first time in common terms, and compare some of their basic properties.
3. Four groups of measures are distinguished, with a fundamental distinction arising between 'broad sense' measures incorporating differences in composition attributable to species richness gradients, and 'narrow sense' measures that focus on compositional differences independent of such gradients. On a number of occasions on which the former have been employed in the literature the latter may have been more appropriate, and there are many situations in which consideration of both kinds of measures would be valuable.
4. We particularly recommend (i) considering beta diversity measures in terms of matching/mismatching components (usually denoted a , b and c) and thereby identifying the contribution of different sources of variation in species composition, and (ii) the use of ternary plots to express the relationship between the values of these measures and of the components, and as a way of understanding patterns in beta diversity.
Resumo:
In the coming decade installed offshore wind capacity is expected to expand rapidly. This will be both technically and economically challenging. Precise wind resource assessment is one of the more imminent challenges. It is more difficult to assess wind power offshore than onshore due to the paucity of representative wind speed data. Offshore site-specific data is less accessible and is far more costly to collect. However, offshore wind speed data collected from sources such as wave buoys, remote sensing from satellites, national weather ships, and coastal meteorological stations and met masts on barges and platforms may be extrapolated to assess offshore wind power. This study attempts to determine the usefulness of pre-existing offshore wind speed measurements in resource assessment, and presents the results of wind resource estimation in the Atlantic Ocean and in the Irish Sea using data from two offshore meteorological buoys. © 2012 IEEE.
Resumo:
The efficiency of generation plants is an important measure for evaluating the operating performance. The objective of this paper is to evaluate electricity power generation by conducting an All-Island-Generator-Efficiency-Study (AIGES) for the Republic of Ireland and Northern Ireland by utilising a Data Envelopment Analysis (DEA) approach. An operational performance efficiency index is defined and pursued for the year 2008. The economic activities of electricity generation units/plants examined in this paper are characterized by numerous input and output indicators. Constant returns to scale (CRS) and variable returns to scale (VRS) type DEA models are employed in the analysis. Also a slacks based analysis indicates the level of inefficiency for each variable examined. The findings from this study provide a general ranking and evaluation but also facilitate various interesting efficiency comparisons between generators by fuel type.