903 resultados para Data selection


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The objective of this study was to estimate (co)variance components using random regression on B-spline functions to weight records obtained from birth to adulthood. A total of 82 064 weight records of 8145 females obtained from the data bank of the Nellore Breeding Program (PMGRN/Nellore Brazil) which started in 1987, were used. The models included direct additive and maternal genetic effects and animal and maternal permanent environmental effects as random. Contemporary group and dam age at calving (linear and quadratic effect) were included as fixed effects, and orthogonal Legendre polynomials of age (cubic regression) were considered as random covariate. The random effects were modeled using B-spline functions considering linear, quadratic and cubic polynomials for each individual segment. Residual variances were grouped in five age classes. Direct additive genetic and animal permanent environmental effects were modeled using up to seven knots (six segments). A single segment with two knots at the end points of the curve was used for the estimation of maternal genetic and maternal permanent environmental effects. A total of 15 models were studied, with the number of parameters ranging from 17 to 81. The models that used B-splines were compared with multi-trait analyses with nine weight traits and to a random regression model that used orthogonal Legendre polynomials. A model fitting quadratic B-splines, with four knots or three segments for direct additive genetic effect and animal permanent environmental effect and two knots for maternal additive genetic effect and maternal permanent environmental effect, was the most appropriate and parsimonious model to describe the covariance structure of the data. Selection for higher weight, such as at young ages, should be performed taking into account an increase in mature cow weight. Particularly, this is important in most of Nellore beef cattle production systems, where the cow herd is maintained on range conditions. There is limited modification of the growth curve of Nellore cattle with respect to the aim of selecting them for rapid growth at young ages while maintaining constant adult weight.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis describes the developments of new models and toolkits for the orbit determination codes to support and improve the precise radio tracking experiments of the Cassini-Huygens mission, an interplanetary mission to study the Saturn system. The core of the orbit determination process is the comparison between observed observables and computed observables. Disturbances in either the observed or computed observables degrades the orbit determination process. Chapter 2 describes a detailed study of the numerical errors in the Doppler observables computed by NASA's ODP and MONTE, and ESA's AMFIN. A mathematical model of the numerical noise was developed and successfully validated analyzing against the Doppler observables computed by the ODP and MONTE, with typical relative errors smaller than 10%. The numerical noise proved to be, in general, an important source of noise in the orbit determination process and, in some conditions, it may becomes the dominant noise source. Three different approaches to reduce the numerical noise were proposed. Chapter 3 describes the development of the multiarc library, which allows to perform a multi-arc orbit determination with MONTE. The library was developed during the analysis of the Cassini radio science gravity experiments of the Saturn's satellite Rhea. Chapter 4 presents the estimation of the Rhea's gravity field obtained from a joint multi-arc analysis of Cassini R1 and R4 fly-bys, describing in details the spacecraft dynamical model used, the data selection and calibration procedure, and the analysis method followed. In particular, the approach of estimating the full unconstrained quadrupole gravity field was followed, obtaining a solution statistically not compatible with the condition of hydrostatic equilibrium. The solution proved to be stable and reliable. The normalized moment of inertia is in the range 0.37-0.4 indicating that Rhea's may be almost homogeneous, or at least characterized by a small degree of differentiation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

With the advent of cloud computing model, distributed caches have become the cornerstone for building scalable applications. Popular systems like Facebook [1] or Twitter use Memcached [5], a highly scalable distributed object cache, to speed up applications by avoiding database accesses. Distributed object caches assign objects to cache instances based on a hashing function, and objects are not moved from a cache instance to another unless more instances are added to the cache and objects are redistributed. This may lead to situations where some cache instances are overloaded when some of the objects they store are frequently accessed, while other cache instances are less frequently used. In this paper we propose a multi-resource load balancing algorithm for distributed cache systems. The algorithm aims at balancing both CPU and Memory resources among cache instances by redistributing stored data. Considering the possible conflict of balancing multiple resources at the same time, we give CPU and Memory resources weighted priorities based on the runtime load distributions. A scarcer resource is given a higher weight than a less scarce resource when load balancing. The system imbalance degree is evaluated based on monitoring information, and the utility load of a node, a unit for resource consumption. Besides, since continuous rebalance of the system may affect the QoS of applications utilizing the cache system, our data selection policy ensures that each data migration minimizes the system imbalance degree and hence, the total reconfiguration cost can be minimized. An extensive simulation is conducted to compare our policy with other policies. Our policy shows a significant improvement in time efficiency and decrease in reconfiguration cost.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En la presente Tesis se ha llevado a cabo el contraste y desarrollo de metodologías que permitan mejorar el cálculo de las avenidas de proyecto y extrema empleadas en el cálculo de la seguridad hidrológica de las presas. En primer lugar se ha abordado el tema del cálculo de las leyes de frecuencia de caudales máximos y su extrapolación a altos periodos de retorno. Esta cuestión es de gran relevancia, ya que la adopción de estándares de seguridad hidrológica para las presas cada vez más exigentes, implica la utilización de periodos de retorno de diseño muy elevados cuya estimación conlleva una gran incertidumbre. Es importante, en consecuencia incorporar al cálculo de los caudales de diseño todas la técnicas disponibles para reducir dicha incertidumbre. Asimismo, es importante hacer una buena selección del modelo estadístico (función de distribución y procedimiento de ajuste) de tal forma que se garantice tanto su capacidad para describir el comportamiento de la muestra, como para predecir de manera robusta los cuantiles de alto periodo de retorno. De esta forma, se han realizado estudios a escala nacional con el objetivo de determinar el esquema de regionalización que ofrece mejores resultados para las características hidrológicas de las cuencas españolas, respecto a los caudales máximos anuales, teniendo en cuenta el numero de datos disponibles. La metodología utilizada parte de la identificación de regiones homogéneas, cuyos límites se han determinado teniendo en cuenta las características fisiográficas y climáticas de las cuencas, y la variabilidad de sus estadísticos, comprobando posteriormente su homogeneidad. A continuación, se ha seleccionado el modelo estadístico de caudales máximos anuales con un mejor comportamiento en las distintas zonas de la España peninsular, tanto para describir los datos de la muestra como para extrapolar a los periodos de retorno más altos. El proceso de selección se ha basado, entre otras cosas, en la generación sintética de series de datos mediante simulaciones de Monte Carlo, y el análisis estadístico del conjunto de resultados obtenido a partir del ajuste de funciones de distribución a estas series bajo distintas hipótesis. Posteriormente, se ha abordado el tema de la relación caudal-volumen y la definición de los hidrogramas de diseño en base a la misma, cuestión que puede ser de gran importancia en el caso de presas con grandes volúmenes de embalse. Sin embargo, los procedimientos de cálculo hidrológico aplicados habitualmente no tienen en cuenta la dependencia estadística entre ambas variables. En esta Tesis se ha desarrollado un procedimiento para caracterizar dicha dependencia estadística de una manera sencilla y robusta, representando la función de distribución conjunta del caudal punta y el volumen en base a la función de distribución marginal del caudal punta y la función de distribución condicionada del volumen respecto al caudal. Esta última se determina mediante una función de distribución log-normal, aplicando un procedimiento de ajuste regional. Se propone su aplicación práctica a través de un procedimiento de cálculo probabilístico basado en la generación estocástica de un número elevado de hidrogramas. La aplicación a la seguridad hidrológica de las presas de este procedimiento requiere interpretar correctamente el concepto de periodo de retorno aplicado a variables hidrológicas bivariadas. Para ello, se realiza una propuesta de interpretación de dicho concepto. El periodo de retorno se entiende como el inverso de la probabilidad de superar un determinado nivel de embalse. Al relacionar este periodo de retorno con las variables hidrológicas, el hidrograma de diseño de la presa deja de ser un único hidrograma para convertirse en una familia de hidrogramas que generan un mismo nivel máximo en el embalse, representados mediante una curva en el plano caudal volumen. Esta familia de hidrogramas de diseño depende de la propia presa a diseñar, variando las curvas caudal-volumen en función, por ejemplo, del volumen de embalse o la longitud del aliviadero. El procedimiento propuesto se ilustra mediante su aplicación a dos casos de estudio. Finalmente, se ha abordado el tema del cálculo de las avenidas estacionales, cuestión fundamental a la hora de establecer la explotación de la presa, y que puede serlo también para estudiar la seguridad hidrológica de presas existentes. Sin embargo, el cálculo de estas avenidas es complejo y no está del todo claro hoy en día, y los procedimientos de cálculo habitualmente utilizados pueden presentar ciertos problemas. El cálculo en base al método estadístico de series parciales, o de máximos sobre un umbral, puede ser una alternativa válida que permite resolver esos problemas en aquellos casos en que la generación de las avenidas en las distintas estaciones se deba a un mismo tipo de evento. Se ha realizado un estudio con objeto de verificar si es adecuada en España la hipótesis de homogeneidad estadística de los datos de caudal de avenida correspondientes a distintas estaciones del año. Asimismo, se han analizado los periodos estacionales para los que es más apropiado realizar el estudio, cuestión de gran relevancia para garantizar que los resultados sean correctos, y se ha desarrollado un procedimiento sencillo para determinar el umbral de selección de los datos de tal manera que se garantice su independencia, una de las principales dificultades en la aplicación práctica de la técnica de las series parciales. Por otra parte, la aplicación practica de las leyes de frecuencia estacionales requiere interpretar correctamente el concepto de periodo de retorno para el caso estacional. Se propone un criterio para determinar los periodos de retorno estacionales de forma coherente con el periodo de retorno anual y con una distribución adecuada de la probabilidad entre las distintas estaciones. Por último, se expone un procedimiento para el cálculo de los caudales estacionales, ilustrándolo mediante su aplicación a un caso de estudio. The compare and develop of a methodology in order to improve the extreme flow estimation for dam hydrologic security has been developed. First, the work has been focused on the adjustment of maximum peak flows distribution functions from which to extrapolate values for high return periods. This has become a major issue as the adoption of stricter standards on dam hydrologic security involves estimation of high design return periods which entails great uncertainty. Accordingly, it is important to incorporate all available techniques for the estimation of design peak flows in order to reduce this uncertainty. Selection of the statistical model (distribution function and adjustment method) is also important since its ability to describe the sample and to make solid predictions for high return periods quantiles must be guaranteed. In order to provide practical application of previous methodologies, studies have been developed on a national scale with the aim of determining a regionalization scheme which features best results in terms of annual maximum peak flows for hydrologic characteristics of Spanish basins taking into account the length of available data. Applied methodology starts with the delimitation of regions taking into account basin’s physiographic and climatic characteristics and the variability of their statistical properties, and continues with their homogeneity testing. Then, a statistical model for maximum annual peak flows is selected with the best behaviour for the different regions in peninsular Spain in terms of describing sample data and making solid predictions for high return periods. This selection has been based, among others, on synthetic data series generation using Monte Carlo simulations and statistical analysis of results from distribution functions adjustment following different hypothesis. Secondly, the work has been focused on the analysis of the relationship between peak flow and volume and how to define design flood hydrographs based on this relationship which can be highly important for large volume reservoirs. However, commonly used hydrologic procedures do not take statistical dependence between these variables into account. A simple and sound method for statistical dependence characterization has been developed by the representation of a joint distribution function of maximum peak flow and volume which is based on marginal distribution function of peak flow and conditional distribution function of volume for a given peak flow. The last one is determined by a regional adjustment procedure of a log-normal distribution function. Practical application is proposed by a probabilistic estimation procedure based on stochastic generation of a large number of hydrographs. The use of this procedure for dam hydrologic security requires a proper interpretation of the return period concept applied to bivariate hydrologic data. A standard is proposed in which it is understood as the inverse of the probability of exceeding a determined reservoir level. When relating return period and hydrological variables the only design flood hydrograph changes into a family of hydrographs which generate the same maximum reservoir level and that are represented by a curve in the peak flow-volume two-dimensional space. This family of design flood hydrographs depends on the dam characteristics as for example reservoir volume or spillway length. Two study cases illustrate the application of the developed methodology. Finally, the work has been focused on the calculation of seasonal floods which are essential when determining the reservoir operation and which can be also fundamental in terms of analysing the hydrologic security of existing reservoirs. However, seasonal flood calculation is complex and nowadays it is not totally clear. Calculation procedures commonly used may present certain problems. Statistical partial duration series, or peaks over threshold method, can be an alternative approach for their calculation that allow to solve problems encountered when the same type of event is responsible of floods in different seasons. A study has been developed to verify the hypothesis of statistical homogeneity of peak flows for different seasons in Spain. Appropriate seasonal periods have been analyzed which is highly relevant to guarantee correct results. In addition, a simple procedure has been defined to determine data selection threshold on a way that ensures its independency which is one of the main difficulties in practical application of partial series. Moreover, practical application of seasonal frequency laws requires a correct interpretation of the concept of seasonal return period. A standard is proposed in order to determine seasonal return periods coherently with the annual return period and with an adequate seasonal probability distribution. Finally a methodology is proposed to calculate seasonal peak flows. A study case illustrates the application of the proposed methodology.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Physiological signals, which are controlled by the autonomic nervous system (ANS), could be used to detect the affective state of computer users and therefore find applications in medicine and engineering. The Pupil Diameter (PD) seems to provide a strong indication of the affective state, as found by previous research, but it has not been investigated fully yet. ^ In this study, new approaches based on monitoring and processing the PD signal for off-line and on-line affective assessment ("relaxation" vs. "stress") are proposed. Wavelet denoising and Kalman filtering methods are first used to remove abrupt changes in the raw Pupil Diameter (PD) signal. Then three features (PDmean, PDmax and PDWalsh) are extracted from the preprocessed PD signal for the affective state classification. In order to select more relevant and reliable physiological data for further analysis, two types of data selection methods are applied, which are based on the paired t-test and subject self-evaluation, respectively. In addition, five different kinds of the classifiers are implemented on the selected data, which achieve average accuracies up to 86.43% and 87.20%, respectively. Finally, the receiver operating characteristic (ROC) curve is utilized to investigate the discriminating potential of each individual feature by evaluation of the area under the ROC curve, which reaches values above 0.90. ^ For the on-line affective assessment, a hard threshold is implemented first in order to remove the eye blinks from the PD signal and then a moving average window is utilized to obtain the representative value PDr for every one-second time interval of PD. There are three main steps for the on-line affective assessment algorithm, which are preparation, feature-based decision voting and affective determination. The final results show that the accuracies are 72.30% and 73.55% for the data subsets, which were respectively chosen using two types of data selection methods (paired t-test and subject self-evaluation). ^ In order to further analyze the efficiency of affective recognition through the PD signal, the Galvanic Skin Response (GSR) was also monitored and processed. The highest affective assessment classification rate obtained from GSR processing is only 63.57% (based on the off-line processing algorithm). The overall results confirm that the PD signal should be considered as one of the most powerful physiological signals to involve in future automated real-time affective recognition systems, especially for detecting the "relaxation" vs. "stress" states.^

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Physiological signals, which are controlled by the autonomic nervous system (ANS), could be used to detect the affective state of computer users and therefore find applications in medicine and engineering. The Pupil Diameter (PD) seems to provide a strong indication of the affective state, as found by previous research, but it has not been investigated fully yet. In this study, new approaches based on monitoring and processing the PD signal for off-line and on-line affective assessment (“relaxation” vs. “stress”) are proposed. Wavelet denoising and Kalman filtering methods are first used to remove abrupt changes in the raw Pupil Diameter (PD) signal. Then three features (PDmean, PDmax and PDWalsh) are extracted from the preprocessed PD signal for the affective state classification. In order to select more relevant and reliable physiological data for further analysis, two types of data selection methods are applied, which are based on the paired t-test and subject self-evaluation, respectively. In addition, five different kinds of the classifiers are implemented on the selected data, which achieve average accuracies up to 86.43% and 87.20%, respectively. Finally, the receiver operating characteristic (ROC) curve is utilized to investigate the discriminating potential of each individual feature by evaluation of the area under the ROC curve, which reaches values above 0.90. For the on-line affective assessment, a hard threshold is implemented first in order to remove the eye blinks from the PD signal and then a moving average window is utilized to obtain the representative value PDr for every one-second time interval of PD. There are three main steps for the on-line affective assessment algorithm, which are preparation, feature-based decision voting and affective determination. The final results show that the accuracies are 72.30% and 73.55% for the data subsets, which were respectively chosen using two types of data selection methods (paired t-test and subject self-evaluation). In order to further analyze the efficiency of affective recognition through the PD signal, the Galvanic Skin Response (GSR) was also monitored and processed. The highest affective assessment classification rate obtained from GSR processing is only 63.57% (based on the off-line processing algorithm). The overall results confirm that the PD signal should be considered as one of the most powerful physiological signals to involve in future automated real-time affective recognition systems, especially for detecting the “relaxation” vs. “stress” states.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called. 632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We tested the effects of four data characteristics on the results of reserve selection algorithms. The data characteristics were nestedness of features (land types in this case), rarity of features, size variation of sites (potential reserves) and size of data sets (numbers of sites and features). We manipulated data sets to produce three levels, with replication, of each of these data characteristics while holding the other three characteristics constant. We then used an optimizing algorithm and three heuristic algorithms to select sites to solve several reservation problems. We measured efficiency as the number or total area of selected sites, indicating the relative cost of a reserve system. Higher nestedness increased the efficiency of all algorithms (reduced the total cost of new reserves). Higher rarity reduced the efficiency of all algorithms (increased the total cost of new reserves). More variation in site size increased the efficiency of all algorithms expressed in terms of total area of selected sites. We measured the suboptimality of heuristic algorithms as the percentage increase of their results over optimal (minimum possible) results. Suboptimality is a measure of the reliability of heuristics as indicative costing analyses. Higher rarity reduced the suboptimality of heuristics (increased their reliability) and there is some evidence that more size variation did the same for the total area of selected sites. We discuss the implications of these results for the use of reserve selection algorithms as indicative and real-world planning tools.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Medication data retrieved from Australian Repatriation Pharmaceutical Benefits Scheme (RPBS) claims for 44 veterans residing in nursing homes and Pharmaceutical Benefits Scheme (PBS) claims for 898 nursing home residents were compared with medication data from nursing home records to determine the optimal time interval for retrieving claims data and its validity. Optimal matching was achieved using 12 weeks of RPBS claims data, with 60% of medications in the RPBS claims located in nursing home administration records, and 78% of medications administered to nursing home residents identified in RPBS claims. In comparison, 48% of medications administered to nursing home residents could be found in 12 weeks of PBS data, and 56% of medications present in PBS claims could be matched with nursing home administration records. RPBS claims data was superior to PBS, due to the larger number of scheduled items available to veterans and the veteran's file number, which acts as a unique identifier. These findings should be taken into account when using prescription claims data for medication histories, prescriber feedback, drug utilisation, intervention or epidemiological studies. (C) 2001 Elsevier Science Inc. All rights reserved.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

27th Annual Conference of the European Cetacean Society. Setúbal, Portugal, 8-10 April 2013.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Feature selection is a central problem in machine learning and pattern recognition. On large datasets (in terms of dimension and/or number of instances), using search-based or wrapper techniques can be cornputationally prohibitive. Moreover, many filter methods based on relevance/redundancy assessment also take a prohibitively long time on high-dimensional. datasets. In this paper, we propose efficient unsupervised and supervised feature selection/ranking filters for high-dimensional datasets. These methods use low-complexity relevance and redundancy criteria, applicable to supervised, semi-supervised, and unsupervised learning, being able to act as pre-processors for computationally intensive methods to focus their attention on smaller subsets of promising features. The experimental results, with up to 10(5) features, show the time efficiency of our methods, with lower generalization error than state-of-the-art techniques, while being dramatically simpler and faster.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In machine learning and pattern recognition tasks, the use of feature discretization techniques may have several advantages. The discretized features may hold enough information for the learning task at hand, while ignoring minor fluctuations that are irrelevant or harmful for that task. The discretized features have more compact representations that may yield both better accuracy and lower training time, as compared to the use of the original features. However, in many cases, mainly with medium and high-dimensional data, the large number of features usually implies that there is some redundancy among them. Thus, we may further apply feature selection (FS) techniques on the discrete data, keeping the most relevant features, while discarding the irrelevant and redundant ones. In this paper, we propose relevance and redundancy criteria for supervised feature selection techniques on discrete data. These criteria are applied to the bin-class histograms of the discrete features. The experimental results, on public benchmark data, show that the proposed criteria can achieve better accuracy than widely used relevance and redundancy criteria, such as mutual information and the Fisher ratio.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Sex allocation data in eusocial Hymenoptera (ants, bees and wasps) provide an excellent opportunity to assess the effectiveness of kin selection, because queens and workers differ in their relatedness to females and males. The first studies on sex allocation in eusocial Hymenoptera compared population sex investment ratios across species. Female-biased investment in monogyne (= with single-queen colonies) populations of ants suggested that workers manipulate sex allocation according to their higher relatedness to females than males (relatedness asymmetry). However, several factors may confound these comparisons across species. First, variation in relatedness asymmetry is typically associated with major changes in breeding system and life history that may also affect sex allocation. Secondly, the relative cost of females and males is difficult to estimate across sexually dimorphic taxa, such as ants. Thirdly, each species in the comparison may not represent an independent data point, because of phylogenetic relationships among species. Recently, stronger evidence that workers control sex allocation has been provided by intraspecific studies of sex ratio variation across colonies. In several species of eusocial Hymenoptera, colonies with high relatedness asymmetry produced mostly females, in contrast to colonies with low relatedness asymmetry which produced mostly males. Additional signs of worker control were found by investigating proximate mechanisms of sex ratio manipulation in ants and wasps. However, worker control is not always effective, and further manipulative experiments will be needed to disentangle the multiple evolutionary factors and processes affecting sex allocation in eusocial Hymenoptera.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Background Multiple logistic regression is precluded from many practical applications in ecology that aim to predict the geographic distributions of species because it requires absence data, which are rarely available or are unreliable. In order to use multiple logistic regression, many studies have simulated "pseudo-absences" through a number of strategies, but it is unknown how the choice of strategy influences models and their geographic predictions of species. In this paper we evaluate the effect of several prevailing pseudo-absence strategies on the predictions of the geographic distribution of a virtual species whose "true" distribution and relationship to three environmental predictors was predefined. We evaluated the effect of using a) real absences b) pseudo-absences selected randomly from the background and c) two-step approaches: pseudo-absences selected from low suitability areas predicted by either Ecological Niche Factor Analysis: (ENFA) or BIOCLIM. We compared how the choice of pseudo-absence strategy affected model fit, predictive power, and information-theoretic model selection results. Results Models built with true absences had the best predictive power, best discriminatory power, and the "true" model (the one that contained the correct predictors) was supported by the data according to AIC, as expected. Models based on random pseudo-absences had among the lowest fit, but yielded the second highest AUC value (0.97), and the "true" model was also supported by the data. Models based on two-step approaches had intermediate fit, the lowest predictive power, and the "true" model was not supported by the data. Conclusion If ecologists wish to build parsimonious GLM models that will allow them to make robust predictions, a reasonable approach is to use a large number of randomly selected pseudo-absences, and perform model selection based on an information theoretic approach. However, the resulting models can be expected to have limited fit.