894 resultados para Data modelling
Resumo:
Species distribution modelling is central to both fundamental and applied research in biogeography. Despite widespread use of models, there are still important conceptual ambiguities as well as biotic and algorithmic uncertainties that need to be investigated in order to increase confidence in model results. We identify and discuss five areas of enquiry that are of high importance for species distribution modelling: (1) clarification of the niche concept; (2) improved designs for sampling data for building models; (3) improved parameterization; (4) improved model selection and predictor contribution; and (5) improved model evaluation. The challenges discussed in this essay do not preclude the need for developments of other areas of research in this field. However, they are critical for allowing the science of species distribution modelling to move forward.
Resumo:
Summary Ecotones are sensitive to change because they contain high numbers of species living at the margin of their environmental tolerance. This is equally true of tree-lines, which are determined by attitudinal or latitudinal temperature gradients. In the current context of climate change, they are expected to undergo modifications in position, tree biomass and possibly species composition. Attitudinal and latitudinal tree-lines differ mainly in the steepness of the underlying temperature gradient: distances are larger at latitudinal tree-lines, which could have an impact on the ability of tree species to migrate in response to climate change. Aside from temperature, tree-lines are also affected on a more local level by pressure from human activities. These are also changing as a consequence of modifications in our societies and may interact with the effects of climate change. Forest dynamics models are often used for climate change simulations because of their mechanistic processes. The spatially-explicit model TreeMig was used as a base to develop a model specifically tuned for the northern European and Alpine tree-line ecotones. For the latter, a module for land-use change processes was also added. The temperature response parameters for the species in the model were first calibrated by means of tree-ring data from various species and sites at both tree-lines. This improved the growth response function in the model, but also lead to the conclusion that regeneration is probably more important than growth for controlling tree-line position and species' distributions. The second step was to implement the module for abandonment of agricultural land in the Alps, based on an existing spatial statistical model. The sensitivity of its most important variables was tested and the model's performance compared to other modelling approaches. The probability that agricultural land would be abandoned was strongly influenced by the distance from the nearest forest and the slope, bath of which are proxies for cultivation costs. When applied to a case study area, the resulting model, named TreeMig-LAb, gave the most realistic results. These were consistent with observed consequences of land-abandonment such as the expansion of the existing forest and closing up of gaps. This new model was then applied in two case study areas, one in the Swiss Alps and one in Finnish Lapland, under a variety of climate change scenarios. These were based on forecasts of temperature change over the next century by the IPCC and the HadCM3 climate model (ΔT: +1.3, +3.5 and +5.6 °C) and included a post-change stabilisation period of 300 years. The results showed radical disruptions at both tree-lines. With the most conservative climate change scenario, species' distributions simply shifted, but it took several centuries reach a new equilibrium. With the more extreme scenarios, some species disappeared from our study areas (e.g. Pinus cembra in the Alps) or dwindled to very low numbers, as they ran out of land into which they could migrate. The most striking result was the lag in the response of most species, independently from the climate change scenario or tree-line type considered. Finally, a statistical model of the effect of reindeer (Rangifer tarandus) browsing on the growth of Pinus sylvestris was developed, as a first step towards implementing human impacts at the boreal tree-line. The expected effect was an indirect one, as reindeer deplete the ground lichen cover, thought to protect the trees against adverse climate conditions. The model showed a small but significant effect of browsing, but as the link with the underlying climate variables was unclear and the model was not spatial, it was not usable as such. Developing the TreeMig-LAb model allowed to: a) establish a method for deriving species' parameters for the growth equation from tree-rings, b) highlight the importance of regeneration in determining tree-line position and species' distributions and c) improve the integration of social sciences into landscape modelling. Applying the model at the Alpine and northern European tree-lines under different climate change scenarios showed that with most forecasted levels of temperature increase, tree-lines would suffer major disruptions, with shifts in distributions and potential extinction of some tree-line species. However, these responses showed strong lags, so these effects would not become apparent before decades and could take centuries to stabilise. Résumé Les écotones son sensibles au changement en raison du nombre élevé d'espèces qui y vivent à la limite de leur tolérance environnementale. Ceci s'applique également aux limites des arbres définies par les gradients de température altitudinaux et latitudinaux. Dans le contexte actuel de changement climatique, on s'attend à ce qu'elles subissent des modifications de leur position, de la biomasse des arbres et éventuellement des essences qui les composent. Les limites altitudinales et latitudinales diffèrent essentiellement au niveau de la pente des gradients de température qui les sous-tendent les distance sont plus grandes pour les limites latitudinales, ce qui pourrait avoir un impact sur la capacité des espèces à migrer en réponse au changement climatique. En sus de la température, la limite des arbres est aussi influencée à un niveau plus local par les pressions dues aux activités humaines. Celles-ci sont aussi en mutation suite aux changements dans nos sociétés et peuvent interagir avec les effets du changement climatique. Les modèles de dynamique forestière sont souvent utilisés pour simuler les effets du changement climatique, car ils sont basés sur la modélisation de processus. Le modèle spatialement explicite TreeMig a été utilisé comme base pour développer un modèle spécialement adapté pour la limite des arbres en Europe du Nord et dans les Alpes. Pour cette dernière, un module servant à simuler des changements d'utilisation du sol a également été ajouté. Tout d'abord, les paramètres de la courbe de réponse à la température pour les espèces inclues dans le modèle ont été calibrées au moyen de données dendrochronologiques pour diverses espèces et divers sites des deux écotones. Ceci a permis d'améliorer la courbe de croissance du modèle, mais a également permis de conclure que la régénération est probablement plus déterminante que la croissance en ce qui concerne la position de la limite des arbres et la distribution des espèces. La seconde étape consistait à implémenter le module d'abandon du terrain agricole dans les Alpes, basé sur un modèle statistique spatial existant. La sensibilité des variables les plus importantes du modèle a été testée et la performance de ce dernier comparée à d'autres approches de modélisation. La probabilité qu'un terrain soit abandonné était fortement influencée par la distance à la forêt la plus proche et par la pente, qui sont tous deux des substituts pour les coûts liés à la mise en culture. Lors de l'application en situation réelle, le nouveau modèle, baptisé TreeMig-LAb, a donné les résultats les plus réalistes. Ceux-ci étaient comparables aux conséquences déjà observées de l'abandon de terrains agricoles, telles que l'expansion des forêts existantes et la fermeture des clairières. Ce nouveau modèle a ensuite été mis en application dans deux zones d'étude, l'une dans les Alpes suisses et l'autre en Laponie finlandaise, avec divers scénarios de changement climatique. Ces derniers étaient basés sur les prévisions de changement de température pour le siècle prochain établies par l'IPCC et le modèle climatique HadCM3 (ΔT: +1.3, +3.5 et +5.6 °C) et comprenaient une période de stabilisation post-changement climatique de 300 ans. Les résultats ont montré des perturbations majeures dans les deux types de limites de arbres. Avec le scénario de changement climatique le moins extrême, les distributions respectives des espèces ont subi un simple glissement, mais il a fallu plusieurs siècles pour qu'elles atteignent un nouvel équilibre. Avec les autres scénarios, certaines espèces ont disparu de la zone d'étude (p. ex. Pinus cembra dans les Alpes) ou ont vu leur population diminuer parce qu'il n'y avait plus assez de terrains disponibles dans lesquels elles puissent migrer. Le résultat le plus frappant a été le temps de latence dans la réponse de la plupart des espèces, indépendamment du scénario de changement climatique utilisé ou du type de limite des arbres. Finalement, un modèle statistique de l'effet de l'abroutissement par les rennes (Rangifer tarandus) sur la croissance de Pinus sylvestris a été développé, comme première étape en vue de l'implémentation des impacts humains sur la limite boréale des arbres. L'effet attendu était indirect, puisque les rennes réduisent la couverture de lichen sur le sol, dont on attend un effet protecteur contre les rigueurs climatiques. Le modèle a mis en évidence un effet modeste mais significatif, mais étant donné que le lien avec les variables climatiques sous jacentes était peu clair et que le modèle n'était pas appliqué dans l'espace, il n'était pas utilisable tel quel. Le développement du modèle TreeMig-LAb a permis : a) d'établir une méthode pour déduire les paramètres spécifiques de l'équation de croissance ä partir de données dendrochronologiques, b) de mettre en évidence l'importance de la régénération dans la position de la limite des arbres et la distribution des espèces et c) d'améliorer l'intégration des sciences sociales dans les modèles de paysage. L'application du modèle aux limites alpines et nord-européennes des arbres sous différents scénarios de changement climatique a montré qu'avec la plupart des niveaux d'augmentation de température prévus, la limite des arbres subirait des perturbations majeures, avec des glissements d'aires de répartition et l'extinction potentielle de certaines espèces. Cependant, ces réponses ont montré des temps de latence importants, si bien que ces effets ne seraient pas visibles avant des décennies et pourraient mettre plusieurs siècles à se stabiliser.
Resumo:
The present research deals with an application of artificial neural networks for multitask learning from spatial environmental data. The real case study (sediments contamination of Geneva Lake) consists of 8 pollutants. There are different relationships between these variables, from linear correlations to strong nonlinear dependencies. The main idea is to construct a subsets of pollutants which can be efficiently modeled together within the multitask framework. The proposed two-step approach is based on: 1) the criterion of nonlinear predictability of each variable ?k? by analyzing all possible models composed from the rest of the variables by using a General Regression Neural Network (GRNN) as a model; 2) a multitask learning of the best model using multilayer perceptron and spatial predictions. The results of the study are analyzed using both machine learning and geostatistical tools.
Resumo:
Aims: To assess the potential distribution of an obligate seeder and active pyrophyte, Cistus salviifolius, a vulnerable species in the Swiss Red List; to derive scenarios by changing the fire return interval; and to discuss the results from a conservation perspective. A more general aim is to assess the impact of fire as a natural factor influencing the vegetation of the southern slopes of the Alps. Locations: Alps, southern Switzerland. Methods: Presence-absence data to fit the model were obtained from the most recent field mapping of C. salviifolius. The quantitative environmental predictors used in this study include topographic, climatic and disturbance (fire) predictors. Models were fitted by logistic regression and evaluated by jackknife and bootstrap approaches. Changes in fire regime were simulated by increasing the time-return interval of fire (simulating longer periods without fire). Two scenarios were considered: no fire in the past 15 years; or in the past 35 years. Results: Rock cover, slope, topographic position, potential evapotranspiration and time elapsed since the last fire were selected in the final model. The Nagelkerke R-2 of the model for C. salviifolius was 0.57 and the Jackknife area under the curve evaluation was 0.89. The bootstrap evaluation revealed model robustness. By increasing the return interval of fire by either up to 15 years, or 35 years, the modelled C. salviifolius population declined by 30-40%, respectively. Main conclusions: Although fire plays a significant role, topography and rock cover appear to be the most important predictors, suggesting that the distribution of C. salviifolius in the southern Swiss Alps is closely related to the availability of supposedly competition-free sites, such as emerging bedrock, ridge locations or steep slopes. Fire is more likely to play a secondary role in allowing C. salviifolius to extend its occurrence temporarily, by increasing germination rates and reducing the competition from surrounding vegetation. To maintain a viable dormant seed bank for C. salviifolius, conservation managers should consider carrying out vegetation clearing and managing wild fire propagation to reduce competition and ensure sufficient recruitment for this species.
Resumo:
Remote sensing using airborne imaging spectroscopy (AIS) is known to retrieve fundamental optical properties of ecosystems. However, the value of these properties for predicting plant species distribution remains unclear. Here, we assess whether such data can add value to topographic variables for predicting plant distributions in French and Swiss alpine grasslands. We fitted statistical models with high spectral and spatial resolution reflectance data and tested four optical indices sensitive to leaf chlorophyll content, leaf water content and leaf area index. We found moderate added-value of AIS data for predicting alpine plant species distribution. Contrary to expectations, differences between species distribution models (SDMs) were not linked to their local abundance or phylogenetic/functional similarity. Moreover, spectral signatures of species were found to be partly site-specific. We discuss current limits of AIS-based SDMs, highlighting issues of scale and informational content of AIS data.
Resumo:
This research provides a description of the process followed in order to assemble a "Social Accounting Matrix" for Spain corresponding to the year 2000 (SAMSP00). As argued in the paper, this process attempts to reconcile ESA95 conventions with requirements of applied general equilibrium modelling. Particularly, problems related to the level of aggregation of net taxation data, and to the valuation system used for expressing the monetary value of input-output transactions have deserved special attention. Since the adoption of ESA95 conventions, input-output transactions have been preferably valued at basic prices, which impose additional difficulties on modellers interested in computing applied general equilibrium models. This paper addresses these difficulties by developing a procedure that allows SAM-builders to change the valuation system of input-output transactions conveniently. In addition, this procedure produces new data related to net taxation information.
Resumo:
The paper presents the Multiple Kernel Learning (MKL) approach as a modelling and data exploratory tool and applies it to the problem of wind speed mapping. Support Vector Regression (SVR) is used to predict spatial variations of the mean wind speed from terrain features (slopes, terrain curvature, directional derivatives) generated at different spatial scales. Multiple Kernel Learning is applied to learn kernels for individual features and thematic feature subsets, both in the context of feature selection and optimal parameters determination. An empirical study on real-life data confirms the usefulness of MKL as a tool that enhances the interpretability of data-driven models.
Resumo:
Depth-averaged velocities and unit discharges within a 30 km reach of one of the world's largest rivers, the Rio Parana, Argentina, were simulated using three hydrodynamic models with different process representations: a reduced complexity (RC) model that neglects most of the physics governing fluid flow, a two-dimensional model based on the shallow water equations, and a three-dimensional model based on the Reynolds-averaged Navier-Stokes equations. Row characteristics simulated using all three models were compared with data obtained by acoustic Doppler current profiler surveys at four cross sections within the study reach. This analysis demonstrates that, surprisingly, the performance of the RC model is generally equal to, and in some instances better than, that of the physics based models in terms of the statistical agreement between simulated and measured flow properties. In addition, in contrast to previous applications of RC models, the present study demonstrates that the RC model can successfully predict measured flow velocities. The strong performance of the RC model reflects, in part, the simplicity of the depth-averaged mean flow patterns within the study reach and the dominant role of channel-scale topographic features in controlling the flow dynamics. Moreover, the very low water surface slopes that typify large sand-bed rivers enable flow depths to be estimated reliably in the RC model using a simple fixed-lid planar water surface approximation. This approach overcomes a major problem encountered in the application of RC models in environments characterised by shallow flows and steep bed gradients. The RC model is four orders of magnitude faster than the physics based models when performing steady-state hydrodynamic calculations. However, the iterative nature of the RC model calculations implies a reduction in computational efficiency relative to some other RC models. A further implication of this is that, if used to simulate channel morphodynamics, the present RC model may offer only a marginal advantage in terms of computational efficiency over approaches based on the shallow water equations. These observations illustrate the trade off between model realism and efficiency that is a key consideration in RC modelling. Moreover, this outcome highlights a need to rethink the use of RC morphodynamic models in fluvial geomorphology and to move away from existing grid-based approaches, such as the popular cellular automata (CA) models, that remain essentially reductionist in nature. In the case of the world's largest sand-bed rivers, this might be achieved by implementing the RC model outlined here as one element within a hierarchical modelling framework that would enable computationally efficient simulation of the morphodynamics of large rivers over millennial time scales. (C) 2012 Elsevier B.V. All rights reserved.
Resumo:
Automatic environmental monitoring networks enforced by wireless communication technologies provide large and ever increasing volumes of data nowadays. The use of this information in natural hazard research is an important issue. Particularly useful for risk assessment and decision making are the spatial maps of hazard-related parameters produced from point observations and available auxiliary information. The purpose of this article is to present and explore the appropriate tools to process large amounts of available data and produce predictions at fine spatial scales. These are the algorithms of machine learning, which are aimed at non-parametric robust modelling of non-linear dependencies from empirical data. The computational efficiency of the data-driven methods allows producing the prediction maps in real time which makes them superior to physical models for the operational use in risk assessment and mitigation. Particularly, this situation encounters in spatial prediction of climatic variables (topo-climatic mapping). In complex topographies of the mountainous regions, the meteorological processes are highly influenced by the relief. The article shows how these relations, possibly regionalized and non-linear, can be modelled from data using the information from digital elevation models. The particular illustration of the developed methodology concerns the mapping of temperatures (including the situations of Föhn and temperature inversion) given the measurements taken from the Swiss meteorological monitoring network. The range of the methods used in the study includes data-driven feature selection, support vector algorithms and artificial neural networks.
Resumo:
1. The ecological niche is a fundamental biological concept. Modelling species' niches is central to numerous ecological applications, including predicting species invasions, identifying reservoirs for disease, nature reserve design and forecasting the effects of anthropogenic and natural climate change on species' ranges. 2. A computational analogue of Hutchinson's ecological niche concept (the multidimensional hyperspace of species' environmental requirements) is the support of the distribution of environments in which the species persist. Recently developed machine-learning algorithms can estimate the support of such high-dimensional distributions. We show how support vector machines can be used to map ecological niches using only observations of species presence to train distribution models for 106 species of woody plants and trees in a montane environment using up to nine environmental covariates. 3. We compared the accuracy of three methods that differ in their approaches to reducing model complexity. We tested models with independent observations of both species presence and species absence. We found that the simplest procedure, which uses all available variables and no pre-processing to reduce correlation, was best overall. Ecological niche models based on support vector machines are theoretically superior to models that rely on simulating pseudo-absence data and are comparable in empirical tests. 4. Synthesis and applications. Accurate species distribution models are crucial for effective environmental planning, management and conservation, and for unravelling the role of the environment in human health and welfare. Models based on distribution estimation rather than classification overcome theoretical and practical obstacles that pervade species distribution modelling. In particular, ecological niche models based on machine-learning algorithms for estimating the support of a statistical distribution provide a promising new approach to identifying species' potential distributions and to project changes in these distributions as a result of climate change, land use and landscape alteration.
Resumo:
Knowledge about spatial biodiversity patterns is a basic criterion for reserve network design. Although herbarium collections hold large quantities of information, the data are often scattered and cannot supply complete spatial coverage. Alternatively, herbarium data can be used to fit species distribution models and their predictions can be used to provide complete spatial coverage and derive species richness maps. Here, we build on previous effort to propose an improved compositionalist framework for using species distribution models to better inform conservation management. We illustrate the approach with models fitted with six different methods and combined using an ensemble approach for 408 plant species in a tropical and megadiverse country (Ecuador). As a complementary view to the traditional richness hotspots methodology, consisting of a simple stacking of species distribution maps, the compositionalist modelling approach used here combines separate predictions for different pools of species to identify areas of alternative suitability for conservation. Our results show that the compositionalist approach better captures the established protected areas than the traditional richness hotspots strategies and allows the identification of areas in Ecuador that would optimally complement the current protection network. Further studies should aim at refining the approach with more groups and additional species information.
Resumo:
1. Few examples of habitat-modelling studies of rare and endangered species exist in the literature, although from a conservation perspective predicting their distribution would prove particularly useful. Paucity of data and lack of valid absences are the probable reasons for this shortcoming. Analytic solutions to accommodate the lack of absence include the ecological niche factor analysis (ENFA) and the use of generalized linear models (GLM) with simulated pseudo-absences. 2. In this study we tested a new approach to generating pseudo-absences, based on a preliminary ENFA habitat suitability (HS) map, for the endangered species Eryngium alpinum. This method of generating pseudo-absences was compared with two others: (i) use of a GLM with pseudo-absences generated totally at random, and (ii) use of an ENFA only. 3. The influence of two different spatial resolutions (i.e. grain) was also assessed for tackling the dilemma of quality (grain) vs. quantity (number of occurrences). Each combination of the three above-mentioned methods with the two grains generated a distinct HS map. 4. Four evaluation measures were used for comparing these HS maps: total deviance explained, best kappa, Gini coefficient and minimal predicted area (MPA). The last is a new evaluation criterion proposed in this study. 5. Results showed that (i) GLM models using ENFA-weighted pseudo-absence provide better results, except for the MPA value, and that (ii) quality (spatial resolution and locational accuracy) of the data appears to be more important than quantity (number of occurrences). Furthermore, the proposed MPA value is suggested as a useful measure of model evaluation when used to complement classical statistical measures. 6. Synthesis and applications. We suggest that the use of ENFA-weighted pseudo-absence is a possible way to enhance the quality of GLM-based potential distribution maps and that data quality (i.e. spatial resolution) prevails over quantity (i.e. number of data). Increased accuracy of potential distribution maps could help to define better suitable areas for species protection and reintroduction.
Resumo:
Sustainable resource use is one of the most important environmental issues of our times. It is closely related to discussions on the 'peaking' of various natural resources serving as energy sources, agricultural nutrients, or metals indispensable in high-technology applications. Although the peaking theory remains controversial, it is commonly recognized that a more sustainable use of resources would alleviate negative environmental impacts related to resource use. In this thesis, sustainable resource use is analysed from a practical standpoint, through several different case studies. Four of these case studies relate to resource metabolism in the Canton of Geneva in Switzerland: the aim was to model the evolution of chosen resource stocks and flows in the coming decades. The studied resources were copper (a bulk metal), phosphorus (a vital agricultural nutrient), and wood (a renewable resource). In addition, the case of lithium (a critical metal) was analysed briefly in a qualitative manner and in an electric mobility perspective. In addition to the Geneva case studies, this thesis includes a case study on the sustainability of space life support systems. Space life support systems are systems whose aim is to provide the crew of a spacecraft with the necessary metabolic consumables over the course of a mission. Sustainability was again analysed from a resource use perspective. In this case study, the functioning of two different types of life support systems, ARES and BIORAT, were evaluated and compared; these systems represent, respectively, physico-chemical and biological life support systems. Space life support systems could in fact be used as a kind of 'laboratory of sustainability' given that they represent closed and relatively simple systems compared to complex and open terrestrial systems such as the Canton of Geneva. The chosen analysis method used in the Geneva case studies was dynamic material flow analysis: dynamic material flow models were constructed for the resources copper, phosphorus, and wood. Besides a baseline scenario, various alternative scenarios (notably involving increased recycling) were also examined. In the case of space life support systems, the methodology of material flow analysis was also employed, but as the data available on the dynamic behaviour of the systems was insufficient, only static simulations could be performed. The results of the case studies in the Canton of Geneva show the following: were resource use to follow population growth, resource consumption would be multiplied by nearly 1.2 by 2030 and by 1.5 by 2080. A complete transition to electric mobility would be expected to only slightly (+5%) increase the copper consumption per capita while the lithium demand in cars would increase 350 fold. For example, phosphorus imports could be decreased by recycling sewage sludge or human urine; however, the health and environmental impacts of these options have yet to be studied. Increasing the wood production in the Canton would not significantly decrease the dependence on wood imports as the Canton's production represents only 5% of total consumption. In the comparison of space life support systems ARES and BIORAT, BIORAT outperforms ARES in resource use but not in energy use. However, as the systems are dimensioned very differently, it remains questionable whether they can be compared outright. In conclusion, the use of dynamic material flow analysis can provide useful information for policy makers and strategic decision-making; however, uncertainty in reference data greatly influences the precision of the results. Space life support systems constitute an extreme case of resource-using systems; nevertheless, it is not clear how their example could be of immediate use to terrestrial systems.
Resumo:
The paper presents some contemporary approaches to spatial environmental data analysis. The main topics are concentrated on the decision-oriented problems of environmental spatial data mining and modeling: valorization and representativity of data with the help of exploratory data analysis, spatial predictions, probabilistic and risk mapping, development and application of conditional stochastic simulation models. The innovative part of the paper presents integrated/hybrid model-machine learning (ML) residuals sequential simulations-MLRSS. The models are based on multilayer perceptron and support vector regression ML algorithms used for modeling long-range spatial trends and sequential simulations of the residuals. NIL algorithms deliver non-linear solution for the spatial non-stationary problems, which are difficult for geostatistical approach. Geostatistical tools (variography) are used to characterize performance of ML algorithms, by analyzing quality and quantity of the spatially structured information extracted from data with ML algorithms. Sequential simulations provide efficient assessment of uncertainty and spatial variability. Case study from the Chernobyl fallouts illustrates the performance of the proposed model. It is shown that probability mapping, provided by the combination of ML data driven and geostatistical model based approaches, can be efficiently used in decision-making process. (C) 2003 Elsevier Ltd. All rights reserved.
Resumo:
Abstract One of the most important issues in molecular biology is to understand regulatory mechanisms that control gene expression. Gene expression is often regulated by proteins, called transcription factors which bind to short (5 to 20 base pairs),degenerate segments of DNA. Experimental efforts towards understanding the sequence specificity of transcription factors is laborious and expensive, but can be substantially accelerated with the use of computational predictions. This thesis describes the use of algorithms and resources for transcriptionfactor binding site analysis in addressing quantitative modelling, where probabilitic models are built to represent binding properties of a transcription factor and can be used to find new functional binding sites in genomes. Initially, an open-access database(HTPSELEX) was created, holding high quality binding sequences for two eukaryotic families of transcription factors namely CTF/NF1 and LEFT/TCF. The binding sequences were elucidated using a recently described experimental procedure called HTP-SELEX, that allows generation of large number (> 1000) of binding sites using mass sequencing technology. For each HTP-SELEX experiments we also provide accurate primary experimental information about the protein material used, details of the wet lab protocol, an archive of sequencing trace files, and assembled clone sequences of binding sequences. The database also offers reasonably large SELEX libraries obtained with conventional low-throughput protocols.The database is available at http://wwwisrec.isb-sib.ch/htpselex/ and and ftp://ftp.isrec.isb-sib.ch/pub/databases/htpselex. The Expectation-Maximisation(EM) algorithm is one the frequently used methods to estimate probabilistic models to represent the sequence specificity of transcription factors. We present computer simulations in order to estimate the precision of EM estimated models as a function of data set parameters(like length of initial sequences, number of initial sequences, percentage of nonbinding sequences). We observed a remarkable robustness of the EM algorithm with regard to length of training sequences and the degree of contamination. The HTPSELEX database and the benchmarked results of the EM algorithm formed part of the foundation for the subsequent project, where a statistical framework called hidden Markov model has been developed to represent sequence specificity of the transcription factors CTF/NF1 and LEF1/TCF using the HTP-SELEX experiment data. The hidden Markov model framework is capable of both predicting and classifying CTF/NF1 and LEF1/TCF binding sites. A covariance analysis of the binding sites revealed non-independent base preferences at different nucleotide positions, providing insight into the binding mechanism. We next tested the LEF1/TCF model by computing binding scores for a set of LEF1/TCF binding sequences for which relative affinities were determined experimentally using non-linear regression. The predicted and experimentally determined binding affinities were in good correlation.