930 resultados para data availability
Resumo:
This research aims to use the multivariate geochemical dataset, generated by the Tellus project, to investigate the appropriate use of transformation methods to maintain the integrity of geochemical data and inherent constrained behaviour in multivariate relationships. The widely used normal score transform is compared with the use of a stepwise conditional transform technique. The Tellus Project, managed by GSNI and funded by the Department of Enterprise Trade and Development and the EU’s Building Sustainable Prosperity Fund, involves the most comprehensive geological mapping project ever undertaken in Northern Ireland. Previous study has demonstrated spatial variability in the Tellus data but geostatistical analysis and interpretation of the datasets requires use of an appropriate methodology that reproduces the inherently complex multivariate relations. Previous investigation of the Tellus geochemical data has included use of Gaussian-based techniques. However, earth science variables are rarely Gaussian, hence transformation of data is integral to the approach. The multivariate geochemical dataset generated by the Tellus project provides an opportunity to investigate the appropriate use of transformation methods, as required for Gaussian-based geostatistical analysis. In particular, the stepwise conditional transform is investigated and developed for the geochemical datasets obtained as part of the Tellus project. The transform is applied to four variables in a bivariate nested fashion due to the limited availability of data. Simulation of these transformed variables is then carried out, along with a corresponding back transformation to original units. Results show that the stepwise transform is successful in reproducing both univariate statistics and the complex bivariate relations exhibited by the data. Greater fidelity to multivariate relationships will improve uncertainty models, which are required for consequent geological, environmental and economic inferences.
Resumo:
Automatically determining and assigning shared and meaningful text labels to data extracted from an e-Commerce web page is a challenging problem. An e-Commerce web page can display a list of data records, each of which can contain a combination of data items (e.g. product name and price) and explicit labels, which describe some of these data items. Recent advances in extraction techniques have made it much easier to precisely extract individual data items and labels from a web page, however, there are two open problems: 1. assigning an explicit label to a data item, and 2. determining labels for the remaining data items. Furthermore, improvements in the availability and coverage of vocabularies, especially in the context of e-Commerce web sites, means that we now have access to a bank of relevant, meaningful and shared labels which can be assigned to extracted data items. However, there is a need for a technique which will take as input a set of extracted data items and assign automatically to them the most relevant and meaningful labels from a shared vocabulary. We observe that the Information Extraction (IE) community has developed a great number of techniques which solve problems similar to our own. In this work-in-progress paper we propose our intention to theoretically and experimentally evaluate different IE techniques to ascertain which is most suitable to solve this problem.
Resumo:
The last decade has witnessed an unprecedented growth in availability of data having spatio-temporal characteristics. Given the scale and richness of such data, finding spatio-temporal patterns that demonstrate significantly different behavior from their neighbors could be of interest for various application scenarios such as – weather modeling, analyzing spread of disease outbreaks, monitoring traffic congestions, and so on. In this paper, we propose an automated approach of exploring and discovering such anomalous patterns irrespective of the underlying domain from which the data is recovered. Our approach differs significantly from traditional methods of spatial outlier detection, and employs two phases – i) discovering homogeneous regions, and ii) evaluating these regions as anomalies based on their statistical difference from a generalized neighborhood. We evaluate the quality of our approach and distinguish it from existing techniques via an extensive experimental evaluation.
Resumo:
In dynamic spectrum access networks, cognitive radio terminals monitor their spectral environment in order to detect and opportunistically access unoccupied frequency channels. The overall performance of such networks depends on the spectrum occupancy or availability patterns. Accurate knowledge on the channel availability enables optimum performance of such networks in terms of spectrum and energy efficiency. This work proposes a novel probabilistic channel availability model that can describe the channel availability in different polarizations for mobile cognitive radio terminals that are likely to change their orientation during their operation. A Gaussian approximation is used to model the empirical occupancy data that was obtained through a measurement campaign in the cellular frequency bands within a realistic operational scenario.
Resumo:
Cognitive radio has been proposed as a means of improving the spectrum utilisation and increasing spectrum efficiency of wireless systems. This can be achieved by allowing cognitive radio terminals to monitor their spectral environment and opportunistically access the unoccupied frequency channels. Due to the opportunistic nature of cognitive radio, the overall performance of such networks depends on the spectrum occupancy or availability patterns. Appropriate knowledge on channel availability can optimise the sensing performance in terms of spectrum and energy efficiency. This work proposes a statistical framework for the channel availability in the polarization domain. A Gaussian Normal approximation is used to model real-world occupancy data obtained through a measurement campaign in the cellular frequency bands within a realistic scenario.
Resumo:
We introduce a quality controlled observational atmospheric, snow, and soil data set from Snoqualmie Pass, Washington, U.S.A., to enable testing of hydrometeorological and snow process representations within a rain-snow transitional climate where existing observations are sparse and limited. Continuous meteorological forcing (including air temperature, total precipitation, wind speed, specific humidity, air pressure, short- and longwave irradiance) are provided at hourly intervals for a 24-year historical period (water years 1989-2012) and at half-hourly intervals for a more-recent period (water years 2013-2015), separated based on the availability of observations. Additional observations include 40-years of snow board new snow accumulation, multiple measurements of total snow depth, and manual snow pits, while more recent years include sub-daily surface temperature, snowpack drainage, soil moisture and temperature profiles, and eddy co-variance derived turbulent heat flux. This data set is ideal for testing hypotheses about energy balance, soil and snow processes in the rain-snow transition zone. Plots of live data can be found here: http://depts.washington.edu/mtnhydr/cgi/plot.cgi
Resumo:
Controlled fires in forest areas are frequently used in most Mediterranean countries as a preventive technique to avoid severe wildfires in summer season. In Portugal, this forest management method of fuel mass availability is also used and has shown to be beneficial as annual statistical reports confirm that the decrease of wildfires occurrence have a direct relationship with the controlled fire practice. However prescribed fire can have serious side effects in some forest soil properties. This work shows the changes that occurred in some forest soils properties after a prescribed fire action. The experiments were carried out in soil cover over a natural site of Andaluzitic schist, in Gramelas, Caminha, Portugal, that had not been burn for four years. The composed soil samples were collected from five plots at three different layers (0-3cm, 3-6cm and 6-18cm) during a three-year monitoring period after the prescribed burning. Principal Component Analysis was used to reach the presented conclusions.
Resumo:
Next-generation sequencing (NGS) technologies have become the standard for data generation in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are known to be problematic when applied to highly polymorphic genomic regions, such as the human leukocyte antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to population genomics analyses, it is important to assess the reliability of NGS data. Here, we evaluate the reliability of genotype calls and allele frequency estimates of the single-nucleotide polymorphisms (SNPs) reported by 1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, and -DQB1). We take advantage of the availability of HLA Sanger sequencing of 930 of the 1092 1000G samples and use this as a gold standard to benchmark the 1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect and that allele frequencies are estimated with an error greater than ±0.1 at approximately 25% of the SNPs in HLA genes. We found a bias toward overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have poor allele frequency estimates and discuss the outcomes of including those sites in different kinds of analyses. Because the HLA region is the most polymorphic in the human genome, our results provide insights into the challenges of using of NGS data at other genomic regions of high diversity.
Resumo:
The prescription of opioid analgesics has risen sharply in North America over the past two decades. This increase has been accompanied by a rise in overdoses. The present study draws on administrative data collected from emergency department contacts to describe the epidemiology of opioid overdose in Ontario b~tween 2002 and 2006 and to examine the role of regional variation in availability of specialist care. The number of poisonings increased from 1250 (10.9 per 100,000) in FY2002 to 1816 (15.2 per 100,000) in FY2005. Local concentration of specialist physicians was significantly associated with the incidence of opioid overdose, inversely at most levels of availability, but positively at very high levels. Regional variation in incidence was also associated with demographics, median family income, and the rate of other drug poisonings. Policy options for limiting opioid-related harms are limited, but improvements in monitoring and clinical management may prove valuable.
Resumo:
Le but de cette thèse est d étendre la théorie du bootstrap aux modèles de données de panel. Les données de panel s obtiennent en observant plusieurs unités statistiques sur plusieurs périodes de temps. Leur double dimension individuelle et temporelle permet de contrôler l 'hétérogénéité non observable entre individus et entre les périodes de temps et donc de faire des études plus riches que les séries chronologiques ou les données en coupe instantanée. L 'avantage du bootstrap est de permettre d obtenir une inférence plus précise que celle avec la théorie asymptotique classique ou une inférence impossible en cas de paramètre de nuisance. La méthode consiste à tirer des échantillons aléatoires qui ressemblent le plus possible à l échantillon d analyse. L 'objet statitstique d intérêt est estimé sur chacun de ses échantillons aléatoires et on utilise l ensemble des valeurs estimées pour faire de l inférence. Il existe dans la littérature certaines application du bootstrap aux données de panels sans justi cation théorique rigoureuse ou sous de fortes hypothèses. Cette thèse propose une méthode de bootstrap plus appropriée aux données de panels. Les trois chapitres analysent sa validité et son application. Le premier chapitre postule un modèle simple avec un seul paramètre et s 'attaque aux propriétés théoriques de l estimateur de la moyenne. Nous montrons que le double rééchantillonnage que nous proposons et qui tient compte à la fois de la dimension individuelle et la dimension temporelle est valide avec ces modèles. Le rééchantillonnage seulement dans la dimension individuelle n est pas valide en présence d hétérogénéité temporelle. Le ré-échantillonnage dans la dimension temporelle n est pas valide en présence d'hétérogénéité individuelle. Le deuxième chapitre étend le précédent au modèle panel de régression. linéaire. Trois types de régresseurs sont considérés : les caractéristiques individuelles, les caractéristiques temporelles et les régresseurs qui évoluent dans le temps et par individu. En utilisant un modèle à erreurs composées doubles, l'estimateur des moindres carrés ordinaires et la méthode de bootstrap des résidus, on montre que le rééchantillonnage dans la seule dimension individuelle est valide pour l'inférence sur les coe¢ cients associés aux régresseurs qui changent uniquement par individu. Le rééchantillonnage dans la dimen- sion temporelle est valide seulement pour le sous vecteur des paramètres associés aux régresseurs qui évoluent uniquement dans le temps. Le double rééchantillonnage est quand à lui est valide pour faire de l inférence pour tout le vecteur des paramètres. Le troisième chapitre re-examine l exercice de l estimateur de différence en di¤érence de Bertrand, Duflo et Mullainathan (2004). Cet estimateur est couramment utilisé dans la littérature pour évaluer l impact de certaines poli- tiques publiques. L exercice empirique utilise des données de panel provenant du Current Population Survey sur le salaire des femmes dans les 50 états des Etats-Unis d Amérique de 1979 à 1999. Des variables de pseudo-interventions publiques au niveau des états sont générées et on s attend à ce que les tests arrivent à la conclusion qu il n y a pas d e¤et de ces politiques placebos sur le salaire des femmes. Bertrand, Du o et Mullainathan (2004) montre que la non-prise en compte de l hétérogénéité et de la dépendance temporelle entraîne d importantes distorsions de niveau de test lorsqu'on évalue l'impact de politiques publiques en utilisant des données de panel. Une des solutions préconisées est d utiliser la méthode de bootstrap. La méthode de double ré-échantillonnage développée dans cette thèse permet de corriger le problème de niveau de test et donc d'évaluer correctement l'impact des politiques publiques.
Resumo:
In the present study the availability of satellite altimeter sea level data with good spatial and temporal resolution is explored to describe and understand circulation of the tropical Indian Ocean. The derived geostrophic circulations showed large variability in all scales. The seasonal cycle described using monthly climatology generated using 12 years SSH data from 1993 to 2004 revealed several new aspects of tropical Indian Ocean circulation. The interannual variability presented in this study using monthly means of SSH data for 12 years have shown large year-to-year variability. The EOF analysis has shown the influence of several periodic signals in the annual and interannual scales where the relative strengths of the signals also varied from year to year. Since one of the reasons for this kind of variability in circulation is the presence of planetary waves. This study discussed the influence of such waves on circulation by presenting two cases one in the Arabian Sea and other in the Bay of Bengal.
Resumo:
The average availability of a repairable system is the expected proportion of time that the system is operating in the interval [0, t]. The present article discusses the nonparametric estimation of the average availability when (i) the data on 'n' complete cycles of system operation are available, (ii) the data are subject to right censorship, and (iii) the process is observed upto a specified time 'T'. In each case, a nonparametric confidence interval for the average availability is also constructed. Simulations are conducted to assess the performance of the estimators.
Resumo:
Short summary: This study was undertaken to assess the diversity of plant resources utilized by the local population in south-western Madagascar, the social, ecological and biophysical conditions that drive their uses and availability, and possible alternative strategies for their sustainable use in the region. The study region, ‘Mahafaly region’, located in south-western Madagascar, is one of the country’s most economically, educationally and climatically disadvantaged regions. With an arid steppe climate, the agricultural production is limited by low water availability and a low level of soil nutrients and soil organic carbon. The region comprises the recently extended Tsimanampetsotsa National Park, with numerous sacred and communities forests, which are threatened by slash and burn agriculture and overexploitation of forests resources. The present study analyzed the availability of wild yams and medicinal plants, and their importance for the livelihood of the local population in this region. An ethnobotanical survey was conducted recording the diversity, local knowledge and use of wild yams and medicinal plants utilized by the local communities in five villages in the Mahafaly region. 250 households were randomly selected followed by semi-structured interviews on the socio-economic characteristics of the households. Data allowed us to characterize sociocultural and socioeconomic factors that determine the local use of wild yams and medicinal plants, and to identify their role in the livelihoods of local people. Species-environment relationships and the current spatial distribution of the wild yams were investigated and predicted using ordination methods and a niche based habitat modelling approach. Species response curves along edaphic gradients allowed us to understand the species requirements on habitat conditions. We thus investigated various alternative methods to enhance the wild yam regeneration for their local conservation and their sustainable use in the Mahafaly region. Altogether, six species of wild yams and a total of 214 medicinal plants species from 68 families and 163 genera were identified in the study region. Results of the cluster and discriminant analysis indicated a clear pattern on resource, resulted in two groups of household and characterized by differences in livestock numbers, off-farm activities, agricultural land and harvests. A generalized linear model highlighted that economic factors significantly affect the collection intensity of wild yams, while the use of medicinal plants depends to a higher degree on socio-cultural factors. The gradient analysis on the distribution of the wild yam species revealed a clear pattern for species habitats. Species models based on NPMR (Nonparametric Multiplicative Regression analysis) indicated the importance of vegetation structure, human interventions, and soil characteristics to determine wild yam species distribution. The prediction of the current availability of wild yam resources showed that abundant wild yam resources are scarce and face high harvest intensity. Experiments on yams cultivation revealed that germination of seeds was enhanced by using pre-germination treatments before planting, vegetative regeneration performed better with the upper part of the tubers (corms) rather than the sets of tubers. In-situ regeneration was possible for the upper parts of the wild tubers but the success depended significantly on the type of soil. The use of manure (10-20 t ha¹) increased the yield of the D. alata and D. alatipes by 40%. We thus suggest the promotion of other cultivated varieties of D. alata found regions neighbouring as the Mahafaly Plateau.
Resumo:
This paper presents a study of connection availability in GMPLS over optical transport networks (OTN) taking into account different network topologies. Two basic path protection schemes are considered and compared with the no protection case. The selected topologies are heterogeneous in geographic coverage, network diameter, link lengths, and average node degree. Connection availability is also computed considering the reliability data of physical components and a well-known network availability model. Results show several correspondences between suitable path protection algorithms and several network topology characteristics
Resumo:
Speaker: Dr Kieron O'Hara Organiser: Time: 04/02/2015 11:00-11:45 Location: B32/3077 Abstract In order to reap the potential societal benefits of big and broad data, it is essential to share and link personal data. However, privacy and data protection considerations mean that, to be shared, personal data must be anonymised, so that the data subject cannot be identified from the data. Anonymisation is therefore a vital tool for data sharing, but deanonymisation, or reidentification, is always possible given sufficient auxiliary information (and as the amount of data grows, both in terms of creation, and in terms of availability in the public domain, the probability of finding such auxiliary information grows). This creates issues for the management of anonymisation, which are exacerbated not only by uncertainties about the future, but also by misunderstandings about the process(es) of anonymisation. This talk discusses these issues in relation to privacy, risk management and security, reports on recent theoretical tools created by the UKAN network of statistics professionals (on which the author is one of the leads), and asks how long anonymisation can remain a useful tool, and what might replace it.