832 resultados para databases and data mining
Resumo:
Title: Data-Driven Text Generation using Neural Networks Speaker: Pavlos Vougiouklis, University of Southampton Abstract: Recent work on neural networks shows their great potential at tackling a wide variety of Natural Language Processing (NLP) tasks. This talk will focus on the Natural Language Generation (NLG) problem and, more specifically, on the extend to which neural network language models could be employed for context-sensitive and data-driven text generation. In addition, a neural network architecture for response generation in social media along with the training methods that enable it to capture contextual information and effectively participate in public conversations will be discussed. Speaker Bio: Pavlos Vougiouklis obtained his 5-year Diploma in Electrical and Computer Engineering from the Aristotle University of Thessaloniki in 2013. He was awarded an MSc degree in Software Engineering from the University of Southampton in 2014. In 2015, he joined the Web and Internet Science (WAIS) research group of the University of Southampton and he is currently working towards the acquisition of his PhD degree in the field of Neural Network Approaches for Natural Language Processing. Title: Provenance is Complicated and Boring — Is there a solution? Speaker: Darren Richardson, University of Southampton Abstract: Paper trails, auditing, and accountability — arguably not the sexiest terms in computer science. But then you discover that you've possibly been eating horse-meat, and the importance of provenance becomes almost palpable. Having accepted that we should be creating provenance-enabled systems, the challenge of then communicating that provenance to casual users is not trivial: users should not have to have a detailed working knowledge of your system, and they certainly shouldn't be expected to understand the data model. So how, then, do you give users an insight into the provenance, without having to build a bespoke system for each and every different provenance installation? Speaker Bio: Darren is a final year Computer Science PhD student. He completed his undergraduate degree in Electronic Engineering at Southampton in 2012.
Resumo:
Colombia en su legislación normatiza el sector de la minería de carbón, sin embargo se considera que las estrategias no han sido suficientes para la identificación, prevención y control de la accidentalidad y enfermedad laboral. Durante el año 2013 el índice de fatalidad fue de 1,59. Estadísticas del año 2004 evidencian que las neumoconiosis fueron las mayores causas de invalidez de origen profesional. Objetivo: Categorizar actividades de intervención en promoción y prevención de accidentalidad y enfermedad laboral en trabajadores de la minería de carbón. Metodología: Se realizó una revisión de literatura sobre minería de carbón y salud la cual fue obtenida de las bases de datos PUBMED, Sciendirect, VHL, SINAB por literatura publicada sin límites de año, en idioma inglés, español o portugués. Para la búsqueda se utilizaron términos en lenguaje controlado (términos MESH), revisión por pares de títulos y resúmenes. Las publicaciones fueron seleccionadas para revisión de texto completo bajo criterios de inclusión y exclusión. Los códigos contemplados para esta revisión fueron: a) país donde la intervención se llevó a cabo, b) salud ocupacional, c) prevención de accidentalidad, d) programas de promoción, e) tecnologías, f) resultados obtenidos. Resultados: Del total de 2500 artículos seleccionados por los autores principales se realizó la revisión de los primeros 300 artículos, 32 hacen referencia al tema de salud ocupacional y minería de carbón, 10 contienen intervenciones consideradas de relevancia para esta revisión bibliográfica. Se presentan intervenciones estadísticamente significativas (p<0.05) y que han demostrado ser de impacto positivo en la minería de carbón en promoción y prevención de accidentalidad y enfermedad ocupacional. Conclusiones: Se identificaron las siguientes cuatro tipos de intervención: 1) las de carácter educativo que hacen referencia a las capacitaciones participativas, el entrenamiento por medio de “degraded image”, la realización de gestión de autocontrol y retroalimentación para el uso de elementos de protección personal (EPP), la aplicación del Modelo de Proceso Paralelo Extendido; 2) intervenciones preventivas como la medición de alcoholimetría antes del turno, la presencia de personal de enfermería en minas de carbón y el reconocimiento de los predictores de la enfermedad para optimizar la prevención primaria; 3)intervenciones de vigilancia como la promovida en la metodología Estadísticas Europeas de Accidentes de Trabajo (EEAT) para la investigación de los accidentes de trabajo, la aplicación de las recomendaciones de la Organización Mundial de la Salud (OMS) para la detección de la neumoconiosis y 4) De carácter tecnológico consistente en la intervención de tareas a partir de los resultados de la aplicación del software desarrollado por el Instituto Nacional para la Seguridad y Salud Ocupacional (NIOSH). Estas intervenciones han demostrado ser eficaces en la promoción y prevención de accidentalidad y enfermedad ocupacional por lo cual se recomienda su aplicación en Colombia posterior al análisis de costo-efectividad.
Resumo:
This paper gives an overview of the project Changing Coastlines: data assimilation for morphodynamic prediction and predictability. This project is investigating whether data assimilation could be used to improve coastal morphodynamic modeling. The concept of data assimilation is described, and the benefits that data assimilation could bring to coastal morphodynamic modeling are discussed. Application of data assimilation in a simple 1D morphodynamic model is presented. This shows that data assimilation can be used to improve the current state of the model bathymetry, and to tune the model parameter. We now intend to implement these ideas in a 2D morphodynamic model, for two study sites. The logistics of this are considered, including model design and implementation, and data requirement issues. We envisage that this work could provide a means for maintaining up-to-date information on coastal bathymetry, without the need for costly survey campaigns. This would be useful for a range of coastal management issues, including coastal flood forecasting.
Resumo:
Soil data and reliable soil maps are imperative for environmental management. conservation and policy. Data from historical point surveys, e.g. experiment site data and farmers fields can serve this purpose. However, legacy soil information is not necessarily collected for spatial analysis and mapping such that the data may not have immediately useful geo-references. Methods are required to utilise these historical soil databases so that we can produce quantitative maps of soil propel-ties to assess spatial and temporal trends but also to assess where future sampling is required. This paper discusses two such databases: the Representative Soil Sampling Scheme which has monitored the agricultural soil in England and Wales from 1969 to 2003 (between 400 and 900 bulked soil samples were taken annually from different agricultural fields); and the former State Chemistry Laboratory, Victoria, Australia where between 1973 and 1994 approximately 80,000 soil samples were submitted for analysis by farmers. Previous statistical analyses have been performed using administrative regions (with sharp boundaries) for both databases, which are largely unrelated to natural features. For a more detailed spatial analysis that call be linked to climate and terrain attributes, gradual variation of these soil properties should be described. Geostatistical techniques such as ordinary kriging are suited to this. This paper describes the format of the databases and initial approaches as to how they can be used for digital soil mapping. For this paper we have selected soil pH to illustrate the analyses for both databases.
Resumo:
Trace elements may present an environmental hazard in the vicinity of mining and smelting activities. However, the factors controlling trace element distribution in soils around ancient and modem mining and smelting areas are not always clear. Tharsis, Riotinto and Huelva are located in the Iberian Pyrite Belt in SW Spain. Tharsis and Riotinto mines have been exploited since 2500 B.C., with intensive smelting taking place. Huelva, established in 1970 and using the Flash Furnace Outokumpu process, is currently one of the largest smelter in the world. Pyrite and chalcopyrite ore have been intensively smelted for Cu. However, unusually for smelters and mines of a similar size, the elevated trace element concentrations in soils were found to be restricted to the immediate vicinity of the mines and smelters, being found up to a maximum of 2 kin from the mines and smelters at Tharsis, Riotinto and Huelva. Trace element partitioning (over 2/3 of trace elements found in the residual immobile fraction of soils at Tharsis) and soil particles examination by SEM-EDX showed that trace elements were not adsorbed onto soil particles, but were included within the matrix of large trace element-rich Fe silicate slag particles (i.e. 1 min circle divide at least 1 wt.% As, Cu and Zn, and 2 wt.% Pb). Slag particle large size (I mm 0) was found to control the geographically restricted trace element distribution in soils at Tharsis, Riotinto and Huelva, since large heavy particles could not have been transported long distances. Distribution and partitioning indicated that impacts to the environment as a result of mining and smelting should remain minimal in the region. (c) 2006 Elsevier B.V. All rights reserved.
Resumo:
Pressing global environmental problems highlight the need to develop tools to measure progress towards "sustainability." However, some argue that any such attempt inevitably reflects the views of those creating such tools and only produce highly contested notions of "reality." To explore this tension, we critically assesses the Environmental Sustainability Index (ESI), a well-publicized product of the World Economic Forum that is designed to measure 'sustainability' by ranking nations on league tables based on extensive databases of environmental indicators. By recreating this index, and then using statistical tools (principal components analysis) to test relations between various components of the index, we challenge ways in which countries are ranked in the ESI. Based on this analysis, we suggest (1) that the approach taken to aggregate, interpret and present the ESI creates a misleading impression that Western countries are more sustainable than the developing world; (2) that unaccounted methodological biases allowed the authors of the ESI to over-generalize the relative 'sustainability' of different countries; and, (3) that this has resulted in simplistic conclusions on the relation between economic growth and environmental sustainability. This criticism should not be interpreted as a call for the abandonment of efforts to create standardized comparable data. Instead, this paper proposes that indicator selection and data collection should draw on a range of voices, including local stakeholders as well as international experts. We also propose that aggregating data into final league ranking tables is too prone to error and creates the illusion of absolute and categorical interpretations. (c) 2004 Elsevier Ltd. All rights reserved.
Resumo:
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids.
Progress on “Changing coastlines: data assimilation for morphodynamic prediction and predictability”
Resumo:
The task of assessing the likelihood and extent of coastal flooding is hampered by the lack of detailed information on near-shore bathymetry. This is required as an input for coastal inundation models, and in some cases the variability in the bathymetry can impact the prediction of those areas likely to be affected by flooding in a storm. The constant monitoring and data collection that would be required to characterise the near-shore bathymetry over large coastal areas is impractical, leaving the option of running morphodynamic models to predict the likely bathymetry at any given time. However, if the models are inaccurate the errors may be significant if incorrect bathymetry is used to predict possible flood risks. This project is assessing the use of data assimilation techniques to improve the predictions from a simple model, by rigorously incorporating observations of the bathymetry into the model, to bring the model closer to the actual situation. Currently we are concentrating on Morecambe Bay as a primary study site, as it has a highly dynamic inter-tidal zone, with changes in the course of channels in this zone impacting the likely locations of flooding from storms. We are working with SAR images, LiDAR, and swath bathymetry to give us the observations over a 2.5 year period running from May 2003 – November 2005. We have a LiDAR image of the entire inter-tidal zone for November 2005 to use as validation data. We have implemented a 3D-Var data assimilation scheme, to investigate the improvements in performance of the data assimilation compared to the previous scheme which was based on the optimal interpolation method. We are currently evaluating these different data assimilation techniques, using 22 SAR data observations. We will also include the LiDAR data and swath bathymetry to improve the observational coverage, and investigate the impact of different types of observation on the predictive ability of the model. We are also assessing the ability of the data assimilation scheme to recover the correct bathymetry after storm events, which can dramatically change the bathymetry in a short period of time.
Resumo:
Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.
Resumo:
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
Resumo:
This chapter describes the present status and future prospects for transgenic (genetically modified) crops. It concentrates on the most recent data obtained from patent databases and field trial applications, as well as the usual scientific literature. By these means, it is possible to obtain a useful perspective into future commercial products and international trends. The various research areas are subdivided on the basis of those associated with input (agronomic) traits and those concerned with output (e.g., food quality) characteristics. Among the former group are new methods of improving stress resistance, and among the latter are many examples of producing pharmaceutical compounds in plants.
Resumo:
Event-related functional magnetic resonance imaging (efMRI) has emerged as a powerful technique for detecting brains' responses to presented stimuli. A primary goal in efMRI data analysis is to estimate the Hemodynamic Response Function (HRF) and to locate activated regions in human brains when specific tasks are performed. This paper develops new methodologies that are important improvements not only to parametric but also to nonparametric estimation and hypothesis testing of the HRF. First, an effective and computationally fast scheme for estimating the error covariance matrix for efMRI is proposed. Second, methodologies for estimation and hypothesis testing of the HRF are developed. Simulations support the effectiveness of our proposed methods. When applied to an efMRI dataset from an emotional control study, our method reveals more meaningful findings than the popular methods offered by AFNI and FSL. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
This work analyzes the use of linear discriminant models, multi-layer perceptron neural networks and wavelet networks for corporate financial distress prediction. Although simple and easy to interpret, linear models require statistical assumptions that may be unrealistic. Neural networks are able to discriminate patterns that are not linearly separable, but the large number of parameters involved in a neural model often causes generalization problems. Wavelet networks are classification models that implement nonlinear discriminant surfaces as the superposition of dilated and translated versions of a single "mother wavelet" function. In this paper, an algorithm is proposed to select dilation and translation parameters that yield a wavelet network classifier with good parsimony characteristics. The models are compared in a case study involving failed and continuing British firms in the period 1997-2000. Problems associated with over-parameterized neural networks are illustrated and the Optimal Brain Damage pruning technique is employed to obtain a parsimonious neural model. The results, supported by a re-sampling study, show that both neural and wavelet networks may be a valid alternative to classical linear discriminant models.
Resumo:
Data on the potential health benefits of dietary flavanols and procyanidins, especially in the context of cardiovascular health, are considerable and continue to accumulate. Significant progress has been made in flavanol analytics and the creation of phytonutrient-content food databases, and novel data emanated from epidemiological investigations as well as dietary intervention studies. However, a comprehensive understanding of the pharmacological properties of flavanols and procyanidins, including their precise mechanisms of action in vivo, and a conclusive, consensus-based accreditation of a causal relationship between intake and health benefits in the context of primary and secondary cardiovascular disease prevention is still outstanding. Thus, the objective of this review is to identify and discuss key questions and gaps that will need to be addressed in order to conclusively demonstrate whether or not dietary flavanols and procyanidins have a role in preventing, delaying the onset of, or treating cardiovascular diseases, and thus improving human life expectancy and quality of life.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.