838 resultados para text and data mining
Resumo:
Surface temperature is a key aspect of weather and climate, but the term may refer to different quantities that play interconnected roles and are observed by different means. In a community-based activity in June 2012, the EarthTemp Network brought together 55 researchers from five continents to improve the interaction between scientific communities who focus on surface temperature in particular domains, to exploit the strengths of different observing systems and to better meet the needs of different communities. The workshop identified key needs for progress towards meeting scientific and societal requirements for surface temperature understanding and information, which are presented in this community paper. A "whole-Earth" perspective is required with more integrated, collaborative approaches to observing and understanding Earth's various surface temperatures. It is necessary to build understanding of the relationships between different surface temperatures, where presently inadequate, and undertake large-scale systematic intercomparisons. Datasets need to be easier to obtain and exploit for a wide constituency of users, with the differences and complementarities communicated in readily understood terms, and realistic and consistent uncertainty information provided. Steps were also recommended to curate and make available data that are presently inaccessible, develop new observing systems and build capacities to accelerate progress in the accuracy and usability of surface temperature datasets.
Resumo:
Expert systems have been increasingly popular for commercial importance. A rule based system is a special type of an expert system, which consists of a set of ‘if-then‘ rules and can be applied as a decision support system in many areas such as healthcare, transportation and security. Rule based systems can be constructed based on both expert knowledge and data. This paper aims to introduce the theory of rule based systems especially on categorization and construction of such systems from a conceptual point of view. This paper also introduces rule based systems for classification tasks in detail.
Resumo:
The DIAMET (DIAbatic influences on Mesoscale structures in ExTratropical storms) project aims to improve forecasts of high-impact weather in extratropical cyclones through field measurements, high-resolution numerical modeling, and improved design of ensemble forecasting and data assimilation systems. This article introduces DIAMET and presents some of the first results. Four field campaigns were conducted by the project, one of which, in late 2011, coincided with an exceptionally stormy period marked by an unusually strong, zonal North Atlantic jet stream and a succession of severe windstorms in northwest Europe. As a result, December 2011 had the highest monthly North Atlantic Oscillation index (2.52) of any December in the last 60 years. Detailed observations of several of these storms were gathered using the UK’s BAe146 research aircraft and extensive ground-based measurements. As an example of the results obtained during the campaign, observations are presented of cyclone Friedhelm on 8 December 2011, when surface winds with gusts exceeding 30 m s-1 crossed central Scotland, leading to widespread disruption to transportation and electricity supply. Friedhelm deepened 44 hPa in 24 hours and developed a pronounced bent-back front wrapping around the storm center. The strongest winds at 850 hPa and the surface occurred in the southern quadrant of the storm, and detailed measurements showed these to be most intense in clear air between bands of showers. High-resolution ensemble forecasts from the Met Office showed similar features, with the strongest winds aligned in linear swaths between the bands, suggesting that there is potential for improved skill in forecasts of damaging winds.
Resumo:
Twitter is both a micro-blogging service and a platform for public conversation. Direct conversation is facilitated in Twitter through the use of @’s (mentions) and replies. While the conversational element of Twitter is of particular interest to the marketing sector, relatively few data-mining studies have focused on this area. We analyse conversations associated with reciprocated mentions that take place in a data-set consisting of approximately 4 million tweets collected over a period of 28 days that contain at least one mention. We ignore tweet content and instead use the mention network structure and its dynamical properties to identify and characterise Twitter conversations between pairs of users and within larger groups. We consider conversational balance, meaning the fraction of content contributed by each party. The goal of this work is to draw out some of the mechanisms driving conversation in Twitter, with the potential aim of developing conversational models.
Resumo:
The Finnish Meteorological Institute, in collaboration with the University of Helsinki, has established a new ground-based remote-sensing network in Finland. The network consists of five topographically, ecologically and climatically different sites distributed from southern to northern Finland. The main goal of the network is to monitor air pollution and boundary layer properties in near real time, with a Doppler lidar and ceilometer at each site. In addition to these operational tasks, two sites are members of the Aerosols, Clouds and Trace gases Research InfraStructure Network (ACTRIS); a Ka band cloud radar at Sodankylä will provide cloud retrievals within CloudNet, and a multi-wavelength Raman lidar, PollyXT (POrtabLe Lidar sYstem eXTended), in Kuopio provides optical and microphysical aerosol properties through EARLINET (the European Aerosol Research Lidar Network). Three C-band weather radars are located in the Helsinki metropolitan area and are deployed for operational and research applications. We performed two inter-comparison campaigns to investigate the Doppler lidar performance, compare the backscatter signal and wind profiles, and to optimize the lidar sensitivity through adjusting the telescope focus length and data-integration time to ensure sufficient signal-to-noise ratio (SNR) in low-aerosol-content environments. In terms of statistical characterization, the wind-profile comparison showed good agreement between different lidars. Initially, there was a discrepancy in the SNR and attenuated backscatter coefficient profiles which arose from an incorrectly reported telescope focus setting from one instrument, together with the need to calibrate. After diagnosing the true telescope focus length, calculating a new attenuated backscatter coefficient profile with the new telescope function and taking into account calibration, the resulting attenuated backscatter profiles all showed good agreement with each other. It was thought that harsh Finnish winters could pose problems, but, due to the built-in heating systems, low ambient temperatures had no, or only a minor, impact on the lidar operation – including scanning-head motion. However, accumulation of snow and ice on the lens has been observed, which can lead to the formation of a water/ice layer thus attenuating the signal inconsistently. Thus, care must be taken to ensure continuous snow removal.
Resumo:
Classical regression methods take vectors as covariates and estimate the corresponding vectors of regression parameters. When addressing regression problems on covariates of more complex form such as multi-dimensional arrays (i.e. tensors), traditional computational models can be severely compromised by ultrahigh dimensionality as well as complex structure. By exploiting the special structure of tensor covariates, the tensor regression model provides a promising solution to reduce the model’s dimensionality to a manageable level, thus leading to efficient estimation. Most of the existing tensor-based methods independently estimate each individual regression problem based on tensor decomposition which allows the simultaneous projections of an input tensor to more than one direction along each mode. As a matter of fact, multi-dimensional data are collected under the same or very similar conditions, so that data share some common latent components but can also have their own independent parameters for each regression task. Therefore, it is beneficial to analyse regression parameters among all the regressions in a linked way. In this paper, we propose a tensor regression model based on Tucker Decomposition, which identifies not only the common components of parameters across all the regression tasks, but also independent factors contributing to each particular regression task simultaneously. Under this paradigm, the number of independent parameters along each mode is constrained by a sparsity-preserving regulariser. Linked multiway parameter analysis and sparsity modeling further reduce the total number of parameters, with lower memory cost than their tensor-based counterparts. The effectiveness of the new method is demonstrated on real data sets.
Resumo:
The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.
Resumo:
4-Dimensional Variational Data Assimilation (4DVAR) assimilates observations through the minimisation of a least-squares objective function, which is constrained by the model flow. We refer to 4DVAR as strong-constraint 4DVAR (sc4DVAR) in this thesis as it assumes the model is perfect. Relaxing this assumption gives rise to weak-constraint 4DVAR (wc4DVAR), leading to a different minimisation problem with more degrees of freedom. We consider two wc4DVAR formulations in this thesis, the model error formulation and state estimation formulation. The 4DVAR objective function is traditionally solved using gradient-based iterative methods. The principle method used in Numerical Weather Prediction today is the Gauss-Newton approach. This method introduces a linearised `inner-loop' objective function, which upon convergence, updates the solution of the non-linear `outer-loop' objective function. This requires many evaluations of the objective function and its gradient, which emphasises the importance of the Hessian. The eigenvalues and eigenvectors of the Hessian provide insight into the degree of convexity of the objective function, while also indicating the difficulty one may encounter while iterative solving 4DVAR. The condition number of the Hessian is an appropriate measure for the sensitivity of the problem to input data. The condition number can also indicate the rate of convergence and solution accuracy of the minimisation algorithm. This thesis investigates the sensitivity of the solution process minimising both wc4DVAR objective functions to the internal assimilation parameters composing the problem. We gain insight into these sensitivities by bounding the condition number of the Hessians of both objective functions. We also precondition the model error objective function and show improved convergence. We show that both formulations' sensitivities are related to error variance balance, assimilation window length and correlation length-scales using the bounds. We further demonstrate this through numerical experiments on the condition number and data assimilation experiments using linear and non-linear chaotic toy models.
Resumo:
With the increase in e-commerce and the digitisation of design data and information,the construction sector has become reliant upon IT infrastructure and systems. The design and production process is more complex, more interconnected, and reliant upon greater information mobility, with seamless exchange of data and information in real time. Construction small and medium-sized enterprises (CSMEs), in particular,the speciality contractors, can effectively utilise cost-effective collaboration-enabling technologies, such as cloud computing, to help in the effective transfer of information and data to improve productivity. The system dynamics (SD) approach offers a perspective and tools to enable a better understanding of the dynamics of complex systems. This research focuses upon system dynamics methodology as a modelling and analysis tool in order to understand and identify the key drivers in the absorption of cloud computing for CSMEs. The aim of this paper is to determine how the use of system dynamics (SD) can improve the management of information flow through collaborative technologies leading to improved productivity. The data supporting the use of system dynamics was obtained through a pilot study consisting of questionnaires and interviews from five CSMEs in the UK house-building sector.
Resumo:
Accurate knowledge of the location and magnitude of ocean heat content (OHC) variability and change is essential for understanding the processes that govern decadal variations in surface temperature, quantifying changes in the planetary energy budget, and developing constraints on the transient climate response to external forcings. We present an overview of the temporal and spatial characteristics of OHC variability and change as represented by an ensemble of dynamical and statistical ocean reanalyses (ORAs). Spatial maps of the 0–300 m layer show large regions of the Pacific and Indian Oceans where the interannual variability of the ensemble mean exceeds ensemble spread, indicating that OHC variations are well-constrained by the available observations over the period 1993–2009. At deeper levels, the ORAs are less well-constrained by observations with the largest differences across the ensemble mostly associated with areas of high eddy kinetic energy, such as the Southern Ocean and boundary current regions. Spatial patterns of OHC change for the period 1997–2009 show good agreement in the upper 300 m and are characterized by a strong dipole pattern in the Pacific Ocean. There is less agreement in the patterns of change at deeper levels, potentially linked to differences in the representation of ocean dynamics, such as water mass formation processes. However, the Atlantic and Southern Oceans are regions in which many ORAs show widespread warming below 700 m over the period 1997–2009. Annual time series of global and hemispheric OHC change for 0–700 m show the largest spread for the data sparse Southern Hemisphere and a number of ORAs seem to be subject to large initialization ‘shock’ over the first few years. In agreement with previous studies, a number of ORAs exhibit enhanced ocean heat uptake below 300 and 700 m during the mid-1990s or early 2000s. The ORA ensemble mean (±1 standard deviation) of rolling 5-year trends in full-depth OHC shows a relatively steady heat uptake of approximately 0.9 ± 0.8 W m−2 (expressed relative to Earth’s surface area) between 1995 and 2002, which reduces to about 0.2 ± 0.6 W m−2 between 2004 and 2006, in qualitative agreement with recent analysis of Earth’s energy imbalance. There is a marked reduction in the ensemble spread of OHC trends below 300 m as the Argo profiling float observations become available in the early 2000s. In general, we suggest that ORAs should be treated with caution when employed to understand past ocean warming trends—especially when considering the deeper ocean where there is little in the way of observational constraints. The current work emphasizes the need to better observe the deep ocean, both for providing observational constraints for future ocean state estimation efforts and also to develop improved models and data assimilation methods.
Resumo:
One of the top ten most influential data mining algorithms, k-means, is known for being simple and scalable. However, it is sensitive to initialization of prototypes and requires that the number of clusters be specified in advance. This paper shows that evolutionary techniques conceived to guide the application of k-means can be more computationally efficient than systematic (i.e., repetitive) approaches that try to get around the above-mentioned drawbacks by repeatedly running the algorithm from different configurations for the number of clusters and initial positions of prototypes. To do so, a modified version of a (k-means based) fast evolutionary algorithm for clustering is employed. Theoretical complexity analyses for the systematic and evolutionary algorithms under interest are provided. Computational experiments and statistical analyses of the results are presented for artificial and text mining data sets. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Multidimensional Visualization techniques are invaluable tools for analysis of structured and unstructured data with variable dimensionality. This paper introduces PEx-Image-Projection Explorer for Images-a tool aimed at supporting analysis of image collections. The tool supports a methodology that employs interactive visualizations to aid user-driven feature detection and classification tasks, thus offering improved analysis and exploration capabilities. The visual mappings employ similarity-based multidimensional projections and point placement to layout the data on a plane for visual exploration. In addition to its application to image databases, we also illustrate how the proposed approach can be successfully employed in simultaneous analysis of different data types, such as text and images, offering a common visual representation for data expressed in different modalities.
Resumo:
Point placement strategies aim at mapping data points represented in higher dimensions to bi-dimensional spaces and are frequently used to visualize relationships amongst data instances. They have been valuable tools for analysis and exploration of data sets of various kinds. Many conventional techniques, however, do not behave well when the number of dimensions is high, such as in the case of documents collections. Later approaches handle that shortcoming, but may cause too much clutter to allow flexible exploration to take place. In this work we present a novel hierarchical point placement technique that is capable of dealing with these problems. While good grouping and separation of data with high similarity is maintained without increasing computation cost, its hierarchical structure lends itself both to exploration in various levels of detail and to handling data in subsets, improving analysis capability and also allowing manipulation of larger data sets.
Resumo:
Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.
Resumo:
The purpose of this paper is to study China’s income inequality under rapid economic growth.Does the relationship between economic growth and income inequality in China follow theKuznets hypothesis? What is the main cause and trend of China’s income inequality? We usedata which covers the period 1980-2005 to analyze the overall inequality, and data coveringthe period 1980-2002 to analyze the inequality inside rural and urban areas. The derivedresults doubt the validity of Kuznets hypothesis on explaining the relationship betweeneconomic growth and income inequality in China. Also we derive the trend of China’sincreased income inequality and find that the urban-rural income disparity is the main causeof China’s income inequality.