106 resultados para text and data mining


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is well known that there is a dynamic relationship between cerebral blood flow (CBF) and cerebral blood volume (CBV). With increasing applications of functional MRI, where the blood oxygen-level-dependent signals are recorded, the understanding and accurate modeling of the hemodynamic relationship between CBF and CBV becomes increasingly important. This study presents an empirical and data-based modeling framework for model identification from CBF and CBV experimental data. It is shown that the relationship between the changes in CBF and CBV can be described using a parsimonious autoregressive with exogenous input model structure. It is observed that neither the ordinary least-squares (LS) method nor the classical total least-squares (TLS) method can produce accurate estimates from the original noisy CBF and CBV data. A regularized total least-squares (RTLS) method is thus introduced and extended to solve such an error-in-the-variables problem. Quantitative results show that the RTLS method works very well on the noisy CBF and CBV data. Finally, a combination of RTLS with a filtering method can lead to a parsimonious but very effective model that can characterize the relationship between the changes in CBF and CBV.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Artisanal miners have tended to be portrayed in the literature and media as people who work hard and play hard, not infrequently depicted as ‘rough diamonds’ likely to cross the boundaries of appropriate behaviour through pursuit of wealth and flamboyant living, often at the cost of local environmental damage. A popular alternative image is that of marginalised labourers, driven by poverty to toil in harsh conditions and pursuing mining livelihoods in the face of national governments and large-scale mining companies’ subversion of their land and mineral rights. Both views reflect partial realities, but are inclined to exaggerate the position of miners as mischief-making rogues or victims. Through documentation of the multi-faceted nature of Tanzanian artisanal miners’ work and home lives during the country’s on-going economic mineralisation, we endeavour to convey a balanced rendering of their aspirations, occupational identity and social ties. Our emphasis is on their working lives as artisans, how they organise themselves and contend with the risks of their occupation, including their engagement with government policy and large-scale mining interests.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we discuss the current state-of-the-art in estimating, evaluating, and selecting among non-linear forecasting models for economic and financial time series. We review theoretical and empirical issues, including predictive density, interval and point evaluation and model selection, loss functions, data-mining, and aggregation. In addition, we argue that although the evidence in favor of constructing forecasts using non-linear models is rather sparse, there is reason to be optimistic. However, much remains to be done. Finally, we outline a variety of topics for future research, and discuss a number of areas which have received considerable attention in the recent literature, but where many questions remain.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A glance along the finance shelves at any bookshop reveals a large number of books that seek to show readers how to ‘make a million’ or ‘beat the market’ with allegedly highly profitable equity trading strategies. This paper investigates whether useful trading strategies can be derived from popular books of investment strategy, with What Works on Wall Street by James P. O'Shaughnessy used as an example. Specifically, we test whether this strategy would have produced a similarly spectacular performance in the UK context as was demonstrated by the author for the US market. As part of our investigation, we highlight a general methodology for determining whether the observed superior performance of a trading rule could be attributed in part or in entirety to data mining. Overall, we find that the O'Shaughnessy rule performs reasonably well in the UK equity market, yielding higher returns than the FTSE All-Share Index, but lower returns than an equally weighted benchmark

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Surface temperature is a key aspect of weather and climate, but the term may refer to different quantities that play interconnected roles and are observed by different means. In a community-based activity in June 2012, the EarthTemp Network brought together 55 researchers from five continents to improve the interaction between scientific communities who focus on surface temperature in particular domains, to exploit the strengths of different observing systems and to better meet the needs of different communities. The workshop identified key needs for progress towards meeting scientific and societal requirements for surface temperature understanding and information, which are presented in this community paper. A "whole-Earth" perspective is required with more integrated, collaborative approaches to observing and understanding Earth's various surface temperatures. It is necessary to build understanding of the relationships between different surface temperatures, where presently inadequate, and undertake large-scale systematic intercomparisons. Datasets need to be easier to obtain and exploit for a wide constituency of users, with the differences and complementarities communicated in readily understood terms, and realistic and consistent uncertainty information provided. Steps were also recommended to curate and make available data that are presently inaccessible, develop new observing systems and build capacities to accelerate progress in the accuracy and usability of surface temperature datasets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Expert systems have been increasingly popular for commercial importance. A rule based system is a special type of an expert system, which consists of a set of ‘if-then‘ rules and can be applied as a decision support system in many areas such as healthcare, transportation and security. Rule based systems can be constructed based on both expert knowledge and data. This paper aims to introduce the theory of rule based systems especially on categorization and construction of such systems from a conceptual point of view. This paper also introduces rule based systems for classification tasks in detail.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The DIAMET (DIAbatic influences on Mesoscale structures in ExTratropical storms) project aims to improve forecasts of high-impact weather in extratropical cyclones through field measurements, high-resolution numerical modeling, and improved design of ensemble forecasting and data assimilation systems. This article introduces DIAMET and presents some of the first results. Four field campaigns were conducted by the project, one of which, in late 2011, coincided with an exceptionally stormy period marked by an unusually strong, zonal North Atlantic jet stream and a succession of severe windstorms in northwest Europe. As a result, December 2011 had the highest monthly North Atlantic Oscillation index (2.52) of any December in the last 60 years. Detailed observations of several of these storms were gathered using the UK’s BAe146 research aircraft and extensive ground-based measurements. As an example of the results obtained during the campaign, observations are presented of cyclone Friedhelm on 8 December 2011, when surface winds with gusts exceeding 30 m s-1 crossed central Scotland, leading to widespread disruption to transportation and electricity supply. Friedhelm deepened 44 hPa in 24 hours and developed a pronounced bent-back front wrapping around the storm center. The strongest winds at 850 hPa and the surface occurred in the southern quadrant of the storm, and detailed measurements showed these to be most intense in clear air between bands of showers. High-resolution ensemble forecasts from the Met Office showed similar features, with the strongest winds aligned in linear swaths between the bands, suggesting that there is potential for improved skill in forecasts of damaging winds.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Twitter is both a micro-blogging service and a platform for public conversation. Direct conversation is facilitated in Twitter through the use of @’s (mentions) and replies. While the conversational element of Twitter is of particular interest to the marketing sector, relatively few data-mining studies have focused on this area. We analyse conversations associated with reciprocated mentions that take place in a data-set consisting of approximately 4 million tweets collected over a period of 28 days that contain at least one mention. We ignore tweet content and instead use the mention network structure and its dynamical properties to identify and characterise Twitter conversations between pairs of users and within larger groups. We consider conversational balance, meaning the fraction of content contributed by each party. The goal of this work is to draw out some of the mechanisms driving conversation in Twitter, with the potential aim of developing conversational models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Finnish Meteorological Institute, in collaboration with the University of Helsinki, has established a new ground-based remote-sensing network in Finland. The network consists of five topographically, ecologically and climatically different sites distributed from southern to northern Finland. The main goal of the network is to monitor air pollution and boundary layer properties in near real time, with a Doppler lidar and ceilometer at each site. In addition to these operational tasks, two sites are members of the Aerosols, Clouds and Trace gases Research InfraStructure Network (ACTRIS); a Ka band cloud radar at Sodankylä will provide cloud retrievals within CloudNet, and a multi-wavelength Raman lidar, PollyXT (POrtabLe Lidar sYstem eXTended), in Kuopio provides optical and microphysical aerosol properties through EARLINET (the European Aerosol Research Lidar Network). Three C-band weather radars are located in the Helsinki metropolitan area and are deployed for operational and research applications. We performed two inter-comparison campaigns to investigate the Doppler lidar performance, compare the backscatter signal and wind profiles, and to optimize the lidar sensitivity through adjusting the telescope focus length and data-integration time to ensure sufficient signal-to-noise ratio (SNR) in low-aerosol-content environments. In terms of statistical characterization, the wind-profile comparison showed good agreement between different lidars. Initially, there was a discrepancy in the SNR and attenuated backscatter coefficient profiles which arose from an incorrectly reported telescope focus setting from one instrument, together with the need to calibrate. After diagnosing the true telescope focus length, calculating a new attenuated backscatter coefficient profile with the new telescope function and taking into account calibration, the resulting attenuated backscatter profiles all showed good agreement with each other. It was thought that harsh Finnish winters could pose problems, but, due to the built-in heating systems, low ambient temperatures had no, or only a minor, impact on the lidar operation – including scanning-head motion. However, accumulation of snow and ice on the lens has been observed, which can lead to the formation of a water/ice layer thus attenuating the signal inconsistently. Thus, care must be taken to ensure continuous snow removal.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Classical regression methods take vectors as covariates and estimate the corresponding vectors of regression parameters. When addressing regression problems on covariates of more complex form such as multi-dimensional arrays (i.e. tensors), traditional computational models can be severely compromised by ultrahigh dimensionality as well as complex structure. By exploiting the special structure of tensor covariates, the tensor regression model provides a promising solution to reduce the model’s dimensionality to a manageable level, thus leading to efficient estimation. Most of the existing tensor-based methods independently estimate each individual regression problem based on tensor decomposition which allows the simultaneous projections of an input tensor to more than one direction along each mode. As a matter of fact, multi-dimensional data are collected under the same or very similar conditions, so that data share some common latent components but can also have their own independent parameters for each regression task. Therefore, it is beneficial to analyse regression parameters among all the regressions in a linked way. In this paper, we propose a tensor regression model based on Tucker Decomposition, which identifies not only the common components of parameters across all the regression tasks, but also independent factors contributing to each particular regression task simultaneously. Under this paradigm, the number of independent parameters along each mode is constrained by a sparsity-preserving regulariser. Linked multiway parameter analysis and sparsity modeling further reduce the total number of parameters, with lower memory cost than their tensor-based counterparts. The effectiveness of the new method is demonstrated on real data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

4-Dimensional Variational Data Assimilation (4DVAR) assimilates observations through the minimisation of a least-squares objective function, which is constrained by the model flow. We refer to 4DVAR as strong-constraint 4DVAR (sc4DVAR) in this thesis as it assumes the model is perfect. Relaxing this assumption gives rise to weak-constraint 4DVAR (wc4DVAR), leading to a different minimisation problem with more degrees of freedom. We consider two wc4DVAR formulations in this thesis, the model error formulation and state estimation formulation. The 4DVAR objective function is traditionally solved using gradient-based iterative methods. The principle method used in Numerical Weather Prediction today is the Gauss-Newton approach. This method introduces a linearised `inner-loop' objective function, which upon convergence, updates the solution of the non-linear `outer-loop' objective function. This requires many evaluations of the objective function and its gradient, which emphasises the importance of the Hessian. The eigenvalues and eigenvectors of the Hessian provide insight into the degree of convexity of the objective function, while also indicating the difficulty one may encounter while iterative solving 4DVAR. The condition number of the Hessian is an appropriate measure for the sensitivity of the problem to input data. The condition number can also indicate the rate of convergence and solution accuracy of the minimisation algorithm. This thesis investigates the sensitivity of the solution process minimising both wc4DVAR objective functions to the internal assimilation parameters composing the problem. We gain insight into these sensitivities by bounding the condition number of the Hessians of both objective functions. We also precondition the model error objective function and show improved convergence. We show that both formulations' sensitivities are related to error variance balance, assimilation window length and correlation length-scales using the bounds. We further demonstrate this through numerical experiments on the condition number and data assimilation experiments using linear and non-linear chaotic toy models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

With the increase in e-commerce and the digitisation of design data and information,the construction sector has become reliant upon IT infrastructure and systems. The design and production process is more complex, more interconnected, and reliant upon greater information mobility, with seamless exchange of data and information in real time. Construction small and medium-sized enterprises (CSMEs), in particular,the speciality contractors, can effectively utilise cost-effective collaboration-enabling technologies, such as cloud computing, to help in the effective transfer of information and data to improve productivity. The system dynamics (SD) approach offers a perspective and tools to enable a better understanding of the dynamics of complex systems. This research focuses upon system dynamics methodology as a modelling and analysis tool in order to understand and identify the key drivers in the absorption of cloud computing for CSMEs. The aim of this paper is to determine how the use of system dynamics (SD) can improve the management of information flow through collaborative technologies leading to improved productivity. The data supporting the use of system dynamics was obtained through a pilot study consisting of questionnaires and interviews from five CSMEs in the UK house-building sector.