106 resultados para text and data mining
Resumo:
In any data mining applications, automated text and text and image retrieval of information is needed. This becomes essential with the growth of the Internet and digital libraries. Our approach is based on the latent semantic indexing (LSI) and the corresponding term-by-document matrix suggested by Berry and his co-authors. Instead of using deterministic methods to find the required number of first "k" singular triplets, we propose a stochastic approach. First, we use Monte Carlo method to sample and to build much smaller size term-by-document matrix (e.g. we build k x k matrix) from where we then find the first "k" triplets using standard deterministic methods. Second, we investigate how we can reduce the problem to finding the "k"-largest eigenvalues using parallel Monte Carlo methods. We apply these methods to the initial matrix and also to the reduced one. The algorithms are running on a cluster of workstations under MPI and results of the experiments arising in textual retrieval of Web documents as well as comparison of the stochastic methods proposed are presented. (C) 2003 IMACS. Published by Elsevier Science B.V. All rights reserved.
Resumo:
This work analyzes the use of linear discriminant models, multi-layer perceptron neural networks and wavelet networks for corporate financial distress prediction. Although simple and easy to interpret, linear models require statistical assumptions that may be unrealistic. Neural networks are able to discriminate patterns that are not linearly separable, but the large number of parameters involved in a neural model often causes generalization problems. Wavelet networks are classification models that implement nonlinear discriminant surfaces as the superposition of dilated and translated versions of a single "mother wavelet" function. In this paper, an algorithm is proposed to select dilation and translation parameters that yield a wavelet network classifier with good parsimony characteristics. The models are compared in a case study involving failed and continuing British firms in the period 1997-2000. Problems associated with over-parameterized neural networks are illustrated and the Optimal Brain Damage pruning technique is employed to obtain a parsimonious neural model. The results, supported by a re-sampling study, show that both neural and wavelet networks may be a valid alternative to classical linear discriminant models.
Resumo:
This review of recent developments starts with the publication of Harold van der Heijden's Study Database Edition IV, John Nunn's second trilogy on the endgame, and a range of endgame tables (EGTs) to the DTC, DTZ and DTZ50 metrics. It then summarises data-mining work by Eiko Bleicher and Guy Haworth in 2010. This used CQL and pgn2fen to find some 3,000 EGT-faulted studies in the database above, and the Type A (value-critical) and Type B-DTM (DTM-depth-critical) zugzwangs in the mainlines of those studies. The same technique was used to mine Chessbase's BIG DATABASE 2010 to identify Type A/B zugzwangs, and to identify the pattern of value-concession and DTM-depth concession in sub-7-man play.
Resumo:
Peak picking is an early key step in MS data analysis. We compare three commonly used approaches to peak picking and discuss their merits by means of statistical analysis. Methods investigated encompass signal-to-noise ratio, continuous wavelet transform, and a correlation-based approach using a Gaussian template. Functionality of the three methods is illustrated and discussed in a practical context using a mass spectral data set created with MALDI-TOF technology. Sensitivity and specificity are investigated using a manually defined reference set of peaks. As an additional criterion, the robustness of the three methods is assessed by a perturbation analysis and illustrated using ROC curves.
Resumo:
Depreciation is a key element of understanding the returns from and price of commercial real estate. Understanding its impact is important for asset allocation models and asset management decisions. It is a key input into well-constructed pricing models and its impact on indices of commercial real estate prices needs to be recognised. There have been a number of previous studies of the impact of depreciation on real estate, particularly in the UK. Law (2004) analysed all of these studies and found that the seemingly consistent results were an illusion as they all used a variety of measurement methods and data. In addition, none of these studies examined impact on total returns; they examined either rental value depreciation alone or rental and capital value depreciation. This study seeks to rectify this omission, adopting the best practice measurement framework set out by Law (2004). Using individual property data from the UK Investment Property Databank for the 10-year period between 1994 and 2003, rental and capital depreciation, capital expenditure rates, and total return series for the data sample and for a benchmark are calculated for 10 market segments. The results are complicated by the period of analysis which started in the aftermath of the major UK real estate recession of the early 1990s, but they give important insights into the impact of depreciation in different segments of the UK real estate investment market.
Resumo:
Van der Heijden’s ENDGAME STUDY DATABASE IV, HHDBIV, is the definitive collection of 76,132 chess studies. The zugzwang position or zug, one in which the side to move would prefer not to, is a frequent theme in the literature of chess studies. In this third data-mining of HHDBIV, we report on the occurrence of sub-7-man zugs there as discovered by the use of CQL and Nalimov endgame tables (EGTs). We also mine those Zugzwang Studies in which a zug more significantly appears in both its White-to-move (wtm) and Black-to-move (btm) forms. We provide some illustrative and extreme examples of zugzwangs in studies.
Resumo:
Integrated Arable Farming Systems are examined from the perspective of the farmer considering the use of such techniques, and data are presented which suggest that the uptake of the approach may expose the manager to a greater degree of risk. Observations are made about the possible uptake of such systems in the UK and the implications this may have for agricultural and environmental policy in general.
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.
Resumo:
The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.
Resumo:
In the heart, inflammatory cytokines including interleukin (IL) 1β are implicated in regulating adaptive and maladaptive changes, whereas IL33 negatively regulates cardiomyocyte hypertrophy and promotes cardioprotection. These agonists signal through a common co-receptor but, in cardiomyocytes, IL1β more potently activates mitogen-activated protein kinases and NFκB, pathways that regulate gene expression. We compared the effects of external application of IL1β and IL33 on the cardiomyocyte transcriptome. Neonatal rat cardiomyocytes were exposed to IL1β or IL33 (0.5, 1 or 2h). Transcriptomic profiles were determined using Affymetrix rat genome 230 2.0 microarrays and data were validated by quantitative PCR. IL1β induced significant changes in more RNAs than IL33 and, generally, to a greater degree. It also had a significantly greater effect in downregulating mRNAs and in regulating mRNAs associated with selected pathways. IL33 had a greater effect on a small, select group of specific transcripts. Thus, differences in intensity of intracellular signals can deliver qualitatively different responses. Quantitatively different responses in production of receptor agonists and transcription factors may contribute to qualitative differences at later times resulting in different phenotypic cellular responses.
Resumo:
Cross-bred cow adoption is an important and potent policy variable precipitating subsistence household entry into emerging milk markets. This paper focuses on the problem of designing policies that encourage and sustain milkmarket expansion among a sample of subsistence households in the Ethiopian highlands. In this context it is desirable to measure households’ ‘proximity’ to market in terms of the level of deficiency of essential inputs. This problem is compounded by four factors. One is the existence of cross-bred cow numbers (count data) as an important, endogenous decision by the household; second is the lack of a multivariate generalization of the Poisson regression model; third is the censored nature of the milk sales data (sales from non-participating households are, essentially, censored at zero); and fourth is an important simultaneity that exists between the decision to adopt a cross-bred cow, the decision about how much milk to produce, the decision about how much milk to consume and the decision to market that milk which is produced but not consumed internally by the household. Routine application of Gibbs sampling and data augmentation overcome these problems in a relatively straightforward manner. We model the count data from two sites close to Addis Ababa in a latent, categorical-variable setting with known bin boundaries. The single-equation model is then extended to a multivariate system that accommodates the covariance between crossbred-cow adoption, milk-output, and milk-sales equations. The latent-variable procedure proves tractable in extension to the multivariate setting and provides important information for policy formation in emerging-market settings
Resumo:
An important feature of agribusiness promotion programs is their lagged impact on consumption. Efficient investment in advertising requires reliable estimates of these lagged responses and it is desirable from both applied and theoretical standpoints to have a flexible method for estimating them. This note derives an alternative Bayesian methodology for estimating lagged responses when investments occur intermittently within a time series. The method exploits a latent-variable extension of the natural-conjugate, normal-linear model, Gibbs sampling and data augmentation. It is applied to a monthly time series on Turkish pasta consumption (1993:5-1998:3) and three, nonconsecutive promotion campaigns (1996:3, 1997:3, 1997:10). The results suggest that responses were greatest to the second campaign, which allocated its entire budget to television media; that its impact peaked in the sixth month following expenditure; and that the rate of return (measured in metric tons additional consumption per thousand dollars expended) was around a factor of 20.
Resumo:
Platelets in the circulation are triggered by vascular damage to activate, aggregate and form a thrombus that prevents excessive blood loss. Platelet activation is stringently regulated by intracellular signalling cascades, which when activated inappropriately lead to myocardial infarction and stroke. Strategies to address platelet dysfunction have included proteomics approaches which have lead to the discovery of a number of novel regulatory proteins of potential therapeutic value. Global analysis of platelet proteomes may enhance the outcome of these studies by arranging this information in a contextual manner that recapitulates established signalling complexes and predicts novel regulatory processes. Platelet signalling networks have already begun to be exploited with interrogation of protein datasets using in silico methodologies that locate functionally feasible protein clusters for subsequent biochemical validation. Characterization of these biological systems through analysis of spatial and temporal organization of component proteins is developing alongside advances in the proteomics field. This focused review highlights advances in platelet proteomics data mining approaches that complement the emerging systems biology field. We have also highlighted nucleated cell types as key examples that can inform platelet research. Therapeutic translation of these modern approaches to understanding platelet regulatory mechanisms will enable the development of novel anti-thrombotic strategies.