100 resultados para Data Pre-processing
Resumo:
Aim: A nested case-control discovery study was undertaken 10 test whether information within the serum peptidome can improve on the utility of CA125 for early ovarian cancer detection. Materials and Methods: High-throughput matrix-assisted laser desorption ionisation mass spectrometry (MALDI-MS) was used to profile 295 serum samples from women pre-dating their ovarian cancer diagnosis and from 585 matched control samples. Classification rules incorporating CA125 and MS peak intensities were tested for discriminating ability. Results: Two peaks were found which in combination with CA125 discriminated cases from controls up to 15 and 11 months before diagnosis, respectively, and earlier than using CA125 alone. One peak was identified as connective tissue-activating peptide III (CTAPIII), whilst the other was putatively identified as platelet factor 4 (PF4). ELISA data supported the down-regulation of PF4 in early cancer cases. Conclusion: Serum peptide information with CA125 improves lead time for early detection of ovarian cancer. The candidate markers are platelet-derived chemokines, suggesting a link between platelet function and tumour development.
Resumo:
We evaluate a number of real estate sentiment indices to ascertain current and forward-looking information content that may be useful for forecasting the demand and supply activities. Our focus lies on sector-specific surveys targeting the players from the supply-side of both residential and non-residential real estate markets. Analyzing the dynamic relationships within a Vector Auto-Regression (VAR) framework, we test the efficacy of these indices by comparing them with other coincident indicators in predicting real estate returns. Overall, our analysis suggests that sentiment indicators convey important information which should be embedded in the modeling exercise to predict real estate market returns. Generally, sentiment indices show better information content than broad economic indicators. The goodness of fit of our models is higher for the residential market than for the non-residential real estate sector. The impulse responses, in general, conform to our theoretical expectations. Variance decompositions and out-of-sample predictions generally show desired contribution and reasonable improvement respectively, thus upholding our hypothesis. Quite remarkably, consistent with the theory, the predictability swings when we look through different phases of the cycle. This perhaps suggests that, e.g. during recessions, market players’ expectations may be more accurate predictor of the future performances, conceivably indicating a ‘negative’ information processing bias and thus conforming to the precautionary motive of consumer behaviour.
Resumo:
Optimal state estimation from given observations of a dynamical system by data assimilation is generally an ill-posed inverse problem. In order to solve the problem, a standard Tikhonov, or L2, regularization is used, based on certain statistical assumptions on the errors in the data. The regularization term constrains the estimate of the state to remain close to a prior estimate. In the presence of model error, this approach does not capture the initial state of the system accurately, as the initial state estimate is derived by minimizing the average error between the model predictions and the observations over a time window. Here we examine an alternative L1 regularization technique that has proved valuable in image processing. We show that for examples of flow with sharp fronts and shocks, the L1 regularization technique performs more accurately than standard L2 regularization.
Resumo:
The technique of constructing a transformation, or regrading, of a discrete data set such that the histogram of the transformed data matches a given reference histogram is commonly known as histogram modification. The technique is widely used for image enhancement and normalization. A method which has been previously derived for producing such a regrading is shown to be “best” in the sense that it minimizes the error between the cumulative histogram of the transformed data and that of the given reference function, over all single-valued, monotone, discrete transformations of the data. Techniques for smoothed regrading, which provide a means of balancing the error in matching a given reference histogram against the information lost with respect to a linear transformation are also examined. The smoothed regradings are shown to optimize certain cost functionals. Numerical algorithms for generating the smoothed regradings, which are simple and efficient to implement, are described, and practical applications to the processing of LANDSAT image data are discussed.
Resumo:
Methods for producing nonuniform transformations, or regradings, of discrete data are discussed. The transformations are useful in image processing, principally for enhancement and normalization of scenes. Regradings which “equidistribute” the histogram of the data, that is, which transform it into a constant function, are determined. Techniques for smoothing the regrading, dependent upon a continuously variable parameter, are presented. Generalized methods for constructing regradings such that the histogram of the data is transformed into any prescribed function are also discussed. Numerical algorithms for implementing the procedures and applications to specific examples are described.
Resumo:
Paleosols were exposed in sections through four abandoned pre-Hispanic agricultural terraces surrounding an infilled mire basin in the southern Peruvian Andes. The two paleosols beneath the Tocotoccasa terrace represent the original ‘natural’ solum and a later soil formed after construction of the agricultural terrace, probably during the early Middle Horizon cultural period (615–695 AD). The soil at the current surface developed subsequent to the building up and reconstruction of the terrace, possibly during the late Late Intermediate period (1200–1400 AD). Micromorphology revealed an unexpected abundance of clay coatings within the upper terrace paleosol and surface terrace soil, a phenonemon attributed to the migration and/or accumulation of neoformed clay produced from the weathering of very unstable volcanic clasts, perhaps fuelled by arid/humid climatic oscillations and/or seasonal input of irrigation waters. The paleosols at Tocotoccasa could not be correlated with any degree of confidence with those beneath the other three terraces due to differences in pedosedimentary properties and uncertainties over chronological controls. Thus, it seems likely that either the terraces were (re)constructed and utilised over different cultural periods or that there is significant variation in the extent of weathering of material used for reconstruction of the terraces. Unfortunately, it cannot be ascertained from the data available whether the terraces were abandoned for any significant period of time prior to reconstruction and, if so, whether this was a regional phenomenon related to climate, social, or economic changes.
Resumo:
This paper presents practical approaches to the problem of sample size re-estimation in the case of clinical trials with survival data when proportional hazards can be assumed. When data are readily available at the time of the review, on a full range of survival experiences across the recruited patients, it is shown that, as expected, performing a blinded re-estimation procedure is straightforward and can help to maintain the trial's pre-specified error rates. Two alternative methods for dealing with the situation where limited survival experiences are available at the time of the sample size review are then presented and compared. In this instance, extrapolation is required in order to undertake the sample size re-estimation. Worked examples, together with results from a simulation study are described. It is concluded that, as in the standard case, use of either extrapolation approach successfully protects the trial error rates. Copyright © 2012 John Wiley & Sons, Ltd.
Resumo:
When performing data fusion, one often measures where targets were and then wishes to deduce where targets currently are. There has been recent research on the processing of such out-of-sequence data. This research has culminated in the development of a number of algorithms for solving the associated tracking problem. This paper reviews these different approaches in a common Bayesian framework and proposes an architecture that orthogonalises the data association and out-of-sequence problems such that any combination of solutions to these two problems can be used together. The emphasis is not on advocating one approach over another on the basis of computational expense, but rather on understanding the relationships among the algorithms so that any approximations made are explicit. Results for a multi-sensor scenario involving out-of-sequence data association are used to illustrate the utility of this approach in a specific context.
Resumo:
In data fusion systems, one often encounters measurements of past target locations and then wishes to deduce where the targets are currently located. Recent research on the processing of such out-of-sequence data has culminated in the development of a number of algorithms for solving the associated tracking problem. This paper reviews these different approaches in a common Bayesian framework and proposes an architecture that orthogonalises the data association and out-of-sequence problems such that any combination of solutions to these two problems can be used together. The emphasis is not on advocating one approach over another on the basis of computational expense, but rather on understanding the relationships between the algorithms so that any approximations made are explicit.
Resumo:
1. Nutrient concentrations (particularly N and P) determine the extent to which water bodies are or may become eutrophic. Direct determination of nutrient content on a wide scale is labour intensive but the main sources of N and P are well known. This paper describes and tests an export coefficient model for prediction of total N and total P from: (i) land use, stock headage and human population; (ii) the export rates of N and P from these sources; and (iii) the river discharge. Such a model might be used to forecast the effects of changes in land use in the future and to hindcast past water quality to establish comparative or baseline states for the monitoring of change. 2. The model has been calibrated against observed data for 1988 and validated against sets of observed data for a sequence of earlier years in ten British catchments varying from uplands through rolling, fertile lowlands to the flat topography of East Anglia. 3. The model predicted total N and total P concentrations with high precision (95% of the variance in observed data explained). It has been used in two forms: the first on a specific catchment basis; the second for a larger natural region which contains the catchment with the assumption that all catchments within that region will be similar. Both models gave similar results with little loss of precision in the latter case. This implies that it will be possible to describe the overall pattern of nutrient export in the UK with only a fraction of the effort needed to carry out the calculations for each individual water body. 4. Comparison between land use, stock headage, population numbers and nutrient export for the ten catchments in the pre-war year of 1931, and for 1970 and 1988 show that there has been a substantial loss of rough grazing to fertilized temporary and permanent grasslands, an increase in the hectarage devoted to arable, consistent increases in the stocking of cattle and sheep and a marked movement of humans to these rural catchments. 5. All of these trends have increased the flows of nutrients with more than a doubling of both total N and total P loads during the period. On average in these rural catchments, stock wastes have been the greatest contributors to both N and P exports, with cultivation the next most important source of N and people of P. Ratios of N to P were high in 1931 and remain little changed so that, in these catchments, phosphorus continues to be the nutrient most likely to control algal crops in standing waters supplied by the rivers studied.
Resumo:
Very high-resolution Synthetic Aperture Radar sensors represent an alternative to aerial photography for delineating floods in built-up environments where flood risk is highest. However, even with currently available SAR image resolutions of 3 m and higher, signal returns from man-made structures hamper the accurate mapping of flooded areas. Enhanced image processing algorithms and a better exploitation of image archives are required to facilitate the use of microwave remote sensing data for monitoring flood dynamics in urban areas. In this study a hybrid methodology combining radiometric thresholding, region growing and change detection is introduced as an approach enabling the automated, objective and reliable flood extent extraction from very high-resolution urban SAR images. The method is based on the calibration of a statistical distribution of “open water” backscatter values inferred from SAR images of floods. SAR images acquired during dry conditions enable the identification of areas i) that are not “visible” to the sensor (i.e. regions affected by ‘layover’ and ‘shadow’) and ii) that systematically behave as specular reflectors (e.g. smooth tarmac, permanent water bodies). Change detection with respect to a pre- or post flood reference image thereby reduces over-detection of inundated areas. A case study of the July 2007 Severn River flood (UK) observed by the very high-resolution SAR sensor on board TerraSAR-X as well as airborne photography highlights advantages and limitations of the proposed method. We conclude that even though the fully automated SAR-based flood mapping technique overcomes some limitations of previous methods, further technological and methodological improvements are necessary for SAR-based flood detection in urban areas to match the flood mapping capability of high quality aerial photography.
Resumo:
OBJECTIVES: The prediction of protein structure and the precise understanding of protein folding and unfolding processes remains one of the greatest challenges in structural biology and bioinformatics. Computer simulations based on molecular dynamics (MD) are at the forefront of the effort to gain a deeper understanding of these complex processes. Currently, these MD simulations are usually on the order of tens of nanoseconds, generate a large amount of conformational data and are computationally expensive. More and more groups run such simulations and generate a myriad of data, which raises new challenges in managing and analyzing these data. Because the vast range of proteins researchers want to study and simulate, the computational effort needed to generate data, the large data volumes involved, and the different types of analyses scientists need to perform, it is desirable to provide a public repository allowing researchers to pool and share protein unfolding data. METHODS: To adequately organize, manage, and analyze the data generated by unfolding simulation studies, we designed a data warehouse system that is embedded in a grid environment to facilitate the seamless sharing of available computer resources and thus enable many groups to share complex molecular dynamics simulations on a more regular basis. RESULTS: To gain insight into the conformational fluctuations and stability of the monomeric forms of the amyloidogenic protein transthyretin (TTR), molecular dynamics unfolding simulations of the monomer of human TTR have been conducted. Trajectory data and meta-data of the wild-type (WT) protein and the highly amyloidogenic variant L55P-TTR represent the test case for the data warehouse. CONCLUSIONS: Web and grid services, especially pre-defined data mining services that can run on or 'near' the data repository of the data warehouse, are likely to play a pivotal role in the analysis of molecular dynamics unfolding data.
Resumo:
In a world where data is captured on a large scale the major challenge for data mining algorithms is to be able to scale up to large datasets. There are two main approaches to inducing classification rules, one is the divide and conquer approach, also known as the top down induction of decision trees; the other approach is called the separate and conquer approach. A considerable amount of work has been done on scaling up the divide and conquer approach. However, very little work has been conducted on scaling up the separate and conquer approach.In this work we describe a parallel framework that allows the parallelisation of a certain family of separate and conquer algorithms, the Prism family. Parallelisation helps the Prism family of algorithms to harvest additional computer resources in a network of computers in order to make the induction of classification rules scale better on large datasets. Our framework also incorporates a pre-pruning facility for parallel Prism algorithms.
Resumo:
Inducing rules from very large datasets is one of the most challenging areas in data mining. Several approaches exist to scaling up classification rule induction to large datasets, namely data reduction and the parallelisation of classification rule induction algorithms. In the area of parallelisation of classification rule induction algorithms most of the work has been concentrated on the Top Down Induction of Decision Trees (TDIDT), also known as the ‘divide and conquer’ approach. However powerful alternative algorithms exist that induce modular rules. Most of these alternative algorithms follow the ‘separate and conquer’ approach of inducing rules, but very little work has been done to make the ‘separate and conquer’ approach scale better on large training data. This paper examines the potential of the recently developed blackboard based J-PMCRI methodology for parallelising modular classification rule induction algorithms that follow the ‘separate and conquer’ approach. A concrete implementation of the methodology is evaluated empirically on very large datasets.
Resumo:
The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.