Biblioteca Digital

44 resultados para Data sets storage

CIDB: Chlamydia Interactive Database for cross-querying genomics, transcriptomics and proteomics data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Chlamydiae are important pathogens of humans, birds and a wide range of animals. They are a unique group of bacteria, characterized by their developmental cycle. Chlamydia has been difficult to study because of their obligate intracellular growth habit and lack of a genetic transformation system. However, the past 5 years has seen the full genome sequencing of seven strains of Chlamydia and a rapid expansion of genomic, transcriptomic (RT-PCR, microarray) and proteomic analysis of these pathogens. The Chlamydia Interactive Database (CIDB) described here is the first database of its type that holds genomic, RT-PCR, microarray and proteomics data sets that can be cross-queried by researchers for patterns in the data. Combining the data of many research groups into a single database and cross-querying from different perspectives should enhance our understanding of the complex cell biology of these pathogens. The database is available at: http://www3.it.deakin.edu.au:8080/CIDB/.

Estimating mean exposures from censored data: exposure to benzene in the Australian petroleum industry

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry. Each job or task in the industry was assigned a Base Estimate (BE) of exposure derived from task-based personal exposure assessments carried out by the company occupational hygienists. The BEs corresponded to the estimated arithmetic mean exposure to benzene for each job or task and were used in a deterministic algorithm to estimate the exposure of subjects in the study. Nearly all of the data sets underlying the BEs were found to contain some values below the limit of detection (LOD) of the sampling and analytical methods and some were very heavily censored; up to 95% of the data were below the LOD in some data sets. It was necessary, therefore, to use a method of calculating the arithmetic mean exposures that took into account the censored data. Three different methods were employed in an attempt to select the most appropriate method for the particular data in the study. A common method is to replace the missing (censored) values with half the detection limit. This method has been recommended for data sets where much of the data are below the limit of detection or where the data are highly skewed; with a geometric standard deviation of 3 or more. Another method, involving replacing the censored data with the limit of detection divided by the square root of 2, has been recommended when relatively few data are below the detection limit or where data are not highly skewed. A third method that was examined is Cohen's method. This involves mathematical extrapolation of the left-hand tail of the distribution, based on the distribution of the uncensored data, and calculation of the maximum likelihood estimate of the arithmetic mean. When these three methods were applied to the data in this study it was found that the first two simple methods give similar results in most cases. Cohen's method on the other hand, gave results that were generally, but not always, higher than simpler methods and in some cases gave extremely high and even implausible estimates of the mean. It appears that if the data deviate substantially from a simple log-normal distribution, particularly if high outliers are present, then Cohen's method produces erratic and unreliable estimates. After examining these results, and both the distributions and proportions of censored data, it was decided that the half limit of detection method was most suitable in this particular study.

Bayesian classification of catchments using spatial data: a first step to improved modelling of catchment effects on stream ecological condition

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A major challenge facing freshwater ecologists and managers is the development of models that link stream ecological condition to catchment scale effects, such as land use. Previous attempts to make such models have followed two general approaches. The bottom-up approach employs mechanistic models, which can quickly become too complex to be useful. The top-down approach employs empirical models derived from large data sets, and has often suffered from large amounts of unexplained variation in stream condition.

We believe that the lack of success of both modelling approaches may be at least partly explained by scientists considering too wide a breadth of catchment type. Thus, we believe that by stratifying large sets of catchments into groups of similar types prior to modelling, both types of models may be improved. This paper describes preliminary work using a Bayesian classification software package, ‘Autoclass’ (Cheeseman and Stutz 1996) to create classes of catchments within the Murray Darling Basin based on physiographic data.

Autoclass uses a model-based classification method that employs finite mixture modelling and trades off model fit versus complexity, leading to a parsimonious solution. The software provides information on the posterior probability that the classification is ‘correct’ and also probabilities for alternative classifications. The importance of each attribute in defining the individual classes is calculated and presented, assisting description of the classes. Each case is ‘assigned’ to a class based on membership probability, but the probability of membership of other classes is also provided. This feature deals very well with cases that do not fit neatly into a larger class. Lastly, Autoclass requires the user to specify the measurement error of continuous variables.

Catchments were derived from the Australian digital elevation model. Physiographic data werederived from national spatial data sets. There was very little information on measurement errors for the spatial data, and so a conservative error of 5% of data range was adopted for all continuous attributes. The incorporation of uncertainty into spatial data sets remains a research challenge.

The results of the classification were very encouraging. The software found nine classes of catchments in the Murray Darling Basin. The classes grouped together geographically, and followed altitude and latitude gradients, despite the fact that these variables were not included in the classification. Descriptions of the classes reveal very different physiographic environments, ranging from dry and flat catchments (i.e. lowlands), through to wet and hilly catchments (i.e. mountainous areas). Rainfall and slope were two important discriminators between classes. These two attributes, in particular, will affect the ways in which the stream interacts with the catchment, and can thus be expected to modify the effects of land use change on ecological condition. Thus, realistic models of the effects of land use change on streams would differ between the different types of catchments, and sound management practices will differ.

A small number of catchments were assigned to their primary class with relatively low probability. These catchments lie on the boundaries of groups of catchments, with the second most likely class being an adjacent group. The locations of these ‘uncertain’ catchments show that the Bayesian classification dealt well with cases that do not fit neatly into larger classes.

Although the results are intuitive, we cannot yet assess whether the classifications described in this paper would assist the modelling of catchment scale effects on stream ecological condition. It is most likely that catchment classification and modelling will be an iterative process, where the needs of the model are used to guide classification, and the results of classifications used to suggest further refinements to models.

Data pre-processing for more effective gene clustering

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The high-throughput experimental data from the new gene microarray technology has spurred numerous efforts to find effective ways of processing microarray data for revealing real biological relationships among genes. This work proposes an innovative data pre-processing approach to identify noise data in the data sets and eliminate or reduce the impact of the noise data on gene clustering, With the proposed algorithm, the pre-processed data sets make the clustering results stable across clustering algorithms with different similarity metrics, the important information of genes and features is kept, and the clustering quality is improved. The primary evaluation on real microarray data sets has shown the effectiveness of the proposed algorithm.

Learning from large data : bias, variance, sampling, and learning curves

Relevância:

90.00% 90.00%

Publicador:

Resumo:

One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination. This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered. This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets. The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy, model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy. Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy. One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.

Scaling connectionist compositional representations

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The Recursive Auto-Associative Memory (RAAM) has come to dominate connectionist investigations into representing compositional structure. Although an adequate model when dealing with limited data, the capacity of RAAM to scale-up to real-world tasks has been frequently questioned. RAAM networks are difficult to train (due to the moving target effect) and as such training times can be lengthy. Investigations into RAAM have produced many variants in an attempt to overcome such limitations. We outline how one such model ((S)RAAM) is able to quickly produce context-sensitive representations that may be used to aid a deterministic parsing process. By substituting a symbolic stack in an existing hybrid parser, we show that (S)RAAM is more than capable of encoding the real-world data sets employed. We conclude by suggesting that models such as (S)RAAM offer valuable insights into the features of connectionist compositional representations.

A Bayesian framework for learning shared and individual subspaces from multiple data sources

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper presents a novel Bayesian formulation to exploit shared structures across multiple data sources, constructing foundations for effective mining and retrieval across disparate domains. We jointly analyze diverse data sources using a unifying piece of metadata (textual tags). We propose a method based on Bayesian Probabilistic Matrix Factorization (BPMF) which is able to explicitly model the partial knowledge common to the datasets using shared subspaces and the knowledge specific to each dataset using individual subspaces. For the proposed model, we derive an efficient algorithm for learning the joint factorization based on Gibbs sampling. The effectiveness of the model is demonstrated by social media retrieval tasks across single and multiple media. The proposed solution is applicable to a wider context, providing a formal framework suitable for exploiting individual as well as mutual knowledge present across heterogeneous data sources of many kinds.

Visualizing and classifying data using a hybrid intelligent system

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In this paper, a hybrid intelligent system that integrates the SOM (Self-Organizing Map) neural network, kMER (kernel-based Maximum Entropy learning Rule), and Probabilistic Neural Network (PNN) for data visualization and classification is proposed. The rationales of this Probabilistic SOM-kMER model are explained, and its applicability is demonstrated using two benchmark data sets. The results are analyzed and compared with those from a number of existing methods. Implication of the proposed hybrid system as a useful and usable data visualization and classification tool is discussed.

Dimensionality reduction of protein mass spectrometry data using random projection

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Protein mass spectrometry (MS) pattern recognition has recently emerged as a new method for cancer diagnosis. Unfortunately, classification performance may degrade owing to the enormously high dimensionality of the data. This paper investigates the use of Random Projection in protein MS data dimensionality reduction. The effectiveness of Random Projection (RP) is analyzed and compared against Principal Component Analysis (PCA) by using three classification algorithms, namely Support Vector Machine, Feed-forward Neural Networks and K-Nearest Neighbour. Three real-world cancer data sets are employed to evaluate the performances of RP and PCA. Through the investigations, RP method demonstrated better or at least comparable classification performance as PCA if the dimensionality of the projection matrix is sufficiently large. This paper also explores the use of RP as a pre-processing step prior to PCA. The results show that without sacrificing classification accuracy, performing RP prior to PCA significantly improves the computational time.

Use of circle-segments as a data visualization technique for feature selection in pattern classification

Relevância:

90.00% 90.00%

Publicador:

Resumo:

One of the issues associated with pattern classification using data based machine learning systems is the “curse of dimensionality”. In this paper, the circle-segments method is proposed as a feature selection method to identify important input features before the entire data set is provided for learning with machine learning systems. Specifically, four machine learning systems are deployed for classification, viz. Multilayer Perceptron (MLP), Support Vector Machine (SVM), Fuzzy ARTMAP (FAM), and k-Nearest Neighbour (kNN). The integration between the circle-segments method and the machine learning systems has been applied to two case studies comprising one benchmark and one real data sets. Overall, the results after feature selection using the circle segments method demonstrate improvements in performance even with more than 50% of the input features eliminated from the original data sets.

Overcoming data constraints to create meaningful ecological models

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Many techniques used to model ecosystems cannot be meaningfully applied to large-scale ecological problems due to data constraints. Disparate collection methods, data types and incomplete data sets, or limited theoretical understanding mean that a wide range of modelling techniques used to model physical processes or for problems specific to species or populations cannot be used at an ecosystem scale. In developing an ecological response model for the Coorong, a South Australian hypersaline estuary, we combined several flexible modelling approaches in a statistical framework to develop an approach we call ‘ecosystem states’. This model uses simulated hydrodynamic conditions as input to predict one of a suite of states per space and time, allowing prediction of likely ecological conditions under a variety of scenarios. Each ecosystem state has defined sets of biota and physico-chemical parameters. The existing model is limited in that its predictions have yet to be tested and, as yet, no spatial or temporal connectivity has been incorporated into simulated time series of ecosystem states. This approach can be used in a wide range of ecosystems, where enough data are available to model ecosystem states. We are in the process of applying the technique to a nearby lake system. This has been more difficult than for the Coorong as there is little overlap in the spatial and temporal coverage of biological data sets for that region. The approach is robust to low-quality biological data and missing environmental data, so should suit situations where community or management monitoring programs have occurred through time.

Measurement error causes scale-dependent threshold erosion of biological signals in animal movement data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Recent advances in telemetry technology have created a wealth of tracking data available for many animal species moving over spatial scales from tens of meters to tens of thousands of kilometers. Increasingly, such data sets are being used for quantitative movement analyses aimed at extracting fundamental biological signals such as optimal searching behavior and scale-dependent foraging decisions. We show here that the location error inherent in various tracking technologies reduces the ability to detect patterns of behavior within movements. Our analyses endeavored to set out a series of initial ground rules for ecologists to help ensure that sampling noise is not misinterpreted as a real biological signal. We simulated animal movement tracks using specialized random walks known as Lévy flights at three spatial scales of investigation: 100-km, 10-km, and 1-km maximum daily step lengths. The locations generated in the simulations were then blurred using known error distributions associated with commonly applied tracking methods: the Global Positioning System (GPS), Argos polar-orbiting satellites, and light-level geolocation. Deviations from the idealized Lévy flight pattern were assessed for each track after incrementing levels of location error were applied at each spatial scale, with additional assessments of the effect of error on scale-dependent movement patterns measured using fractal mean dimension and first-passage time (FPT) analyses. The accuracy of parameter estimation (Lévy μ, fractal mean D, and variance in FPT) declined precipitously at threshold errors relative to each spatial scale. At 100-km maximum daily step lengths, error standard deviations of ≥10 km seriously eroded the biological patterns evident in the simulated tracks, with analogous thresholds at the 10-km and 1-km scales (error SD ≥ 1.3 km and 0.07 km, respectively). Temporal subsampling of the simulated tracks maintained some elements of the biological signals depending on error level and spatial scale. Failure to account for large errors relative to the scale of movement can produce substantial biases in the interpretation of movement patterns. This study provides researchers with a framework for understanding the limitations of their data and identifies how temporal subsampling can help to reduce the influence of spatial error on their conclusions.

Recommendations for improved data processing from expired gas analysis indirect calorimetry

Relevância:

90.00% 90.00%

Publicador:

Resumo:

There is currently no universally recommended and accepted method of data processing within the science of indirect calorimetry for either mixing chamber or breath-by-breath systems of expired gas analysis. Exercise physiologists were first surveyed to determine methods used to process oxygen consumption ([OV0312]O 2) data, and current attitudes to data processing within the science of indirect calorimetry. Breath-by-breath datasets obtained from indirect calorimetry during incremental exercise were then used to demonstrate the consequences of commonly used time, breath and digital filter post-acquisition data processing strategies. Assessment of the variability in breath-by-breath data was determined using multiple regression based on the independent variables ventilation (VE), and the expired gas fractions for oxygen and carbon dioxide, FEO 2 and FECO2, respectively. Based on the results of explanation of variance of the breath-by-breath [OV0312]O2 data, methods of processing to remove variability were proposed for time-averaged, breath-averaged and digital filter applications. Among exercise physiologists, the strategy used to remove the variability in sequential [OV0312]O2 measurements varied widely, and consisted of time averages (30 sec [38%], 60 sec [18%], 20 sec [11%], 15 sec [8%]), a moving average of five to 11 breaths (10%), and the middle five of seven breaths (7%). Most respondents indicated that they used multiple criteria to establish maximum [OV0312]O 2 ([OV0312]O2max) including: the attainment of age-predicted maximum heart rate (HRmax) [53%], respiratory exchange ratio (RER) >1.10 (49%) or RER >1.15 (27%) and a rating of perceived exertion (RPE) of >17, 18 or 19 (20%). The reasons stated for these strategies included their own beliefs (32%), what they were taught (26%), what they read in research articles (22%), tradition (13%) and the influence of their colleagues (7%). The combination of VE, FEO 2 and FECO2 removed 96-98% of [OV0312]O2 breath-by-breath variability in incremental and steady-state exercise [OV0312]O2 data sets, respectively. Correction of residual error in [OV0312]O2 datasets to 10% of the raw variability results from application of a 30-second time average, 15-breath running average, or a 0.04 Hz low cut-off digital filter. Thus, we recommend that once these data processing strategies are used, the peak or maximal value becomes the highest processed datapoint. Exercise physiologists need to agree on, and continually refine through empirical research, a consistent process for analysing data from indirect calorimetry.

PPFSCADA: Privacy preserving framework for SCADA data publishing

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Supervisory Control and Data Acquisition (SCADA) systems control and monitor industrial and critical infrastructure functions, such as electricity, gas, water, waste, railway, and traffic. Recent attacks on SCADA systems highlight the need for stronger SCADA security. Thus, sharing SCADA traffic data has become a vital requirement in SCADA systems to analyze security risks and develop appropriate security solutions. However, inappropriate sharing and usage of SCADA data could threaten the privacy of companies and prevent sharing of data. In this paper, we present a privacy preserving strategy-based permutation technique called PPFSCADA framework, in which data privacy, statistical properties and data mining utilities can be controlled at the same time. In particular, our proposed approach involves: (i) vertically partitioning the original data set to improve the performance of perturbation; (ii) developing a framework to deal with various types of network traffic data including numerical, categorical and hierarchical attributes; (iii) grouping the portioned sets into a number of clusters based on the proposed framework; and (iv) the perturbation process is accomplished by the alteration of the original attribute value by a new value (clusters centroid). The effectiveness of the proposed PPFSCADA framework is shown through several experiments on simulated SCADA, intrusion detection and network traffic data sets. Through experimental analysis, we show that PPFSCADA effectively deals with multivariate traffic attributes, producing compatible results as the original data, and also substantially improving the performance of the five supervised approaches and provides high level of privacy protection. © 2014 Published by Elsevier B.V. All rights reserved.

Stream quantiles via maximal entropy histograms

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We address the problem of estimating the running quantile of a data stream when the memory for storing observations is limited.We (i) highlight the limitations of approaches previously described in the literature which make them unsuitable for non-stationary streams, (ii) describe a novel principle for the utilization of the available storage space, and (iii) introduce two novel algorithms which exploit the proposed principle. Experiments on three large realworld data sets demonstrate that the proposed methods vastly outperform the existing alternatives.

«
1
2
3
»