960 resultados para data sets


Relevância:

70.00% 70.00%

Publicador:

Resumo:

As we enter an era of ‘big data’, asset information is becoming a deliverable of complex projects. Prior research suggests digital technologies enable rapid, flexible forms of project organizing. This research analyses practices of managing change in Airbus, CERN and Crossrail, through desk-based review, interviews, visits and a cross-case workshop. These organizations deliver complex projects, rely on digital technologies to manage large data-sets; and use configuration management, a systems engineering approach with mid-20th century origins, to establish and maintain integrity. In them, configuration management has become more, rather than less, important. Asset information is structured, with change managed through digital systems, using relatively hierarchical, asynchronous and sequential processes. The paper contributes by uncovering limits to flexibility in complex projects where integrity is important. Challenges of managing change are discussed, considering the evolving nature of configuration management; potential use of analytics on complex projects; and implications for research and practice.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper reviews the literature concerning the practice of using Online Analytical Processing (OLAP) systems to recall information stored by Online Transactional Processing (OLTP) systems. Such a review provides a basis for discussion on the need for the information that are recalled through OLAP systems to maintain the contexts of transactions with the data captured by the respective OLTP system. The paper observes an industry trend involving the use of OLTP systems to process information into data, which are then stored in databases without the business rules that were used to process information and data stored in OLTP databases without associated business rules. This includes the necessitation of a practice, whereby, sets of business rules are used to extract, cleanse, transform and load data from disparate OLTP systems into OLAP databases to support the requirements for complex reporting and analytics. These sets of business rules are usually not the same as business rules used to capture data in particular OLTP systems. The paper argues that, differences between the business rules used to interpret these same data sets, risk gaps in semantics between information captured by OLTP systems and information recalled through OLAP systems. Literature concerning the modeling of business transaction information as facts with context as part of the modelling of information systems were reviewed to identify design trends that are contributing to the design quality of OLTP and OLAP systems. The paper then argues that; the quality of OLTP and OLAP systems design has a critical dependency on the capture of facts with associated context, encoding facts with contexts into data with business rules, storage and sourcing of data with business rules, decoding data with business rules into the facts with the context and recall of facts with associated contexts. The paper proposes UBIRQ, a design model to aid the co-design of data with business rules storage for OLTP and OLAP purposes. The proposed design model provides the opportunity for the implementation and use of multi-purpose databases, and business rules stores for OLTP and OLAP systems. Such implementations would enable the use of OLTP systems to record and store data with executions of business rules, which will allow for the use of OLTP and OLAP systems to query data with business rules used to capture the data. Thereby ensuring information recalled via OLAP systems preserves the contexts of transactions as per the data captured by the respective OLTP system.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The size and complexity of data sets generated within ecosystem-level programmes merits their capture, curation, storage and analysis, synthesis and visualisation using Big Data approaches. This review looks at previous attempts to organise and analyse such data through the International Biological Programme and draws on the mistakes made and the lessons learned for effective Big Data approaches to current Research Councils United Kingdom (RCUK) ecosystem-level programmes, using Biodiversity and Ecosystem Service Sustainability (BESS) and Environmental Virtual Observatory Pilot (EVOp) as exemplars. The challenges raised by such data are identified, explored and suggestions are made for the two major issues of extending analyses across different spatio-temporal scales and for the effective integration of quantitative and qualitative data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

A quality assessment of the CFC-11 (CCl3F), CFC-12 (CCl2F2), HF, and SF6 products from limb-viewing satellite instruments is provided by means of a detailed intercomparison. The climatologies in the form of monthly zonal mean time series are obtained from HALOE, MIPAS, ACE-FTS, and HIRDLS within the time period 1991–2010. The intercomparisons focus on the mean biases of the monthly and annual zonal mean fields and aim to identify their vertical, latitudinal and temporal structure. The CFC evaluations (based on MIPAS, ACE-FTS and HIRDLS) reveal that the uncertainty in our knowledge of the atmospheric CFC-11 and CFC-12 mean state, as given by satellite data sets, is smallest in the tropics and mid-latitudes at altitudes below 50 and 20 hPa, respectively, with a 1σ multi-instrument spread of up to ±5 %. For HF, the situation is reversed. The two available data sets (HALOE and ACE-FTS) agree well above 100 hPa, with a spread in this region of ±5 to ±10 %, while at altitudes below 100 hPa the HF annual mean state is less well known, with a spread ±30 % and larger. The atmospheric SF6 annual mean states derived from two satellite data sets (MIPAS and ACE-FTS) show only very small differences with a spread of less than ±5 % and often below ±2.5 %. While the overall agreement among the climatological data sets is very good for large parts of the upper troposphere and lower stratosphere (CFCs, SF6) or middle stratosphere (HF), individual discrepancies have been identified. Pronounced deviations between the instrument climatologies exist for particular atmospheric regions which differ from gas to gas. Notable features are differently shaped isopleths in the subtropics, deviations in the vertical gradients in the lower stratosphere and in the meridional gradients in the upper troposphere, and inconsistencies in the seasonal cycle. Additionally, long-term drifts between the instruments have been identified for the CFC-11 and CFC-12 time series. The evaluations as a whole provide guidance on what data sets are the most reliable for applications such as studies of atmospheric transport and variability, model–measurement comparisons and detection of long-term trends. The data sets will be publicly available from the SPARC Data Centre and through PANGAEA (doi:10.1594/PANGAEA.849223).

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Precipitation and temperature climate indices are calculated using the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis and validated against observational data from some stations over Brazil and other data sources. The spatial patterns of the climate indices trends are analyzed for the period 1961-1990 over South America. In addition, the correlation and linear regression coefficients for some specific stations were also obtained in order to compare with the reanalysis data. In general, the results suggest that NCEP/NCAR reanalysis can provide useful information about minimum temperature and consecutive dry days indices at individual grid cells in Brazil. However, some regional differences in the climate indices trends are observed when different data sets are compared. For instance, the NCEP/NCAR reanalysis shows a reversal signal for all rainfall annual indices and the cold night index over Argentina. Despite these differences, maps of the trends for most of the annual climate indices obtained from the NCEP/NCAR reanalysis and BRANT analysis are generally in good agreement with other available data sources and previous findings in the literature for large areas of southern South America. The pattern of trends for the precipitation annual indices over the 30 years analyzed indicates a change to wetter conditions over southern and southeastern parts of Brazil, Paraguay, Uruguay, central and northern Argentina, and parts of Chile and a decrease over southwestern South America. All over South America, the climate indices related to the minimum temperature (warm or cold nights) have clearly shown a warming tendency; however, no consistent changes in maximum temperature extremes (warm and cold days) have been observed. Therefore, one must be careful before suggesting an), trends for warm or cold days.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Astronomy has evolved almost exclusively by the use of spectroscopic and imaging techniques, operated separately. With the development of modern technologies, it is possible to obtain data cubes in which one combines both techniques simultaneously, producing images with spectral resolution. To extract information from them can be quite complex, and hence the development of new methods of data analysis is desirable. We present a method of analysis of data cube (data from single field observations, containing two spatial and one spectral dimension) that uses Principal Component Analysis (PCA) to express the data in the form of reduced dimensionality, facilitating efficient information extraction from very large data sets. PCA transforms the system of correlated coordinates into a system of uncorrelated coordinates ordered by principal components of decreasing variance. The new coordinates are referred to as eigenvectors, and the projections of the data on to these coordinates produce images we will call tomograms. The association of the tomograms (images) to eigenvectors (spectra) is important for the interpretation of both. The eigenvectors are mutually orthogonal, and this information is fundamental for their handling and interpretation. When the data cube shows objects that present uncorrelated physical phenomena, the eigenvector`s orthogonality may be instrumental in separating and identifying them. By handling eigenvectors and tomograms, one can enhance features, extract noise, compress data, extract spectra, etc. We applied the method, for illustration purpose only, to the central region of the low ionization nuclear emission region (LINER) galaxy NGC 4736, and demonstrate that it has a type 1 active nucleus, not known before. Furthermore, we show that it is displaced from the centre of its stellar bulge.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Phylogenetic analyses of representative species from the five genera of Winteraceae (Drimys, Pseudowintera, Takhtajania, Tasmannia, and Zygogynum s.l.) were performed using ITS nuclear sequences and a combined data-set of ITS + psbA-trnH + rpS16 sequences (sampling of 30 and 15 species, respectively). Indel informativity using simple gap coding or gaps as a fifth character was examined in both data-sets. Parsimony and Bayesian analyses support the monophyly of Drimys, Tasmannia, and Zygogynum s.l., but do not support the monophyly of Belliolum, Zygogynum s.s., and Bubbia. Within Drimys, the combined data-set recovers two subclades. Divergence time estimates suggest that the splitting between Drimys and its sister clade (Pseudowintera + Zygogynum s.l.) occurred around the end of the Cretaceous; in contrast, the divergence between the two subclades within Drimys is more recent (15.5-18.5 MY) and coincides in time with the Andean uplift. Estimates suggest that the earliest divergences within Winteraceae could have predated the first events of Gondwana fragmentation. (C) 2009 Elsevier Inc. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, we present an algorithm for cluster analysis that integrates aspects from cluster ensemble and multi-objective clustering. The algorithm is based on a Pareto-based multi-objective genetic algorithm, with a special crossover operator, which uses clustering validation measures as objective functions. The algorithm proposed can deal with data sets presenting different types of clusters, without the need of expertise in cluster analysis. its result is a concise set of partitions representing alternative trade-offs among the objective functions. We compare the results obtained with our algorithm, in the context of gene expression data sets, to those achieved with multi-objective Clustering with automatic K-determination (MOCK). the algorithm most closely related to ours. (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Clustering is a difficult problem especially when we consider the task in the context of a data stream of categorical attributes. In this paper, we propose SCLOPE, a novel algorithm based on CLOPErsquos intuitive observation about cluster histograms. Unlike CLOPE however, our algo- rithm is very fast and operates within the constraints of a data stream environment. In particular, we designed SCLOPE according to the recent CluStream framework. Our evaluation of SCLOPE shows very promising results. It consistently outperforms CLOPE in speed and scalability tests on our data sets while maintaining high cluster purity; it also supports cluster analysis that other algorithms in its class do not.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Methods are presented for calculating minimum sample sizes necessary to obtain precise estimates of fungal spore dimensions. Using previously published spore-length data sets for Peronospora species, we demonstrate that 41—71 spores need to be measured to estimate the mean length with a reasonable level of statistical precision and resolution. This is further progressed with examples for calculating the minimum number of spore lengths to measure when matching an undetermined specimen to a known species. Although applied only to spore-length data, all described methods can be applied to any morphometric data that satisfy certain statistical assumptions.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Chlamydiae are important pathogens of humans, birds and a wide range of animals. They are a unique group of bacteria, characterized by their developmental cycle. Chlamydia has been difficult to study because of their obligate intracellular growth habit and lack of a genetic transformation system. However, the past 5 years has seen the full genome sequencing of seven strains of Chlamydia and a rapid expansion of genomic, transcriptomic (RT-PCR, microarray) and proteomic analysis of these pathogens. The Chlamydia Interactive Database (CIDB) described here is the first database of its type that holds genomic, RT-PCR, microarray and proteomics data sets that can be cross-queried by researchers for patterns in the data. Combining the data of many research groups into a single database and cross-querying from different perspectives should enhance our understanding of the complex cell biology of these pathogens. The database is available at: http://www3.it.deakin.edu.au:8080/CIDB/.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry. Each job or task in the industry was assigned a Base Estimate (BE) of exposure derived from task-based personal exposure assessments carried out by the company occupational hygienists. The BEs corresponded to the estimated arithmetic mean exposure to benzene for each job or task and were used in a deterministic algorithm to estimate the exposure of subjects in the study. Nearly all of the data sets underlying the BEs were found to contain some values below the limit of detection (LOD) of the sampling and analytical methods and some were very heavily censored; up to 95% of the data were below the LOD in some data sets. It was necessary, therefore, to use a method of calculating the arithmetic mean exposures that took into account the censored data. Three different methods were employed in an attempt to select the most appropriate method for the particular data in the study. A common method is to replace the missing (censored) values with half the detection limit. This method has been recommended for data sets where much of the data are below the limit of detection or where the data are highly skewed; with a geometric standard deviation of 3 or more. Another method, involving replacing the censored data with the limit of detection divided by the square root of 2, has been recommended when relatively few data are below the detection limit or where data are not highly skewed. A third method that was examined is Cohen's method. This involves mathematical extrapolation of the left-hand tail of the distribution, based on the distribution of the uncensored data, and calculation of the maximum likelihood estimate of the arithmetic mean. When these three methods were applied to the data in this study it was found that the first two simple methods give similar results in most cases. Cohen's method on the other hand, gave results that were generally, but not always, higher than simpler methods and in some cases gave extremely high and even implausible estimates of the mean. It appears that if the data deviate substantially from a simple log-normal distribution, particularly if high outliers are present, then Cohen's method produces erratic and unreliable estimates. After examining these results, and both the distributions and proportions of censored data, it was decided that the half limit of detection method was most suitable in this particular study.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Data is one of the domains in grid research that deals with the storage, replication, and management of large data sets in a distributed environment. The all-data-to-all sites replication scheme such as read-one write-all and tree grid structure (TGS) are the popular techniques being used for replication and management of data in this domain. However, these techniques have its weaknesses in terms of data storage capacity and also data access times due to some number of sites must ‘agree’ in common to execute certain transactions. In this paper, we propose the all-data-to-some-sites scheme called the neighbor replication on triangular grid (NRTG) technique by considering only neighbors have the replicated data, and thus, minimizes the storage capacity as well as high update availability. Also, the technique tolerates failures such as server failures, site failure or even network partitioning using remote procedure call (RPC).