970 resultados para Datasets
Resumo:
Extracting frequent subtrees from the tree structured data has important applications in Web mining. In this paper, we introduce a novel canonical form for rooted labelled unordered trees called the balanced-optimal-search canonical form (BOCF) that can handle the isomorphism problem efficiently. Using BOCF, we define a tree structure guided scheme based enumeration approach that systematically enumerates only the valid subtrees. Finally, we present the balanced optimal search tree miner (BOSTER) algorithm based on BOCF and the proposed enumeration approach, for finding frequent induced subtrees from a database of labelled rooted unordered trees. Experiments on the real datasets compare the efficiency of BOSTER over the two state-of-the-art algorithms for mining induced unordered subtrees, HybridTreeMiner and UNI3. The results are encouraging.
Resumo:
This paper presents an algorithm for mining unordered embedded subtrees using the balanced-optimal-search canonical form (BOCF). A tree structure guided scheme based enumeration approach is defined using BOCF for systematically enumerating the valid subtrees only. Based on this canonical form and enumeration technique, the balanced optimal search embedded subtree mining algorithm (BEST) is introduced for mining embedded subtrees from a database of labelled rooted unordered trees. The extensive experiments on both synthetic and real datasets demonstrate the efficiency of BEST over the two state-of-the-art algorithms for mining embedded unordered subtrees, SLEUTH and U3.
Resumo:
Recently, attempts to improve decision making in species management have focussed on uncertainties associated with modelling temporal fluctuations in populations. Reducing model uncertainty is challenging; while larger samples improve estimation of species trajectories and reduce statistical errors, they typically amplify variability in observed trajectories. In particular, traditional modelling approaches aimed at estimating population trajectories usually do not account well for nonlinearities and uncertainties associated with multi-scale observations characteristic of large spatio-temporal surveys. We present a Bayesian semi-parametric hierarchical model for simultaneously quantifying uncertainties associated with model structure and parameters, and scale-specific variability over time. We estimate uncertainty across a four-tiered spatial hierarchy of coral cover from the Great Barrier Reef. Coral variability is well described; however, our results show that, in the absence of additional model specifications, conclusions regarding coral trajectories become highly uncertain when considering multiple reefs, suggesting that management should focus more at the scale of individual reefs. The approach presented facilitates the description and estimation of population trajectories and associated uncertainties when variability cannot be attributed to specific causes and origins. We argue that our model can unlock value contained in large-scale datasets, provide guidance for understanding sources of uncertainty, and support better informed decision making
Resumo:
A tag-based item recommendation method generates an ordered list of items, likely interesting to a particular user, using the users past tagging behaviour. However, the users tagging behaviour varies in different tagging systems. A potential problem in generating quality recommendation is how to build user profiles, that interprets user behaviour to be effectively used, in recommendation models. Generally, the recommendation methods are made to work with specific types of user profiles, and may not work well with different datasets. In this paper, we investigate several tagging data interpretation and representation schemes that can lead to building an effective user profile. We discuss the various benefits a scheme brings to a recommendation method by highlighting the representative features of user tagging behaviours on a specific dataset. Empirical analysis shows that each interpretation scheme forms a distinct data representation which eventually affects the recommendation result. Results on various datasets show that an interpretation scheme should be selected based on the dominant usage in the tagging data (i.e. either higher amount of tags or higher amount of items present). The usage represents the characteristic of user tagging behaviour in the system. The results also demonstrate how the scheme is able to address the cold-start user problem.
Resumo:
Monitoring the environment with acoustic sensors is an effective method for understanding changes in ecosystems. Through extensive monitoring, large-scale, ecologically relevant, datasets can be produced that can inform environmental policy. The collection of acoustic sensor data is a solved problem; the current challenge is the management and analysis of raw audio data to produce useful datasets for ecologists. This paper presents the applied research we use to analyze big acoustic datasets. Its core contribution is the presentation of practical large-scale acoustic data analysis methodologies. We describe details of the data workflows we use to provide both citizen scientists and researchers practical access to large volumes of ecoacoustic data. Finally, we propose a work in progress large-scale architecture for analysis driven by a hybrid cloud-and-local production-grade website.
Resumo:
Low voltage distribution networks feature a high degree of load unbalance and the addition of rooftop photovoltaic is driving further unbalances in the network. Single phase consumers are distributed across the phases but even if the consumer distribution was well balanced when the network was constructed changes will occur over time. Distribution transformer losses are increased by unbalanced loadings. The estimation of transformer losses is a necessary part of the routine upgrading and replacement of transformers and the identification of the phase connections of households allows a precise estimation of the phase loadings and total transformer loss. This paper presents a new technique and preliminary test results for a method of automatically identifying the phase of each customer by correlating voltage information from the utility's transformer system with voltage information from customer smart meters. The techniques are novel as they are purely based upon a time series of electrical voltage measurements taken at the household and at the distribution transformer. Experimental results using a combination of electrical power and current of the real smart meter datasets demonstrate the performance of our techniques.
Resumo:
A new technique is presented for automatically identifying the phase connection of domestic customers. Voltage information from a reference three phase house is correlated with voltage information from other customer electricity meters on the same network to determine the highest probability phase connection. The techniques are purely based upon a time series of electrical voltage measurements taken by the household smart meters and no additional equipment is required. The method is demonstrated using real smart meter datasets to correctly identify the phase connections of 75 consumers on a low voltage distribution feeder.
Resumo:
This paper presents an online, unsupervised training algorithm enabling vision-based place recognition across a wide range of changing environmental conditions such as those caused by weather, seasons, and day-night cycles. The technique applies principal component analysis to distinguish between aspects of a location’s appearance that are condition-dependent and those that are condition-invariant. Removing the dimensions associated with environmental conditions produces condition-invariant images that can be used by appearance-based place recognition methods. This approach has a unique benefit – it requires training images from only one type of environmental condition, unlike existing data-driven methods that require training images with labelled frame correspondences from two or more environmental conditions. The method is applied to two benchmark variable condition datasets. Performance is equivalent or superior to the current state of the art despite the lesser training requirements, and is demonstrated to generalise to previously unseen locations.
Resumo:
This thesis proposes three novel models which extend the statistical methodology for motor unit number estimation, a clinical neurology technique. Motor unit number estimation is important in the treatment of degenerative muscular diseases and, potentially, spinal injury. Additionally, a recent and untested statistic to enable statistical model choice is found to be a practical alternative for larger datasets. The existing methods for dose finding in dual-agent clinical trials are found to be suitable only for designs of modest dimensions. The model choice case-study is the first of its kind containing interesting results using so-called unit information prior distributions.
Resumo:
Most of the existing algorithms for approximate Bayesian computation (ABC) assume that it is feasible to simulate pseudo-data from the model at each iteration. However, the computational cost of these simulations can be prohibitive for high dimensional data. An important example is the Potts model, which is commonly used in image analysis. Images encountered in real world applications can have millions of pixels, therefore scalability is a major concern. We apply ABC with a synthetic likelihood to the hidden Potts model with additive Gaussian noise. Using a pre-processing step, we fit a binding function to model the relationship between the model parameters and the synthetic likelihood parameters. Our numerical experiments demonstrate that the precomputed binding function dramatically improves the scalability of ABC, reducing the average runtime required for model fitting from 71 hours to only 7 minutes. We also illustrate the method by estimating the smoothing parameter for remotely sensed satellite imagery. Without precomputation, Bayesian inference is impractical for datasets of that scale.
Resumo:
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
Resumo:
This paper presents an overview of the strengths and limitations of existing and emerging geophysical tools for landform studies. The objectives are to discuss recent technical developments and to provide a review of relevant recent literature, with a focus on propagating field methods with terrestrial applications. For various methods in this category, including ground-penetrating radar (GPR), electrical resistivity (ER), seismics, and electromagnetic (EM) induction, the technical backgrounds are introduced, followed by section on novel developments relevant to landform characterization. For several decades, GPR has been popular for characterization of the shallow subsurface and in particular sedimentary systems. Novel developments in GPR include the use of multi-offset systems to improve signal-to-noise ratios and data collection efficiency, amongst others, and the increased use of 3D data. Multi-electrode ER systems have become popular in recent years as they allow for relatively fast and detailed mapping. Novel developments include time-lapse monitoring of dynamic processes as well as the use of capacitively-coupled systems for fast, non-invasive surveys. EM induction methods are especially popular for fast mapping of spatial variation, but can also be used to obtain information on the vertical variation in subsurface electrical conductivity. In recent years several examples of the use of plane wave EM for characterization of landforms have been published. Seismic methods for landform characterization include seismic reflection and refraction techniques and the use of surface waves. A recent development is the use of passive sensing approaches. The use of multiple geophysical methods, which can benefit from the sensitivity to different subsurface parameters, is becoming more common. Strategies for coupled and joint inversion of complementary datasets will, once more widely available, benefit the geophysical study of landforms.Three cases studies are presented on the use of electrical and GPR methods for characterization of landforms in the range of meters to 100. s of meters in dimension. In a study of polygonal patterned ground in the Saginaw Lowlands, Michigan, USA, electrical resistivity tomography was used to characterize differences in subsurface texture and water content associated with polygon-swale topography. Also, a sand-filled thermokarst feature was identified using electrical resistivity data. The second example is on the use of constant spread traversing (CST) for characterization of large-scale glaciotectonic deformation in the Ludington Ridge, Michigan. Multiple CST surveys parallel to an ~. 60. m high cliff, where broad (~. 100. m) synclines and narrow clay-rich anticlines are visible, illustrated that at least one of the narrow structures extended inland. A third case study discusses internal structures of an eolian dune on a coastal spit in New Zealand. Both 35 and 200. MHz GPR data, which clearly identified a paleosol and internal sedimentary structures of the dune, were used to improve understanding of the development of the dune, which may shed light on paleo-wind directions.
Resumo:
Functional MRI studies commonly refer to activation patterns as being localized in specific Brodmann areas, referring to Brodmann’s divisions of the human cortex based on cytoarchitectonic boundaries [3]. Typically, Brodmann areas that match regions in the group averaged functional maps are estimated by eye, leading to inaccurate parcellations and significant error. To avoid this limitation, we developed a method using high-dimensional nonlinear registration to project the Brodmann areas onto individual 3D co-registered structural and functional MRI datasets, using an elastic deformation vector field in the cortical parameter space. Based on a sulcal pattern matching approach [11], an N=27 scan single subject atlas (the Colin Holmes atlas [15]) with associated Brodmann areas labeled on its surface, was deformed to match 3D cortical surface models generated from individual subjects’ structural MRIs (sMRIs). The deformed Brodmann areas were used to quantify and localize functional MRI (fMRI) BOLD activation during the performance of the Tower of London task [7].
Resumo:
This chapter analyses the copyright law framework needed to ensure open access to outputs of the Australian academic and research sector such as journal articles and theses. It overviews the new knowledge landscape, the principles of copyright law, the concept of open access to knowledge, the recently developed open content models of copyright licensing and the challenges faced in providing greater access to knowledge and research outputs.
Resumo:
This paper presents a new metric, which we call the lighting variance ratio, for quantifying descriptors in terms of their variance to illumination changes. In many applications it is desirable to have descriptors that are robust to changes in illumination, especially in outdoor environments. The lighting variance ratio is useful for comparing descriptors and determining if a descriptor is lighting invariant enough for a given environment. The metric is analysed across a number of datasets, cameras and descriptors. The results show that the upright SIFT descriptor is typically the most lighting invariant descriptor.