424 resultados para Blog datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents an overview of the strengths and limitations of existing and emerging geophysical tools for landform studies. The objectives are to discuss recent technical developments and to provide a review of relevant recent literature, with a focus on propagating field methods with terrestrial applications. For various methods in this category, including ground-penetrating radar (GPR), electrical resistivity (ER), seismics, and electromagnetic (EM) induction, the technical backgrounds are introduced, followed by section on novel developments relevant to landform characterization. For several decades, GPR has been popular for characterization of the shallow subsurface and in particular sedimentary systems. Novel developments in GPR include the use of multi-offset systems to improve signal-to-noise ratios and data collection efficiency, amongst others, and the increased use of 3D data. Multi-electrode ER systems have become popular in recent years as they allow for relatively fast and detailed mapping. Novel developments include time-lapse monitoring of dynamic processes as well as the use of capacitively-coupled systems for fast, non-invasive surveys. EM induction methods are especially popular for fast mapping of spatial variation, but can also be used to obtain information on the vertical variation in subsurface electrical conductivity. In recent years several examples of the use of plane wave EM for characterization of landforms have been published. Seismic methods for landform characterization include seismic reflection and refraction techniques and the use of surface waves. A recent development is the use of passive sensing approaches. The use of multiple geophysical methods, which can benefit from the sensitivity to different subsurface parameters, is becoming more common. Strategies for coupled and joint inversion of complementary datasets will, once more widely available, benefit the geophysical study of landforms.Three cases studies are presented on the use of electrical and GPR methods for characterization of landforms in the range of meters to 100. s of meters in dimension. In a study of polygonal patterned ground in the Saginaw Lowlands, Michigan, USA, electrical resistivity tomography was used to characterize differences in subsurface texture and water content associated with polygon-swale topography. Also, a sand-filled thermokarst feature was identified using electrical resistivity data. The second example is on the use of constant spread traversing (CST) for characterization of large-scale glaciotectonic deformation in the Ludington Ridge, Michigan. Multiple CST surveys parallel to an ~. 60. m high cliff, where broad (~. 100. m) synclines and narrow clay-rich anticlines are visible, illustrated that at least one of the narrow structures extended inland. A third case study discusses internal structures of an eolian dune on a coastal spit in New Zealand. Both 35 and 200. MHz GPR data, which clearly identified a paleosol and internal sedimentary structures of the dune, were used to improve understanding of the development of the dune, which may shed light on paleo-wind directions.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Functional MRI studies commonly refer to activation patterns as being localized in specific Brodmann areas, referring to Brodmann’s divisions of the human cortex based on cytoarchitectonic boundaries [3]. Typically, Brodmann areas that match regions in the group averaged functional maps are estimated by eye, leading to inaccurate parcellations and significant error. To avoid this limitation, we developed a method using high-dimensional nonlinear registration to project the Brodmann areas onto individual 3D co-registered structural and functional MRI datasets, using an elastic deformation vector field in the cortical parameter space. Based on a sulcal pattern matching approach [11], an N=27 scan single subject atlas (the Colin Holmes atlas [15]) with associated Brodmann areas labeled on its surface, was deformed to match 3D cortical surface models generated from individual subjects’ structural MRIs (sMRIs). The deformed Brodmann areas were used to quantify and localize functional MRI (fMRI) BOLD activation during the performance of the Tower of London task [7].

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This chapter analyses the copyright law framework needed to ensure open access to outputs of the Australian academic and research sector such as journal articles and theses. It overviews the new knowledge landscape, the principles of copyright law, the concept of open access to knowledge, the recently developed open content models of copyright licensing and the challenges faced in providing greater access to knowledge and research outputs.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

That’s what one researcher told us when we asked them about applying for NHMRC Project Grant funding. Others said that applying for funding had made them ill, lost them friends, ruined Christmas and caused arguments with friends and family. What makes applying for funding so bad? We’ve tried to summarise the problems with the system in the diagram above. This is based on our group’s four years of research into the funding process. Some of the arrows are based on evidence from our surveys (Survey 1, Survey 2), others are based on anecdote or experience and so maybe wrong. Please let me know if I’ve missed an arrow or an issue.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This paper presents a new metric, which we call the lighting variance ratio, for quantifying descriptors in terms of their variance to illumination changes. In many applications it is desirable to have descriptors that are robust to changes in illumination, especially in outdoor environments. The lighting variance ratio is useful for comparing descriptors and determining if a descriptor is lighting invariant enough for a given environment. The metric is analysed across a number of datasets, cameras and descriptors. The results show that the upright SIFT descriptor is typically the most lighting invariant descriptor.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The commercialization of aerial image processing is highly dependent on the platforms such as UAVs (Unmanned Aerial Vehicles). However, the lack of an automated UAV forced landing site detection system has been identified as one of the main impediments to allow UAV flight over populated areas in civilian airspace. This article proposes a UAV forced landing site detection system that is based on machine learning approaches including the Gaussian Mixture Model and the Support Vector Machine. A range of learning parameters are analysed including the number of Guassian mixtures, support vector kernels including linear, radial basis function Kernel (RBF) and polynormial kernel (poly), and the order of RBF kernel and polynormial kernel. Moreover, a modified footprint operator is employed during feature extraction to better describe the geometric characteristics of the local area surrounding a pixel. The performance of the presented system is compared to a baseline UAV forced landing site detection system which uses edge features and an Artificial Neural Network (ANN) region type classifier. Experiments conducted on aerial image datasets captured over typical urban environments reveal improved landing site detection can be achieved with an SVM classifier with an RBF kernel using a combination of colour and texture features. Compared to the baseline system, the proposed system provides significant improvement in term of the chance to detect a safe landing area, and the performance is more stable than the baseline in the presence of changes to the UAV altitude.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Guardian recently published an article by Nakkiah Lui, a Gamillaroi and Torres Strait Islander woman and writer, titled “Why this year’s NAIDOC week will be my last”. In response, Dr Chelsea Bond, an Aboriginal (Munanjahli) and South Sea Islander Australian and a senior lecturer with the Aboriginal and Torres Strait Islander Studies Unit at the University of Queensland, explains why she will continue to celebrate NAIDOC Week – as an act of agency.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Indigenous leader Pat Dodson – who revealed he has met Prime Minister Tony Abbott only once, and then in passing – said last week that removal of frontline services from Indigenous organisations working towards Closing the Gap in Indigenous health “would seem counter intuitive to any fair-minded Australian”. But that, he said in this Age OpEd, has been the result of the Federal Government’s much-awaited Indigenous Advancement Strategy...

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In recent years, considerable research efforts have been directed to micro-array technologies and their role in providing simultaneous information on expression profiles for thousands of genes. These data, when subjected to clustering and classification procedures, can assist in identifying patterns and providing insight on biological processes. To understand the properties of complex gene expression datasets, graphical representations can be used. Intuitively, the data can be represented in terms of a bipartite graph, with weighted edges corresponding to gene-sample node couples in the dataset. Biologically meaningful subgraphs can be sought, but performance can be influenced both by the search algorithm, and, by the graph-weighting scheme and both merit rigorous investigation. In this paper, we focus on edge-weighting schemes for bipartite graphical representation of gene expression. Two novel methods are presented: the first is based on empirical evidence; the second on a geometric distribution. The schemes are compared for several real datasets, assessing efficiency of performance based on four essential properties: robustness to noise and missing values, discrimination, parameter influence on scheme efficiency and reusability. Recommendations and limitations are briefly discussed. Keywords: Edge-weighting; weighted graphs; gene expression; bi-clustering

Relevância:

10.00% 10.00%

Publicador:

Resumo:

One of the main challenges in data analytics is that discovering structures and patterns in complex datasets is a computer-intensive task. Recent advances in high-performance computing provide part of the solution. Multicore systems are now more affordable and more accessible. In this paper, we investigate how this can be used to develop more advanced methods for data analytics. We focus on two specific areas: model-driven analysis and data mining using optimisation techniques.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Most real-life data analysis problems are difficult to solve using exact methods, due to the size of the datasets and the nature of the underlying mechanisms of the system under investigation. As datasets grow even larger, finding the balance between the quality of the approximation and the computing time of the heuristic becomes non-trivial. One solution is to consider parallel methods, and to use the increased computational power to perform a deeper exploration of the solution space in a similar time. It is, however, difficult to estimate a priori whether parallelisation will provide the expected improvement. In this paper we consider a well-known method, genetic algorithms, and evaluate on two distinct problem types the behaviour of the classic and parallel implementations.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In providing simultaneous information on expression profiles for thousands of genes, microarray technologies have, in recent years, been largely used to investigate mechanisms of gene expression. Clustering and classification of such data can, indeed, highlight patterns and provide insight on biological processes. A common approach is to consider genes and samples of microarray datasets as nodes in a bipartite graphs, where edges are weighted e.g. based on the expression levels. In this paper, using a previously-evaluated weighting scheme, we focus on search algorithms and evaluate, in the context of biclustering, several variations of Genetic Algorithms. We also introduce a new heuristic “Propagate”, which consists in recursively evaluating neighbour solutions with one more or one less active conditions. The results obtained on three well-known datasets show that, for a given weighting scheme,optimal or near-optimal solutions can be identified.