957 resultados para Imbalanced datasets
Resumo:
Traditional nearest points methods use all the samples in an image set to construct a single convex or affine hull model for classification. However, strong artificial features and noisy data may be generated from combinations of training samples when significant intra-class variations and/or noise occur in the image set. Existing multi-model approaches extract local models by clustering each image set individually only once, with fixed clusters used for matching with various image sets. This may not be optimal for discrimination, as undesirable environmental conditions (eg. illumination and pose variations) may result in the two closest clusters representing different characteristics of an object (eg. frontal face being compared to non-frontal face). To address the above problem, we propose a novel approach to enhance nearest points based methods by integrating affine/convex hull classification with an adapted multi-model approach. We first extract multiple local convex hulls from a query image set via maximum margin clustering to diminish the artificial variations and constrain the noise in local convex hulls. We then propose adaptive reference clustering (ARC) to constrain the clustering of each gallery image set by forcing the clusters to have resemblance to the clusters in the query image set. By applying ARC, noisy clusters in the query set can be discarded. Experiments on Honda, MoBo and ETH-80 datasets show that the proposed method outperforms single model approaches and other recent techniques, such as Sparse Approximated Nearest Points, Mutual Subspace Method and Manifold Discriminant Analysis.
Resumo:
This paper describes a novel system for automatic classification of images obtained from Anti-Nuclear Antibody (ANA) pathology tests on Human Epithelial type 2 (HEp-2) cells using the Indirect Immunofluorescence (IIF) protocol. The IIF protocol on HEp-2 cells has been the hallmark method to identify the presence of ANAs, due to its high sensitivity and the large range of antigens that can be detected. However, it suffers from numerous shortcomings, such as being subjective as well as time and labour intensive. Computer Aided Diagnostic (CAD) systems have been developed to address these problems, which automatically classify a HEp-2 cell image into one of its known patterns (eg. speckled, homogeneous). Most of the existing CAD systems use handpicked features to represent a HEp-2 cell image, which may only work in limited scenarios. We propose a novel automatic cell image classification method termed Cell Pyramid Matching (CPM), which is comprised of regional histograms of visual words coupled with the Multiple Kernel Learning framework. We present a study of several variations of generating histograms and show the efficacy of the system on two publicly available datasets: the ICPR HEp-2 cell classification contest dataset and the SNPHEp-2 dataset.
Resumo:
Person re-identification is particularly challenging due to significant appearance changes across separate camera views. In order to re-identify people, a representative human signature should effectively handle differences in illumination, pose and camera parameters. While general appearance-based methods are modelled in Euclidean spaces, it has been argued that some applications in image and video analysis are better modelled via non-Euclidean manifold geometry. To this end, recent approaches represent images as covariance matrices, and interpret such matrices as points on Riemannian manifolds. As direct classification on such manifolds can be difficult, in this paper we propose to represent each manifold point as a vector of similarities to class representers, via a recently introduced form of Bregman matrix divergence known as the Stein divergence. This is followed by using a discriminative mapping of similarity vectors for final classification. The use of similarity vectors is in contrast to the traditional approach of embedding manifolds into tangent spaces, which can suffer from representing the manifold structure inaccurately. Comparative evaluations on benchmark ETHZ and iLIDS datasets for the person re-identification task show that the proposed approach obtains better performance than recent techniques such as Histogram Plus Epitome, Partial Least Squares, and Symmetry-Driven Accumulation of Local Features.
Resumo:
Existing multi-model approaches for image set classification extract local models by clustering each image set individually only once, with fixed clusters used for matching with other image sets. However, this may result in the two closest clusters to represent different characteristics of an object, due to different undesirable environmental conditions (such as variations in illumination and pose). To address this problem, we propose to constrain the clustering of each query image set by forcing the clusters to have resemblance to the clusters in the gallery image sets. We first define a Frobenius norm distance between subspaces over Grassmann manifolds based on reconstruction error. We then extract local linear subspaces from a gallery image set via sparse representation. For each local linear subspace, we adaptively construct the corresponding closest subspace from the samples of a probe image set by joint sparse representation. We show that by minimising the sparse representation reconstruction error, we approach the nearest point on a Grassmann manifold. Experiments on Honda, ETH-80 and Cambridge-Gesture datasets show that the proposed method consistently outperforms several other recent techniques, such as Affine Hull based Image Set Distance (AHISD), Sparse Approximated Nearest Points (SANP) and Manifold Discriminant Analysis (MDA).
Resumo:
This thesis investigates the fusion of 3D visual information with 2D image cues to provide 3D semantic maps of large-scale environments in which a robot traverses for robotic applications. A major theme of this thesis was to exploit the availability of 3D information acquired from robot sensors to improve upon 2D object classification alone. The proposed methods have been evaluated on several indoor and outdoor datasets collected from mobile robotic platforms including a quadcopter and ground vehicle covering several kilometres of urban roads.
Resumo:
A common problem with the use of tensor modeling in generating quality recommendations for large datasets is scalability. In this paper, we propose the Tensor-based Recommendation using Probabilistic Ranking method that generates the reconstructed tensor using block-striped parallel matrix multiplication and then probabilistically calculates the preferences of user to rank the recommended items. Empirical analysis on two real-world datasets shows that the proposed method is scalable for large tensor datasets and is able to outperform the benchmarking methods in terms of accuracy.
Resumo:
The use of Wireless Sensor Networks (WSNs) for vibration-based Structural Health Monitoring (SHM) has become a promising approach due to many advantages such as low cost, fast and flexible deployment. However, inherent technical issues such as data asynchronicity and data loss have prevented these distinct systems from being extensively used. Recently, several SHM-oriented WSNs have been proposed and believed to be able to overcome a large number of technical uncertainties. Nevertheless, there is limited research verifying the applicability of those WSNs with respect to demanding SHM applications like modal analysis and damage identification. Based on a brief review, this paper first reveals that Data Synchronization Error (DSE) is the most inherent factor amongst uncertainties of SHM-oriented WSNs. Effects of this factor are then investigated on outcomes and performance of the most robust Output-only Modal Analysis (OMA) techniques when merging data from multiple sensor setups. The two OMA families selected for this investigation are Frequency Domain Decomposition (FDD) and data-driven Stochastic Subspace Identification (SSI-data) due to the fact that they both have been widely applied in the past decade. Accelerations collected by a wired sensory system on a large-scale laboratory bridge model are initially used as benchmark data after being added with a certain level of noise to account for the higher presence of this factor in SHM-oriented WSNs. From this source, a large number of simulations have been made to generate multiple DSE-corrupted datasets to facilitate statistical analyses. The results of this study show the robustness of FDD and the precautions needed for SSI-data family when dealing with DSE at a relaxed level. Finally, the combination of preferred OMA techniques and the use of the channel projection for the time-domain OMA technique to cope with DSE are recommended.
Resumo:
Molecular biology is a scientific discipline which has changed fundamentally in character over the past decade to rely on large scale datasets – public and locally generated - and their computational analysis and annotation. Undergraduate education of biologists must increasingly couple this domain context with a data-driven computational scientific method. Yet modern programming and scripting languages and rich computational environments such as R and MATLAB present significant barriers to those with limited exposure to computer science, and may require substantial tutorial assistance over an extended period if progress is to be made. In this paper we report our experience of undergraduate bioinformatics education using the familiar, ubiquitous spreadsheet environment of Microsoft Excel. We describe a configurable extension called QUT.Bio.Excel, a custom ribbon, supporting a rich set of data sources, external tools and interactive processing within the spreadsheet, and a range of problems to demonstrate its utility and success in addressing the needs of students over their studies.
Resumo:
Over the past decade the mitochondrial (mt) genome has become the most widely used genomic resource available for systematic entomology. While the availability of other types of ‘–omics’ data – in particular transcriptomes – is increasing rapidly, mt genomes are still vastly cheaper to sequence and are far less demanding of high quality templates. Furthermore, almost all other ‘–omics’ approaches also sequence the mt genome, and so it can form a bridge between legacy and contemporary datasets. Mitochondrial genomes have now been sequenced for all insect orders, and in many instances representatives of each major lineage within orders (suborders, series or superfamilies depending on the group). They have also been applied to systematic questions at all taxonomic scales from resolving interordinal relationships (e.g. Cameron et al., 2009; Wan et al., 2012; Wang et al., 2012), through many intraordinal (e.g. Dowton et al., 2009; Timmermans et al., 2010; Zhao et al. 2013a) and family-level studies (e.g. Nelson et al., 2012; Zhao et al., 2013b) to population/biogeographic studies (e.g. Ma et al., 2012). Methodological issues around the use of mt genomes in insect phylogenetic analyses and the empirical results found to date have recently been reviewed by Cameron (2014), yet the technical aspects of sequencing and annotating mt genomes were not covered. Most papers which generate new mt genome report their methods in a simplified form which can be difficult to replicate without specific knowledge of the field. Published studies utilize a sufficiently wide range of approaches, usually without justification for the one chosen, that confusion about commonly used jargon such as ‘long PCR’ and ‘primer walking’ could be a serious barrier to entry. Furthermore, sequenced mt genomes have been annotated (gene locations defined) to wildly varying standards and improving data quality through consistent annotation procedures will benefit all downstream users of these datasets. The aims of this review are therefore to: 1. Describe in detail the various sequencing methods used on insect mt genomes; 2. Explore the strengths/weakness of different approaches; 3. Outline the procedures and software used for insect mt genome annotation, and; 4. Highlight quality control steps used for new annotations, and to improve the re-annotation of previously sequenced mt genomes used in systematic or comparative research.
Resumo:
This paper presents a novel place recognition algorithm inspired by the recent discovery of overlapping and multi-scale spatial maps in the rodent brain. We mimic this hierarchical framework by training arrays of Support Vector Machines to recognize places at multiple spatial scales. Place match hypotheses are then cross-validated across all spatial scales, a process which combines the spatial specificity of the finest spatial map with the consensus provided by broader mapping scales. Experiments on three real-world datasets including a large robotics benchmark demonstrate that mapping over multiple scales uniformly improves place recognition performance over a single scale approach without sacrificing localization accuracy. We present analysis that illustrates how matching over multiple scales leads to better place recognition performance and discuss several promising areas for future investigation.
Resumo:
This article presents the field applications and validations for the controlled Monte Carlo data generation scheme. This scheme was previously derived to assist the Mahalanobis squared distance–based damage identification method to cope with data-shortage problems which often cause inadequate data multinormality and unreliable identification outcome. To do so, real-vibration datasets from two actual civil engineering structures with such data (and identification) problems are selected as the test objects which are then shown to be in need of enhancement to consolidate their conditions. By utilizing the robust probability measures of the data condition indices in controlled Monte Carlo data generation and statistical sensitivity analysis of the Mahalanobis squared distance computational system, well-conditioned synthetic data generated by an optimal controlled Monte Carlo data generation configurations can be unbiasedly evaluated against those generated by other set-ups and against the original data. The analysis results reconfirm that controlled Monte Carlo data generation is able to overcome the shortage of observations, improve the data multinormality and enhance the reliability of the Mahalanobis squared distance–based damage identification method particularly with respect to false-positive errors. The results also highlight the dynamic structure of controlled Monte Carlo data generation that makes this scheme well adaptive to any type of input data with any (original) distributional condition.
Resumo:
A catchment-scale multivariate statistical analysis of hydrochemistry enabled assessment of interactions between alluvial groundwater and Cressbrook Creek, an intermittent drainage system in southeast Queensland, Australia. Hierarchical cluster analyses and principal component analysis were applied to time-series data to evaluate the hydrochemical evolution of groundwater during periods of extreme drought and severe flooding. A simple three-dimensional geological model was developed to conceptualise the catchment morphology and the stratigraphic framework of the alluvium. The alluvium forms a two-layer system with a basal coarse-grained layer overlain by a clay-rich low-permeability unit. In the upper and middle catchment, alluvial groundwater is chemically similar to streamwater, particularly near the creek (reflected by high HCO3/Cl and K/Na ratios and low salinities), indicating a high degree of connectivity. In the lower catchment, groundwater is more saline with lower HCO3/Cl and K/Na ratios, notably during dry periods. Groundwater salinity substantially decreased following severe flooding in 2011, notably in the lower catchment, confirming that flooding is an important mechanism for both recharge and maintaining groundwater quality. The integrated approach used in this study enabled effective interpretation of hydrological processes and can be applied to a variety of hydrological settings to synthesise and evaluate large hydrochemical datasets.
Resumo:
The fields of molecular biology and cell biology are being flooded with complex genomic and proteomic datasets of large dimensions. We now recognize that each molecule in the cell and tissue can no longer be viewed as an isolated entity. Instead, each molecule must be considered as one member of an interacting network. Consequently, there is an urgent need for mathematical models to understand the behavior of cell signaling networks in health and in disease.
Resumo:
The proliferation of news reports published in online websites and news information sharing among social media users necessitates effective techniques for analysing the image, text and video data related to news topics. This paper presents the first study to classify affective facial images on emerging news topics. The proposed system dynamically monitors and selects the current hot (of great interest) news topics with strong affective interestingness using textual keywords in news articles and social media discussions. Images from the selected hot topics are extracted and classified into three categorized emotions, positive, neutral and negative, based on facial expressions of subjects in the images. Performance evaluations on two facial image datasets collected from real-world resources demonstrate the applicability and effectiveness of the proposed system in affective classification of facial images in news reports. Facial expression shows high consistency with the affective textual content in news reports for positive emotion, while only low correlation has been observed for neutral and negative. The system can be directly used for applications, such as assisting editors in choosing photos with a proper affective semantic for a certain topic during news report preparation.
Resumo:
Marine sediments around volcanic islands contain an archive of volcaniclastic deposits, which can be used to reconstruct the volcanic history of an area. Such records hold many advantages over often incomplete terrestrial datasets. This includes the potential for precise and continuous dating of intervening sediment packages, which allow a correlatable and temporally-constrained stratigraphic framework to be constructed across multiple marine sediment cores. Here, we discuss a marine record of eruptive and mass-wasting events spanning ~250 ka offshore of Montserrat, using new data from IODP Expedition 340, as well as previously collected cores. By using a combination of high-resolution oxygen isotope stratigraphy, AMS radiocarbon dating, biostratigraphy of foraminifera and calcareous nannofossils and clast componentry, we identify five major events at Soufriere Hills volcano since 250 ka. Lateral correlation of these events across sediment cores collected offshore of the south and south west of Montserrat, have improved our understanding of the timing, extent and associations between events in this area. Correlations reveal that powerful and potentially erosive density-currents travelled at least 33 km offshore, and demonstrate that marine deposits, produced by eruption-fed and mass-wasting events on volcanic islands, are heterogeneous in their spatial distribution. Thus, multiple drilling/coring sites are needed to reconstruct the full chronostratigraphy of volcanic islands. This multidisciplinary study will be vital to interpreting the chaotic records of submarine landslides at other sites drilled during Expedition 340 and provides a framework that can be applied to the stratigraphic analysis of sediments surrounding other volcanic islands.