957 resultados para Imbalanced datasets
Resumo:
Collisions between pedestrians and vehicles continue to be a major problem throughout the world. Pedestrians trying to cross roads and railway tracks without any caution are often highly susceptible to collisions with vehicles and trains. Continuous financial, human and other losses have prompted transport related organizations to come up with various solutions addressing this issue. However, the quest for new and significant improvements in this area is still ongoing. This work addresses this issue by building a general framework using computer vision techniques to automatically monitor pedestrian movements in such high-risk areas to enable better analysis of activity, and the creation of future alerting strategies. As a result of rapid development in the electronics and semi-conductor industry there is extensive deployment of CCTV cameras in public places to capture video footage. This footage can then be used to analyse crowd activities in those particular places. This work seeks to identify the abnormal behaviour of individuals in video footage. In this work we propose using a Semi-2D Hidden Markov Model (HMM), Full-2D HMM and Spatial HMM to model the normal activities of people. The outliers of the model (i.e. those observations with insufficient likelihood) are identified as abnormal activities. Location features, flow features and optical flow textures are used as the features for the model. The proposed approaches are evaluated using the publicly available UCSD datasets, and we demonstrate improved performance using a Semi-2D Hidden Markov Model compared to other state of the art methods. Further we illustrate how our proposed methods can be applied to detect anomalous events at rail level crossings.
Resumo:
In this paper, we explore the effectiveness of patch-based gradient feature extraction methods when applied to appearance-based gait recognition. Extending existing popular feature extraction methods such as HOG and LDP, we propose a novel technique which we term the Histogram of Weighted Local Directions (HWLD). These 3 methods are applied to gait recognition using the GEI feature, with classification performed using SRC. Evaluations on the CASIA and OULP datasets show significant improvements using these patch-based methods over existing implementations, with the proposed method achieving the highest recognition rate for the respective datasets. In addition, the HWLD can easily be extended to 3D, which we demonstrate using the GEV feature on the DGD dataset, observing improvements in performance.
Resumo:
Automated crowd counting has become an active field of computer vision research in recent years. Existing approaches are scene-specific, as they are designed to operate in the single camera viewpoint that was used to train the system. Real world camera networks often span multiple viewpoints within a facility, including many regions of overlap. This paper proposes a novel scene invariant crowd counting algorithm that is designed to operate across multiple cameras. The approach uses camera calibration to normalise features between viewpoints and to compensate for regions of overlap. This compensation is performed by constructing an 'overlap map' which provides a measure of how much an object at one location is visible within other viewpoints. An investigation into the suitability of various feature types and regression models for scene invariant crowd counting is also conducted. The features investigated include object size, shape, edges and keypoints. The regression models evaluated include neural networks, K-nearest neighbours, linear and Gaussian process regresion. Our experiments demonstrate that accurate crowd counting was achieved across seven benchmark datasets, with optimal performance observed when all features were used and when Gaussian process regression was used. The combination of scene invariance and multi camera crowd counting is evaluated by training the system on footage obtained from the QUT camera network and testing it on three cameras from the PETS 2009 database. Highly accurate crowd counting was observed with a mean relative error of less than 10%. Our approach enables a pre-trained system to be deployed on a new environment without any additional training, bringing the field one step closer toward a 'plug and play' system.
Resumo:
Endometrial cancer is one of the most common female diseases in developed nations and is the most commonly diagnosed gynaecological cancer in Australia. The disease is commonly classified by histology: endometrioid or non-endometrioid endometrial cancer. While non-endometrioid endometrial cancers are accepted to be high-grade, aggressive cancers, endometrioid cancers (comprising 80% of all endometrial cancers diagnosed) generally carry a favourable patient prognosis. However, endometrioid endometrial cancer patients endure significant morbidity due to surgery and radiotherapy used for disease treatment, and patients with recurrent disease have a 5-year survival rate of less than 50%. Genetic analysis of women with endometrial cancer could uncover novel markers associated with disease risk and/or prognosis, which could then be used to identify women at high risk and for the use of specialised treatments. Proteases are widely accepted to play an important role in the development and progression of cancer. This PhD project hypothesised that SNPs from two protease gene families, the matrix metalloproteases (MMPs, including their tissue inhibitors, TIMPs) and the tissue kallikrein-related peptidases (KLKs) would be associated with endometrial cancer susceptibility and/or prognosis. In the first part of this study, optimisation of the genotyping techniques was performed. Results from previously published endometrial cancer genetic association studies were attempted to be validated in a large, multicentre replication set (maximum cases n = 2,888, controls n = 4,483, 3 studies). The rs11224561 progesterone receptor SNP (PGR, A/G) was observed to be associated with increased endometrial cancer risk (per A allele OR 1.31, 95% CI 1.12-1.53; p-trend = 0.001), a result which was initially reported among a Chinese sample set. Previously reported associations for the remaining 8 SNPs investigated for this section of the PhD study were not confirmed, thereby reinforcing the importance of validation of genetic association studies. To examine the effect of SNPs from the MMP and KLK families on endometrial cancer risk, we selected the most significantly associated MMP and KLK SNPs from genome-wide association study analysis (GWAS) to be genotyped in the GWAS replication set (cases n = 4,725, controls n = 9,803, 13 studies). The significance of the MMP24 rs932562 SNP was unchanged after incorporation of the stage 2 samples (Stage 1 per allele OR 1.18, p = 0.002; Combined Stage 1 and 2 OR 1.09, p = 0.002). The rs10426 SNP, located 3' to KLK10 was predicted by bioinformatic analysis to effect miRNA binding. This SNP was observed in the GWAS stage 1 result to exhibit a recessive effect on endometrial cancer risk, a result which was not validated in the stage 2 sample set (Stage 1 OR 1.44, p = 0.007; Combined Stage 1 and 2 OR 1.14, p = 0.08). Investigation of the regions imputed surrounding the MMP, TIMP and KLK genes did not reveal any significant targets for further analysis. Analysis of the case data from the endometrial cancer GWAS to identify genetic variation associated with cancer grade did not reveal SNPs from the MMP, TIMP or KLK genes to be statistically significant. However, the representation of SNPs from the MMP, TIMP and KLK families by the GWAS genotyping platform used in this PhD project was examined and observed to be very low, with the genetic variation of four genes (MMP23A, MMP23B, MMP28 and TIMP1) not captured at all by this technique. This suggests that comprehensive candidate gene association studies will be required to assess the role of SNPs from these genes with endometrial cancer risk and prognosis. Meta-analysis of gene expression microarray datasets curated as part of this PhD study identified a number of MMP, TIMP and KLK genes to display differential expression by endometrial cancer status (MMP2, MMP10, MMP11, MMP13, MMP19, MMP25 and KLK1) and histology (MMP2, MMP11, MMP12, MMP26, MMP28, TIMP2, TIMP3, KLK6, KLK7, KLK11 and KLK12). In light of these findings these genes should be prioritised for future targeted genetic association studies. Two SNPs located 43.5 Mb apart on chromosome 15 were observed from the GWAS analysis to be associated with increased endometrial cancer grade, results that were validated in silico in two independent datasets. One of these SNPs, rs8035725 is located in the 5' untranslated region of a MYC promoter binding protein DENND4A (Stage 1 OR 1.15, p = 9.85 x 10P -5 P, combined Stage 1 and in silico validation OR 1.13, p = 5.24 x 10P -6 P). This SNP has previously been reported to alter the expression of PTPLAD1, a gene involved in the synthesis of very long fatty acid chains and in the Rac1 signaling pathway. Meta-analysis of gene expression microarray data found PTPLAD1 to display increased expression in the aggressive non-endometrioid histology compared with endometrioid endometrial cancer, suggesting that the causal SNP underlying the observed genetic association may influence expression of this gene. Neither rs8035725 nor significant SNPs identified by imputation were predicted bioinformatically to affect transcription factor binding sites, indicating that further studies are required to assess their potential effect on other regulatory elements. The other grade- associated SNP, rs6606792, is located upstream of an inferred pseudogene, ELMO2P1 (Stage 1 OR 1.12, p = 5 x 10P -5 P; combined Stage 1 and in silico validation OR 1.09, p = 3.56 x 10P -5 P). Imputation of the ±1 Mb region surrounding this SNP revealed a cluster of significantly associated variants which are predicted to abolish various transcription factor binding sites, and would be expected to decrease gene expression. ELMO2P1 was not included on the microarray platforms collected for this PhD, and so its expression could not be investigated. However, the high sequence homology of ELMO2P1 with ELMO2, a gene important to cell motility, indicates that ELMO2 could be the parent gene for ELMO2P1 and as such, ELMO2P1 could function to regulate the expression of ELMO2. Increased expression of ELMO2 was seen to be associated with increasing endometrial cancer grade, as well as with aggressive endometrial cancer histological subtypes by microarray meta-analysis. Thus, it is hypothesised that SNPs in linkage disequilibrium with rs6606792 decrease the transcription of ELMO2P1, reducing the regulatory effect of ELMO2P1 on ELMO2 expression. Consequently, ELMO2 expression is increased, cell motility is enhanced leading to an aggressive endometrial cancer phenotype. In summary, these findings have identified several areas of research for further study. The results presented in this thesis provide evidence that a SNP in PGR is associated with risk of developing endometrial cancer. This PhD study also reports two independent loci on chromosome 15 to be associated with increased endometrial cancer grade, and furthermore, genes associated with these SNPs to be differentially expressed according in aggressive subtypes and/or by grade. The studies reported in this thesis support the need for comprehensive SNP association studies on prioritised MMP, TIMP and KLK genes in large sample sets. Until these studies are performed, the role of MMP, TIMP and KLK genetic variation remains unclear. Overall, this PhD study has contributed to the understanding of genetic variation involvement in endometrial cancer susceptibility and prognosis. Importantly, the genetic regions highlighted in this study could lead to the identification of novel gene targets to better understand the biology of endometrial cancer and also aid in the development of therapeutics directed at treating this disease.
Resumo:
Background Single nucleotide polymorphisms (SNPs) rs429358 (ε4) and rs7412 (ε2), both invoking changes in the amino-acid sequence of the apolipoprotein E (APOE) gene, have previously been tested for association with multiple sclerosis (MS) risk. However, none of these studies was sufficiently powered to detect modest effect sizes at acceptable type-I error rates. As both SNPs are only imperfectly captured on commonly used microarray genotyping platforms, their evaluation in the context of genome-wide association studies has been hindered until recently. Methods We genotyped 12 740 subjects hitherto not studied for their APOE status, imputed raw genotype data from 8739 subjects from five independent genome wide association studies datasets using the most recent high-resolution reference panels, and extracted genotype data for 8265 subjects from previous candidate gene assessments. Results Despite sufficient power to detect associations at genome-wide significance thresholds across a range of ORs, our analyses did not support a role of rs429358 or rs7412 on MS susceptibility. This included meta-analyses of the combined data across 13 913 MS cases and 15 831 controls (OR=0.95, p=0.259, and OR 1.07, p=0.0569, for rs429358 and rs7412, respectively). Conclusion Given the large sample size of our analyses, it is unlikely that the two APOE missense SNPs studied here exert any relevant effects on MS susceptibility.
Resumo:
Objectives This study introduces and assesses the precision of a standardized protocol for anthropometric measurement of the juvenile cranium using three-dimensional surface rendered models, for implementation in forensic investigation or paleodemographic research. Materials and methods A subset of multi-slice computed tomography (MSCT) DICOM datasets (n=10) of modern Australian subadults (birth—10 years) was accessed from the “Skeletal Biology and Forensic Anthropology Virtual Osteological Database” (n>1200), obtained from retrospective clinical scans taken at Brisbane children hospitals (2009–2013). The capabilities of Geomagic Design X™ form the basis of this study; introducing standardized protocols using triangle surface mesh models to (i) ascertain linear dimensions using reference plane networks and (ii) calculate the area of complex regions of interest on the cranium. Results The protocols described in this paper demonstrate high levels of repeatability between five observers of varying anatomical expertise and software experience. Intra- and inter-observer error was indiscernible with total technical error of measurement (TEM) values ≤0.56 mm, constituting <0.33% relative error (rTEM) for linear measurements; and a TEM value of ≤12.89 mm2, equating to <1.18% (rTEM) of the total area of the anterior fontanelle and contiguous sutures. Conclusions Exploiting the advances of MSCT in routine clinical assessment, this paper assesses the application of this virtual approach to acquire highly reproducible morphometric data in a non-invasive manner for human identification and population studies in growth and development. The protocols and precision testing presented are imperative for the advancement of “virtual anthropology” into routine Australian medico-legal death investigation.
Resumo:
A generalised gamma bidding model is presented, which incorporates many previous models. The log likelihood equations are provided. Using a new method of testing, variants of the model are fitted to some real data for construction contract auctions to find the best fitting models for groupings of bidders. The results are examined for simplifying assumptions, including all those in the main literature. These indicate no one model to be best for all datasets. However, some models do appear to perform significantly better than others and it is suggested that future research would benefit from a closer examination of these.
Resumo:
This thesis presents a novel approach to mobile robot navigation using visual information towards the goal of long-term autonomy. A novel concept of a continuous appearance-based trajectory is proposed in order to solve the limitations of previous robot navigation systems, and two new algorithms for mobile robots, CAT-SLAM and CAT-Graph, are presented and evaluated. These algorithms yield performance exceeding state-of-the-art methods on public benchmark datasets and large-scale real-world environments, and will help enable widespread use of mobile robots in everyday applications.
Resumo:
The huge amount of CCTV footage available makes it very burdensome to process these videos manually through human operators. This has made automated processing of video footage through computer vision technologies necessary. During the past several years, there has been a large effort to detect abnormal activities through computer vision techniques. Typically, the problem is formulated as a novelty detection task where the system is trained on normal data and is required to detect events which do not fit the learned ‘normal’ model. There is no precise and exact definition for an abnormal activity; it is dependent on the context of the scene. Hence there is a requirement for different feature sets to detect different kinds of abnormal activities. In this work we evaluate the performance of different state of the art features to detect the presence of the abnormal objects in the scene. These include optical flow vectors to detect motion related anomalies, textures of optical flow and image textures to detect the presence of abnormal objects. These extracted features in different combinations are modeled using different state of the art models such as Gaussian mixture model(GMM) and Semi- 2D Hidden Markov model(HMM) to analyse the performances. Further we apply perspective normalization to the extracted features to compensate for perspective distortion due to the distance between the camera and objects of consideration. The proposed approach is evaluated using the publicly available UCSD datasets and we demonstrate improved performance compared to other state of the art methods.
Resumo:
As the systematic investigation of Twitter as a communications platform continues, the question of developing reliable comparative metrics for the evaluation of public, communicative phenomena on Twitter becomes paramount. What is necessary here is the establishment of an accepted standard for the quantitative description of user activities on Twitter. This needs to be flexible enough in order to be applied to a wide range of communicative situations, such as the evaluation of individual users’ and groups of users’ Twitter communication strategies, the examination of communicative patterns within hashtags and other identifiable ad hoc publics on Twitter (Bruns & Burgess, 2011), and even the analysis of very large datasets of everyday interactions on the platform. By providing a framework for quantitative analysis on Twitter communication, researchers in different areas (e.g., communication studies, sociology, information systems) are enabled to adapt methodological approaches and to conduct analyses on their own. Besides general findings about communication structure on Twitter, large amounts of data might be used to better understand issues or events retrospectively, detect issues or events in an early stage, or even to predict certain real-world developments (e.g., election results; cf. Tumasjan, Sprenger, Sandner, & Welpe, 2010, for an early attempt to do so).
Resumo:
Association rule mining is one technique that is widely used when querying databases, especially those that are transactional, in order to obtain useful associations or correlations among sets of items. Much work has been done focusing on efficiency, effectiveness and redundancy. There has also been a focusing on the quality of rules from single level datasets with many interestingness measures proposed. However, with multi-level datasets now being common there is a lack of interestingness measures developed for multi-level and cross-level rules. Single level measures do not take into account the hierarchy found in a multi-level dataset. This leaves the Support-Confidence approach, which does not consider the hierarchy anyway and has other drawbacks, as one of the few measures available. In this chapter we propose two approaches which measure multi-level association rules to help evaluate their interestingness by considering the database’s underlying taxonomy. These measures of diversity and peculiarity can be used to help identify those rules from multi-level datasets that are potentially useful.
Resumo:
Efficient and effective feature detection and representation is an important consideration when processing videos, and a large number of applications such as motion analysis, 3D scene understanding, tracking etc. depend on this. Amongst several feature description methods, local features are becoming increasingly popular for representing videos because of their simplicity and efficiency. While they achieve state-of-the-art performance with low computational complexity, their performance is still too limited for real world applications. Furthermore, rapid increases in the uptake of mobile devices has increased the demand for algorithms that can run with reduced memory and computational requirements. In this paper we propose a semi binary based feature detectordescriptor based on the BRISK detector, which can detect and represent videos with significantly reduced computational requirements, while achieving comparable performance to the state of the art spatio-temporal feature descriptors. First, the BRISK feature detector is applied on a frame by frame basis to detect interest points, then the detected key points are compared against consecutive frames for significant motion. Key points with significant motion are encoded with the BRISK descriptor in the spatial domain and Motion Boundary Histogram in the temporal domain. This descriptor is not only lightweight but also has lower memory requirements because of the binary nature of the BRISK descriptor, allowing the possibility of applications using hand held devices.We evaluate the combination of detectordescriptor performance in the context of action classification with a standard, popular bag-of-features with SVM framework. Experiments are carried out on two popular datasets with varying complexity and we demonstrate comparable performance with other descriptors with reduced computational complexity.
Resumo:
This paper presents a new multi-scale place recognition system inspired by the recent discovery of overlapping, multi-scale spatial maps stored in the rodent brain. By training a set of Support Vector Machines to recognize places at varying levels of spatial specificity, we are able to validate spatially specific place recognition hypotheses against broader place recognition hypotheses without sacrificing localization accuracy. We evaluate the system in a range of experiments using cameras mounted on a motorbike and a human in two different environments. At 100% precision, the multiscale approach results in a 56% average improvement in recall rate across both datasets. We analyse the results and then discuss future work that may lead to improvements in both robotic mapping and our understanding of sensory processing and encoding in the mammalian brain.
Resumo:
Most recommender systems attempt to use collaborative filtering, content-based filtering or hybrid approach to recommend items to new users. Collaborative filtering recommends items to new users based on their similar neighbours, and content-based filtering approach tries to recommend items that are similar to new users' profiles. The fundamental issues include how to profile new users, and how to deal with the over-specialization in content-based recommender systems. Indeed, the terms used to describe items can be formed as a concept hierarchy. Therefore, we aim to describe user profiles or information needs by using concepts vectors. This paper presents a new method to acquire user information needs, which allows new users to describe their preferences on a concept hierarchy rather than rating items. It also develops a new ranking function to recommend items to new users based on their information needs. The proposed approach is evaluated on Amazon book datasets. The experimental results demonstrate that the proposed approach can largely improve the effectiveness of recommender systems.
Resumo:
Big Data presents many challenges related to volume, whether one is interested in studying past datasets or, even more problematically, attempting to work with live streams of data. The most obvious challenge, in a ‘noisy’ environment such as contemporary social media, is to collect the pertinent information; be that information for a specific study, tweets which can inform emergency services or other responders to an ongoing crisis, or give an advantage to those involved in prediction markets. Often, such a process is iterative, with keywords and hashtags changing with the passage of time, and both collection and analytic methodologies need to be continually adapted to respond to this changing information. While many of the data sets collected and analyzed are preformed, that is they are built around a particular keyword, hashtag, or set of authors, they still contain a large volume of information, much of which is unnecessary for the current purpose and/or potentially useful for future projects. Accordingly, this panel considers methods for separating and combining data to optimize big data research and report findings to stakeholders. The first paper considers possible coding mechanisms for incoming tweets during a crisis, taking a large stream of incoming tweets and selecting which of those need to be immediately placed in front of responders, for manual filtering and possible action. The paper suggests two solutions for this, content analysis and user profiling. In the former case, aspects of the tweet are assigned a score to assess its likely relationship to the topic at hand, and the urgency of the information, whilst the latter attempts to identify those users who are either serving as amplifiers of information or are known as an authoritative source. Through these techniques, the information contained in a large dataset could be filtered down to match the expected capacity of emergency responders, and knowledge as to the core keywords or hashtags relating to the current event is constantly refined for future data collection. The second paper is also concerned with identifying significant tweets, but in this case tweets relevant to particular prediction market; tennis betting. As increasing numbers of professional sports men and women create Twitter accounts to communicate with their fans, information is being shared regarding injuries, form and emotions which have the potential to impact on future results. As has already been demonstrated with leading US sports, such information is extremely valuable. Tennis, as with American Football (NFL) and Baseball (MLB) has paid subscription services which manually filter incoming news sources, including tweets, for information valuable to gamblers, gambling operators, and fantasy sports players. However, whilst such services are still niche operations, much of the value of information is lost by the time it reaches one of these services. The paper thus considers how information could be filtered from twitter user lists and hash tag or keyword monitoring, assessing the value of the source, information, and the prediction markets to which it may relate. The third paper examines methods for collecting Twitter data and following changes in an ongoing, dynamic social movement, such as the Occupy Wall Street movement. It involves the development of technical infrastructure to collect and make the tweets available for exploration and analysis. A strategy to respond to changes in the social movement is also required or the resulting tweets will only reflect the discussions and strategies the movement used at the time the keyword list is created — in a way, keyword creation is part strategy and part art. In this paper we describe strategies for the creation of a social media archive, specifically tweets related to the Occupy Wall Street movement, and methods for continuing to adapt data collection strategies as the movement’s presence in Twitter changes over time. We also discuss the opportunities and methods to extract data smaller slices of data from an archive of social media data to support a multitude of research projects in multiple fields of study. The common theme amongst these papers is that of constructing a data set, filtering it for a specific purpose, and then using the resulting information to aid in future data collection. The intention is that through the papers presented, and subsequent discussion, the panel will inform the wider research community not only on the objectives and limitations of data collection, live analytics, and filtering, but also on current and in-development methodologies that could be adopted by those working with such datasets, and how such approaches could be customized depending on the project stakeholders.