957 resultados para Imbalanced datasets
Resumo:
In recent years, the Web 2.0 has provided considerable facilities for people to create, share and exchange information and ideas. Upon this, the user generated content, such as reviews, has exploded. Such data provide a rich source to exploit in order to identify the information associated with specific reviewed items. Opinion mining has been widely used to identify the significant features of items (e.g., cameras) based upon user reviews. Feature extraction is the most critical step to identify useful information from texts. Most existing approaches only find individual features about a product without revealing the structural relationships between the features which usually exist. In this paper, we propose an approach to extract features and feature relationships, represented as a tree structure called feature taxonomy, based on frequent patterns and associations between patterns derived from user reviews. The generated feature taxonomy profiles the product at multiple levels and provides more detailed information about the product. Our experiment results based on some popularly used review datasets show that our proposed approach is able to capture the product features and relations effectively.
Resumo:
As of today, opinion mining has been widely used to iden- tify the strength and weakness of products (e.g., cameras) or services (e.g., services in medical clinics or hospitals) based upon people's feed- back such as user reviews. Feature extraction is a crucial step for opinion mining which has been used to collect useful information from user reviews. Most existing approaches only find individual features of a product without the structural relationships between the features which usually exists. In this paper, we propose an approach to extract features and feature relationship, represented as tree structure called a feature hi- erarchy, based on frequent patterns and associations between patterns derived from user reviews. The generated feature hierarchy profiles the product at multiple levels and provides more detailed information about the product. Our experiment results based on some popularly used review datasets show that the proposed feature extraction approach can identify more correct features than the baseline model. Even though the datasets used in the experiment are about cameras, our work can be ap- plied to generate features about a service such as the services in hospitals or clinics.
Resumo:
For a planetary rover to successfully traverse across unstructured terrain autonomously, one of the major challenges is to assess its local traversability such that it can plan a trajectory to achieve its mission goals efficiently while minimising risk to the vehicle itself. This paper aims to provide a comparative study on different approaches for representing the geometry of Martian terrain for the purpose of evaluating terrain traversability. An accurate representation of the geometric properties of the terrain is essential as it can directly affect the determination of traversability for a ground vehicle. We explore current state-of-the-art techniques for terrain estimation, in particular Gaussian Processes (GP) in various forms, and discuss the suitability of each technique in the context of an unstructured Martian terrain. Furthermore, we present the limitations of regression techniques in terms of spatial correlation and continuity assumptions, and the impact on traversability analysis of a planetary rover across unstructured terrain. The analysis was performed on datasets of the Mars Yard at the Powerhouse Museum in Sydney, obtained using the onboard RGB-D camera.
Resumo:
This document describes large, accurately calibrated and time-synchronised datasets, gathered in controlled environmental conditions, using an unmanned ground vehicle equipped with a wide variety of sensors. These sensors include: multiple laser scanners, a millimetre wave radar scanner, a colour camera and an infra-red camera. Full details of the sensors are given, as well as the calibration parameters needed to locate them with respect to each other and to the platform. This report also specifies the format and content of the data, and the conditions in which the data have been gathered. The data collection was made in two different situations of the vehicle: static and dynamic. The static tests consisted of sensing a fixed ’reference’ terrain, containing simple known objects, from a motionless vehicle. For the dynamic tests, data were acquired from a moving vehicle in various environments, mainly rural, including an open area, a semi-urban zone and a natural area with different types of vegetation. For both categories, data have been gathered in controlled environmental conditions, which included the presence of dust, smoke and rain. Most of the environments involved were static, except for a few specific datasets which involve the presence of a walking pedestrian. Finally, this document presents illustrations of the effects of adverse environmental conditions on sensor data, as a first step towards reliability and integrity in autonomous perceptual systems.
Resumo:
Malignant Pleural Mesothelioma (MPM) is an aggressive cancer that is often diagnosed at an advanced stage and is characterized by a long latency period (20-40 years between initial exposure and diagnosis) and prior exposure to asbestos. Currently accurate diagnosis of MPM is difficult due to the lack of sensitive biomarkers and despite minor improvements in treatment, median survival rates do not exceed 12 months. Accumulating evidence suggests that aberrant expression of long non-coding RNAs (lncRNAs) play an important functional role in cancer biology. LncRNAs are a class of recently discovered non-protein coding RNAs >200 nucleotides in length with a role in regulating transcription. Here we used NCode long noncoding microarrays to identify differentially expressed lncRNAs potentially involved in MPM pathogenesis. High priority candidate lncRNAs were selected on the basis of statistical (P<0.05) and biological significance (>3-fold difference). Expression levels of 9 candidate lncRNAs were technically validated using RT-qPCR, and biologically validated in three independent test sets: (1) 57 archived MPM tissues obtained from extrapleural pneumonectomy patients, (2) 15 cryopreserved MPM and 3 benign pleura, and (3) an extended panel of 10 MPM cell lines. RT-qPCR analysis demonstrated consistent up-regulation of these lncRNAs in independent datasets. ROC curve analysis showed that two candidates were able to separate benign pleura and MPM with high sensitivity and specificity, and were associated with nodal metastases and survival following induction chemotherapy. These results suggest that lncRNAs have potential to serve as biomarkers in MPM.
Resumo:
The global financial crisis (GFC) in 2008 rocked local, regional, and state economies throughout the world. Several intermediate outcomes of the GFC have been well documented in the literature including loss of jobs and reduced income. Relatively little research has, however, examined the impacts of the GFC on individual level travel behaviour change. To address this shortcoming, HABITAT panel data were employed to estimate a multinomial logit model to examine mode switching behaviour between 2007 (pre-GFC) and 2009 (post-GFC) of a baby boomers cohort in Brisbane, Australia—a city within a developed country that has been on many metrics the least affected by the GFC. In addition, a Poisson regression model was estimated to model the number of trips made by individuals in 2007, 2008, and 2009. The South East Queensland Travel Survey datasets were used to develop this model. Four linear regression models were estimated to assess the effects of the GFC on time allocated to travel during a day: one for each of the three travel modes including public transport, active transport, less environmentally friendly transport; and an overall travel time model irrespective of mode. The results reveal that individuals were more likely to switch to public transport who lost their job or whose income reduced between 2007 and 2009. Individuals also made significantly fewer trips in 2008 and 2009 compared to 2007. Individuals spent significantly less time using less environmentally friendly transport but more time using public transport in 2009. Baby boomers switched to more environmentally friendly travel modes during the GFC.
Resumo:
In order to increase the accuracy of patient positioning for complex radiotherapy treatments various 3D imaging techniques have been developed. MegaVoltage Cone Beam CT (MVCBCT) can utilise existing hardware to implement a 3D imaging modality to aid patient positioning. MVCBCT has been investigated using an unmodified Elekta Precise linac and 15 iView amorphous silicon electronic portal imaging device (EPID). Two methods of delivery and acquisition have been investigated for imaging an anthropomorphic head phantom and quality assurance phantom. Phantom projections were successfully acquired and CT datasets reconstructed using both acquisition methods. Bone, tissue and air were 20 clearly resolvable in both phantoms even with low dose (22 MU) scans. The feasibility of MegaVoltage Cone beam CT was investigated using a standard linac, amorphous silicon EPID and a combination of a free open source reconstruction toolkit as well as custom in-house software written in Matlab. The resultant image quality has 25 been assessed and presented. Although bone, tissue and air were resolvable 2 in all scans, artifacts are present and scan doses are increased when compared with standard portal imaging. The feasibility of MVCBCT with unmodified Elekta Precise linac and EPID has been considered as well as the identification of possible areas for future development in artifact correction techniques to 30 further improve image quality.
Resumo:
The K-means algorithm is one of the most popular techniques in clustering. Nevertheless, the performance of the K-means algorithm depends highly on initial cluster centers and converges to local minima. This paper proposes a hybrid evolutionary programming based clustering algorithm, called PSO-SA, by combining particle swarm optimization (PSO) and simulated annealing (SA). The basic idea is to search around the global solution by SA and to increase the information exchange among particles using a mutation operator to escape local optima. Three datasets, Iris, Wisconsin Breast Cancer, and Ripley’s Glass, have been considered to show the effectiveness of the proposed clustering algorithm in providing optimal clusters. The simulation results show that the PSO-SA clustering algorithm not only has a better response but also converges more quickly than the K-means, PSO, and SA algorithms.
Resumo:
An important aspect of decision support systems involves applying sophisticated and flexible statistical models to real datasets and communicating these results to decision makers in interpretable ways. An important class of problem is the modelling of incidence such as fire, disease etc. Models of incidence known as point processes or Cox processes are particularly challenging as they are ‘doubly stochastic’ i.e. obtaining the probability mass function of incidents requires two integrals to be evaluated. Existing approaches to the problem either use simple models that obtain predictions using plug-in point estimates and do not distinguish between Cox processes and density estimation but do use sophisticated 3D visualization for interpretation. Alternatively other work employs sophisticated non-parametric Bayesian Cox process models, but do not use visualization to render interpretable complex spatial temporal forecasts. The contribution here is to fill this gap by inferring predictive distributions of Gaussian-log Cox processes and rendering them using state of the art 3D visualization techniques. This requires performing inference on an approximation of the model on a discretized grid of large scale and adapting an existing spatial-diurnal kernel to the log Gaussian Cox process context.
Resumo:
Many emerging economies are dangling the patent system to stimulate bio-technological innovations with the ultimate premise that these will improve their economic and social growth. The patent system mandates full disclosure of the patented invention in exchange of a temporary exclusive patent right. Recently, however, patent offices have fallen short of complying with such a mandate, especially for genetic inventions. Most patent offices provide only static information about disclosed patent sequences and even some do not keep track of the sequence listing data in their own database. The successful partnership of QUT Library and Cambia exemplifies advocacy in Open Access, Open Innovation and User Participation. The library extends its services to various departments within the university, builds and encourages research networks to complement skills needed to make a contribution in the real world.
Resumo:
Enterprises, both public and private, have rapidly commenced using the benefits of enterprise resource planning (ERP) combined with business analytics and “open data sets” which are often outside the control of the enterprise to gain further efficiencies, build new service operations and increase business activity. In many cases, these business activities are based around relevant software systems hosted in a “cloud computing” environment. “Garbage in, garbage out”, or “GIGO”, is a term long used to describe problems in unqualified dependency on information systems, dating from the 1960s. However, a more pertinent variation arose sometime later, namely “garbage in, gospel out” signifying that with large scale information systems, such as ERP and usage of open datasets in a cloud environment, the ability to verify the authenticity of those data sets used may be almost impossible, resulting in dependence upon questionable results. Illicit data set “impersonation” becomes a reality. At the same time the ability to audit such results may be an important requirement, particularly in the public sector. This paper discusses the need for enhancement of identity, reliability, authenticity and audit services, including naming and addressing services, in this emerging environment and analyses some current technologies that are offered and which may be appropriate. However, severe limitations to addressing these requirements have been identified and the paper proposes further research work in the area.
Resumo:
Enterprise resource planning (ERP) systems are rapidly being combined with “big data” analytics processes and publicly available “open data sets”, which are usually outside the arena of the enterprise, to expand activity through better service to current clients as well as identifying new opportunities. Moreover, these activities are now largely based around relevant software systems hosted in a “cloud computing” environment. However, the over 50- year old phrase related to mistrust in computer systems, namely “garbage in, garbage out” or “GIGO”, is used to describe problems of unqualified and unquestioning dependency on information systems. However, a more relevant GIGO interpretation arose sometime later, namely “garbage in, gospel out” signifying that with large scale information systems based around ERP and open datasets as well as “big data” analytics, particularly in a cloud environment, the ability to verify the authenticity and integrity of the data sets used may be almost impossible. In turn, this may easily result in decision making based upon questionable results which are unverifiable. Illicit “impersonation” of and modifications to legitimate data sets may become a reality while at the same time the ability to audit any derived results of analysis may be an important requirement, particularly in the public sector. The pressing need for enhancement of identity, reliability, authenticity and audit services, including naming and addressing services, in this emerging environment is discussed in this paper. Some current and appropriate technologies currently being offered are also examined. However, severe limitations in addressing the problems identified are found and the paper proposes further necessary research work for the area. (Note: This paper is based on an earlier unpublished paper/presentation “Identity, Addressing, Authenticity and Audit Requirements for Trust in ERP, Analytics and Big/Open Data in a ‘Cloud’ Computing Environment: A Review and Proposal” presented to the Department of Accounting and IT, College of Management, National Chung Chen University, 20 November 2013.)
Resumo:
In this study, a machine learning technique called anomaly detection is employed for wind turbine bearing fault detection. Basically, the anomaly detection algorithm is used to recognize the presence of unusual and potentially faulty data in a dataset, which contains two phases: a training phase and a testing phase. Two bearing datasets were used to validate the proposed technique, fault-seeded bearing from a test rig located at Case Western Reserve University to validate the accuracy of the anomaly detection method, and a test to failure data of bearings from the NSF I/UCR Center for Intelligent Maintenance Systems (IMS). The latter data set was used to compare anomaly detection with SVM, a previously well-known applied method, in rapidly finding the incipient faults.
Resumo:
We present PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on combination of PAC-Bayesian bounding technique with Empirical Bernstein bound. It allows to take advantage of small empirical variance and is especially useful in regression. We show that when the empirical variance is significantly smaller than the empirical loss PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. PAC-Bayes-Empirical-Bernstein inequality is an interesting example of application of PAC-Bayesian bounding technique to self-bounding functions. We provide empirical comparison of PAC-Bayes-Empirical-Bernstein inequality with PAC-Bayes-kl inequality on a synthetic example and several UCI datasets.
The Arab Spring and its social media audiences : English and Arabic Twitter users and their networks
Resumo:
2011 ‘Arab Spring’ are likely to overstate the impact of Facebook and Twitter on these uprisings, it is nonetheless true that protests and unrest in countries from Tunisia to Syria generated a substantial amount of social media activity. On Twitter alone, several millions of tweets containing the hashtags #libya or #egypt were generated during 2011, both by directly affected citizens of these countries, and by onlookers from further afield. What remains unclear, though, is the extent to which there was any direct interaction between these two groups (especially considering potential language barriers between them). Building on hashtag datasets gathered between January and November 2011, this paper compares patterns of Twitter usage during the popular revolution in Egypt and the civil war in Libya. Using custom-made tools for processing ‘big data’, we examine the volume of tweets sent by English-, Arabic-, and mixed-language Twitter users over time, and examine the networks of interaction (variously through @replying, retweeting, or both) between these groups as they developed and shifted over the course of these uprisings. Examining @reply and retweet traffic, we identify general patterns of information flow between the English- and Arabic-speaking sides of the Twittersphere, and highlight the roles played by users bridging both language spheres.