880 resultados para transformed data
em Queensland University of Technology - ePrints Archive
Resumo:
A significant proportion of the cost of software development is due to software testing and maintenance. This is in part the result of the inevitable imperfections due to human error, lack of quality during the design and coding of software, and the increasing need to reduce faults to improve customer satisfaction in a competitive marketplace. Given the cost and importance of removing errors improvements in fault detection and removal can be of significant benefit. The earlier in the development process faults can be found, the less it costs to correct them and the less likely other faults are to develop. This research aims to make the testing process more efficient and effective by identifying those software modules most likely to contain faults, allowing testing efforts to be carefully targeted. This is done with the use of machine learning algorithms which use examples of fault prone and not fault prone modules to develop predictive models of quality. In order to learn the numerical mapping between module and classification, a module is represented in terms of software metrics. A difficulty in this sort of problem is sourcing software engineering data of adequate quality. In this work, data is obtained from two sources, the NASA Metrics Data Program, and the open source Eclipse project. Feature selection before learning is applied, and in this area a number of different feature selection methods are applied to find which work best. Two machine learning algorithms are applied to the data - Naive Bayes and the Support Vector Machine - and predictive results are compared to those of previous efforts and found to be superior on selected data sets and comparable on others. In addition, a new classification method is proposed, Rank Sum, in which a ranking abstraction is laid over bin densities for each class, and a classification is determined based on the sum of ranks over features. A novel extension of this method is also described based on an observed polarising of points by class when rank sum is applied to training data to convert it into 2D rank sum space. SVM is applied to this transformed data to produce models the parameters of which can be set according to trade-off curves to obtain a particular performance trade-off.
Resumo:
Purpose Serum levels of the inflammatory markers YKL-40 and IL-6 are increased in many conditions, including cancers. We examined serum YKL-40 and IL-6 levels in patients with Hodgkin lymphoma (HL), a tumor with strong immunologic reaction to relatively few tumor cells, especially in nodular sclerosis HL. Experimental Design We analyzed Danish and Swedish patients with incident HL (N=470) and population controls from Denmark (N= 245 for YKL-40; N= 348 for IL-6). Serum YKL-40 and IL-6 levels were determined by ELISA, and log-transformed data were analysed by linear regression, adjusting for age and sex. Results Serum levels of YKL-40 and IL-6 were increased in HL patients compared to controls (YKL-40: 3.6-fold, IL-6: 8.3-fold; both p<0.0001). In samples from pre-treatment HL patients (N=176), levels were correlated with more advanced stages (ptrend 0.0001 for YKL-40 and 0.013 for IL-6) and in those with B symptoms, but levels were similar in nodular sclerosis and mixed cellularity subtypes, by EBV status, and in younger (<45 years old) and older patients. Patients tested soon after treatment onset had significantly lower levels than pre-treatment patients, but even >6 months after treatment onset, serum YKL-40 and IL-6 levels remained significantly increased, compared to controls. In patients who died (N=12), pre-treatment levels for both YKL-40 and IL-6 were higher than in survivors, although not statistically significantly. Conclusions Serum YKL-40 and IL-6 levels were increased in untreated HL patients and those with more advanced stages but did not differ significantly by HL histology. Following treatment, serum levels were significantly lower.
Resumo:
The high morbidity and mortality associated with atherosclerotic coronary vascular disease (CVD) and its complications are being lessened by the increased knowledge of risk factors, effective preventative measures and proven therapeutic interventions. However, significant CVD morbidity remains and sudden cardiac death continues to be a presenting feature for some subsequently diagnosed with CVD. Coronary vascular disease is also the leading cause of anaesthesia related complications. Stress electrocardiography/exercise testing is predictive of 10 year risk of CVD events and the cardiovascular variables used to score this test are monitored peri-operatively. Similar physiological time-series datasets are being subjected to data mining methods for the prediction of medical diagnoses and outcomes. This study aims to find predictors of CVD using anaesthesia time-series data and patient risk factor data. Several pre-processing and predictive data mining methods are applied to this data. Physiological time-series data related to anaesthetic procedures are subjected to pre-processing methods for removal of outliers, calculation of moving averages as well as data summarisation and data abstraction methods. Feature selection methods of both wrapper and filter types are applied to derived physiological time-series variable sets alone and to the same variables combined with risk factor variables. The ability of these methods to identify subsets of highly correlated but non-redundant variables is assessed. The major dataset is derived from the entire anaesthesia population and subsets of this population are considered to be at increased anaesthesia risk based on their need for more intensive monitoring (invasive haemodynamic monitoring and additional ECG leads). Because of the unbalanced class distribution in the data, majority class under-sampling and Kappa statistic together with misclassification rate and area under the ROC curve (AUC) are used for evaluation of models generated using different prediction algorithms. The performance based on models derived from feature reduced datasets reveal the filter method, Cfs subset evaluation, to be most consistently effective although Consistency derived subsets tended to slightly increased accuracy but markedly increased complexity. The use of misclassification rate (MR) for model performance evaluation is influenced by class distribution. This could be eliminated by consideration of the AUC or Kappa statistic as well by evaluation of subsets with under-sampled majority class. The noise and outlier removal pre-processing methods produced models with MR ranging from 10.69 to 12.62 with the lowest value being for data from which both outliers and noise were removed (MR 10.69). For the raw time-series dataset, MR is 12.34. Feature selection results in reduction in MR to 9.8 to 10.16 with time segmented summary data (dataset F) MR being 9.8 and raw time-series summary data (dataset A) being 9.92. However, for all time-series only based datasets, the complexity is high. For most pre-processing methods, Cfs could identify a subset of correlated and non-redundant variables from the time-series alone datasets but models derived from these subsets are of one leaf only. MR values are consistent with class distribution in the subset folds evaluated in the n-cross validation method. For models based on Cfs selected time-series derived and risk factor (RF) variables, the MR ranges from 8.83 to 10.36 with dataset RF_A (raw time-series data and RF) being 8.85 and dataset RF_F (time segmented time-series variables and RF) being 9.09. The models based on counts of outliers and counts of data points outside normal range (Dataset RF_E) and derived variables based on time series transformed using Symbolic Aggregate Approximation (SAX) with associated time-series pattern cluster membership (Dataset RF_ G) perform the least well with MR of 10.25 and 10.36 respectively. For coronary vascular disease prediction, nearest neighbour (NNge) and the support vector machine based method, SMO, have the highest MR of 10.1 and 10.28 while logistic regression (LR) and the decision tree (DT) method, J48, have MR of 8.85 and 9.0 respectively. DT rules are most comprehensible and clinically relevant. The predictive accuracy increase achieved by addition of risk factor variables to time-series variable based models is significant. The addition of time-series derived variables to models based on risk factor variables alone is associated with a trend to improved performance. Data mining of feature reduced, anaesthesia time-series variables together with risk factor variables can produce compact and moderately accurate models able to predict coronary vascular disease. Decision tree analysis of time-series data combined with risk factor variables yields rules which are more accurate than models based on time-series data alone. The limited additional value provided by electrocardiographic variables when compared to use of risk factors alone is similar to recent suggestions that exercise electrocardiography (exECG) under standardised conditions has limited additional diagnostic value over risk factor analysis and symptom pattern. The effect of the pre-processing used in this study had limited effect when time-series variables and risk factor variables are used as model input. In the absence of risk factor input, the use of time-series variables after outlier removal and time series variables based on physiological variable values’ being outside the accepted normal range is associated with some improvement in model performance.
Resumo:
Traffic Simulation models tend to have their own data input and output formats. In an effort to standardise the input for traffic simulations, we introduce in this paper a set of data marts that aim to serve as a common interface between the necessaary data, stored in dedicated databases, and the swoftware packages, that require the input in a certain format. The data marts are developed based on real world objects (e.g. roads, traffic lights, controllers) rather than abstract models and hence contain all necessary information that can be transformed by the importing software package to their needs. The paper contains a full description of the data marts for network coding, simulation results, and scenario management, which have been discussed with industry partners to ensure sustainability.
Resumo:
Electrochemical processes in mesoporous TiO2-Nafion thin films deposited on indium tin oxide (ITO) electrodes are inherently complex and affected by capacitance, Ohmic iR-drop, RC-time constant phenomena, and by potential and pH-dependent conductivity. In this study, large-amplitude sinusoidally modulated voltammetry (LASMV) is employed to provide access to almost purely Faradaic-based current data from second harmonic components, as well as capacitance and potential domain information from the fundamental harmonic for mesoporous TiO2-Nafion film electrodes. The LASMV response has been investigated with and without an immobilized one-electron redox system, ferrocenylmethyltrimethylammonium+. Results clearly demonstrate that the electron transfer associated with the immobilized ferrocene derivative follows two independent pathways i) electron hopping within the Nafion network and ii) conduction through the TiO2 backbone. The pH effect on the voltammetric response for the TiO2 reduction pathway (ii) can be clearly identified in the 2nd harmonic LASMV response with the diffusion controlled ferrocene response (i) acting as a pH independent reference. Application of second harmonic data derived from LASMV measurement, because of the minimal contribution from capacitance currents, may lead to reference-free pH sensing with systems like that found for ferrocene derivatives.
Resumo:
Compositional data analysis usually deals with relative information between parts where the total (abundances, mass, amount, etc.) is unknown or uninformative. This article addresses the question of what to do when the total is known and is of interest. Tools used in this case are reviewed and analysed, in particular the relationship between the positive orthant of D-dimensional real space, the product space of the real line times the D-part simplex, and their Euclidean space structures. The first alternative corresponds to data analysis taking logarithms on each component, and the second one to treat a log-transformed total jointly with a composition describing the distribution of component amounts. Real data about total abundances of phytoplankton in an Australian river motivated the present study and are used for illustration.
Resumo:
Recent data indicate that levels of overweight and obesity are increasing at an alarming rate throughout the world. At a population level (and commonly to assess individual health risk), the prevalence of overweight and obesity is calculated using cut-offs of the Body Mass Index (BMI) derived from height and weight. Similarly, the BMI is also used to classify individuals and to provide a notional indication of potential health risk. It is likely that epidemiologic surveys that are reliant on BMI as a measure of adiposity will overestimate the number of individuals in the overweight (and slightly obese) categories. This tendency to misclassify individuals may be more pronounced in athletic populations or groups in which the proportion of more active individuals is higher. This differential is most pronounced in sports where it is advantageous to have a high BMI (but not necessarily high fatness). To illustrate this point we calculated the BMIs of international professional rugby players from the four teams involved in the semi-finals of the 2003 Rugby Union World Cup. According to the World Health Organisation (WHO) cut-offs for BMI, approximately 65% of the players were classified as overweight and approximately 25% as obese. These findings demonstrate that a high BMI is commonplace (and a potentially desirable attribute for sport performance) in professional rugby players. An unanswered question is what proportion of the wider population, classified as overweight (or obese) according to the BMI, is misclassified according to both fatness and health risk? It is evident that being overweight should not be an obstacle to a physically active lifestyle. Similarly, a reliance on BMI alone may misclassify a number of individuals who might otherwise have been automatically considered fat and/or unfit.
Resumo:
In this paper, a singularly perturbed ordinary differential equation with non-smooth data is considered. The numerical method is generated by means of a Petrov-Galerkin finite element method with the piecewise-exponential test function and the piecewise-linear trial function. At the discontinuous point of the coefficient, a special technique is used. The method is shown to be first-order accurate and singular perturbation parameter uniform convergence. Finally, numerical results are presented, which are in agreement with theoretical results.