24 resultados para Classification approach
Resumo:
Support vector machine (SVM) is a powerful technique for data classification. Despite of its good theoretic foundations and high classification accuracy, normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is highly dependent on the size of data set. This paper presents a novel SVM classification approach for large data sets by using minimum enclosing ball clustering. After the training data are partitioned by the proposed clustering method, the centers of the clusters are used for the first time SVM classification. Then we use the clusters whose centers are support vectors or those clusters which have different classes to perform the second time SVM classification. In this stage most data are removed. Several experimental results show that the approach proposed in this paper has good classification accuracy compared with classic SVM while the training is significantly faster than several other SVM classifiers.
Resumo:
Mobile malware has continued to grow at an alarming rate despite on-going mitigation efforts. This has been much more prevalent on Android due to being an open platform that is rapidly overtaking other competing platforms in the mobile smart devices market. Recently, a new generation of Android malware families has emerged with advanced evasion capabilities which make them much more difficult to detect using conventional methods. This paper proposes and investigates a parallel machine learning based classification approach for early detection of Android malware. Using real malware samples and benign applications, a composite classification model is developed from parallel combination of heterogeneous classifiers. The empirical evaluation of the model under different combination schemes demonstrates its efficacy and potential to improve detection accuracy. More importantly, by utilizing several classifiers with diverse characteristics, their strengths can be harnessed not only for enhanced Android malware detection but also quicker white box analysis by means of the more interpretable constituent classifiers.
Resumo:
N-gram analysis is an approach that investigates the structure of a program using bytes, characters or text strings. This research uses dynamic analysis to investigate malware detection using a classification approach based on N-gram analysis. A key issue with dynamic analysis is the length of time a program has to be run to ensure a correct classification. The motivation for this research is to find the optimum subset of operational codes (opcodes) that make the best indicators of malware and to determine how long a program has to be monitored to ensure an accurate support vector machine (SVM) classification of benign and malicious software. The experiments within this study represent programs as opcode density histograms gained through dynamic analysis for different program run periods. A SVM is used as the program classifier to determine the ability of different program run lengths to correctly determine the presence of malicious software. The findings show that malware can be detected with different program run lengths using a small number of opcodes
Resumo:
N-gram analysis is an approach that investigates the structure of a program using bytes, characters or text strings. This research uses dynamic analysis to investigate malware detection using a classification approach based on N-gram analysis. The motivation for this research is to find a subset of Ngram features that makes a robust indicator of malware. The experiments within this paper represent programs as N-gram density histograms, gained through dynamic analysis. A Support Vector Machine (SVM) is used as the program classifier to determine the ability of N-grams to correctly determine the presence of malicious software. The preliminary findings show that an N-gram size N=3 and N=4 present the best avenues for further analysis.
Resumo:
During recent years, the increasing knowledge of genetic and physiological changes in polycythemia vera (PV) and of different types of congenital erythrocytosis has led to fundamental changes in recommendations for the diagnostic approach to patients with erythrocytosis. Although widely accepted for adult patients this approach may not be appropriate with regard to children and adolescents affected by erythrocytosis. The "congenital erythrocytosis" working group established within the framework of the MPN&MPNr-EuroNet (COST action BM0902) addressed this question in a consensus finding process and developed a specific algorithm for the diagnosis of erythrocytosis in childhood and adolescence which is presented here. Pediatr Blood Cancer 2013;9999:XX-XX. © 2013 Wiley Periodicals, Inc.
Resumo:
Classification methods with embedded feature selection capability are very appealing for the analysis of complex processes since they allow the analysis of root causes even when the number of input variables is high. In this work, we investigate the performance of three techniques for classification within a Monte Carlo strategy with the aim of root cause analysis. We consider the naive bayes classifier and the logistic regression model with two different implementations for controlling model complexity, namely, a LASSO-like implementation with a L1 norm regularization and a fully Bayesian implementation of the logistic model, the so called relevance vector machine. Several challenges can arise when estimating such models mainly linked to the characteristics of the data: a large number of input variables, high correlation among subsets of variables, the situation where the number of variables is higher than the number of available data points and the case of unbalanced datasets. Using an ecological and a semiconductor manufacturing dataset, we show advantages and drawbacks of each method, highlighting the superior performance in term of classification accuracy for the relevance vector machine with respect to the other classifiers. Moreover, we show how the combination of the proposed techniques and the Monte Carlo approach can be used to get more robust insights into the problem under analysis when faced with challenging modelling conditions.
Resumo:
The grading of crushed aggregate is carried out usually by sieving. We describe a new image-based approach to the automatic grading of such materials. The operational problem addressed is where the camera is located directly over a conveyor belt. Our approach characterizes the information content of each image, taking into account relative variation in the pixel data, and resolution scale. In feature space, we find very good class separation using a multidimensional linear classifier. The innovation in this work includes (i) introducing an effective image-based approach into this application area, and (ii) our supervised classification using wavelet entropy-based features.
Resumo:
A study was performed to determine if targeted metabolic profiling of cattle sera could be used to establish a predictive tool for identifying hormone misuse in cattle. Metabolites were assayed in heifers (n ) 5) treated with nortestosterone decanoate (0.85 mg/kg body weight), untreated heifers (n ) 5), steers (n ) 5) treated with oestradiol benzoate (0.15 mg/kg body weight) and untreated steers (n ) 5). Treatments were administered on days 0, 14, and 28 throughout a 42 day study period. Two support vector machines (SVMs) were trained, respectively, from heifer and steer data to identify hormonetreated animals. Performance of both SVM classifiers were evaluated by sensitivity and specificity of treatment prediction. The SVM trained on steer data achieved 97.33% sensitivity and 93.85% specificity while the one on heifer data achieved 94.67% sensitivity and 87.69% specificity. Solutions of SVM classifiers were further exploited to determine those days when classification accuracy of the SVM was most reliable. For heifers and steers, days 17-35 were determined to be the most selective. In summary, bioinformatics applied to targeted metabolic profiles generated from standard clinical chemistry analyses, has yielded an accurate, inexpensive, high-throughput test for predicting steroid abuse in cattle.
Resumo:
The identification and classification of network traffic and protocols is a vital step in many quality of service and security systems. Traffic classification strategies must evolve, alongside the protocols utilising the Internet, to overcome the use of ephemeral or masquerading port numbers and transport layer encryption. This research expands the concept of using machine learning on the initial statistics of flow of packets to determine its underlying protocol. Recognising the need for efficient training/retraining of a classifier and the requirement for fast classification, the authors investigate a new application of k-means clustering referred to as 'two-way' classification. The 'two-way' classification uniquely analyses a bidirectional flow as two unidirectional flows and is shown, through experiments on real network traffic, to improve classification accuracy by as much as 18% when measured against similar proposals. It achieves this accuracy while generating fewer clusters, that is, fewer comparisons are needed to classify a flow. A 'two-way' classification offers a new way to improve accuracy and efficiency of machine learning statistical classifiers while still maintaining the fast training times associated with the k-means.
Resumo:
Discrete Conditional Phase-type (DC-Ph) models consist of a process component (survival distribution) preceded by a set of related conditional discrete variables. This paper introduces a DC-Ph model where the conditional component is a classification tree. The approach is utilised for modelling health service capacities by better predicting service times, as captured by Coxian Phase-type distributions, interfaced with results from a classification tree algorithm. To illustrate the approach, a case-study within the healthcare delivery domain is given, namely that of maternity services. The classification analysis is shown to give good predictors for complications during childbirth. Based on the classification tree predictions, the duration of childbirth on the labour ward is then modelled as either a two or three-phase Coxian distribution. The resulting DC-Ph model is used to calculate the number of patients and associated bed occupancies, patient turnover, and to model the consequences of changes to risk status.
Resumo:
Clinical and pathological heterogeneity of breast cancer hinders selection of appropriate treatment for individual cases. Molecular profiling at gene or protein levels may elucidate the biological variance of tumors and provide a new classification system that correlates better with biological, clinical and prognostic parameters. We studied the immunohistochemical profile of a panel of seven important biomarkers using tumor tissue arrays. The tumor samples were then classified with a monothetic (binary variables) clustering algorithm. Two distinct groups of tumors are characterized by the estrogen receptor (ER) status and tumor grade (p = 0.0026). Four biomarkers, c-erbB2, Cox-2, p53 and VEGF, were significantly overexpressed in tumors with the ER-negative (ER-) phenotype. Eight subsets of tumors were further identified according to the expression status of VEGF, c-erbB2 and p53. The malignant potential of the ER-/VEGF+ subgroup was associated with the strong correlations of Cox-2 and c-erb132 with VEGF. Our results indicate that this molecular classification system, based on the statistical analysis of immunohistochemical profiling, is a useful approach for tumor grouping. Some of these subgroups have a relative genetic homogeneity that may allow further study of specific genetically-controlled metabolic pathways. This approach may hold great promise in rationalizing the application of different therapeutic strategies for different subgroups of breast tumors. (C) 2003 Elsevier Inc. All rights reserved.
Resumo:
The management of water resources in Ireland prior to the Water Framework Directive (WFD) has focussed on surface water and groundwater as separate entities. A critical element to the successful implementation of the
WFD is to improve our understanding of the interaction between the two and flow mechanisms by which groundwaters discharge to surface waters. An improved understanding of the contribution of groundwater to surface water is required for the classification of groundwater body status and the determination of groundwater quality thresholds. The results of the study will also have a wider application to many areas of the WFD.
A subcommittee of the WFD Groundwater Working Group (GWWG) has been formed to develop a methodology to estimate the groundwater contribution to Irish Rivers. The group has selected a number of analytical techniques to quantify components of stream flow in an Irish context (Master Recession Curve, Unit Hydrograph, Flood Studies Report methodologies and
hydrogeological analytical modelling). The components of stream flow that can be identified include deep groundwater, intermediate and overland. These analyses have been tested on seven pilot catchments that have a variety of hydrogeological settings and have been used to inform and constrain a mathematical model. The mathematical model used was the NAM (NedbØr-AfstrØmnings-Model) rainfall-runoff model which is a module of DHIs MIKE 11 modelling suite. The results from these pilot catchments have been used to develop a decision model based on catchment descriptors from GIS datasets for the selection of NAM parameters. The datasets used include the mapping of aquifers, vulnerability and subsoils, soils, the Digital Terrain Model, CORINE and lakes. The national coverage of the GIS datasets has allowed the extrapolation of the mathematical model to regional catchments across Ireland.
Resumo:
Tunnel construction planning requires careful consideration of the spoil management part, as this involves environmental, economic and legal requirements. In this paper a methodological approach that considers the interaction between technical and geological factors in determining the features of the resulting muck is proposed. This gives indications about the required treatments as well as laboratory and field characterisation tests to be performed to assess muck recovery alternatives. While this reuse is an opportunity for excavations in good quality homogeneous grounds (e.g. granitic mass), it is critical for complex formation. This approach has been validated, at present, for three different geo-materials resulting from a tunnel excavation carried out with a large diameter Earth Pressure Balance Shield (EPB) through a complex geological succession. Physical parameters and technological features of the three materials have been assessed, according to their valorisation potential, for defining re-utilisation patterns. The methodology proved to be effective and the laboratory tests carried out on the three materials allowed the suitability and treatment effectiveness for each muck recovery strategy to be defined. © 2014 Elsevier Ltd.
Resumo:
In this paper a multiple classifier machine learning methodology for Predictive Maintenance (PdM) is presented. PdM is a prominent strategy for dealing with maintenance issues given the increasing need to minimize downtime and associated costs. One of the challenges with PdM is generating so called ’health factors’ or quantitative indicators of the status of a system associated with a given maintenance issue, and determining their relationship to operating costs and failure risk. The proposed PdM methodology allows dynamical decision rules to be adopted for maintenance management and can be used with high-dimensional and censored data problems. This is achieved by training multiple classification modules with different prediction horizons to provide different performance trade-offs in terms of frequency of unexpected breaks and unexploited lifetime and then employing this information in an operating cost based maintenance decision system to minimise expected costs. The effectiveness of the methodology is demonstrated using a simulated example and a benchmark semiconductor manufacturing maintenance problem.