920 resultados para kernel estimators
Resumo:
Support Vector Machines(SVMs) are hyperplane classifiers defined in a kernel induced feature space. The data size dependent training time complexity of SVMs usually prohibits its use in applications involving more than a few thousands of data points. In this paper we propose a novel kernel based incremental data clustering approach and its use for scaling Non-linear Support Vector Machines to handle large data sets. The clustering method introduced can find cluster abstractions of the training data in a kernel induced feature space. These cluster abstractions are then used for selective sampling based training of Support Vector Machines to reduce the training time without compromising the generalization performance. Experiments done with real world datasets show that this approach gives good generalization performance at reasonable computational expense.
Resumo:
This study examines the properties of Generalised Regression (GREG) estimators for domain class frequencies and proportions. The family of GREG estimators forms the class of design-based model-assisted estimators. All GREG estimators utilise auxiliary information via modelling. The classic GREG estimator with a linear fixed effects assisting model (GREG-lin) is one example. But when estimating class frequencies, the study variable is binary or polytomous. Therefore logistic-type assisting models (e.g. logistic or probit model) should be preferred over the linear one. However, other GREG estimators than GREG-lin are rarely used, and knowledge about their properties is limited. This study examines the properties of L-GREG estimators, which are GREG estimators with fixed-effects logistic-type models. Three research questions are addressed. First, I study whether and when L-GREG estimators are more accurate than GREG-lin. Theoretical results and Monte Carlo experiments which cover both equal and unequal probability sampling designs and a wide variety of model formulations show that in standard situations, the difference between L-GREG and GREG-lin is small. But in the case of a strong assisting model, two interesting situations arise: if the domain sample size is reasonably large, L-GREG is more accurate than GREG-lin, and if the domain sample size is very small, estimation of assisting model parameters may be inaccurate, resulting in bias for L-GREG. Second, I study variance estimation for the L-GREG estimators. The standard variance estimator (S) for all GREG estimators resembles the Sen-Yates-Grundy variance estimator, but it is a double sum of prediction errors, not of the observed values of the study variable. Monte Carlo experiments show that S underestimates the variance of L-GREG especially if the domain sample size is minor, or if the assisting model is strong. Third, since the standard variance estimator S often fails for the L-GREG estimators, I propose a new augmented variance estimator (A). The difference between S and the new estimator A is that the latter takes into account the difference between the sample fit model and the census fit model. In Monte Carlo experiments, the new estimator A outperformed the standard estimator S in terms of bias, root mean square error and coverage rate. Thus the new estimator provides a good alternative to the standard estimator.
Resumo:
Automatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior and the use of appropriate data mining techniques on the chosen representation. In this paper, we use the sequence of system calls to characterize program execution. The data mining tasks addressed are learning to map system call streams to fault labels and automatic identification of fault causes. Spectrum kernels and SVM are used for the former while latent semantic analysis is used for the latter The techniques are demonstrated for the intrusion dataset containing system call traces. The results show that kernel techniques are as accurate as the best available results but are faster by orders of magnitude. We also show that latent semantic indexing is capable of revealing fault-specific features.
Resumo:
Microorganisms exist predominantly as sessile multispecies communities in natural habitats. Most bacterial species can form these matrix-enclosed microbial communities called biofilms. Biofilms occur in a wide range of environments, on every surface with sufficient moisture and nutrients, also on surfaces in industrial settings and engineered water systems. This unwanted biofilm formation on equipment surfaces is called biofouling. Biofouling can significantly decrease equipment performance and lifetime and cause contamination and impaired quality of the industrial product. In this thesis we studied bacterial adherence to abiotic surfaces by using coupons of stainless steel coated or not coated with fluoropolymer or diamond like carbon (DLC). As model organisms we used bacterial isolates from paper machines (Meiothermus silvanus, Pseudoxanthomonas taiwanensis and Deinococcus geothermalis) and also well characterised species isolated from medical implants (Staphylococcus epidermidis). We found that coating of steel surface with these materials reduced its tendency towards biofouling: Fluoropolymer and DLC coatings repelled all four biofilm formers on steel. We found great differences between bacterial species in their preference of surfaces to adhere as well as their ultrastructural details, like number and thickness of adhesion organelles they expressed. These details responded differently towards the different surfaces they adhered to. We further found that biofilms of D. geothermalis formed on titanium dioxide coated coupons of glass, steel and titanium, were effectively removed by photocatalytic action in response to irradiation at 360 nm. However, on non-coated glass or steel surfaces irradiation had no detectable effect on the amount of bacterial biomass. We showed that the adhesion organelles of bacteria on illuminated TiO2 coated coupons were complety destroyed whereas on non-coated coupons they looked intact when observed by microscope. Stainless steel is the most widely used material for industrial process equipments and surfaces. The results in this thesis showed that stainless steel is prone to biofouling by phylogenetically distant bacterial species and that coating of the steel may offer a tool for reduced biofouling of industrial equipment. Photocatalysis, on the other hand, is a potential technique for biofilm removal from surfaces in locations where high level of hygiene is required. Our study of natural biofilms on barley kernel surfaces showed that also there the microbes possessed adhesion organelles visible with electronmicroscope both before and after steeping. The microbial community of dry barley kernels turned into a dense biofilm covered with slimy extracellular polymeric substance (EPS) in the kernels after steeping in water. Steeping is the first step in malting. We also presented evidence showing that certain strains of Lactobacillus plantarum and Wickerhamomyces anomalus, when used as starter cultures in the steeping water, could enter the barley kernel and colonise the tissues of the barley kernel. By use of a starter culture it was possible to reduce the extensive production of EPS, which resulted in a faster filtration of the mash.
Resumo:
A key trait of Free and Open Source Software (FOSS) development is its distributed nature. Nevertheless, two project-level operations, the fork and the merge of program code, are among the least well understood events in the lifespan of a FOSS project. Some projects have explicitly adopted these operations as the primary means of concurrent development. In this study, we examine the effect of highly distributed software development, is found in the Linux kernel project, on collection and modelling of software development data. We find that distributed development calls for sophisticated temporal modelling techniques where several versions of the source code tree can exist at once. Attention must be turned towards the methods of quality assurance and peer review that projects employ to manage these parallel source trees. Our analysis indicates that two new metrics, fork rate and merge rate, could be useful for determining the role of distributed version control systems in FOSS projects. The study presents a preliminary data set consisting of version control and mailing list data.
Resumo:
A key trait of Free and Open Source Software (FOSS) development is its distributed nature. Nevertheless, two project-level operations, the fork and the merge of program code, are among the least well understood events in the lifespan of a FOSS project. Some projects have explicitly adopted these operations as the primary means of concurrent development. In this study, we examine the effect of highly distributed software development, is found in the Linux kernel project, on collection and modelling of software development data. We find that distributed development calls for sophisticated temporal modelling techniques where several versions of the source code tree can exist at once. Attention must be turned towards the methods of quality assurance and peer review that projects employ to manage these parallel source trees. Our analysis indicates that two new metrics, fork rate and merge rate, could be useful for determining the role of distributed version control systems in FOSS projects. The study presents a preliminary data set consisting of version control and mailing list data.
Resumo:
According to certain arguments, computation is observer-relative either in the sense that many physical systems implement many computations (Hilary Putnam), or in the sense that almost all physical systems implement all computations (John Searle). If sound, these arguments have a potentially devastating consequence for the computational theory of mind: if arbitrary physical systems can be seen to implement arbitrary computations, the notion of computation seems to lose all explanatory power as far as brains and minds are concerned. David Chalmers and B. Jack Copeland have attempted to counter these relativist arguments by placing certain constraints on the definition of implementation. In this thesis, I examine their proposals and find both wanting in some respects. During the course of this examination, I give a formal definition of the class of combinatorial-state automata , upon which Chalmers s account of implementation is based. I show that this definition implies two theorems (one an observation due to Curtis Brown) concerning the computational power of combinatorial-state automata, theorems which speak against founding the theory of implementation upon this formalism. Toward the end of the thesis, I sketch a definition of the implementation of Turing machines in dynamical systems, and offer this as an alternative to Chalmers s and Copeland s accounts of implementation. I demonstrate that the definition does not imply Searle s claim for the universal implementation of computations. However, the definition may support claims that are weaker than Searle s, yet still troubling to the computationalist. There remains a kernel of relativity in implementation at any rate, since the interpretation of physical systems seems itself to be an observer-relative matter, to some degree at least. This observation helps clarify the role the notion of computation can play in cognitive science. Specifically, I will argue that the notion should be conceived as an instrumental rather than as a fundamental or foundational one.
Resumo:
Statistical learning algorithms provide a viable framework for geotechnical engineering modeling. This paper describes two statistical learning algorithms applied for site characterization modeling based on standard penetration test (SPT) data. More than 2700 field SPT values (N) have been collected from 766 boreholes spread over an area of 220 sqkm area in Bangalore. To get N corrected value (N,), N values have been corrected (Ne) for different parameters such as overburden stress, size of borehole, type of sampler, length of connecting rod, etc. In three-dimensional site characterization model, the function N-c=N-c (X, Y, Z), where X, Y and Z are the coordinates of a point corresponding to N, value, is to be approximated in which N, value at any half-space point in Bangalore can be determined. The first algorithm uses least-square support vector machine (LSSVM), which is related to aridge regression type of support vector machine. The second algorithm uses relevance vector machine (RVM), which combines the strengths of kernel-based methods and Bayesian theory to establish the relationships between a set of input vectors and a desired output. The paper also presents the comparative study between the developed LSSVM and RVM model for site characterization. Copyright (C) 2009 John Wiley & Sons,Ltd.
First simultaneous measurement of the top quark mass in the lepton+jets and dilepton channels at CDF
Resumo:
We present a measurement of the mass of the top quark using data corresponding to an integrated luminosity of 1.9fb^-1 of ppbar collisions collected at sqrt{s}=1.96 TeV with the CDF II detector at Fermilab's Tevatron. This is the first measurement of the top quark mass using top-antitop pair candidate events in the lepton + jets and dilepton decay channels simultaneously. We reconstruct two observables in each channel and use a non-parametric kernel density estimation technique to derive two-dimensional probability density functions from simulated signal and background samples. The observables are the top quark mass and the invariant mass of two jets from the W decay in the lepton + jets channel, and the top quark mass and the scalar sum of transverse energy of the event in the dilepton channel. We perform a simultaneous fit for the top quark mass and the jet energy scale, which is constrained in situ by the hadronic W boson mass. Using 332 lepton + jets candidate events and 144 dilepton candidate events, we measure the top quark mass to be mtop=171.9 +/- 1.7 (stat. + JES) +/- 1.1 (syst.) GeV/c^2 = 171.9 +/- 2.0 GeV/c^2.
Resumo:
Core Vector Machine(CVM) is suitable for efficient large-scale pattern classification. In this paper, a method for improving the performance of CVM with Gaussian kernel function irrespective of the orderings of patterns belonging to different classes within the data set is proposed. This method employs a selective sampling based training of CVM using a novel kernel based scalable hierarchical clustering algorithm. Empirical studies made on synthetic and real world data sets show that the proposed strategy performs well on large data sets.
Resumo:
This paper discusses a method for scaling SVM with Gaussian kernel function to handle large data sets by using a selective sampling strategy for the training set. It employs a scalable hierarchical clustering algorithm to construct cluster indexing structures of the training data in the kernel induced feature space. These are then used for selective sampling of the training data for SVM to impart scalability to the training process. Empirical studies made on real world data sets show that the proposed strategy performs well on large data sets.
Resumo:
The near flow field of small aspect ratio elliptic turbulent free jets (issuing from nozzle and orifice) was experimentally studied using a 2D PIV. Two point velocity correlations in these jets revealed the extent and orientation of the large scale structures in the major and minor planes. The spatial filtering of the instantaneous velocity field using Gaussian convolution kernel shows that while a single large vortex ring circumscribing the jet seems to be present at the exit of nozzle, the orifice jet exhibited a number of smaller vortex ring pairs close to jet exit. The smaller length scale observed in the case of the orifice jet is representative of the smaller azimuthal vortex rings that generate axial vortex field as they are convected. This results in the axis-switching in the case of orifice jet and may have a mechanism different from the self induction process as observed in the case of contoured nozzle jet flow.
Resumo:
To enhance the utilization of the wood, the sawmills are forced to place more emphasis on planning to master the whole production chain from the forest to the end product. One significant obstacle to integrating the forest-sawmill-market production chain is the lack of appropriate information about forest stands. Since the wood procurement point of view in forest planning systems has been almost totally disregarded there has been a great need to develop an easy and efficient pre-harvest measurement method, allowing separate measurement of stands prior to harvesting. The main purpose of this study was to develop a measurement method for pine stands which forest managers could use in describing the properties of the standing trees for sawing production planning. Study materials were collected from ten Scots pine stands (Pinus sylvestris) located in North Häme and South Pohjanmaa, in southern Finland. The data comprise test sawing data on 314 pine stems, dbh and height measures of all trees and measures of the quality parameters of pine sawlog stems in all ten study stands as well as the locations of all trees in six stands. The study was divided into four sub-studies which deal with pine quality prediction, construction of diameter and dead branch height distributions, sampling designs and applying height and crown height models. The final proposal for the pre-harvest measurement method is a synthesis of the individual sub-studies. Quality analysis resulted in choosing dbh, distance from stump height to the first dead branch (dead branch height), crown height and tree height as the most appropriate quality characteristics of Scots pine. Dbh and dead branch height are measured from each pine sample tree while height and crown height are derived from dbh measures by aid of mixed height and crown height models. Pine and spruce diameter distribution as well as dead branch height distribution are most effectively predicted by the kernel function. Roughly 25 sample trees seems to be appropriate in pure pine stands. In mixed stands the number of sample trees needs to be increased in proportion to the intensity of pines in order to attain the same level of accuracy.
Resumo:
Fujikawa's method of evaluating the supercurrent and the superconformal current anomalies, using the heat-kernel regularization scheme, is extended to theories with gauge invariance, in particular, to the off-shell N=1 supersymmetric Yang-Mills (SSYM) theory. The Jacobians of supersymmetry and superconformal transformations are finite. Although the gauge-fixing term is not supersymmetric and the regularization scheme is not manifestly supersymmetric, we find that the regularized Jacobians are gauge invariant and finite and they can be expressed in such a way that there is no one-loop supercurrent anomaly for the N=1 SSYM theory. The superconformal anomaly is nonzero and the anomaly agrees with a similar result obtained using other methods.
Resumo:
Fujikawa's method of evaluating the anomalies is extended to the on-shell supersymmetric (SUSY) theories. The supercurrent and the superconformal current anomalies are evaluated for the Wess-Zumino model using the background-field formulation and heat-kernel regularization. We find that the regularized Jacobians for SUSY and superconformal transformations are finite. The results can be expressed in a form such that there is no supercurrent anomaly but a finite nonzero superconformal anomaly, in agreement with similar results obtained using other methods.