85 resultados para Binary Classification
Resumo:
The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.
Resumo:
Several popular Machine Learning techniques are originally designed for the solution of two-class problems. However, several classification problems have more than two classes. One approach to deal with multiclass problems using binary classifiers is to decompose the multiclass problem into multiple binary sub-problems disposed in a binary tree. This approach requires a binary partition of the classes for each node of the tree, which defines the tree structure. This paper presents two algorithms to determine the tree structure taking into account information collected from the used dataset. This approach allows the tree structure to be determined automatically for any multiclass dataset.
Resumo:
Credit scoring modelling comprises one of the leading formal tools for supporting the granting of credit. Its core objective consists of the generation of a score by means of which potential clients can be listed in the order of the probability of default. A critical factor is whether a credit scoring model is accurate enough in order to provide correct classification of the client as a good or bad payer. In this context the concept of bootstraping aggregating (bagging) arises. The basic idea is to generate multiple classifiers by obtaining the predicted values from the fitted models to several replicated datasets and then combining them into a single predictive classification in order to improve the classification accuracy. In this paper we propose a new bagging-type variant procedure, which we call poly-bagging, consisting of combining predictors over a succession of resamplings. The study is derived by credit scoring modelling. The proposed poly-bagging procedure was applied to some different artificial datasets and to a real granting of credit dataset up to three successions of resamplings. We observed better classification accuracy for the two-bagged and the three-bagged models for all considered setups. These results lead to a strong indication that the poly-bagging approach may promote improvement on the modelling performance measures, while keeping a flexible and straightforward bagging-type structure easy to implement. (C) 2011 Elsevier Ltd. All rights reserved.
Resumo:
In this paper, we study binary differential equations a(x, y)dy (2) + 2b(x, y) dx dy + c(x, y)dx (2) = 0, where a, b, and c are real analytic functions. Following the geometric approach of Bruce and Tari in their work on multiplicity of implicit differential equations, we introduce a definition of the index for this class of equations that coincides with the classical Hopf`s definition for positive binary differential equations. Our results also apply to implicit differential equations F(x, y, p) = 0, where F is an analytic function, p = dy/dx, F (p) = 0, and F (pp) not equal aEuro parts per thousand 0 at the singular point. For these equations, we relate the index of the equation at the singular point with the index of the gradient of F and index of the 1-form omega = dy -aEuro parts per thousand pdx defined on the singular surface F = 0.
Resumo:
The use of liposomes to encapsulate materials has received widespread attention for drug delivery, transfection, diagnostic reagent, and as immunoadjuvants. Phospholipid polymers form a new class of biomaterials with many potential applications in medicine and research. Of interest are polymeric phospholipids containing a diacetylene moiety along their acyl chain since these kinds of lipids can be polymerized by Ultra-Violet (UV) irradiation to form chains of covalently linked lipids in the bilayer. In particular the diacetylenic phosphatidylcholine 1,2-bis(10,12-tricosadiynoyl)- sn-glycero-3-phosphocholine (DC8,9PC) can form intermolecular cross-linking through the diacetylenic group to produce a conjugated polymer within the hydrocarbon region of the bilayer. As knowledge of liposome structures is certainly fundamental for system design improvement for new and better applications, this work focuses on the structural properties of polymerized DC8,9PC:1,2-dimyristoyl-sn-glycero-3-phusphocholine (DMPC) liposomes. Liposomes containing mixtures of DC8,9PC and DMPC, at different molar ratios, and exposed to different polymerization cycles, were studied through the analysis of the electron spin resonance (ESR) spectra of a spin label incorporated into the bilayer, and the calorimetric data obtained from differential scanning calorimetry (DSC) studies. Upon irradiation, if all lipids had been polymerized, no gel-fluid transition would be expected. However, even samples that went through 20 cycles of UV irradiation presented a DSC band, showing that around 80% of the DC8,9PC molecules were not polymerized. Both DSC and ESR indicated that the two different lipids scarcely mix at low temperatures, however few molecules of DMPC are present in DC8,9PC rich domains and vice versa. UV irradiation was found to affect the gel fluid transition of both DMPC and DC8,9PC rich regions, indicating the presence of polymeric units of DC8,9PC in both areas, A model explaining lipids rearrangement is proposed for this partially polymerized system.
Resumo:
Extending our previous work `Fields on the Poincare group and quantum description of orientable objects` (Gitman and Shelepin 2009 Eur. Phys. J. C 61 111-39), we consider here a classification of orientable relativistic quantum objects in 3 + 1 dimensions. In such a classification, one uses a maximal set of ten commuting operators (generators of left and right transformations) in the space of functions on the Poincare group. In addition to the usual six quantum numbers related to external symmetries (given by left generators), there appear additional quantum numbers related to internal symmetries (given by right generators). Spectra of internal and external symmetry operators are interrelated, which, however, does not contradict the Coleman-Mandula no-go theorem. We believe that the proposed approach can be useful for the description of elementary spinning particles considered as orientable objects. In particular, it gives a group-theoretical interpretation of some facts of the existing phenomenological classification of spinning particles.
Resumo:
In this paper, we present a study on a deterministic partially self-avoiding walk (tourist walk), which provides a novel method for texture feature extraction. The method is able to explore an image on all scales simultaneously. Experiments were conducted using different dynamics concerning the tourist walk. A new strategy, based on histograms. to extract information from its joint probability distribution is presented. The promising results are discussed and compared to the best-known methods for texture description reported in the literature. (C) 2009 Elsevier Ltd. All rights reserved.
Resumo:
Shape provides one of the most relevant information about an object. This makes shape one of the most important visual attributes used to characterize objects. This paper introduces a novel approach for shape characterization, which combines modeling shape into a complex network and the analysis of its complexity in a dynamic evolution context. Descriptors computed through this approach show to be efficient in shape characterization, incorporating many characteristics, such as scale and rotation invariant. Experiments using two different shape databases (an artificial shapes database and a leaf shape database) are presented in order to evaluate the method. and its results are compared to traditional shape analysis methods found in literature. (C) 2009 Published by Elsevier B.V.
Resumo:
Differently from theoretical scale-free networks, most real networks present multi-scale behavior, with nodes structured in different types of functional groups and communities. While the majority of approaches for classification of nodes in a complex network has relied on local measurements of the topology/connectivity around each node, valuable information about node functionality can be obtained by concentric (or hierarchical) measurements. This paper extends previous methodologies based on concentric measurements, by studying the possibility of using agglomerative clustering methods, in order to obtain a set of functional groups of nodes, considering particular institutional collaboration network nodes, including various known communities (departments of the University of Sao Paulo). Among the interesting obtained findings, we emphasize the scale-free nature of the network obtained, as well as identification of different patterns of authorship emerging from different areas (e.g. human and exact sciences). Another interesting result concerns the relatively uniform distribution of hubs along concentric levels, contrariwise to the non-uniform pattern found in theoretical scale-free networks such as the BA model. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
The latest version of CATH (class, architecture, topology, homology) (version 3.2), released in July 2008 (http://www.cathdb.info), contains 1 14215 domains, 2178 Homologous superfamilies and 1110 fold groups. We have assigned 20 330 new domains, 87 new homologous superfamilies and 26 new folds since CATH release version 3.1. A total of 28 064 new domains have been assigned since our NAR 2007 database publication (CATH version 3.0). The CATH website has been completely redesigned and includes more comprehensive documentation. We have revisited the CATH architecture level as part of the development of a `Protein Chart` and present information on the population of each architecture. The CATHEDRAL structure comparison algorithm has been improved and used to characterize structural diversity in CATH superfamilies and structural overlaps between superfamilies. Although the majority of superfamilies in CATH are not structurally diverse and do not overlap significantly with other superfamilies, similar to 4% of superfamilies are very diverse and these are the superfamilies that are most highly populated in both the PDB and in the genomes. Information on the degree of structural diversity in each superfamily and structural overlaps between superfamilies can now be downloaded from the CATH website.
Resumo:
Bothropasin is a 48 kDa hemorrhagic PIII snake venom metalloprotease (SVMP) isolated from Bothrops jararaca, containing disintegrin/cysteine-rich adhesive domains. Here we present the crystal structure of bothropasin complexed with the inhibitor POL647. The catalytic domain consists of a scaffold of two subdomains organized similarly to those described for other SVMPs, including the zinc and calcium-binding sites. The free cysteine residue Cys(189) is located within a hydrophobic core and it is not available for disulfide bonding or other interactions. There is no identifiable secondary structure for the disintegrin domain, but instead it is composed mostly of loops stabilized by seven disulfide bonds and by two calcium ions. The ECD region is in a loop and is structurally related to the RGD region of RGD disintegrins, which are derived from I`ll SVMPs. The ECD motif is stabilized by the Cys(117)_Cys(310) disulfide bond (between the disintegrin and cysteine-rich domains) and by one calcium ion. The side chain of Glu(276) of the ECD motif is exposed to solvent and free to make interactions. In bothropasin, the HVR (hyper-variable region) described for other Pill SVMPs in the cysteine-rich domain, presents a well-conserved sequence with respect to several other Pill members from different species. We propose that this subset be referred to as PIII-HCR (highly conserved region) SVMPs. The differences in the disintegrin-like, cysteine-rich or disintegrin-like cysteine-rich domains may be involved in selecting target binding, which in turn could generate substrate diversity or specificity for the catalytic domain. (C) 2008 Elsevier Ltd. All rights reserved.
Resumo:
In this paper we present a novel approach for multispectral image contextual classification by combining iterative combinatorial optimization algorithms. The pixel-wise decision rule is defined using a Bayesian approach to combine two MRF models: a Gaussian Markov Random Field (GMRF) for the observations (likelihood) and a Potts model for the a priori knowledge, to regularize the solution in the presence of noisy data. Hence, the classification problem is stated according to a Maximum a Posteriori (MAP) framework. In order to approximate the MAP solution we apply several combinatorial optimization methods using multiple simultaneous initializations, making the solution less sensitive to the initial conditions and reducing both computational cost and time in comparison to Simulated Annealing, often unfeasible in many real image processing applications. Markov Random Field model parameters are estimated by Maximum Pseudo-Likelihood (MPL) approach, avoiding manual adjustments in the choice of the regularization parameters. Asymptotic evaluations assess the accuracy of the proposed parameter estimation procedure. To test and evaluate the proposed classification method, we adopt metrics for quantitative performance assessment (Cohen`s Kappa coefficient), allowing a robust and accurate statistical analysis. The obtained results clearly show that combining sub-optimal contextual algorithms significantly improves the classification performance, indicating the effectiveness of the proposed methodology. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The design of translation invariant and locally defined binary image operators over large windows is made difficult by decreased statistical precision and increased training time. We present a complete framework for the application of stacked design, a recently proposed technique to create two-stage operators that circumvents that difficulty. We propose a novel algorithm, based on Information Theory, to find groups of pixels that should be used together to predict the Output Value. We employ this algorithm to automate the process of creating a set of first-level operators that are later combined in a global operator. We also propose a principled way to guide this combination, by using feature selection and model comparison. Experimental results Show that the proposed framework leads to better results than single stage design. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
We explicitly construct a stationary coupling attaining Ornstein`s (d) over bar -distance between ordered pairs of binary chains of infinite order. Our main tool is a representation of the transition probabilities of the coupled bivariate chain of infinite order as a countable mixture of Markov transition probabilities of increasing order. Under suitable conditions on the loss of memory of the chains, this representation implies that the coupled chain can be represented as a concatenation of i.i.d. sequences of bivariate finite random strings of symbols. The perfect simulation algorithm is based on the fact that we can identify the first regeneration point to the left of the origin almost surely.
Resumo:
We review several asymmetrical links for binary regression models and present a unified approach for two skew-probit links proposed in the literature. Moreover, under skew-probit link, conditions for the existence of the ML estimators and the posterior distribution under improper priors are established. The framework proposed here considers two sets of latent variables which are helpful to implement the Bayesian MCMC approach. A simulation study to criteria for models comparison is conducted and two applications are made. Using different Bayesian criteria we show that, for these data sets, the skew-probit links are better than alternative links proposed in the literature.