971 resultados para K-Nearest Neighbors
Resumo:
Background: Allergy is a form of hypersensitivity to normally innocuous substances, such as dust, pollen, foods or drugs. Allergens are small antigens that commonly provoke an IgE antibody response. There are two types of bioinformatics-based allergen prediction. The first approach follows FAO/WHO Codex alimentarius guidelines and searches for sequence similarity. The second approach is based on identifying conserved allergenicity-related linear motifs. Both approaches assume that allergenicity is a linearly coded property. In the present study, we applied ACC pre-processing to sets of known allergens, developing alignment-independent models for allergen recognition based on the main chemical properties of amino acid sequences.Results: A set of 684 food, 1,156 inhalant and 555 toxin allergens was collected from several databases. A set of non-allergens from the same species were selected to mirror the allergen set. The amino acids in the protein sequences were described by three z-descriptors (z1, z2 and z3) and by auto- and cross-covariance (ACC) transformation were converted into uniform vectors. Each protein was presented as a vector of 45 variables. Five machine learning methods for classification were applied in the study to derive models for allergen prediction. The methods were: discriminant analysis by partial least squares (DA-PLS), logistic regression (LR), decision tree (DT), naïve Bayes (NB) and k nearest neighbours (kNN). The best performing model was derived by kNN at k = 3. It was optimized, cross-validated and implemented in a server named AllerTOP, freely accessible at http://www.pharmfac.net/allertop. AllerTOP also predicts the most probable route of exposure. In comparison to other servers for allergen prediction, AllerTOP outperforms them with 94% sensitivity.Conclusions: AllerTOP is the first alignment-free server for in silico prediction of allergens based on the main physicochemical properties of proteins. Significantly, as well allergenicity AllerTOP is able to predict the route of allergen exposure: food, inhalant or toxin. © 2013 Dimitrov et al.; licensee BioMed Central Ltd.
Resumo:
This thesis studies survival analysis techniques dealing with censoring to produce predictive tools that predict the risk of endovascular aortic aneurysm repair (EVAR) re-intervention. Censoring indicates that some patients do not continue follow up, so their outcome class is unknown. Methods dealing with censoring have drawbacks and cannot handle the high censoring of the two EVAR datasets collected. Therefore, this thesis presents a new solution to high censoring by modifying an approach that was incapable of differentiating between risks groups of aortic complications. Feature selection (FS) becomes complicated with censoring. Most survival FS methods depends on Cox's model, however machine learning classifiers (MLC) are preferred. Few methods adopted MLC to perform survival FS, but they cannot be used with high censoring. This thesis proposes two FS methods which use MLC to evaluate features. The two FS methods use the new solution to deal with censoring. They combine factor analysis with greedy stepwise FS search which allows eliminated features to enter the FS process. The first FS method searches for the best neural networks' configuration and subset of features. The second approach combines support vector machines, neural networks, and K nearest neighbor classifiers using simple and weighted majority voting to construct a multiple classifier system (MCS) for improving the performance of individual classifiers. It presents a new hybrid FS process by using MCS as a wrapper method and merging it with the iterated feature ranking filter method to further reduce the features. The proposed techniques outperformed FS methods based on Cox's model such as; Akaike and Bayesian information criteria, and least absolute shrinkage and selector operator in the log-rank test's p-values, sensitivity, and concordance. This proves that the proposed techniques are more powerful in correctly predicting the risk of re-intervention. Consequently, they enable doctors to set patients’ appropriate future observation plan.
Resumo:
The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^
Resumo:
Voice communication systems such as Voice-over IP (VoIP), Public Switched Telephone Networks, and Mobile Telephone Networks, are an integral means of human tele-interaction. These systems pose distinctive challenges due to their unique characteristics such as low volume, burstiness and stringent delay/loss requirements across heterogeneous underlying network technologies. Effective quality evaluation methodologies are important for system development and refinement, particularly by adopting user feedback based measurement. Presently, most of the evaluation models are system-centric (Quality of Service or QoS-based), which questioned us to explore a user-centric (Quality of Experience or QoE-based) approach as a step towards the human-centric paradigm of system design. We research an affect-based QoE evaluation framework which attempts to capture users' perception while they are engaged in voice communication. Our modular approach consists of feature extraction from multiple information sources including various affective cues and different classification procedures such as Support Vector Machines (SVM) and k-Nearest Neighbor (kNN). The experimental study is illustrated in depth with detailed analysis of results. The evidences collected provide the potential feasibility of our approach for QoE evaluation and suggest the consideration of human affective attributes in modeling user experience.
Resumo:
Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
Resumo:
A systematic diagrammatic expansion for Gutzwiller wavefunctions (DE-GWFs) proposed very recently is used for the description of the superconducting (SC) ground state in the two-dimensional square-lattice t-J model with the hopping electron amplitudes t (and t') between nearest (and next-nearest) neighbors. For the example of the SC state analysis we provide a detailed comparison of the method's results with those of other approaches. Namely, (i) the truncated DE-GWF method reproduces the variational Monte Carlo (VMC) results and (ii) in the lowest (zeroth) order of the expansion the method can reproduce the analytical results of the standard Gutzwiller approximation (GA), as well as of the recently proposed 'grand-canonical Gutzwiller approximation' (called either GCGA or SGA). We obtain important features of the SC state. First, the SC gap at the Fermi surface resembles a d(x2-y2) wave only for optimally and overdoped systems, being diminished in the antinodal regions for the underdoped case in a qualitative agreement with experiment. Corrections to the gap structure are shown to arise from the longer range of the real-space pairing. Second, the nodal Fermi velocity is almost constant as a function of doping and agrees semi-quantitatively with experimental results. Third, we compare the
Resumo:
A detailed non-equilibrium state diagram of shape-anisotropic particle fluids is constructed. The effects of particle shape are explored using Naive Mode Coupling Theory (NMCT), and a single particle Non-linear Langevin Equation (NLE) theory. The dynamical behavior of non-ergodic fluids are discussed. We employ a rotationally frozen approach to NMCT in order to determine a transition to center of mass (translational) localization. Both ideal and kinetic glass transitions are found to be highly shape dependent, and uniformly increase with particle dimensionality. The glass transition volume fraction of quasi 1- and 2- dimensional particles fall monotonically with the number of sites (aspect ratio), while 3-dimensional particles display a non-monotonic dependence of glassy vitrification on the number of sites. Introducing interparticle attractions results in a far more complex state diagram. The ideal non-ergodic boundary shows a glass-fluid-gel re-entrance previously predicted for spherical particle fluids. The non-ergodic region of the state diagram presents qualitatively different dynamics in different regimes. They are qualified by the different behaviors of the NLE dynamic free energy. The caging dominated, repulsive glass regime is characterized by long localization lengths and barrier locations, dictated by repulsive hard core interactions, while the bonding dominated gel region has short localization lengths (commensurate with the attraction range), and barrier locations. There exists a small region of the state diagram which is qualified by both glassy and gel localization lengths in the dynamic free energy. A much larger (high volume fraction, and high attraction strength) region of phase space is characterized by short gel-like localization lengths, and long barrier locations. The region is called the attractive glass and represents a 2-step relaxation process whereby a particle first breaks attractive physical bonds, and then escapes its topological cage. The dynamic fragility of fluids are highly particle shape dependent. It increases with particle dimensionality and falls with aspect ratio for quasi 1- and 2- dimentional particles. An ultralocal limit analysis of the NLE theory predicts universalities in the behavior of relaxation times, and elastic moduli. The equlibrium phase diagram of chemically anisotropic Janus spheres and Janus rods are calculated employing a mean field Random Phase Approximation. The calculations for Janus rods are corroborated by the full liquid state Reference Interaction Site Model theory. The Janus particles consist of attractive and repulsive regions. Both rods and spheres display rich phase behavior. The phase diagrams of these systems display fluid, macrophase separated, attraction driven microphase separated, repulsion driven microphase separated and crystalline regimes. Macrophase separation is predicted in highly attractive low volume fraction systems. Attraction driven microphase separation is charaterized by long length scale divergences, where the ordering length scale determines the microphase ordered structures. The ordering length scale of repulsion driven microphase separation is determined by the repulsive range. At the high volume fractions, particles forgo the enthalpic considerations of attractions and repulsions to satisfy hard core constraints and maximize vibrational entropy. This results in site length scale ordering in rods, and the sphere length scale ordering in Janus spheres, i.e., crystallization. A change in the Janus balance of both rods and spheres results in quantitative changes in spinodal temperatures and the position of phase boundaries. However, a change in the block sequence of Janus rods causes qualitative changes in the type of microphase ordered state, and induces prominent features (such as the Lifshitz point) in the phase diagrams of these systems. A detailed study of the number of nearest neighbors in Janus rod systems reflect a deep connection between this local measure of structure, and the structure factor which represents the most global measure of order.
Resumo:
Image (Video) retrieval is an interesting problem of retrieving images (videos) similar to the query. Images (Videos) are represented in an input (feature) space and similar images (videos) are obtained by finding nearest neighbors in the input representation space. Numerous input representations both in real valued and binary space have been proposed for conducting faster retrieval. In this thesis, we present techniques that obtain improved input representations for retrieval in both supervised and unsupervised settings for images and videos. Supervised retrieval is a well known problem of retrieving same class images of the query. We address the practical aspects of achieving faster retrieval with binary codes as input representations for the supervised setting in the first part, where binary codes are used as addresses into hash tables. In practice, using binary codes as addresses does not guarantee fast retrieval, as similar images are not mapped to the same binary code (address). We address this problem by presenting an efficient supervised hashing (binary encoding) method that aims to explicitly map all the images of the same class ideally to a unique binary code. We refer to the binary codes of the images as `Semantic Binary Codes' and the unique code for all same class images as `Class Binary Code'. We also propose a new class based Hamming metric that dramatically reduces the retrieval times for larger databases, where only hamming distance is computed to the class binary codes. We also propose a Deep semantic binary code model, by replacing the output layer of a popular convolutional Neural Network (AlexNet) with the class binary codes and show that the hashing functions learned in this way outperforms the state of the art, and at the same time provide fast retrieval times. In the second part, we also address the problem of supervised retrieval by taking into account the relationship between classes. For a given query image, we want to retrieve images that preserve the relative order i.e. we want to retrieve all same class images first and then, the related classes images before different class images. We learn such relationship aware binary codes by minimizing the similarity between inner product of the binary codes and the similarity between the classes. We calculate the similarity between classes using output embedding vectors, which are vector representations of classes. Our method deviates from the other supervised binary encoding schemes as it is the first to use output embeddings for learning hashing functions. We also introduce new performance metrics that take into account the related class retrieval results and show significant gains over the state of the art. High Dimensional descriptors like Fisher Vectors or Vector of Locally Aggregated Descriptors have shown to improve the performance of many computer vision applications including retrieval. In the third part, we will discuss an unsupervised technique for compressing high dimensional vectors into high dimensional binary codes, to reduce storage complexity. In this approach, we deviate from adopting traditional hyperplane hashing functions and instead learn hyperspherical hashing functions. The proposed method overcomes the computational challenges of directly applying the spherical hashing algorithm that is intractable for compressing high dimensional vectors. A practical hierarchical model that utilizes divide and conquer techniques using the Random Select and Adjust (RSA) procedure to compress such high dimensional vectors is presented. We show that our proposed high dimensional binary codes outperform the binary codes obtained using traditional hyperplane methods for higher compression ratios. In the last part of the thesis, we propose a retrieval based solution to the Zero shot event classification problem - a setting where no training videos are available for the event. To do this, we learn a generic set of concept detectors and represent both videos and query events in the concept space. We then compute similarity between the query event and the video in the concept space and videos similar to the query event are classified as the videos belonging to the event. We show that we significantly boost the performance using concept features from other modalities.
Resumo:
Dissertação de Mestrado, Engenharia Informática, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2014
Resumo:
The Lennard-Jones Devonshire 1 (LJD) single particle theory for liquids is extended and applied to the anharmonic solid in a high temperature limit. The exact free energy for the crystal is expressed as a convergent series of terms involving larger and larger sets of contiguous particles called cell-clusters. The motions of all the particles within cell-clusters are correlated to each other and lead to non-trivial integrals of orders 3, 6, 9, ... 3N. For the first time the six dimensional integral has been calculated to high accuracy using a Lennard-Jones (6-12) pair interaction between nearest neighbours only for the f.c.c. lattice. The thermodynamic properties predicted by this model agree well with experimental results for solid Xenon.
Resumo:
A complete census of planetary systems around a volume-limited sample of solar-type stars (FGK dwarfs) in the Solar neighborhood (d a parts per thousand currency signaEuro parts per thousand 15 pc) with uniform sensitivity down to Earth-mass planets within their Habitable Zones out to several AUs would be a major milestone in extrasolar planets astrophysics. This fundamental goal can be achieved with a mission concept such as NEAT-the Nearby Earth Astrometric Telescope. NEAT is designed to carry out space-borne extremely-high-precision astrometric measurements at the 0.05 mu as (1 sigma) accuracy level, sufficient to detect dynamical effects due to orbiting planets of mass even lower than Earth's around the nearest stars. Such a survey mission would provide the actual planetary masses and the full orbital geometry for all the components of the detected planetary systems down to the Earth-mass limit. The NEAT performance limits can be achieved by carrying out differential astrometry between the targets and a set of suitable reference stars in the field. The NEAT instrument design consists of an off-axis parabola single-mirror telescope (D = 1 m), a detector with a large field of view located 40 m away from the telescope and made of 8 small movable CCDs located around a fixed central CCD, and an interferometric calibration system monitoring dynamical Young's fringes originating from metrology fibers located at the primary mirror. The mission profile is driven by the fact that the two main modules of the payload, the telescope and the focal plane, must be located 40 m away leading to the choice of a formation flying option as the reference mission, and of a deployable boom option as an alternative choice. The proposed mission architecture relies on the use of two satellites, of about 700 kg each, operating at L2 for 5 years, flying in formation and offering a capability of more than 20,000 reconfigurations. The two satellites will be launched in a stacked configuration using a Soyuz ST launch vehicle. The NEAT primary science program will encompass an astrometric survey of our 200 closest F-, G- and K-type stellar neighbors, with an average of 50 visits each distributed over the nominal mission duration. The main survey operation will use approximately 70% of the mission lifetime. The remaining 30% of NEAT observing time might be allocated, for example, to improve the characterization of the architecture of selected planetary systems around nearby targets of specific interest (low-mass stars, young stars, etc.) discovered by Gaia, ground-based high-precision radial-velocity surveys, and other programs. With its exquisite, surgical astrometric precision, NEAT holds the promise to provide the first thorough census for Earth-mass planets around stars in the immediate vicinity of our Sun.
Resumo:
Pancreatic β-cells are highly sensitive to suboptimal or excess nutrients, as occurs in protein-malnutrition and obesity. Taurine (Tau) improves insulin secretion in response to nutrients and depolarizing agents. Here, we assessed the expression and function of Cav and KATP channels in islets from malnourished mice fed on a high-fat diet (HFD) and supplemented with Tau. Weaned mice received a normal (C) or a low-protein diet (R) for 6 weeks. Half of each group were fed a HFD for 8 weeks without (CH, RH) or with 5% Tau since weaning (CHT, RHT). Isolated islets from R mice showed lower insulin release with glucose and depolarizing stimuli. In CH islets, insulin secretion was increased and this was associated with enhanced KATP inhibition and Cav activity. RH islets secreted less insulin at high K(+) concentration and showed enhanced KATP activity. Tau supplementation normalized K(+)-induced secretion and enhanced glucose-induced Ca(2+) influx in RHT islets. R islets presented lower Ca(2+) influx in response to tolbutamide, and higher protein content and activity of the Kir6.2 subunit of the KATP. Tau increased the protein content of the α1.2 subunit of the Cav channels and the SNARE proteins SNAP-25 and Synt-1 in CHT islets, whereas in RHT, Kir6.2 and Synt-1 proteins were increased. In conclusion, impaired islet function in R islets is related to higher content and activity of the KATP channels. Tau treatment enhanced RHT islet secretory capacity by improving the protein expression and inhibition of the KATP channels and enhancing Synt-1 islet content.