901 resultados para Clustering search algorithm


Relevância:

30.00% 30.00%

Publicador:

Resumo:

3 Summary 3. 1 English The pharmaceutical industry has been facing several challenges during the last years, and the optimization of their drug discovery pipeline is believed to be the only viable solution. High-throughput techniques do participate actively to this optimization, especially when complemented by computational approaches aiming at rationalizing the enormous amount of information that they can produce. In siiico techniques, such as virtual screening or rational drug design, are now routinely used to guide drug discovery. Both heavily rely on the prediction of the molecular interaction (docking) occurring between drug-like molecules and a therapeutically relevant target. Several softwares are available to this end, but despite the very promising picture drawn in most benchmarks, they still hold several hidden weaknesses. As pointed out in several recent reviews, the docking problem is far from being solved, and there is now a need for methods able to identify binding modes with a high accuracy, which is essential to reliably compute the binding free energy of the ligand. This quantity is directly linked to its affinity and can be related to its biological activity. Accurate docking algorithms are thus critical for both the discovery and the rational optimization of new drugs. In this thesis, a new docking software aiming at this goal is presented, EADock. It uses a hybrid evolutionary algorithm with two fitness functions, in combination with a sophisticated management of the diversity. EADock is interfaced with .the CHARMM package for energy calculations and coordinate handling. A validation was carried out on 37 crystallized protein-ligand complexes featuring 11 different proteins. The search space was defined as a sphere of 15 R around the center of mass of the ligand position in the crystal structure, and conversely to other benchmarks, our algorithms was fed with optimized ligand positions up to 10 A root mean square deviation 2MSD) from the crystal structure. This validation illustrates the efficiency of our sampling heuristic, as correct binding modes, defined by a RMSD to the crystal structure lower than 2 A, were identified and ranked first for 68% of the complexes. The success rate increases to 78% when considering the five best-ranked clusters, and 92% when all clusters present in the last generation are taken into account. Most failures in this benchmark could be explained by the presence of crystal contacts in the experimental structure. EADock has been used to understand molecular interactions involved in the regulation of the Na,K ATPase, and in the activation of the nuclear hormone peroxisome proliferatoractivated receptors a (PPARa). It also helped to understand the action of common pollutants (phthalates) on PPARy, and the impact of biotransformations of the anticancer drug Imatinib (Gleevec®) on its binding mode to the Bcr-Abl tyrosine kinase. Finally, a fragment-based rational drug design approach using EADock was developed, and led to the successful design of new peptidic ligands for the a5ß1 integrin, and for the human PPARa. In both cases, the designed peptides presented activities comparable to that of well-established ligands such as the anticancer drug Cilengitide and Wy14,643, respectively. 3.2 French Les récentes difficultés de l'industrie pharmaceutique ne semblent pouvoir se résoudre que par l'optimisation de leur processus de développement de médicaments. Cette dernière implique de plus en plus. de techniques dites "haut-débit", particulièrement efficaces lorsqu'elles sont couplées aux outils informatiques permettant de gérer la masse de données produite. Désormais, les approches in silico telles que le criblage virtuel ou la conception rationnelle de nouvelles molécules sont utilisées couramment. Toutes deux reposent sur la capacité à prédire les détails de l'interaction moléculaire entre une molécule ressemblant à un principe actif (PA) et une protéine cible ayant un intérêt thérapeutique. Les comparatifs de logiciels s'attaquant à cette prédiction sont flatteurs, mais plusieurs problèmes subsistent. La littérature récente tend à remettre en cause leur fiabilité, affirmant l'émergence .d'un besoin pour des approches plus précises du mode d'interaction. Cette précision est essentielle au calcul de l'énergie libre de liaison, qui est directement liée à l'affinité du PA potentiel pour la protéine cible, et indirectement liée à son activité biologique. Une prédiction précise est d'une importance toute particulière pour la découverte et l'optimisation de nouvelles molécules actives. Cette thèse présente un nouveau logiciel, EADock, mettant en avant une telle précision. Cet algorithme évolutionnaire hybride utilise deux pressions de sélections, combinées à une gestion de la diversité sophistiquée. EADock repose sur CHARMM pour les calculs d'énergie et la gestion des coordonnées atomiques. Sa validation a été effectuée sur 37 complexes protéine-ligand cristallisés, incluant 11 protéines différentes. L'espace de recherche a été étendu à une sphère de 151 de rayon autour du centre de masse du ligand cristallisé, et contrairement aux comparatifs habituels, l'algorithme est parti de solutions optimisées présentant un RMSD jusqu'à 10 R par rapport à la structure cristalline. Cette validation a permis de mettre en évidence l'efficacité de notre heuristique de recherche car des modes d'interactions présentant un RMSD inférieur à 2 R par rapport à la structure cristalline ont été classés premier pour 68% des complexes. Lorsque les cinq meilleures solutions sont prises en compte, le taux de succès grimpe à 78%, et 92% lorsque la totalité de la dernière génération est prise en compte. La plupart des erreurs de prédiction sont imputables à la présence de contacts cristallins. Depuis, EADock a été utilisé pour comprendre les mécanismes moléculaires impliqués dans la régulation de la Na,K ATPase et dans l'activation du peroxisome proliferatoractivated receptor a (PPARa). Il a également permis de décrire l'interaction de polluants couramment rencontrés sur PPARy, ainsi que l'influence de la métabolisation de l'Imatinib (PA anticancéreux) sur la fixation à la kinase Bcr-Abl. Une approche basée sur la prédiction des interactions de fragments moléculaires avec protéine cible est également proposée. Elle a permis la découverte de nouveaux ligands peptidiques de PPARa et de l'intégrine a5ß1. Dans les deux cas, l'activité de ces nouveaux peptides est comparable à celles de ligands bien établis, comme le Wy14,643 pour le premier, et le Cilengitide (PA anticancéreux) pour la seconde.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We consider the numerical treatment of the optical flow problem by evaluating the performance of the trust region method versus the line search method. To the best of our knowledge, the trust region method is studied here for the first time for variational optical flow computation. Four different optical flow models are used to test the performance of the proposed algorithm combining linear and nonlinear data terms with quadratic and TV regularization. We show that trust region often performs better than line search; especially in the presence of non-linearity and non-convexity in the model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Zonal management in vineyards requires the prior delineation of stable yield zones within the parcel. Among the different methodologies used for zone delineation, cluster analysis of yield data from several years is one of the possibilities cited in scientific literature. However, there exist reasonable doubts concerning the cluster algorithm to be used and the number of zones that have to be delineated within a field. In this paper two different cluster algorithms have been compared (k-means and fuzzy c-means) using the grape yield data corresponding to three successive years (2002, 2003 and 2004), for a ‘Pinot Noir’ vineyard parcel. Final choice of the most recommendable algorithm has been linked to obtaining a stable pattern of spatial yield distribution and to allowing for the delineation of compact and average sized areas. The general recommendation is to use reclassified maps of two clusters or yield classes (low yield zone and high yield zone) and, consequently, the site-specific vineyard management should be based on the prior delineation of just two different zones or sub-parcels. The two tested algorithms are good options for this purpose. However, the fuzzy c-means algorithm allows for a better zoning of the parcel, forming more compact areas and with more equilibrated zonal differences over time.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This master’s thesis aims to study and represent from literature how evolutionary algorithms are used to solve different search and optimisation problems in the area of software engineering. Evolutionary algorithms are methods, which imitate the natural evolution process. An artificial evolution process evaluates fitness of each individual, which are solution candidates. The next population of candidate solutions is formed by using the good properties of the current population by applying different mutation and crossover operations. Different kinds of evolutionary algorithm applications related to software engineering were searched in the literature. Applications were classified and represented. Also the necessary basics about evolutionary algorithms were presented. It was concluded, that majority of evolutionary algorithm applications related to software engineering were about software design or testing. For example, there were applications about classifying software production data, project scheduling, static task scheduling related to parallel computing, allocating modules to subsystems, N-version programming, test data generation and generating an integration test order. Many applications were experimental testing rather than ready for real production use. There were also some Computer Aided Software Engineering tools based on evolutionary algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We consider the numerical treatment of the optical flow problem by evaluating the performance of the trust region method versus the line search method. To the best of our knowledge, the trust region method is studied here for the first time for variational optical flow computation. Four different optical flow models are used to test the performance of the proposed algorithm combining linear and nonlinear data terms with quadratic and TV regularization. We show that trust region often performs better than line search; especially in the presence of non-linearity and non-convexity in the model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The analysis of rockfall characteristics and spatial distribution is fundamental to understand and model the main factors that predispose to failure. In our study we analysed LiDAR point clouds aiming to: (1) detect and characterise single rockfalls; (2) investigate their spatial distribution. To this end, different cluster algorithms were applied: 1a) Nearest Neighbour Clutter Removal (NNCR) in combination with the Expectation?Maximization (EM) in order to separate feature points from clutter; 1b) a density based algorithm (DBSCAN) was applied to isolate the single clusters (i.e. the rockfall events); 2) finally we computed the Ripley's K-function to investigate the global spatial pattern of the extracted rockfalls. The method allowed proper identification and characterization of more than 600 rockfalls occurred on a cliff located in Puigcercos (Catalonia, Spain) during a time span of six months. The spatial distribution of these events proved that rockfall were clustered distributed at a welldefined distance-range. Computations were carried out using R free software for statistical computing and graphics. The understanding of the spatial distribution of precursory rockfalls may shed light on the forecasting of future failures.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One of the major problems in machine vision is the segmentation of images of natural scenes. This paper presents a new proposal for the image segmentation problem which has been based on the integration of edge and region information. The main contours of the scene are detected and used to guide the posterior region growing process. The algorithm places a number of seeds at both sides of a contour allowing stating a set of concurrent growing processes. A previous analysis of the seeds permits to adjust the homogeneity criterion to the regions's characteristics. A new homogeneity criterion based on clustering analysis and convex hull construction is proposed

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Image segmentation of natural scenes constitutes a major problem in machine vision. This paper presents a new proposal for the image segmentation problem which has been based on the integration of edge and region information. This approach begins by detecting the main contours of the scene which are later used to guide a concurrent set of growing processes. A previous analysis of the seed pixels permits adjustment of the homogeneity criterion to the region's characteristics during the growing process. Since the high variability of regions representing outdoor scenes makes the classical homogeneity criteria useless, a new homogeneity criterion based on clustering analysis and convex hull construction is proposed. Experimental results have proven the reliability of the proposed approach

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Speaker diarization is the process of sorting speeches according to the speaker. Diarization helps to search and retrieve what a certain speaker uttered in a meeting. Applications of diarization systemsextend to other domains than meetings, for example, lectures, telephone, television, and radio. Besides, diarization enhances the performance of several speech technologies such as speaker recognition, automatic transcription, and speaker tracking. Methodologies previously used in developing diarization systems are discussed. Prior results and techniques are studied and compared. Methods such as Hidden Markov Models and Gaussian Mixture Models that are used in speaker recognition and other speech technologies are also used in speaker diarization. The objective of this thesis is to develop a speaker diarization system in meeting domain. Experimental part of this work indicates that zero-crossing rate can be used effectively in breaking down the audio stream into segments, and adaptive Gaussian Models fit adequately short audio segments. Results show that 35 Gaussian Models and one second as average length of each segment are optimum values to build a diarization system for the tested data. Uniting the segments which are uttered by same speaker is done in a bottom-up clustering by a newapproach of categorizing the mixture weights.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Clustering soil and crop data can be used as a basis for the definition of management zones because the data are grouped into clusters based on the similar interaction of these variables. Therefore, the objective of this study was to identify management zones using fuzzy c-means clustering analysis based on the spatial and temporal variability of soil attributes and corn yield. The study site (18 by 250-m in size) was located in Jaboticabal, São Paulo/Brazil. Corn yield was measured in one hundred 4.5 by 10-m cells along four parallel transects (25 observations per transect) over five growing seasons between 2001 and 2010. Soil chemical and physical attributes were measured. SAS procedure MIXED was used to identify which variable(s) most influenced the spatial variability of corn yield over the five study years. Basis saturation (BS) was the variable that better related to corn yield, thus, semivariograms models were fitted for BS and corn yield and then, data values were krigged. Management Zone Analyst software was used to carry out the fuzzy c-means clustering algorithm. The optimum number of management zones can change over time, as well as the degree of agreement between the BS and corn yield management zone maps. Thus, it is very important take into account the temporal variability of crop yield and soil attributes to delineate management zones accurately.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Chronic hepatitis B (HBV) and C (HCV) virus infections are the most important factors associated with hepatocellular carcinoma (HCC), but tumor prognosis remains poor due to the lack of diagnostic biomarkers. In order to identify novel diagnostic markers and therapeutic targets, the gene expression profile associated with viral and non-viral HCC was assessed in 9 tumor samples by oligo-microarrays. The differentially expressed genes were examined using a z-score and KEGG pathway for the search of ontological biological processes. We selected a non-redundant set of 15 genes with the lowest P value for clustering samples into three groups using the non-supervised algorithm k-means. Fisher’s linear discriminant analysis was then applied in an exhaustive search of trios of genes that could be used to build classifiers for class distinction. Different transcriptional levels of genes were identified in HCC of different etiologies and from different HCC samples. When comparing HBV-HCC vs HCV-HCC, HBV-HCC/HCV-HCC vs non-viral (NV)-HCC, HBC-HCC vs NV-HCC, and HCV-HCC vs NV-HCC of the 58 non-redundant differentially expressed genes, only 6 genes (IKBKβ, CREBBP, WNT10B, PRDX6, ITGAV, and IFNAR1) were found to be associated with hepatic carcinogenesis. By combining trios, classifiers could be generated, which correctly classified 100% of the samples. This expression profiling may provide a useful tool for research into the pathophysiology of HCC. A detailed understanding of how these distinct genes are involved in molecular pathways is of fundamental importance to the development of effective HCC chemoprevention and treatment.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The goal of most clustering algorithms is to find the optimal number of clusters (i.e. fewest number of clusters). However, analysis of molecular conformations of biological macromolecules obtained from computer simulations may benefit from a larger array of clusters. The Self-Organizing Map (SOM) clustering method has the advantage of generating large numbers of clusters, but often gives ambiguous results. In this work, SOMs have been shown to be reproducible when the same conformational dataset is independently clustered multiple times (~100), with the help of the Cramérs V-index (C_v). The ability of C_v to determine which SOMs are reproduced is generalizable across different SOM source codes. The conformational ensembles produced from MD (molecular dynamics) and REMD (replica exchange molecular dynamics) simulations of the penta peptide Met-enkephalin (MET) and the 34 amino acid protein human Parathyroid Hormone (hPTH) were used to evaluate SOM reproducibility. The training length for the SOM has a huge impact on the reproducibility. Analysis of MET conformational data definitively determined that toroidal SOMs cluster data better than bordered maps due to the fact that toroidal maps do not have an edge effect. For the source code from MATLAB, it was determined that the learning rate function should be LINEAR with an initial learning rate factor of 0.05 and the SOM should be trained by a sequential algorithm. The trained SOMs can be used as a supervised classification for another dataset. The toroidal 10×10 hexagonal SOMs produced from the MATLAB program for hPTH conformational data produced three sets of reproducible clusters (27%, 15%, and 13% of 100 independent runs) which find similar partitionings to those of smaller 6×6 SOMs. The χ^2 values produced as part of the C_v calculation were used to locate clusters with identical conformational memberships on independently trained SOMs, even those with different dimensions. The χ^2 values could relate the different SOM partitionings to each other.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Computational Biology is the research are that contributes to the analysis of biological data through the development of algorithms which will address significant research problems.The data from molecular biology includes DNA,RNA ,Protein and Gene expression data.Gene Expression Data provides the expression level of genes under different conditions.Gene expression is the process of transcribing the DNA sequence of a gene into mRNA sequences which in turn are later translated into proteins.The number of copies of mRNA produced is called the expression level of a gene.Gene expression data is organized in the form of a matrix. Rows in the matrix represent genes and columns in the matrix represent experimental conditions.Experimental conditions can be different tissue types or time points.Entries in the gene expression matrix are real values.Through the analysis of gene expression data it is possible to determine the behavioral patterns of genes such as similarity of their behavior,nature of their interaction,their respective contribution to the same pathways and so on. Similar expression patterns are exhibited by the genes participating in the same biological process.These patterns have immense relevance and application in bioinformatics and clinical research.Theses patterns are used in the medical domain for aid in more accurate diagnosis,prognosis,treatment planning.drug discovery and protein network analysis.To identify various patterns from gene expression data,data mining techniques are essential.Clustering is an important data mining technique for the analysis of gene expression data.To overcome the problems associated with clustering,biclustering is introduced.Biclustering refers to simultaneous clustering of both rows and columns of a data matrix. Clustering is a global whereas biclustering is a local model.Discovering local expression patterns is essential for identfying many genetic pathways that are not apparent otherwise.It is therefore necessary to move beyond the clustering paradigm towards developing approaches which are capable of discovering local patterns in gene expression data.A biclusters is a submatrix of the gene expression data matrix.The rows and columns in the submatrix need not be contiguous as in the gene expression data matrix.Biclusters are not disjoint.Computation of biclusters is costly because one will have to consider all the combinations of columans and rows in order to find out all the biclusters.The search space for the biclustering problem is 2 m+n where m and n are the number of genes and conditions respectively.Usually m+n is more than 3000.The biclustering problem is NP-hard.Biclustering is a powerful analytical tool for the biologist.The research reported in this thesis addresses the problem of biclustering.Ten algorithms are developed for the identification of coherent biclusters from gene expression data.All these algorithms are making use of a measure called mean squared residue to search for biclusters.The objective here is to identify the biclusters of maximum size with the mean squared residue lower than a given threshold. All these algorithms begin the search from tightly coregulated submatrices called the seeds.These seeds are generated by K-Means clustering algorithm.The algorithms developed can be classified as constraint based,greedy and metaheuristic.Constarint based algorithms uses one or more of the various constaints namely the MSR threshold and the MSR difference threshold.The greedy approach makes a locally optimal choice at each stage with the objective of finding the global optimum.In metaheuristic approaches particle Swarm Optimization(PSO) and variants of Greedy Randomized Adaptive Search Procedure(GRASP) are used for the identification of biclusters.These algorithms are implemented on the Yeast and Lymphoma datasets.Biologically relevant and statistically significant biclusters are identified by all these algorithms which are validated by Gene Ontology database.All these algorithms are compared with some other biclustering algorithms.Algorithms developed in this work overcome some of the problems associated with the already existing algorithms.With the help of some of the algorithms which are developed in this work biclusters with very high row variance,which is higher than the row variance of any other algorithm using mean squared residue, are identified from both Yeast and Lymphoma data sets.Such biclusters which make significant change in the expression level are highly relevant biologically.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper, moving flock patterns are mined from spatio- temporal datasets by incorporating a clustering algorithm. A flock is defined as the set of data that move together for a certain continuous amount of time. Finding out moving flock patterns using clustering algorithms is a potential method to find out frequent patterns of movement in large trajectory datasets. In this approach, SPatial clusteRing algoRithm thrOugh sWarm intelligence (SPARROW) is the clustering algorithm used. The advantage of using SPARROW algorithm is that it can effectively discover clusters of widely varying sizes and shapes from large databases. Variations of the proposed method are addressed and also the experimental results show that the problem of scalability and duplicate pattern formation is addressed. This method also reduces the number of patterns produced

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Knowledge discovery in databases is the non-trivial process of identifying valid, novel potentially useful and ultimately understandable patterns from data. The term Data mining refers to the process which does the exploratory analysis on the data and builds some model on the data. To infer patterns from data, data mining involves different approaches like association rule mining, classification techniques or clustering techniques. Among the many data mining techniques, clustering plays a major role, since it helps to group the related data for assessing properties and drawing conclusions. Most of the clustering algorithms act on a dataset with uniform format, since the similarity or dissimilarity between the data points is a significant factor in finding out the clusters. If a dataset consists of mixed attributes, i.e. a combination of numerical and categorical variables, a preferred approach is to convert different formats into a uniform format. The research study explores the various techniques to convert the mixed data sets to a numerical equivalent, so as to make it equipped for applying the statistical and similar algorithms. The results of clustering mixed category data after conversion to numeric data type have been demonstrated using a crime data set. The thesis also proposes an extension to the well known algorithm for handling mixed data types, to deal with data sets having only categorical data. The proposed conversion has been validated on a data set corresponding to breast cancer. Moreover, another issue with the clustering process is the visualization of output. Different geometric techniques like scatter plot, or projection plots are available, but none of the techniques display the result projecting the whole database but rather demonstrate attribute-pair wise analysis