919 resultados para Model-based Categorical Sequence Clustering
Resumo:
In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a finite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to official statistics, shows its usefulness.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.
Resumo:
This paper proposes a novel demand response model using a fuzzy subtractive cluster approach. The model development provides support to domestic consumer decisions on controllable loads management, considering consumers' consumption needs and the appropriate load shape or rescheduling in order to achieve possible economic benefits. The model based on fuzzy subtractive clustering method considers clusters of domestic consumption covering an adequate consumption range. Analysis of different scenarios is presented considering available electric power and electric energy prices. Simulation results are presented and conclusions of the proposed demand response model are discussed. (C) 2016 Elsevier Ltd. All rights reserved.
Resumo:
This paper proposes a novel demand response model using a fuzzy subtractive cluster approach. The model development provides support to domestic consumer decisions on controllable loads management, considering consumers’ consumption needs and the appropriate load shape or rescheduling in order to achieve possible economic benefits. The model based on fuzzy subtractive clustering method considers clusters of domestic consumption covering an adequate consumption range. Analysis of different scenarios is presented considering available electric power and electric energy prices. Simulation results are presented and conclusions of the proposed demand response model are discussed.
Resumo:
This paper proposes a novel demand response model using a fuzzy subtractive cluster approach. The model development provides support to domestic consumer decisions on controllable loads management, considering consumers’ consumption needs and the appropriate load shape or rescheduling in order to achieve possible economic benefits. The model based on fuzzy subtractive clustering method considers clusters of domestic consumption covering an adequate consumption range. Analysis of different scenarios is presented considering available electric power and electric energy prices. Simulation results are presented and conclusions of the proposed demand response model are discussed.
Resumo:
Garlic is a spice and a medicinal plant; hence, there is an increasing interest in 'developing' new varieties with different culinary properties or with high content of nutraceutical compounds. Phenotypic traits and dominant molecular markers are predominantly used to evaluate the genetic diversity of garlic clones. However, 24 SSR markers (codominant) specific for garlic are available in the literature, fostering germplasm researches. In this study, we genotyped 130 garlic accessions from Brazil and abroad using 17 polymorphic SSR markers to assess the genetic diversity and structure. This is the first attempt to evaluate a large set of accessions maintained by Brazilian institutions. A high level of redundancy was detected in the collection (50 % of the accessions represented eight haplotypes). However, non-redundant accessions presented high genetic diversity. We detected on average five alleles per locus, Shannon index of 1.2, HO of 0.5, and HE of 0.6. A core collection was set with 17 accessions, covering 100 % of the alleles with minimum redundancy. Overall FST and D values indicate a strong genetic structure within accessions. Two major groups identified by both model-based (Bayesian approach) and hierarchical clustering (UPGMA dendrogram) techniques were coherent with the classification of accessions according to maturity time (growth cycle): early-late and midseason accessions. Assessing genetic diversity and structure of garlic collections is the first step towards an efficient management and conservation of accessions in genebanks, as well as to advance future genetic studies and improvement of garlic worldwide.
Resumo:
Motivated by a recently proposed biologically inspired face recognition approach, we investigated the relation between human behavior and a computational model based on Fourier-Bessel (FB) spatial patterns. We measured human recognition performance of FB filtered face images using an 8-alternative forced-choice method. Test stimuli were generated by converting the images from the spatial to the FB domain, filtering the resulting coefficients with a band-pass filter, and finally taking the inverse FB transformation of the filtered coefficients. The performance of the computational models was tested using a simulation of the psychophysical experiment. In the FB model, face images were first filtered by simulated V1- type neurons and later analyzed globally for their content of FB components. In general, there was a higher human contrast sensitivity to radially than to angularly filtered images, but both functions peaked at the 11.3-16 frequency interval. The FB-based model presented similar behavior with regard to peak position and relative sensitivity, but had a wider frequency band width and a narrower response range. The response pattern of two alternative models, based on local FB analysis and on raw luminance, strongly diverged from the human behavior patterns. These results suggest that human performance can be constrained by the type of information conveyed by polar patterns, and consequently that humans might use FB-like spatial patterns in face processing.
Resumo:
Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.
Resumo:
Experimental data for E. coli debris size reduction during high-pressure homogenisation at 55 MPa are presented. A mathematical model based on grinding theory is developed to describe the data. The model is based on first-order breakage and compensation conditions. It does not require any assumption of a specified distribution for debris size and can be used given information on the initial size distribution of whole cells and the disruption efficiency during homogenisation. The number of homogeniser passes is incorporated into the model and used to describe the size reduction of non-induced stationary and induced E. coil cells during homogenisation. Regressing the results to the model equations gave an excellent fit to experimental data ( > 98.7% of variance explained for both fermentations), confirming the model's potential for predicting size reduction during high-pressure homogenisation. This study provides a means to optimise both homogenisation and disc-stack centrifugation conditions for recombinant product recovery. (C) 1997 Elsevier Science Ltd.
Resumo:
Nursing diagnoses associated with alterations of urinary elimination require different interventions, Nurses, who are not specialists, require support to diagnose and manage patients with disturbances of urine elimination. The aim of this study was to present a model based on fuzzy logic for differential diagnosis of alterations in urinary elimination, considering nursing diagnosis approved by the North American Nursing Diagnosis Association, 2001-2002. Fuzzy relations and the maximum-minimum composition approach were used to develop the system. The model performance was evaluated with 195 cases from the database of a previous study, resulting in 79.0% of total concordance and 19.5% of partial concordance, when compared with the panel of experts. Total discordance was observed in only three cases (1.5%). The agreement between model and experts was excellent (kappa = 0.98, P < .0001) or substantial (kappa = 0.69, P < .0001) when considering the overestimative accordance (accordance was considered when at least one diagnosis was equal) and the underestimative discordance (discordance was considered when at least one diagnosis was different), respectively. The model herein presented showed good performance and a simple theoretical structure, therefore demanding few computational resources.
Resumo:
Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.
Resumo:
Genetic diversity of contemporary domesticated species is shaped by both natural and human-driven processes. However, until now, little is known about how domestication has imprinted the variation of fruit tree species. In this study, we reconstruct the recent evolutionary history of the domesticated almond tree, Prunus dulcis, around the Mediterranean basin, using a combination of nuclear and chloroplast microsatellites [i.e. simple sequence repeat (SSRs)] to investigate patterns of genetic diversity. Whereas conservative chloroplast SSRs show a widespread haplotype and rare locally distributed variants, nuclear SSRs show a pattern of isolation by distance with clines of diversity from the East to the West of the Mediterranean basin, while Bayesian genetic clustering reveals a substantial longitudinal genetic structure. Both kinds of markers thus support a single domestication event, in the eastern side of the Mediterranean basin. In addition, model-based estimation of the timing of genetic divergence among those clusters is estimated sometime during the Holocene, a result that is compatible with human-mediated dispersal of almond tree out of its centre of origin. Still, the detection of region-specific alleles suggests that gene flow from relictual wild preglacial populations (in North Africa) or from wild counterparts (in the Near East) could account for a fraction of the diversity observed.
Resumo:
BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.
Resumo:
Toxicokinetic modeling is a useful tool to describe or predict the behavior of a chemical agent in the human or animal organism. A general model based on four compartments was developed in a previous study in order to quantify the effect of human variability on a wide range of biological exposure indicators. The aim of this study was to adapt this existing general toxicokinetic model to three organic solvents, which were methyl ethyl ketone, 1-methoxy-2-propanol and 1,1,1,-trichloroethane, and to take into account sex differences. We assessed in a previous human volunteer study the impact of sex on different biomarkers of exposure corresponding to the three organic solvents mentioned above. Results from that study suggested that not only physiological differences between men and women but also differences due to sex hormones levels could influence the toxicokinetics of the solvents. In fact the use of hormonal contraceptive had an effect on the urinary levels of several biomarkers, suggesting that exogenous sex hormones could influence CYP2E1 enzyme activity. These experimental data were used to calibrate the toxicokinetic models developed in this study. Our results showed that it was possible to use an existing general toxicokinetic model for other compounds. In fact, most of the simulation results showed good agreement with the experimental data obtained for the studied solvents, with a percentage of model predictions that lies within the 95% confidence interval varying from 44.4 to 90%. Results pointed out that for same exposure conditions, men and women can show important differences in urinary levels of biological indicators of exposure. Moreover, when running the models by simulating industrial working conditions, these differences could even be more pronounced. In conclusion, a general and simple toxicokinetic model, adapted for three well known organic solvents, allowed us to show that metabolic parameters can have an important impact on the urinary levels of the corresponding biomarkers. These observations give evidence of an interindividual variablity, an aspect that should have its place in the approaches for setting limits of occupational exposure.
Resumo:
Identifying the geographic distribution of populations is a basic, yet crucial step in many fundamental and applied ecological projects, as it provides key information on which many subsequent analyses depend. However, this task is often costly and time consuming, especially where rare species are concerned and where most sampling designs generally prove inefficient. At the same time, rare species are those for which distribution data are most needed for their conservation to be effective. To enhance fieldwork sampling, model-based sampling (MBS) uses predictions from species distribution models: when looking for the species in areas of high habitat suitability, chances should be higher to find them. We thoroughly tested the efficiency of MBS by conducting an important survey in the Swiss Alps, assessing the detection rate of three rare and five common plant species. For each species, habitat suitability maps were produced following an ensemble modeling framework combining two spatial resolutions and two modeling techniques. We tested the efficiency of MBS and the accuracy of our models by sampling 240 sites in the field (30 sitesx8 species). Across all species, the MBS approach proved to be effective. In particular, the MBS design strictly led to the discovery of six sites of presence of one rare plant, increasing chances to find this species from 0 to 50%. For common species, MBS doubled the new population discovery rates as compared to random sampling. Habitat suitability maps coming from the combination of four individual modeling methods predicted well the species' distribution and more accurately than the individual models. As a conclusion, using MBS for fieldwork could efficiently help in increasing our knowledge of rare species distribution. More generally, we recommend using habitat suitability models to support conservation plans.