3 resultados para Gene Structure

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

30.00% 30.00%

Publicador:

Resumo:

The main aim of this Ph.D. dissertation is the study of clustering dependent data by means of copula functions with particular emphasis on microarray data. Copula functions are a popular multivariate modeling tool in each field where the multivariate dependence is of great interest and their use in clustering has not been still investigated. The first part of this work contains the review of the literature of clustering methods, copula functions and microarray experiments. The attention focuses on the K–means (Hartigan, 1975; Hartigan and Wong, 1979), the hierarchical (Everitt, 1974) and the model–based (Fraley and Raftery, 1998, 1999, 2000, 2007) clustering techniques because their performance is compared. Then, the probabilistic interpretation of the Sklar’s theorem (Sklar’s, 1959), the estimation methods for copulas like the Inference for Margins (Joe and Xu, 1996) and the Archimedean and Elliptical copula families are presented. In the end, applications of clustering methods and copulas to the genetic and microarray experiments are highlighted. The second part contains the original contribution proposed. A simulation study is performed in order to evaluate the performance of the K–means and the hierarchical bottom–up clustering methods in identifying clusters according to the dependence structure of the data generating process. Different simulations are performed by varying different conditions (e.g., the kind of margins (distinct, overlapping and nested) and the value of the dependence parameter ) and the results are evaluated by means of different measures of performance. In light of the simulation results and of the limits of the two investigated clustering methods, a new clustering algorithm based on copula functions (‘CoClust’ in brief) is proposed. The basic idea, the iterative procedure of the CoClust and the description of the written R functions with their output are given. The CoClust algorithm is tested on simulated data (by varying the number of clusters, the copula models, the dependence parameter value and the degree of overlap of margins) and is compared with the performance of model–based clustering by using different measures of performance, like the percentage of well–identified number of clusters and the not rejection percentage of H0 on . It is shown that the CoClust algorithm allows to overcome all observed limits of the other investigated clustering techniques and is able to identify clusters according to the dependence structure of the data independently of the degree of overlap of margins and the strength of the dependence. The CoClust uses a criterion based on the maximized log–likelihood function of the copula and can virtually account for any possible dependence relationship between observations. Many peculiar characteristics are shown for the CoClust, e.g. its capability of identifying the true number of clusters and the fact that it does not require a starting classification. Finally, the CoClust algorithm is applied to the real microarray data of Hedenfalk et al. (2001) both to the gene expressions observed in three different cancer samples and to the columns (tumor samples) of the whole data matrix.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Eukaryotic ribosomal DNA constitutes a multi gene family organized in a cluster called nucleolar organizer region (NOR); this region is composed usually by hundreds to thousands of tandemly repeated units. Ribosomal genes, being repeated sequences, evolve following the typical pattern of concerted evolution. The autonomous retroelement R2 inserts in the ribosomal gene 28S, leading to defective 28S rDNA genes. R2 element, being a retrotransposon, performs its activity in the genome multiplying its copy number through a “copy and paste” mechanism called target primed reverse transcription. It consists in the retrotranscription of the element’s mRNA into DNA, then the DNA is integrated in the target site. Since the retrotranscription can be interrupted, but the integration will be carried out anyway, truncated copies of the element will also be present in the genome. The study of these truncated variants is a tool to examine the activity of the element. R2 phylogeny appears, in general, not consistent with that of its hosts, except some cases (e.g. Drosophila spp. and Reticulitermes spp.); moreover R2 is absent in some species (Fugu rubripes, human, mouse, etc.), while other species have more R2 lineages in their genome (the turtle Mauremys reevesii, the Japanese beetle Popilia japonica, etc). R2 elements here presented are isolated in 4 species of notostracan branchiopods and in two species of stick insects, whose reproductive strategies range from strict gonochorism to unisexuality. From sequencing data emerges that in Triops cancriformis (Spanish gonochoric population), in Lepidurus arcticus (two putatively unisexual populations from Iceland) and in Bacillus rossius (gonochoric population from Capalbio) the R2 elements are complete and encode functional proteins, reflecting the general features of this family of transposable elements. On the other hand, R2 from Italian and Austrian populations of T. cancriformis (respectively unisexual and hermaphroditic), Lepidurus lubbocki (two elements within the same Italian population, gonochoric but with unfunctional males) and Bacillus grandii grandii (gonochoric population from Ponte Manghisi) have sequences that encode incomplete or non-functional proteins in which it is possible to recognize only part of the characteristic domains. In Lepidurus couesii (Italian gonochoric populations) different elements were found as in L. lubbocki, and the sequencing is still in progress. Two hypothesis are given to explain the inconsistency of R2/host phylogeny: vertical inheritance of the element followed by extinction/diversification or horizontal transmission. My data support previous study that state the vertical transmission as the most likely explanation; nevertheless horizontal transfer events can’t be excluded. I also studied the element’s activity in Spanish populations of T. cancriformis, in L. lubbocki, in L. arcticus and in gonochoric and parthenogenetic populations of B. rossius. In gonochoric populations of T. cancriformis and B. rossius I found that each individual has its own private set of truncated variants. The situation is the opposite for the remaining hermaphroditic/parthenogenetic species and populations, all individuals sharing – in the so far analyzed samples - the majority of variants. This situation is very interesting, because it isn’t concordant with the Muller’s ratchet theory that hypothesizes the parthenogenetic populations being either devoided of transposable elements or TEs overloaded. My data suggest a possible epigenetic mechanism that can block the retrotransposon activity, and in this way deleterious mutations don’t accumulate.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

From the late 1980s, the automation of sequencing techniques and the computer spread gave rise to a flourishing number of new molecular structures and sequences and to proliferation of new databases in which to store them. Here are presented three computational approaches able to analyse the massive amount of publicly avalilable data in order to answer to important biological questions. The first strategy studies the incorrect assignment of the first AUG codon in a messenger RNA (mRNA), due to the incomplete determination of its 5' end sequence. An extension of the mRNA 5' coding region was identified in 477 in human loci, out of all human known mRNAs analysed, using an automated expressed sequence tag (EST)-based approach. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for GNB2L1, QARS and TDP2 and the consequences for the functional studies are discussed. The second approach analyses the codon bias, the phenomenon in which distinct synonymous codons are used with different frequencies, and, following integration with a gene expression profile, estimates the total number of codons present across all the expressed mRNAs (named here "codonome value") in a given biological condition. Systematic analyses across different pathological and normal human tissues and multiple species shows a surprisingly tight correlation between the codon bias and the codonome bias. The third approach is useful to studies the expression of human autism spectrum disorder (ASD) implicated genes. ASD implicated genes sharing microRNA response elements (MREs) for the same microRNA are co-expressed in brain samples from healthy and ASD affected individuals. The different expression of a recently identified long non coding RNA which have four MREs for the same microRNA could disrupt the equilibrium in this network, but further analyses and experiments are needed.