33 resultados para statistical classification
em Helda - Digital Repository of University of Helsinki
Resumo:
The Baltic Sea was studied with respect to selected organic contaminants and their ecotoxicology. The research consisted of analyses of total hydrocarbons, polycyclic aromatic hydrocarbons, bile metabolites, hepatic ethoxyresorufin-O-deethylase (EROD) activity, polychlorinated biphenyls (PCBs) and organochlorine pesticides (OCPs). The contaminants were measured from various matrices, such as seawater, sediment and biota. The methods of analysis were evaluated and refined to comparability of the results. Polyaromatic hydrocarbons, originating from petroleum, are known to be among the most harmful substances to the marine environment. In Baltic subsurface water, seasonal dependence of the total hydrocarbon concentrations (THCs) was seen. Although concentrations of parent polycyclic aromatic hydrocarbons (PAHs) in sediment surface varied between 64 and 5161 ug kg-1 (dw), concentrations above 860 ug kg-1 (dw) were found in all the studied sub-basins of the Baltic Sea. Concentrations commonly considered to substantially increase the risk of liver disease and reproductive impairment in fish, as well as potential effects on growth (above 1000 ug kg-1 dw), were found in all the studied sub-basins of the Baltic Sea except Kattegat. Thus, considerable pollution in sediments was indicated. In bivalves, the sums of 12 PAHs varied on a wet weight basis between 44 and 298 ug kg-1 (ww). The predominant PAHs were high molecular weight and the PAH profiles of M. balthica differed from those found in sediment from the same area. The PAHs were both pyrolytic and petrogenic in origin, and a contribution from diesel engines was found, which indicates pollution of the Baltic Sea, most likely caused by the steadily increasing shipping in the area. The HPLC methods developed for hepatic EROD activity and bile metabolite measurements proved to be fast and suitable for the study of biological effects. A mixed function oxygenase enzyme system in Baltic Sea perch collected from the Gulf of Finland was induced slightly: EROD activity in perch varied from 0.30 14 pmol min-1 mg-1 protein. This range can be considered to be comparable to background values. Recent PAH exposure was also indicated by enhanced levels (213 and 1149 ug kg-1) of the bile metabolite 1-hydroxypyrene. No correlation was indicated between hepatic EROD activity and concentration of 1-hydroxypyrene in bile. PCBs and OCPs were observed in Baltic Sea sediment, bivalves and herring. Sums of seven CBs in surface sediment (0 5 cm) ranged from 0.04 to 6.2 ug kg-1 (dw) and sums of three DDTs from 0.13 to 5.0 ug kg-1 (dw). The highest levels of contaminants were found in the most eastern area of the Gulf of Finland where the highest total carbon and nitrogen content was found and where the lowest percentage proportion of p,p -DDT was found. The highest concentrations of CBs and the lowest concentration of DDTs were found in M. balthica from the Gulf of Finland. The highest levels of DDTs were found in M. balthica from the Hanö Bight, which is the outer part of the Bornholm Basin close to the Swedish mainland. In bivalves, the sums of seven CBs were 72 108 ug kg-1 (lw) and the sums of three DDTs were 66 139 ug kg-1 (lw). Results from temporal trend monitoring showed, that during the period 1985 2002, the concentrations of seven CBs in two-year-old female Baltic herring were clearly decreased, from 9 16 to 2 6 ug kg-1 (ww) in the northern Baltic Sea. At the same time, concentrations of three DDTs declined from 8 15 to 1 5 ug kg-1 (ww). The total concentration of the fat-soluble CBs and DDTs in Baltic herring muscle was shown to be age-dependent; the average concentrations in ten-year-old Baltic herring were three to five-fold higher than in two-year-old herring. In Baltic herring and bivalves, as well as in surface sediments, CB 138 and CB153 were predominant among CBs, whereas among DDTs p,p'-DDD predominated in sediment and p,p'-DDE in bivalves and Baltic herring muscle. Baltic Sea sediments are potential sources of contaminants that may become available for bioaccumulation. Based on ecotoxicological assessment criteria, cause for concern regarding CBs in sediments was indicated for the Gulf of Finland and the northern Baltic Proper, and for the northern Baltic Sea regarding CBs in Baltic herring more than two years old. Statistical classification of selected organic contaminants indicated high-level contamination for p,p'-DDT, p,p'-DDD, p,p'-DDE, total DDTs, HCB, CB118 and CB153 in muscle of Baltic herring in age groups two to ten years; in contrast, concentrations of a-HCH and g-HCH were found to be moderate. The concentrations of DDTs and CBs in bivalves is sufficient to cause biological effects, and demonstrates that long-term biological effects are still possible in the case of DDTs in the Hanö Bight.
Resumo:
Microarrays have a wide range of applications in the biomedical field. From the beginning, arrays have mostly been utilized in cancer research, including classification of tumors into different subgroups and identification of clinical associations. In the microarray format, a collection of small features, such as different oligonucleotides, is attached to a solid support. The advantage of microarray technology is the ability to simultaneously measure changes in the levels of multiple biomolecules. Because many diseases, including cancer, are complex, involving an interplay between various genes and environmental factors, the detection of only a single marker molecule is usually insufficient for determining disease status. Thus, a technique that simultaneously collects information on multiple molecules allows better insights into a complex disease. Since microarrays can be custom-manufactured or obtained from a number of commercial providers, understanding data quality and comparability between different platforms is important to enable the use of the technology to areas beyond basic research. When standardized, integrated array data could ultimately help to offer a complete profile of the disease, illuminating mechanisms and genes behind disorders as well as facilitating disease diagnostics. In the first part of this work, we aimed to elucidate the comparability of gene expression measurements from different oligonucleotide and cDNA microarray platforms. We compared three different gene expression microarrays; one was a commercial oligonucleotide microarray and the others commercial and custom-made cDNA microarrays. The filtered gene expression data from the commercial platforms correlated better across experiments (r=0.78-0.86) than the expression data between the custom-made and either of the two commercial platforms (r=0.62-0.76). Although the results from different platforms correlated reasonably well, combining and comparing the measurements were not straightforward. The clone errors on the custom-made array and annotation and technical differences between the platforms introduced variability in the data. In conclusion, the different gene expression microarray platforms provided results sufficiently concordant for the research setting, but the variability represents a challenge for developing diagnostic applications for the microarrays. In the second part of the work, we performed an integrated high-resolution microarray analysis of gene copy number and expression in 38 laryngeal and oral tongue squamous cell carcinoma cell lines and primary tumors. Our aim was to pinpoint genes for which expression was impacted by changes in copy number. The data revealed that especially amplifications had a clear impact on gene expression. Across the genome, 14-32% of genes in the highly amplified regions (copy number ratio >2.5) had associated overexpression. The impact of decreased copy number on gene underexpression was less clear. Using statistical analysis across the samples, we systematically identified hundreds of genes for which an increased copy number was associated with increased expression. For example, our data implied that FADD and PPFIA1 were frequently overexpressed at the 11q13 amplicon in HNSCC. The 11q13 amplicon, including known oncogenes such as CCND1 and CTTN, is well-characterized in different type of cancers, but the roles of FADD and PPFIA1 remain obscure. Taken together, the integrated microarray analysis revealed a number of known as well as novel target genes in altered regions in HNSCC. The identified genes provide a basis for functional validation and may eventually lead to the identification of novel candidates for targeted therapy in HNSCC.
Resumo:
In genetic epidemiology, population-based disease registries are commonly used to collect genotype or other risk factor information concerning affected subjects and their relatives. This work presents two new approaches for the statistical inference of ascertained data: a conditional and full likelihood approaches for the disease with variable age at onset phenotype using familial data obtained from population-based registry of incident cases. The aim is to obtain statistically reliable estimates of the general population parameters. The statistical analysis of familial data with variable age at onset becomes more complicated when some of the study subjects are non-susceptible, that is to say these subjects never get the disease. A statistical model for a variable age at onset with long-term survivors is proposed for studies of familial aggregation, using latent variable approach, as well as for prospective studies of genetic association studies with candidate genes. In addition, we explore the possibility of a genetic explanation of the observed increase in the incidence of Type 1 diabetes (T1D) in Finland in recent decades and the hypothesis of non-Mendelian transmission of T1D associated genes. Both classical and Bayesian statistical inference were used in the modelling and estimation. Despite the fact that this work contains five studies with different statistical models, they all concern data obtained from nationwide registries of T1D and genetics of T1D. In the analyses of T1D data, non-Mendelian transmission of T1D susceptibility alleles was not observed. In addition, non-Mendelian transmission of T1D susceptibility genes did not make a plausible explanation for the increase in T1D incidence in Finland. Instead, the Human Leucocyte Antigen associations with T1D were confirmed in the population-based analysis, which combines T1D registry information, reference sample of healthy subjects and birth cohort information of the Finnish population. Finally, a substantial familial variation in the susceptibility of T1D nephropathy was observed. The presented studies show the benefits of sophisticated statistical modelling to explore risk factors for complex diseases.
Resumo:
Hereditary nonpolyposis colorectal cancer (HNPCC) is the most common known clearly hereditary cause of colorectal and endometrial cancer (CRC and EC). Dominantly inherited mutations in one of the known mismatch repair (MMR) genes predispose to HNPCC. Defective MMR leads to an accumulation of mutations especially in repeat tracts, presenting microsatellite instability. HNPCC is clinically a very heterogeneous disease. The age at onset varies and the target tissue may vary. In addition, families that fulfill the diagnostic criteria for HNPCC but fail to show any predisposing mutation in MMR genes exist. Our aim was to evaluate the genetic background of familial CRC and EC. We performed comprehensive molecular and DNA copy number analyses of CRCs fulfilling the diagnostic criteria for HNPCC. We studied the role of five pathways (MMR, Wnt, p53, CIN, PI3K/AKT) and divided the tumors into two groups, one with MMR gene germline mutations and the other without. We observed that MMR proficient familial CRC consist of two molecularly distinct groups that differ from MMR deficient tumors. Group A shows paucity of common molecular and chromosomal alterations characteristic of colorectal carcinogenesis. Group B shows molecular features similar to classical microsatellite stable tumors with gross chromosomal alterations. Our finding of a unique tumor profile in group A suggests the involvement of novel predisposing genes and pathways in colorectal cancer cohorts not linked to MMR gene defects. We investigated the genetic background of familial ECs. Among 22 families with clustering of EC, two (9%) were due to MMR gene germline mutations. The remaining familial site-specific ECs are largely comparable with HNPCC associated ECs, the main difference between these groups being MMR proficiency vs. deficiency. We studied the role of PI3K/AKT pathway in familial ECs as well and observed that PIK3CA amplifications are characteristic of familial site-specific EC without MMR gene germline mutations. Most of the high-level amplifications occurred in tumors with stable microsatellites, suggesting that these tumors are more likely associated with chromosomal rather than microsatellite instability and MMR defect. The existence of site-specific endometrial carcinoma as a separate entity remains equivocal until predisposing genes are identified. It is possible that no single highly penetrant gene for this proposed syndrome exists, it may, for example be due to a combination of multiple low penetrance genes. Despite advances in deciphering the molecular genetic background of HNPCC, it is poorly understood why certain organs are more susceptible than others to cancer development. We found that important determinants of the HNPCC tumor spectrum are, in addition to different predisposing germline mutations, organ specific target genes and different instability profiles, loss of heterozygosity at MLH1 locus, and MLH1 promoter methylation. This study provided more precise molecular classification of families with CRC and EC. Our observations on familial CRC and EC are likely to have broader significance that extends to sporadic CRC and EC as well.
Resumo:
Many of the genes predisposing to highly penetrant colorectal cancer (CRC) syndromes, including hereditary non-polyposis colorectal cancer (MLH1, MSH2, MSH6, PMS2), familial adenomatous polyposis (APC), Peutz-Jeghers syndrome (LKB1), juvenile polyposis (SMAD4, BMPR1A), MYH-associated polyposis (MYH), and Cowden syndrome (PTEN) have already been discovered. Identification of these genes has allowed a more precise classification of the hereditary CRC syndromes and provided a means for predictive genetic testing and surveillance. Some of the genes are also involved in sporadic cancer forms, and therefore the investigation of the rare CRC syndromes has been a breakthrough for general cancer research. Despite the accumulating knowledge on hereditary cancer syndromes, a significant number of familial CRCs remain molecularly unexplained after genetic testing, reflecting the possibility of other predisposing genes or existence of novel syndromes. Moreover, genetic variants conferring low-penetrance risk are still largely unknown. In this study, we examined the role of some new high- and low-penetrance alleles on CRC predisposition. We identified disease causing MYH mutations in a subset (9%) of patients with APC and AXIN2 mutation negative adenomatous polyposis. Due to differences in the pattern of inheritance and clinical manifestation, screening for mutations in MYH is beneficial in view of genetic counselling and surveillance. A novel functionally deficient MYH founder mutation A459D was identified in the Finnish population, and this finding had immediate clinical implications for genetic counselling of at risk families. Many patients with hamartomatous polyposis remain without molecular diagnosis due to atypical phenotypes. We therefore sought to classify 49 patients with unexplained hamartomatous or hyperplastic/mixed polyposis by extensive molecular analyses of PTEN, LKB1, BMPR1A, SMAD4, ENG, BRAF, MYH, and BHD along with revision of polyp histology. Mutations were identified in 11/49 (22%) of the patients. In 6 cases the molecular diagnosis was re-classified guiding surveillance and decisions for prophylactic surgery. Re-evaluation of polyp histology with subsequent more accurate selection of candidate gene analyses is beneficial and can be recommended for patients with unexplained polyposis. Furthermore, germline mutations in ENG underlying juvenile polyposis were described for the first time, characterizing a possible novel genetically defined form of hereditary CRC. Association analyses on two putative low-penetrance alleles, NOD2 3020insC and MDM2 SNP309 were performed in a population-based series of 1042 Finnish CRC patients and in cancer-free controls. In contrast to previous results, NOD2 3020insC did not associate with CRC or age at disease onset in the Finnish population. These data suggest that NOD2 3020insC alone might not be sufficient for CRC predisposition. MDM2 SNP309 was as common in the CRC cohort as in the healthy controls. Interesting trends, however, were observed, which after correction for multiple testing did not reach statistical significance. SNP309 was more common in female CRC patients and a trend towards an earlier age at disease onset was observed in women with SNP309. Subsequent studies have supported this observation and SNP309 could affect gender- or hormone-related tumorigenesis. Finally, a large-scale unbiased effort was designed to characterize the complete mutatome of CRC with microsatellite instability (MSI). Using an approach combining expression microarray and genome database searches, we were able to identify putative MSI target genes. Further characterization of one of the genes suggested that it might play a role also in microsatellite stable CRC and Peutz-Jeghers syndrome pathogenesis.
Resumo:
The aim of this work was the assessment about the structure and use of the conceptual model of occlusion in operational weather forecasting. In the beginning a survey has been made about the conceptual model of occlusion as introduced to operational forecasters in the Finnish Meteorological Institute (FMI). In the same context an overview has been performed about the use of the conceptual model in modern operational weather forecasting, especially in connection with the widespread use of numerical forecasts. In order to evaluate the features of the occlusions in operational weather forecasting, all the occlusion processes occurring during year 2003 over Europe and Northern Atlantic area have been investigated using the conceptual model of occlusion and the methods suggested in the FMI. The investigation has yielded a classification of the occluded cyclones on the basis of the extent the conceptual model has fitted the description of the observed thermal structure. The seasonal and geographical distribution of the classes has been inspected. Some relevant cases belonging to different classes have been collected and analyzed in detail: in this deeper investigation tools and techniques, which are not routinely used in operational weather forecasting, have been adopted. Both the statistical investigation of the occluded cyclones during year 2003 and the case studies have revealed that the traditional classification of the types of the occlusion on the basis of the thermal structure doesn t take into account the bigger variety of occlusion structures which can be observed. Moreover the conceptual model of occlusion has turned out to be often inadequate in describing well developed cyclones. A deep and constructive revision of the conceptual model of occlusion is therefore suggested in light of the result obtained in this work. The revision should take into account both the progresses which are being made in building a theoretical footing for the occlusion process and the recent tools and meteorological quantities which are nowadays available.
Resumo:
A new rock mass classification scheme, the Host Rock Classification system (HRC-system) has been developed for evaluating the suitability of volumes of rock mass for the disposal of high-level nuclear waste in Precambrian crystalline bedrock. To support the development of the system, the requirements of host rock to be used for disposal have been studied in detail and the significance of the various rock mass properties have been examined. The HRC-system considers both the long-term safety of the repository and the constructability in the rock mass. The system is specific to the KBS-3V disposal concept and can be used only at sites that have been evaluated to be suitable at the site scale. By using the HRC-system, it is possible to identify potentially suitable volumes within the site at several different scales (repository, tunnel and canister scales). The selection of the classification parameters to be included in the HRC-system is based on an extensive study on the rock mass properties and their various influences on the long-term safety, the constructability and the layout and location of the repository. The parameters proposed for the classification at the repository scale include fracture zones, strength/stress ratio, hydraulic conductivity and the Groundwater Chemistry Index. The parameters proposed for the classification at the tunnel scale include hydraulic conductivity, Q´ and fracture zones and the parameters proposed for the classification at the canister scale include hydraulic conductivity, Q´, fracture zones, fracture width (aperture + filling) and fracture trace length. The parameter values will be used to determine the suitability classes for the volumes of rock to be classified. The HRC-system includes four suitability classes at the repository and tunnel scales and three suitability classes at the canister scale and the classification process is linked to several important decisions regarding the location and acceptability of many components of the repository at all three scales. The HRC-system is, thereby, one possible design tool that aids in locating the different repository components into volumes of host rock that are more suitable than others and that are considered to fulfil the fundamental requirements set for the repository host rock. The generic HRC-system, which is the main result of this work, is also adjusted to the site-specific properties of the Olkiluoto site in Finland and the classification procedure is demonstrated by a test classification using data from Olkiluoto. Keywords: host rock, classification, HRC-system, nuclear waste disposal, long-term safety, constructability, KBS-3V, crystalline bedrock, Olkiluoto
Resumo:
This thesis consists of an introduction, four research articles and an appendix. The thesis studies relations between two different approaches to continuum limit of models of two dimensional statistical mechanics at criticality. The approach of conformal field theory (CFT) could be thought of as the algebraic classification of some basic objects in these models. It has been succesfully used by physicists since 1980's. The other approach, Schramm-Loewner evolutions (SLEs), is a recently introduced set of mathematical methods to study random curves or interfaces occurring in the continuum limit of the models. The first and second included articles argue on basis of statistical mechanics what would be a plausible relation between SLEs and conformal field theory. The first article studies multiple SLEs, several random curves simultaneously in a domain. The proposed definition is compatible with a natural commutation requirement suggested by Dubédat. The curves of multiple SLE may form different topological configurations, ``pure geometries''. We conjecture a relation between the topological configurations and CFT concepts of conformal blocks and operator product expansions. Example applications of multiple SLEs include crossing probabilities for percolation and Ising model. The second article studies SLE variants that represent models with boundary conditions implemented by primary fields. The most well known of these, SLE(kappa, rho), is shown to be simple in terms of the Coulomb gas formalism of CFT. In the third article the space of local martingales for variants of SLE is shown to carry a representation of Virasoro algebra. Finding this structure is guided by the relation of SLEs and CFTs in general, but the result is established in a straightforward fashion. This article, too, emphasizes multiple SLEs and proposes a possible way of treating pure geometries in terms of Coulomb gas. The fourth article states results of applications of the Virasoro structure to the open questions of SLE reversibility and duality. Proofs of the stated results are provided in the appendix. The objective is an indirect computation of certain polynomial expected values. Provided that these expected values exist, in generic cases they are shown to possess the desired properties, thus giving support for both reversibility and duality.
Resumo:
Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
In visual object detection and recognition, classifiers have two interesting characteristics: accuracy and speed. Accuracy depends on the complexity of the image features and classifier decision surfaces. Speed depends on the hardware and the computational effort required to use the features and decision surfaces. When attempts to increase accuracy lead to increases in complexity and effort, it is necessary to ask how much are we willing to pay for increased accuracy. For example, if increased computational effort implies quickly diminishing returns in accuracy, then those designing inexpensive surveillance applications cannot aim for maximum accuracy at any cost. It becomes necessary to find trade-offs between accuracy and effort. We study efficient classification of images depicting real-world objects and scenes. Classification is efficient when a classifier can be controlled so that the desired trade-off between accuracy and effort (speed) is achieved and unnecessary computations are avoided on a per input basis. A framework is proposed for understanding and modeling efficient classification of images. Classification is modeled as a tree-like process. In designing the framework, it is important to recognize what is essential and to avoid structures that are narrow in applicability. Earlier frameworks are lacking in this regard. The overall contribution is two-fold. First, the framework is presented, subjected to experiments, and shown to be satisfactory. Second, certain unconventional approaches are experimented with. This allows the separation of the essential from the conventional. To determine if the framework is satisfactory, three categories of questions are identified: trade-off optimization, classifier tree organization, and rules for delegation and confidence modeling. Questions and problems related to each category are addressed and empirical results are presented. For example, related to trade-off optimization, we address the problem of computational bottlenecks that limit the range of trade-offs. We also ask if accuracy versus effort trade-offs can be controlled after training. For another example, regarding classifier tree organization, we first consider the task of organizing a tree in a problem-specific manner. We then ask if problem-specific organization is necessary.