Biblioteca Digital

36 resultados para Set-Valued Mapping

em Helda - Digital Repository of University of Helsinki

Simulation and graph mining tools for improving gene mapping efficiency

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Gene mapping is a systematic search for genes that affect observable characteristics of an organism. In this thesis we offer computational tools to improve the efficiency of (disease) gene-mapping efforts. In the first part of the thesis we propose an efficient simulation procedure for generating realistic genetical data from isolated populations. Simulated data is useful for evaluating hypothesised gene-mapping study designs and computational analysis tools. As an example of such evaluation, we demonstrate how a population-based study design can be a powerful alternative to traditional family-based designs in association-based gene-mapping projects. In the second part of the thesis we consider a prioritisation of a (typically large) set of putative disease-associated genes acquired from an initial gene-mapping analysis. Prioritisation is necessary to be able to focus on the most promising candidates. We show how to harness the current biomedical knowledge for the prioritisation task by integrating various publicly available biological databases into a weighted biological graph. We then demonstrate how to find and evaluate connections between entities, such as genes and diseases, from this unified schema by graph mining techniques. Finally, in the last part of the thesis, we define the concept of reliable subgraph and the corresponding subgraph extraction problem. Reliable subgraphs concisely describe strong and independent connections between two given vertices in a random graph, and hence they are especially useful for visualising such connections. We propose novel algorithms for extracting reliable subgraphs from large random graphs. The efficiency and scalability of the proposed graph mining methods are backed by extensive experiments on real data. While our application focus is in genetics, the concepts and algorithms can be applied to other domains as well. We demonstrate this generality by considering coauthor graphs in addition to biological graphs in the experiments.

Fumarate Hydratase and Succinate Dehydrogenase in Neoplasia

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Germline mutations in fumarate hydratase (FH) cause hereditary leiomyomatosis and renal cell cancer (HLRCC). FH is a nuclear encoded enzyme which functions in the Krebs tricarboxylic acid cycle, and homozygous mutation in FH lead to severe developmental defects. Both uterine and cutaneous leiomyomas are components of the HLRCC phenotype. Most of these tumours show loss of the wild-type allele and, also, the mutations reduce FH enzyme activity, which indicate that FH is a tumour suppressor gene. The renal cell cancers associated with HLRCC are of rare papillary type 2 histology. Other genes involved in the Krebs cycle, which are also implicated in neoplasia are 3 of the 4 subunits encoding succinate dehydrogenase (SDH); mutations in SHDB, SDHC, and SDHD predispose to paraganglioma and phaeochromocytoma. Although uterine leiomyomas (or fibroids) are very common, the estimations of affected women ranging from 25% to 77%, not much is known about their genetic background. Cytogenetic studies have revealed that rearrangements involving chromosomes 6, 7, 12 and 14 are most commonly seen in fibroids. Deletions on the long arm of chromosome 7 have been reported to be involved in about 17 to 34 % of leiomyomas and the small commonly deleted region on 7q22 suggests that there might be an underlying tumour suppressor gene in that region. The purpose of this study was to investigate the genetic mechanisms behind the development of tumours associated with HLRCC, both renal cell cancer and uterine fibroids. Firstly, a database search at the Finnish cancer registry was conducted in order to identify new families with early-onset RCC and to test if the family history was compatible with HLRCC. Secondly, sporadic uterine fibroids were tested for deletions on 7q in order to define the minimal deleted 7q-region, followed by mutation analysis of the candidate genes. Thirdly, oligonucleotide chips were utilised to study the global gene expression profiles of uterine fibroids in order to test whether 7q-deletions and FH mutations significantly affected fibroid biology. In the screen for early-onset RCC, 214 families were identified. Subsequently, the pedigrees were constructed and clinical data obtained. One of the index cases (RCC at the age of 28) had a mother who had been diagnosed with a heart tumour, which in further investigation turned out to be a paraganglioma. This lead to an alternative hypothesis that SDH, instead of FH, could be involved. SDHA, SDHB, SDHC and SDHD were sequenced from these individuals; a germline SDHB R27X mutation was detected with loss of the wild-type allele in both tumours. These results suggest that germline mutations in the SDHB gene predispose to early-onset RCC establishing a novel form of hereditary RCC. This has immediate clinical implications in the surveillance of patients suffering from early-onset RCC and phaeochromocytoma/paraganglioma. For the studies on sporadic uterine fibroids, a set of 166 fibroids from 51 individuals were collected. The 7q LOH mapping defined a commonly deleted region of about 3.2 mega bases in 11 of the 166 tumours. The deletion was consistent with previously reported allelotyping studies of leiomyomas and it therefore suggested the presence of a tumour suppressor gene in the deleted region. Furthermore, the high-resolution aCGH-chip analysis refined the deleted region to only 2.79Mb. When combined with previous data, the commonly deleted region was only 2.3Mb. The mutation screening of the known genes within the commonly deleted region did not reveal pathogenic mutations, however. The expression microarray analysis revealed that FH-deficient fibroids, both sporadic and familial, had their distinct gene expression profile as they formed their own group in the unsupervised clustering. On the other hand, the presence or absence of 7q-deletions did not significantly alter the global gene expression pattern of fibroids, suggesting that these two groups do not have different biological backgrounds. Multiple differentially expressed genes were identified between FH wild-type and FH-mutant fibroids, and the most significant increase was seen in the expression of carbohydrate metabolism-related and hypoxia inducible factor (HIF) target genes.

Genetic Heterogeneity in Autism Spectrum Disorders in a Population Isolate

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Positional cloning has enabled hypothesis-free, genome-wide scans for genetic factors contributing to disorders or traits. Traditionally linkage analysis has been used to identify regions of interest, followed by meticulous fine mapping and candidate gene screening using association methods and finally sequencing of regions of interest. More recently, genome-wide association analysis has enabled a more direct approach to identify specific genetic variants explaining a part of the variance of the phenotype of interest. Autism spectrum disorders (ASDs) are a group of childhood onset neuropsychiatric disorders with shared core symptoms but varying severity. Although a strong genetic component has been established in ASDs, genetic susceptibility factors have largely eluded characterization. Here, we have utilized modern molecular genetic methods combined with the advantages provided by the special population structure in Finland to identify genetic risk factors for ASDs. The results of this study show that numerous genetic risk factors exist for ASDs even within a population isolate. Stratification based on clinical phenotype resulted in encouraging results, as previously identified linkage to 3p14-p24 was replicated in an independent family set of families with Asperger syndrome, but no other ASDs. Fine-mapping of the previously identified linkage peak for ASDs at 3q25-q27 revealed association between autism and a subunit of the 5-hydroxytryptamine receptor 3C (HTR3C). We also used dense, genome-wide single nucleotide polymorphism (SNP) data to characterize the population structure of Finns. We observed significant population substructure which correlates with the known history of multiple consecutive bottle-necks experienced by the Finnish population. We used this information to ascertain a genetically homogenous subset of autism families to identify possible rare, enriched risk variants using genome-wide SNP data. No rare enriched genetic risk factors were identified in this dataset, although a subset of families could be genealogically linked to form two extended pedigrees. The lack of founder mutations in this isolated population suggests that the majority of genetic risk factors are rare, de novo mutations unique to individual nuclear families. The results of this study are consistent with others in the field. The underlying genetic architecture for this group of disorders appears highly heterogeneous, with common variants accounting for only a subset of genetic risk. The majority of identified risk factors have turned out to be exceedingly rare, and only explain a subset of the genetic risk in the general population in spite of their high penetrance within individual families. The results of this study, together with other results obtained in this field, indicate that family specific linkage, homozygosity mapping and resequencing efforts are needed to identify these rare genetic risk factors.

Burnt area mapping in insular Southeast Asia using medium resolution satellite imagery

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Burnt area mapping in humid tropical insular Southeast Asia using medium resolution (250-500m) satellite imagery is characterized by persisting cloud cover, wide range of land cover types, vast amount of wetland areas and highly varying fire regimes. The objective of this study was to deepen understanding of three major aspects affecting the implementation and limits of medium resolution burnt area mapping in insular Southeast Asia: 1) fire-induced spectral changes, 2) most suitable multitemporal compositing methods and 3) burn scars patterns and size distribution. The results revealed a high variation in fire-induced spectral changes depending on the pre-fire greenness of burnt area. It was concluded that this variation needs to be taken into account in change detection based burnt area mapping algorithms in order to maximize the potential of medium resolution satellite data. Minimum near infrared (MODIS band 2, 0.86μm) compositing method was found to be the most suitable for burnt area mapping purposes using Moderate Resolution Imaging Spectroradiometer (MODIS) data. In general, medium resolution burnt area mapping was found to be usable in the wetlands of insular Southeast Asia, whereas in other areas the usability was seriously jeopardized by the small size of burn scars. The suitability of medium resolution data for burnt area mapping in wetlands is important since recently Southeast Asian wetlands have become a major point of interest in many fields of science due to yearly occurring wild fires that not only degrade these unique ecosystems but also create regional haze problem and release globally significant amounts of carbon into the atmosphere due to burning peat. Finally, super-resolution MODIS images were tested but the test failed to improve the detection of small scars. Therefore, super-resolution technique was not considered to be applicable to regional level burnt area mapping in insular Southeast Asia.

Options for selecting dairy cattle for milk coagulation ability

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Lypsylehmien maidon juoksettumiskyvyn jalostuskeinot Väitöskirjassa tutkittiin lypsylehmien maidon juustonvalmistuslaadun parantamista jalostusvalinnan avulla. Tutkimusaihe on tärkeä, sillä yhä suurempi osa maidosta käytetään juustonvalmistukseen. Tutkimuksen kohteena oli maidon juoksettumiskyky, sillä se on yksi keskeisistä juustomäärään vaikuttavista tekijöistä. Maidon juoksettumiskyky vaihteli huomattavasti lehmien, sonnien, karjojen, rotujen ja lypsykauden vaiheiden välillä. Vaikka tankkimaidon juoksettumiskyvyssä olikin suuria eroja karjoittain, karja selitti vain pienen osan juoksettumiskyvyn kokonaisvaihtelusta. Todennäköisesti perinnölliset erot lehmien välillä selittävät suurimman osan karjojen tankkimaitojen juoksettumiskyvyssä havaituista eroista. Hyvä hoito ja ruokinta vähensivät kuitenkin jossain määrin huonosti juoksettuvien tankkimaitojen osuutta karjoissa. Holstein-friisiläiset lehmät olivat juoksettumiskyvyltään ayrshire-rotuisia lehmiä parempia. Huono juoksettuminen ja juoksettumattomuus oli vain vähäinen ongelma holstein-friisiläisillä (10 %), kun taas kolmannes ayrshire-lehmistä tuotti huonosti juoksettuvaa tai juoksettumatonta maitoa. Maitoa sanotaan huonosti juoksettuvaksi silloin, kun juustomassa ei ole riittävän kiinteää leikattavaksi puolen tunnin kuluttua juoksetteen lisäyksestä. Juoksettumattomaksi määriteltävä maito ei saostu lainkaan puolen tunnin aikana ja on siksi erittäin huonoa raaka-ainetta juustomeijereille. Noin 40 % lehmien välisistä eroista maidon juoksettumiskyvyssä selittyi perinnöllisillä tekijöillä. Juoksettumiskykyä voikin sanoa hyvin periytyväksi ominaisuudeksi. Kolme mittauskertaa lehmää kohti riittää varsin hyvin lehmän maidon keskimääräisen juoksettumiskyvyn arvioimiseen. Tällä hetkellä juoksettumiskyvyn suoran jalostamisen ongelmana on kuitenkin automatisoidun, laajamittaiseen käyttöön soveltuvan mittalaitteen puute. Tämän takia väitöskirjassa tutkittiin mahdollisuuksia jalostaa maidon juoksettumiskykyä epäsuorasti, jonkin toisen ominaisuuden kautta. Tällaisen ominaisuuden pitää olla kyllin voimakkaasti perinnöllisesti kytkeytynyt juoksettumiskykyyn, jotta jalostus olisi mahdollista sen avulla. Tutkittavat ominaisuudet olivat sonnien kokonaisjalostusarvossa jo mukana olevat maitotuotos ja utareterveyteen liittyvät ominaisuudet sekä kokonaisjalostusarvoon kuulumattomat maidon valkuais- ja kaseiinipitoisuus sekä maidon pH. Väitöskirjassa tutkittiin myös mahdollisuuksia ns. merkkiavusteiseen valintaan tutkimalla maidon juoksettumattomuuden perinnöllisyyttä ja kartoittamalla siihen liittyvät kromosomialueet. Tutkimuksen tulosten perusteella lehmien utareterveyden jalostaminen parantaa jonkin verran myös maidon juoksettumiskykyä sekä vähentää juoksettumattomuutta ayrshire-rotuisilla lehmillä. Lehmien maitotuotos ja maidon juoksettumiskyky sekä juoksettumattomuus ovat sen sijaan perinnöllisesti toisistaan riippumattomia ominaisuuksia. Myöskin maidon valkuais- ja kaseiinipitoisuuden perinnöllinen yhteys juoksettumiskykyyn oli likimain nolla. Maidon pH:n ja juoksettumiskyvyn välillä oli melko voimakas perinnöllinen yhteys, joten maidon pH:n jalostaminen parantaisi myös maidon juoksettumiskykyä. Todennäköisesti sen jalostaminen ei kuitenkaan vähentäisi juoksettumatonta maitoa tuottavien lehmien määrää. Koska maidon juoksettumattomuus on niin yleinen ongelma suomalaisilla ayrshire-lehmillä, väitöksessä selvitettiin tarkemmin ilmiön taustoja. Kaikissa kolmessa tutkimusaineistoissa noin 10 % ayrshire-lehmistä tuotti juoksettumatonta maitoa. Kahden vuoden kuukausittaisen seurannan aikana osa lehmistä tuotti juoksettumatonta maitoa lähes joka mittauskerralla. Maidon juoksettumattomuus oli yhteydessä lypsykauden vaiheeseen, mutta mikään ympäristötekijöistä ei pystynyt täysin selittämään sitä. Sen sijaan viitteet sen periytyvyydestä vahvistuivat tutkimusten edetessä. Lopuksi tutkimusryhmä onnistui kartoittamaan juoksettumattomuutta aiheuttavat kromosomialueet kromosomeihin 2 ja 18, lähelle DNA-merkkejä BMS1126 ja BMS1355. Tulosten perusteella maidon juoksettumattomuus ei ole yhteydessä maidon juoksettumistapahtumassa keskeisiin kaseiinigeeneihin. Sen sijaan on mahdollista, että juoksettumattomuusongelman aiheuttavat kaseiinigeenien syntetisoinnin jälkeisessä muokkauksessa tapahtuvat virheet. Asia vaatii kuitenkin perusteellista tutkimista. Väitöksen tulosten perusteella maidon juoksettumattomuusgeeniä kantavien eläinten karsiminen jalostuseläinten joukosta olisi tehokkain tapa jalostaa maidon juoksettumiskykyä suomalaisessa lypsykarjapopulaatiossa.

From outcrops to dust : Mapping, testing, and quality assessment of aggregates

Relevância:

20.00% 20.00%

Publicador:

Beasts on Fields. Human-Wildlife Conflicts in Nature-Culture Borderlands

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Human-wildlife conflicts are today an integral part of the rural development discourse. In this research, the main focus is on the spatial explanation which is not a very common approach in the reviewed literature. My research hypothesis is based on the assumption that human-wildlife conflicts occur when a wild animal crosses a perceived borderline between the nature and culture and enters into the realms of the other. The borderline between nature and culture marks a perceived division of spatial content in our senses of place. The animal subject that crosses this border becomes a subject out of place meaning that the animal is then spatially located in a space where it should not be or where it does not belong according to tradition, custom, rules, law, public opinion, prevailing discourse or some other criteria set by human beings. An appearance of a wild animal in a domesticated space brings an uncontrolled subject into that space where humans have previously commanded total control of all other natural elements. A wild animal out of place may also threaten the biosecurity of the place in question. I carried out a case study in the Liwale district in south-eastern Tanzania to test my hypothesis during June and July 2002. I also collected documents and carried out interviews in Dar es Salaam in 2003. I studied the human-wildlife conflicts in six rural villages, where a total of 183 persons participated in the village meetings. My research methods included semi-structured interviews, participatory mapping, questionnaire survey and Q- methodology. The rural communities in the Liwale district have a long-history of co-existing with wildlife and they still have traditional knowledge of wildlife management and hunting. Wildlife conservation through the establishment of game reserves during the colonial era has escalated human-wildlife conflicts in the Liwale district. This study shows that the villagers perceive some wild animals differently in their images of the African countryside than the district and regional level civil servants do. From the small scale subsistence farmers point of views, wild animals continue to challenge the separation of the wild (the forests) and the domestics spaces (the cultivated fields) by moving across the perceived borders in search of food and shelter. As a result, the farmers may loose their crops, livestock or even their own lives in the confrontations of wild animals. Human-wildlife conflicts in the Liwale district are manifold and cannot be explained simply on the basis of attitudes or perceived images of landscapes. However, the spatial explanation of these conflicts provides us some more understanding of why human-wildlife conflicts are so widely found across the world.

Lake sediment research as a part of lake management - case studies and implications from southern Finland

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In Finland one of the most important current issues in the environmental management is the quality of surface waters. The increasing social importance of lakes and water systems has generated wide-ranging interest in lake restoration and management, concerning especially lakes suffering from eutrophication, but also from other environmental impacts. Most of the factors deteriorating the water quality in Finnish lakes are connected to human activities. Especially since the 1940's, the intensified farming practices and conduction of sewage waters from scattered settlements, cottages and industry have affected the lakes, which simultaneously have developed in to recreational areas for a growing number of people. Therefore, this study was focused on small lakes, which are human impacted, located close to settlement areas and have a significant value for local population. The aim of this thesis was to obtain information from lake sediment records for on-going lake restoration activities and to prove that a well planned, properly focused lake sediment study is an essential part of the work related to evaluation, target consideration and restoration of Finnish lakes. Altogether 11 lakes were studied. The study of Lake Kaljasjärvi was related to the gradual eutrophication of the lake. In lakes Ormajärvi, Suolijärvi, Lehee, Pyhäjärvi and Iso-Roine the main focus was on sediment mapping, as well as on the long term changes of the sedimentation, which were compared to Lake Pääjärvi. In Lake Hormajärvi the role of different kind of sedimentation environments in the eutrophication development of the lake's two basins were compared. Lake Orijärvi has not been eutrophied, but the ore exploitation and related acid main drainage from the catchment area have influenced the lake drastically and the changes caused by metal load were investigated. The twin lakes Etujärvi and Takajärvi are slightly eutrophied, but also suffer problems associated with the erosion of the substantial peat accumulations covering the fringe areas of the lakes. These peat accumulations are related to Holocene water level changes, which were investigated. The methods used were chosen case-specifically for each lake. In general, acoustic soundings of the lakes, detailed description of the nature of the sediment and determinations of the physical properties of the sediment, such as water content, loss on ignition and magnetic susceptibility were used, as was grain size analysis. A wide set of chemical analyses was also used. Diatom and chrysophycean cyst analyses were applied, and the diatom inferred total phosphorus content was reconstructed. The results of these studies prove, that the ideal lake sediment study, as a part of a lake management project, should be two-phased. In the first phase, thoroughgoing mapping of sedimentation patterns should be carried out by soundings and adequate corings. The actual sampling, based on the preliminary results, must include at least one long core from the main sedimentation basin for the determining the natural background state of the lake. The recent, artificially impacted development of the lake can then be determined by short-core and surface sediment studies. The sampling must be focused on the basis of the sediment mapping again, and it should represent all different sedimentation environments and bottom dynamic zones, considering the inlets and outlets, as well as the effects of possible point loaders of the lake. In practice, the budget of the lake management projects of is usually limited and only the most essential work and analyses can be carried out. The set of chemical and biological analyses and dating methods must therefore been thoroughly considered and adapted to the specific management problem. The results show also, that information obtained from a properly performed sediment study enhances the planning of the restoration, makes possible to define the target of the remediation activities and improves the cost-efficiency of the project.

Composition operators on vector-valued BMOA and related function spaces

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A composition operator is a linear operator between spaces of analytic or harmonic functions on the unit disk, which precomposes a function with a fixed self-map of the disk. A fundamental problem is to relate properties of a composition operator to the function-theoretic properties of the self-map. During the recent decades these operators have been very actively studied in connection with various function spaces. The study of composition operators lies in the intersection of two central fields of mathematical analysis; function theory and operator theory. This thesis consists of four research articles and an overview. In the first three articles the weak compactness of composition operators is studied on certain vector-valued function spaces. A vector-valued function takes its values in some complex Banach space. In the first and third article sufficient conditions are given for a composition operator to be weakly compact on different versions of vector-valued BMOA spaces. In the second article characterizations are given for the weak compactness of a composition operator on harmonic Hardy spaces and spaces of Cauchy transforms, provided the functions take values in a reflexive Banach space. Composition operators are also considered on certain weak versions of the above function spaces. In addition, the relationship of different vector-valued function spaces is analyzed. In the fourth article weighted composition operators are studied on the scalar-valued BMOA space and its subspace VMOA. A weighted composition operator is obtained by first applying a composition operator and then a pointwise multiplier. A complete characterization is given for the boundedness and compactness of a weighted composition operator on BMOA and VMOA. Moreover, the essential norm of a weighted composition operator on VMOA is estimated. These results generalize many previously known results about composition operators and pointwise multipliers on these spaces.

Luonnollisten lukujen joukon määriteltävyys funktioalgebroissa

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Let X be a topological space and K the real algebra of the reals, the complex numbers, the quaternions, or the octonions. The functions form X to K form an algebra T(X,K) with pointwise addition and multiplication. We study first-order definability of the constant function set N' corresponding to the set of the naturals in certain subalgebras of T(X,K). In the vocabulary the symbols Constant, +, *, 0', and 1' are used, where Constant denotes the predicate defining the constants, and 0' and 1' denote the constant functions with values 0 and 1 respectively. The most important result is the following. Let X be a topological space, K the real algebra of the reals, the compelex numbers, the quaternions, or the octonions, and R a subalgebra of the algebra of all functions from X to K containing all constants. Then N' is definable in , if at least one of the following conditions is true. (1) The algebra R is a subalgebra of the algebra of all continuous functions containing a piecewise open mapping from X to K. (2) The space X is sigma-compact, and R is a subalgebra of the algebra of all continuous functions containing a function whose range contains a nonempty open set of K. (3) The algebra K is the set of reals or the complex numbers, and R contains a piecewise open mapping from X to K and does not contain an everywhere unbounded function. (4) The algebra R contains a piecewise open mapping from X to the set of the reals and function whose range contains a nonempty open subset of K. Furthermore R does not contain an everywhere unbounded function.

Bayesian QTL Mapping in Inbred and Outbred Experimental Designs

Relevância:

20.00% 20.00%

Publicador:

Genetic mapping of complex traits: the case of Type 1 diabetes

Relevância:

20.00% 20.00%

Publicador:

Vector-valued BMOA and composition operators

Relevância:

20.00% 20.00%

Publicador:

Algorithms for Association-Based Gene Mapping

Relevância:

20.00% 20.00%

Publicador:

Efficient search for statistically significant dependency rules in binary data

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.

«
1
2
3
»