93 resultados para federated search tool
Resumo:
XML documents are becoming more and more common in various environments. In particular, enterprise-scale document management is commonly centred around XML, and desktop applications as well as online document collections are soon to follow. The growing number of XML documents increases the importance of appropriate indexing methods and search tools in keeping the information accessible. Therefore, we focus on content that is stored in XML format as we develop such indexing methods. Because XML is used for different kinds of content ranging all the way from records of data fields to narrative full-texts, the methods for Information Retrieval are facing a new challenge in identifying which content is subject to data queries and which should be indexed for full-text search. In response to this challenge, we analyse the relation of character content and XML tags in XML documents in order to separate the full-text from data. As a result, we are able to both reduce the size of the index by 5-6\% and improve the retrieval precision as we select the XML fragments to be indexed. Besides being challenging, XML comes with many unexplored opportunities which are not paid much attention in the literature. For example, authors often tag the content they want to emphasise by using a typeface that stands out. The tagged content constitutes phrases that are descriptive of the content and useful for full-text search. They are simple to detect in XML documents, but also possible to confuse with other inline-level text. Nonetheless, the search results seem to improve when the detected phrases are given additional weight in the index. Similar improvements are reported when related content is associated with the indexed full-text including titles, captions, and references. Experimental results show that for certain types of document collections, at least, the proposed methods help us find the relevant answers. Even when we know nothing about the document structure but the XML syntax, we are able to take advantage of the XML structure when the content is indexed for full-text search.
Resumo:
Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.
Resumo:
This thesis presents a highly sensitive genome wide search method for recessive mutations. The method is suitable for distantly related samples that are divided into phenotype positives and negatives. High throughput genotype arrays are used to identify and compare homozygous regions between the cohorts. The method is demonstrated by comparing colorectal cancer patients against unaffected references. The objective is to find homozygous regions and alleles that are more common in cancer patients. We have designed and implemented software tools to automate the data analysis from genotypes to lists of candidate genes and to their properties. The programs have been designed in respect to a pipeline architecture that allows their integration to other programs such as biological databases and copy number analysis tools. The integration of the tools is crucial as the genome wide analysis of the cohort differences produces many candidate regions not related to the studied phenotype. CohortComparator is a genotype comparison tool that detects homozygous regions and compares their loci and allele constitutions between two sets of samples. The data is visualised in chromosome specific graphs illustrating the homozygous regions and alleles of each sample. The genomic regions that may harbour recessive mutations are emphasised with different colours and a scoring scheme is given for these regions. The detection of homozygous regions, cohort comparisons and result annotations are all subjected to presumptions many of which have been parameterized in our programs. The effect of these parameters and the suitable scope of the methods have been evaluated. Samples with different resolutions can be balanced with the genotype estimates of their haplotypes and they can be used within the same study.
Resumo:
Current smartphones have a storage capacity of several gigabytes. More and more information is stored on mobile devices. To meet the challenge of information organization, we turn to desktop search. Users often possess multiple devices, and synchronize (subsets of) information between them. This makes file synchronization more important. This thesis presents Dessy, a desktop search and synchronization framework for mobile devices. Dessy uses desktop search techniques, such as indexing, query and index term stemming, and search relevance ranking. Dessy finds files by their content, metadata, and context information. For example, PDF files may be found by their author, subject, title, or text. EXIF data of JPEG files may be used in finding them. User–defined tags can be added to files to organize and retrieve them later. Retrieved files are ranked according to their relevance to the search query. The Dessy prototype uses the BM25 ranking function, used widely in information retrieval. Dessy provides an interface for locating files for both users and applications. Dessy is closely integrated with the Syxaw file synchronizer, which provides efficient file and metadata synchronization, optimizing network usage. Dessy supports synchronization of search results, individual files, and directory trees. It allows finding and synchronizing files that reside on remote computers, or the Internet. Dessy is designed to solve the problem of efficient mobile desktop search and synchronization, also supporting remote and Internet search. Remote searches may be carried out offline using a downloaded index, or while connected to the remote machine on a weak network. To secure user data, transmissions between the Dessy client and server are encrypted using symmetric encryption. Symmetric encryption keys are exchanged with RSA key exchange. Dessy emphasizes extensibility. Also the cryptography can be extended. Users may tag their files with context tags and control custom file metadata. Adding new indexed file types, metadata fields, ranking methods, and index types is easy. Finding files is done with virtual directories, which are views into the user’s files, browseable by regular file managers. On mobile devices, the Dessy GUI provides easy access to the search and synchronization system. This thesis includes results of Dessy synchronization and search experiments, including power usage measurements. Finally, Dessy has been designed with mobility and device constraints in mind. It requires only MIDP 2.0 Mobile Java with FileConnection support, and Java 1.5 on desktop machines.
Resumo:
During the past ten years, large-scale transcript analysis using microarrays has become a powerful tool to identify and predict functions for new genes. It allows simultaneous monitoring of the expression of thousands of genes and has become a routinely used tool in laboratories worldwide. Microarray analysis will, together with other functional genomics tools, take us closer to understanding the functions of all genes in genomes of living organisms. Flower development is a genetically regulated process which has mostly been studied in the traditional model species Arabidopsis thaliana, Antirrhinum majus and Petunia hybrida. The molecular mechanisms behind flower development in them are partly applicable in other plant systems. However, not all biological phenomena can be approached with just a few model systems. In order to understand and apply the knowledge to ecologically and economically important plants, other species also need to be studied. Sequencing of 17 000 ESTs from nine different cDNA libraries of the ornamental plant Gerbera hybrida made it possible to construct a cDNA microarray with 9000 probes. The probes of the microarray represent all different ESTs in the database. From the gerbera ESTs 20% were unique to gerbera while 373 were specific to the Asteraceae family of flowering plants. Gerbera has composite inflorescences with three different types of flowers that vary from each other morphologically. The marginal ray flowers are large, often pigmented and female, while the central disc flowers are smaller and more radially symmetrical perfect flowers. Intermediate trans flowers are similar to ray flowers but smaller in size. This feature together with the molecular tools applied to gerbera, make gerbera a unique system in comparison to the common model plants with only a single kind of flowers in their inflorescence. In the first part of this thesis, conditions for gerbera microarray analysis were optimised including experimental design, sample preparation and hybridization, as well as data analysis and verification. Moreover, in the first study, the flower and flower organ-specific genes were identified. After the reliability and reproducibility of the method were confirmed, the microarrays were utilized to investigate transcriptional differences between ray and disc flowers. This study revealed novel information about the morphological development as well as the transcriptional regulation of early stages of development in various flower types of gerbera. The most interesting finding was differential expression of MADS-box genes, suggesting the existence of flower type-specific regulatory complexes in the specification of different types of flowers. The gerbera microarray was further used to profile changes in expression during petal development. Gerbera ray flower petals are large, which makes them an ideal model to study organogenesis. Six different stages were compared and specifically analysed. Expression profiles of genes related to cell structure and growth implied that during stage two, cells divide, a process which is marked by expression of histones, cyclins and tubulins. Stage 4 was found to be a transition stage between cell division and expansion and by stage 6 cells had stopped division and instead underwent expansion. Interestingly, at the last analysed stage, stage 9, when cells did not grow any more, the highest number of upregulated genes was detected. The gerbera microarray is a fully-functioning tool for large-scale studies of flower development and correlation with real-time RT-PCR results show that it is also highly sensitive and reliable. Gene expression data presented here will be a source for gene expression mining or marker gene discovery in the future studies that will be performed in the Gerbera Laboratory. The publicly available data will also serve the plant research community world-wide.
Resumo:
The purpose of this study was to evaluate the use of sentinel node biopsy (SNB) in the axillary nodal staging in breast cancer. A special interest was in sentinel node (SN) visualization, intraoperative detection of SN metastases, the feasibility of SNB in patients with pure tubular carcinoma (PTC) and in those with ductal carcinoma in situ (DCIS) in core needle biopsy (CNB) and additionally in the detection of axillary recurrences after tumour negative SNB. Patients and methods. 1580 clinically stage T1-T2 node-negative breast cancer patients, who underwent lymphoscintigraphy (LS), SNB and breast surgery between June 2000 - 2004 at the Breast Surgery Unit. The CNB samples were obtained from women, who participated the biennial, population based mammography screening at the Mammography Screening Centre of Helsinki 2001 - 2004.In the follow- up, a cohort of 205 patients who avoided AC due to negative SNB findings were evaluated using ultrasonography one and three years after breast surgery. Results. The visualization rate of axillary SNs was not enhanced by adjusting radioisotope doses according to BMI. The sensitivity of the intraoperative diagnosis of SN metastases of invasive lobular carcinoma (ILC) was higher, 87%, with rapid, intraoperative immunohistochemistry (IHC) group compared to 66% without it. The prevalence of tumour positive SN findings was 27% in the 33 patients with breast tumours diagnosed as PTC. The median histological tumour size was similar in patients with or without axillary metastases. After the histopathological review, six out of 27 patients with true PTC had axillary metastases, with no significant change in the risk factors for axillary metastases. Of the 67 patients with DCIS in the preoperative percutaneous biopsy specimen , 30% had invasion in the surgical specimen. The strongest predictive factor for invasion was the visibility of the lesion in ultrasound. In the three year follow-up, axillary recurrence was found in only two (0.5%) of the total of 383 ultrasound examinations performed during the study, and only one of the 369 examinations revealed cancer. None of the ultrasound examinations were false positive, and no study participant was subjected to unnecessary surgery due to ultrasound monitoring. Conclusions. Adjusting the dose of the radioactive tracer according to patient BMI does not increase the visualization rate of SNs. The intraoperative diagnosis of SN metastases is enhanced by rapid IHC particularly in patients with ILC. SNB seems to be a feasible method for axillary staging of pure tubular carcinoma in patients with a low prevalence of axillary metatastases. SNB also appears to be a sensible method in patients undergoing mastectomy due to DCIS in CNB. It also seems useful in patients with lesions visible in breast US. During follow-up, routine monitoring of the ipsilateral axilla using US is not worthwhile among breast cancer patients who avoided AC due to negative SN findings.
Resumo:
By detecting leading protons produced in the Central Exclusive Diffractive process, p+p → p+X+p, one can measure the missing mass, and scan for possible new particle states such as the Higgs boson. This process augments - in a model independent way - the standard methods for new particle searches at the Large Hadron Collider (LHC) and will allow detailed analyses of the produced central system, such as the spin-parity properties of the Higgs boson. The exclusive central diffractive process makes possible precision studies of gluons at the LHC and complements the physics scenarios foreseen at the next e+e− linear collider. This thesis first presents the conclusions of the first systematic analysis of the expected precision measurement of the leading proton momentum and the accuracy of the reconstructed missing mass. In this initial analysis, the scattered protons are tracked along the LHC beam line and the uncertainties expected in beam transport and detection of the scattered leading protons are accounted for. The main focus of the thesis is in developing the necessary radiation hard precision detector technology for coping with the extremely demanding experimental environment of the LHC. This will be achieved by using a 3D silicon detector design, which in addition to the radiation hardness of up to 5×10^15 neutrons/cm2, offers properties such as a high signal-to- noise ratio, fast signal response to radiation and sensitivity close to the very edge of the detector. This work reports on the development of a novel semi-3D detector design that simplifies the 3D fabrication process, but conserves the necessary properties of the 3D detector design required in the LHC and in other imaging applications.
Resumo:
This three-phase design research describes the modelling processes for DC-circuit phenomena. The first phase presents an analysis of the development of the DC-circuit historical models in the context of constructing Volta s pile at the turn of the 18th century. The second phase involves the designing of a teaching experiment for comprehensive school third graders. Among other considerations, the design work utilises the results of the first phase and research literature of pupils mental models for DC-circuit phenomena. The third phase of the research was concerned with the realisation of the planned teaching experiment. The aim of this phase was to study the development of the external representations of DC-circuit phenomena in a small group of third graders. The aim of the study has been to search for new ways to guide pupils to learn DC-circuit phenomena while emphasing understanding at the qualitative level. Thus, electricity, which has been perceived as a difficult and abstract subject, could be learnt more comprehensively. Especially, the research of younger pupils learning of electricity concepts has not been of great interest at the international level, although DC-circuit phenomena are also taught in the lower classes of comprehensive schools. The results of this study are important, because there has tended to be more teaching of natural sciences in the lower classes of comprehensive schools, and attempts are being made to develop this trend in Finland. In the theoretical part of the research an Experimental-centred representation approach, which emphasises the role of experimentalism in the development of pupil s representations, is created. According to this approach learning at the qualitative level consists of empirical operations like experimenting, observations, perception, and prequantification of nature phenomena, and modelling operations like explaining and reasoning. Besides planning teaching, the new approach can be used as an analysis tool in describing both historical modelling and the development of pupils representations. In the first phase of the study, the research question was: How did the historical models of DC-circuit phenomena develop in Volta s time? The analysis uncovered three qualitative historical models associated with the historical concept formation process. The models include conceptions of the electric circuit as a scene in the DC-circuit phenomena, the comparative electric-current phenomenon as a cause of different observable effect phenomena, and the strength of the battery as a cause of the electric-current phenomenon. These models describe the concept formation process and its phases in Volta s time. The models are portrayed in the analysis using fragments of the models, where observation-based fragments and theoretical fragements are distinguished from each other. The results emphasise the significance of the qualitative concept formation and the meaning of language in the historical modelling of DC-circuit phenomena. For this reason these viewpoints are stressed in planning the teaching experiment in the second phase of the research. In addition, the design process utilised the experimentation behind the historical models of DC-circuit phenomena In the third phase of the study the research question is as follows: How will the small group s external representations of DC-circuit phenomena develop during the teaching experiment? The main question is divided into the following two sub questions: What kind of talk exists in the small group s learning? What kinds of external representations for DC-circuit phenomena exist in the small group discourse during the teaching experiment? The analysis revealed that the teaching experiment of the small group succeeded in its aim to activate talk in the small group. The designed connection cards proved especially successful in activating talk. The connection cards are cards that represent the components of the electric circuit. In the teaching experiment the pupils constructed different connections with the connection cards and discussed, what kinds of DC-circuit phenomena would take place in the corresponding real connections. The talk of the small group was analysed by comparing two situations, firstly, when the small group discussed using connections made with the connection cards and secondly with the same connections using real components. According to the results the talk of the small group included more higher-order thinking when using the connection cards than with similar real components. In order to answer the second sub question concerning the small group s external representations that appeared in the talk during the teaching experiment; student talk was visualised by the fragment maps which incorporate the electric circuit, the electric current and the source voltage. The fragment maps represent the gradual development of the external representations of DC-circuit phenomena in the small group during the teaching experiment. The results of the study challenge the results of previous research into the abstractness and difficulty of electricity concepts. According to this research, the external representations of DC-circuit phenomena clearly developed in the small group of third graders. Furthermore, the fragment maps uncover that although the theoretical explanations of DC-circuit phenomena, which have been obtained as results of typical mental model studies, remain undeveloped, learning at the qualitative level of understanding does take place.