3 resultados para degenerate test set
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
Since the first underground nuclear explosion, carried out in 1958, the analysis of seismic signals generated by these sources has allowed seismologists to refine the travel times of seismic waves through the Earth and to verify the accuracy of the location algorithms (the ground truth for these sources was often known). Long international negotiates have been devoted to limit the proliferation and testing of nuclear weapons. In particular the Treaty for the comprehensive nuclear test ban (CTBT), was opened to signatures in 1996, though, even if it has been signed by 178 States, has not yet entered into force, The Treaty underlines the fundamental role of the seismological observations to verify its compliance, by detecting and locating seismic events, and identifying the nature of their sources. A precise definition of the hypocentral parameters represents the first step to discriminate whether a given seismic event is natural or not. In case that a specific event is retained suspicious by the majority of the State Parties, the Treaty contains provisions for conducting an on-site inspection (OSI) in the area surrounding the epicenter of the event, located through the International Monitoring System (IMS) of the CTBT Organization. An OSI is supposed to include the use of passive seismic techniques in the area of the suspected clandestine underground nuclear test. In fact, high quality seismological systems are thought to be capable to detect and locate very weak aftershocks triggered by underground nuclear explosions in the first days or weeks following the test. This PhD thesis deals with the development of two different seismic location techniques: the first one, known as the double difference joint hypocenter determination (DDJHD) technique, is aimed at locating closely spaced events at a global scale. The locations obtained by this method are characterized by a high relative accuracy, although the absolute location of the whole cluster remains uncertain. We eliminate this problem introducing a priori information: the known location of a selected event. The second technique concerns the reliable estimates of back azimuth and apparent velocity of seismic waves from local events of very low magnitude recorded by a trypartite array at a very local scale. For the two above-mentioned techniques, we have used the crosscorrelation technique among digital waveforms in order to minimize the errors linked with incorrect phase picking. The cross-correlation method relies on the similarity between waveforms of a pair of events at the same station, at the global scale, and on the similarity between waveforms of the same event at two different sensors of the try-partite array, at the local scale. After preliminary tests on the reliability of our location techniques based on simulations, we have applied both methodologies to real seismic events. The DDJHD technique has been applied to a seismic sequence occurred in the Turkey-Iran border region, using the data recorded by the IMS. At the beginning, the algorithm was applied to the differences among the original arrival times of the P phases, so the cross-correlation was not used. We have obtained that the relevant geometrical spreading, noticeable in the standard locations (namely the locations produced by the analysts of the International Data Center (IDC) of the CTBT Organization, assumed as our reference), has been considerably reduced by the application of our technique. This is what we expected, since the methodology has been applied to a sequence of events for which we can suppose a real closeness among the hypocenters, belonging to the same seismic structure. Our results point out the main advantage of this methodology: the systematic errors affecting the arrival times have been removed or at least reduced. The introduction of the cross-correlation has not brought evident improvements to our results: the two sets of locations (without and with the application of the cross-correlation technique) are very similar to each other. This can be commented saying that the use of the crosscorrelation has not substantially improved the precision of the manual pickings. Probably the pickings reported by the IDC are good enough to make the random picking error less important than the systematic error on travel times. As a further justification for the scarce quality of the results given by the cross-correlation, it should be remarked that the events included in our data set don’t have generally a good signal to noise ratio (SNR): the selected sequence is composed of weak events ( magnitude 4 or smaller) and the signals are strongly attenuated because of the large distance between the stations and the hypocentral area. In the local scale, in addition to the cross-correlation, we have performed a signal interpolation in order to improve the time resolution. The algorithm so developed has been applied to the data collected during an experiment carried out in Israel between 1998 and 1999. The results pointed out the following relevant conclusions: a) it is necessary to correlate waveform segments corresponding to the same seismic phases; b) it is not essential to select the exact first arrivals; and c) relevant information can be also obtained from the maximum amplitude wavelet of the waveforms (particularly in bad SNR conditions). Another remarkable point of our procedure is that its application doesn’t demand a long time to process the data, and therefore the user can immediately check the results. During a field survey, such feature will make possible a quasi real-time check allowing the immediate optimization of the array geometry, if so suggested by the results at an early stage.
Resumo:
The Time-Of-Flight (TOF) detector of ALICE is designed to identify charged particles produced in Pb--Pb collisions at the LHC to address the physics of strongly-interacting matter and the Quark-Gluon Plasma (QGP). The detector is based on the Multigap Resistive Plate Chamber (MRPC) technology which guarantees the excellent performance required for a large time-of-flight array. The construction and installation of the apparatus in the experimental site have been completed and the detector is presently fully operative. All the steps which led to the construction of the TOF detector were strictly followed by a set of quality assurance procedures to enable high and uniform performance and eventually the detector has been commissioned with cosmic rays. This work aims at giving a detailed overview of the ALICE TOF detector, also focusing on the tests performed during the construction phase. The first data-taking experience and the first results obtained with cosmic rays during the commissioning phase are presented as well and allow to confirm the readiness state of the TOF detector for LHC collisions.