955 resultados para Large datasets
Resumo:
Gene expression is one of the most critical factors influencing the phenotype of a cell. As a result of several technological advances, measuring gene expression levels has become one of the most common molecular biological measurements to study the behaviour of cells. The scientific community has produced enormous and constantly increasing collection of gene expression data from various human cells both from healthy and pathological conditions. However, while each of these studies is informative and enlighting in its own context and research setup, diverging methods and terminologies make it very challenging to integrate existing gene expression data to a more comprehensive view of human transcriptome function. On the other hand, bioinformatic science advances only through data integration and synthesis. The aim of this study was to develop biological and mathematical methods to overcome these challenges and to construct an integrated database of human transcriptome as well as to demonstrate its usage. Methods developed in this study can be divided in two distinct parts. First, the biological and medical annotation of the existing gene expression measurements needed to be encoded by systematic vocabularies. There was no single existing biomedical ontology or vocabulary suitable for this purpose. Thus, new annotation terminology was developed as a part of this work. Second part was to develop mathematical methods correcting the noise and systematic differences/errors in the data caused by various array generations. Additionally, there was a need to develop suitable computational methods for sample collection and archiving, unique sample identification, database structures, data retrieval and visualization. Bioinformatic methods were developed to analyze gene expression levels and putative functional associations of human genes by using the integrated gene expression data. Also a method to interpret individual gene expression profiles across all the healthy and pathological tissues of the reference database was developed. As a result of this work 9783 human gene expression samples measured by Affymetrix microarrays were integrated to form a unique human transcriptome resource GeneSapiens. This makes it possible to analyse expression levels of 17330 genes across 175 types of healthy and pathological human tissues. Application of this resource to interpret individual gene expression measurements allowed identification of tissue of origin with 92.0% accuracy among 44 healthy tissue types. Systematic analysis of transcriptional activity levels of 459 kinase genes was performed across 44 healthy and 55 pathological tissue types and a genome wide analysis of kinase gene co-expression networks was done. This analysis revealed biologically and medically interesting data on putative kinase gene functions in health and disease. Finally, we developed a method for alignment of gene expression profiles (AGEP) to perform analysis for individual patient samples to pinpoint gene- and pathway-specific changes in the test sample in relation to the reference transcriptome database. We also showed how large-scale gene expression data resources can be used to quantitatively characterize changes in the transcriptomic program of differentiating stem cells. Taken together, these studies indicate the power of systematic bioinformatic analyses to infer biological and medical insights from existing published datasets as well as to facilitate the interpretation of new molecular profiling data from individual patients.
Resumo:
Design of speaker identification schemes for a small number of speakers (around 10) with a high degree of accuracy in controlled environment is a practical proposition today. When the number of speakers is large (say 50–100), many of these schemes cannot be directly extended, as both recognition error and computation time increase monotonically with population size. The feature selection problem is also complex for such schemes. Though there were earlier attempts to rank order features based on statistical distance measures, it has been observed only recently that the best two independent measurements are not the same as the combination in two's for pattern classification. We propose here a systematic approach to the problem using the decision tree or hierarchical classifier with the following objectives: (1) Design of optimal policy at each node of the tree given the tree structure i.e., the tree skeleton and the features to be used at each node. (2) Determination of the optimal feature measurement and decision policy given only the tree skeleton. Applicability of optimization procedures such as dynamic programming in the design of such trees is studied. The experimental results deal with the design of a 50 speaker identification scheme based on this approach.
Resumo:
Proteases belonging to the M20 family are characterized by diverse substrate specificity and participate in several metabolic pathways. The Staphylococcus aureus metallopeptidase, Sapep, is a member of the aminoacylase-I/M20 protein family. This protein is a Mn2+-dependent dipeptidase. The crystal structure of this protein in the Mn2+-bound form and in the open, metal-free state suggests that large interdomain movements could potentially regulate the activity of this enzyme. We note that the extended inactive conformation is stabilized by a disulfide bond in the vicinity of the active site. Although these cysteines, Cys(155) and Cys(178), are not active site residues, the reduced form of this enzyme is substantially more active as a dipeptidase. These findings acquire further relevance given a recent observation that this enzyme is only active in methicillin-resistant S. aureus. The structural and biochemical features of this enzyme provide a template for the design of novel methicillin-resistant S. aureus-specific therapeutics.
Resumo:
In recent years, thanks to developments in information technology, large-dimensional datasets have been increasingly available. Researchers now have access to thousands of economic series and the information contained in them can be used to create accurate forecasts and to test economic theories. To exploit this large amount of information, researchers and policymakers need an appropriate econometric model.Usual time series models, vector autoregression for example, cannot incorporate more than a few variables. There are two ways to solve this problem: use variable selection procedures or gather the information contained in the series to create an index model. This thesis focuses on one of the most widespread index model, the dynamic factor model (the theory behind this model, based on previous literature, is the core of the first part of this study), and its use in forecasting Finnish macroeconomic indicators (which is the focus of the second part of the thesis). In particular, I forecast economic activity indicators (e.g. GDP) and price indicators (e.g. consumer price index), from 3 large Finnish datasets. The first dataset contains a large series of aggregated data obtained from the Statistics Finland database. The second dataset is composed by economic indicators from Bank of Finland. The last dataset is formed by disaggregated data from Statistic Finland, which I call micro dataset. The forecasts are computed following a two steps procedure: in the first step I estimate a set of common factors from the original dataset. The second step consists in formulating forecasting equations including the factors extracted previously. The predictions are evaluated using relative mean squared forecast error, where the benchmark model is a univariate autoregressive model. The results are dataset-dependent. The forecasts based on factor models are very accurate for the first dataset (the Statistics Finland one), while they are considerably worse for the Bank of Finland dataset. The forecasts derived from the micro dataset are still good, but less accurate than the ones obtained in the first case. This work leads to multiple research developments. The results here obtained can be replicated for longer datasets. The non-aggregated data can be represented in an even more disaggregated form (firm level). Finally, the use of the micro data, one of the major contributions of this thesis, can be useful in the imputation of missing values and the creation of flash estimates of macroeconomic indicator (nowcasting).
Resumo:
This study uses the European Centre for Medium-Range Weather Forecasts (ECMWF) model-generated high-resolution 10-day-long predictions for the Year of Tropical Convection (YOTC) 2008. Precipitation forecast skills of the model over the tropics are evaluated against the Tropical Rainfall Measuring Mission (TRMM) estimates. It has been shown that the model was able to capture the monthly to seasonal mean features of tropical convection reasonably. Northward propagation of convective bands over the Bay of Bengal was also forecasted realistically up to 5 days in advance, including the onset phase of the monsoon during the first half of June 2008. However, large errors exist in the daily datasets especially for longer lead times over smaller domains. For shorter lead times (less than 4-5 days), forecast errors are much smaller over the oceans than over land. Moreover, the rate of increase of errors with lead time is rapid over the oceans and is confined to the regions where observed precipitation shows large day-to-day variability. It has been shown that this rapid growth of errors over the oceans is related to the spatial pattern of near-surface air temperature. This is probably due to the one-way air-sea interaction in the atmosphere-only model used for forecasting. While the prescribed surface temperature over the oceans remain realistic at shorter lead times, the pattern and hence the gradient of the surface temperature is not altered with change in atmospheric parameters at longer lead times. It has also been shown that the ECMWF model had considerable difficulties in forecasting very low and very heavy intensity of precipitation over South Asia. The model has too few grids with ``zero'' precipitation and heavy (>40 mm day(-1)) precipitation. On the other hand, drizzle-like precipitation is too frequent in the model compared to that in the TRMM datasets. Further analysis shows that a major source of error in the ECMWF precipitation forecasts is the diurnal cycle over the South Asian monsoon region. The peak intensity of precipitation in the model forecasts over land (ocean) appear about 6 (9) h earlier than that in the observations. Moreover, the amplitude of the diurnal cycle is much higher in the model forecasts compared to that in the TRMM estimates. It has been seen that the phase error of the diurnal cycle increases with forecast lead time. The error in monthly mean 3-hourly precipitation forecasts is about 2-4 times of the error in the daily mean datasets. Thus, effort should be given to improve the phase and amplitude forecast of the diurnal cycle of precipitation from the model.
Resumo:
We report a detailed investigation of resistance noise in single layer graphene films on Si/SiO2 substrates obtained by chemical vapor deposition (CVD) on copper foils. We find that noise in these systems to be rather large, and when expressed in the form of phenomenological Hooge equation, it corresponds to Hooge parameter as large as 0.1-0.5. We also find the variation in the noise magnitude with the gate voltage (or carrier density) and temperature to be surprisingly weak, which is also unlike the behavior of noise in other forms of graphene, in particular those from exfoliation. (C) 2010 American Institute of Physics. doi:10.1063/1.3493655]
Resumo:
Large-area PVDF thin films have been prepared and characterized for quasi-static and high frequency dynamic strain sensing applications. These films are prepared using hot press method and the piezoelectric phase (beta-phase) has been achieved by thermo-mechanical treatment and poling under DC field. The fabricated films have been characterized for quasi-static strain sensing and the linear strain-voltage relationship obtained is promising. In order to evaluate the ultrasonic sensing properties, a PZT wafer has been used to launch Lamb waves in a metal beam on which the PVDF film sensor is bonded at a distance. The voltage signals obtained from the PVDF films have been compared with another PZT wafer sensor placed on the opposite surface of the beam as a reference signal. Due to higher stiffness and higher thickness of the PZT wafer sensors, certain resonance patterns significantly degrade the sensor sensitivity curves. Whereas, the present results show that the large-area PVDF sensors can be superior with the signal amplitude comparable to that of PZT sensors and with no resonance-induced effect, which is due to low mechanical impedance, smaller thickness and larger area of the PVDF film. Moreover, the developed PVDF sensors are able to capture both A(0) and S-0 modes of Lamb wave, whereas the PZT sensors captures only A(0) mode in the same scale of voltage output. This shows promises in using large-area PVDF films with various surface patterns on structures for distributed sensing and structural health monitoring under quasi-static, vibration and ultrasonic situations. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Many large mammals such as elephant, rhino and tiger often come into conflict with people by destroying agricultural crops and even killing people, thus providing a deterrent to conservation efforts. The males of these polygynous species have a greater variance in reproductive success than females, leading to selection pressures favouring a ‘high risk-high gain’ strategy for promoting reproductive success. This brings them into greater conflict with people. For instance, adult male elephants are far more prone than a member of a female-led family herd to raid agricultural crops and to kill people. In polygynous species, the removal of a certain proportion of ‘surplus’ adult males is not likely to affect the fertility and growth rate of the population. Hence, this could be a management tool which would effectively reduce animal-human conflict, and at the same time maintain the viability of the population. Selective removal of males would result in a skewed sex ratio. This would reduce the ‘effective population size’ (as opposed to the total population or census number), increase the rate of genetic drift and, in small populations, lead to inbreeding depression. Plans for managing destructive mammals through the culling of males will have to ensure that the appropriate minimum size in the populations is being maintained.
Resumo:
We report femtosecond time-resolved reflectivity measurements of coherent phonons in tellurium performed over a wide range of temperatures (3-296 K) and pump-laser intensities. A totally symmetric A(1) coherent phonon at 3.6 THz responsible for the oscillations in the reflectivity data is observed to be strongly positively chirped (i.e., phonon time period decreases at longer pump-probe delay times) with increasing photoexcited carrier density, more so at lower temperatures. We show that the temperature dependence of the coherent phonon frequency is anomalous (i.e, increasing with increasing temperature) at high photoexcited carrier density due to electron-phonon interaction. At the highest photoexcited carrier density of (1.4 x 10(21) cm(-3) and the sample temperature of 3 K, the lattice displacement of the coherent phonon mode is estimated to be as high as similar to 0.24 angstrom. Numerical simulations based on coupled effects of optical absorption and carrier diffusion reveal that the diffusion of carriers dominates the nonoscillatory electronic part of the time-resolved reflectivity. Finally, using the pump-probe experiments at low carrier density of 6 x 10(18) cm(-3), we separate the phonon anharmonicity to obtain the electron-phonon coupling contribution to the phonon frequency and linewidth.
Resumo:
The Bay of Bengal (BoB), a small oceanic region surrounded by landmasses with distinct natural and anthropogenic activities and under the influence of seasonally changing airmass types, is characterized by a rather complex and highly heterogeneous aerosol environment. Concurrent measurements of the physical, optical, and chemical (offline analysis) properties of BoB aerosols, made onboard extensive ship-cruises and aircraft sorties during Integrated Campaign for Aerosols, gases and Radiation Budget of March-April 2006, and satellite-retrieved aerosol optical depths and derived parameters, were synthesized following a synergistic approach to delineate the anthropogenic fraction to the composite aerosol parameters and its spatial variation. Quite interestingly and contrary to the general belief, our studies revealed that, despite of the very high aerosol loading (in the marine atmospheric boundary layer as well as in the vertical column) over the northern BoB and a steep decreasing gradient toward the southern latitudes, the anthropogenic fraction showed a steady increase from North to South (where no obvious anthropogenic source regions exist). Consequently, the direct radiative forcing at the top of the atmosphere due to anthropogenic aerosols remained nearly constant over the entire BoB with values in the range from -3.3 to -3.6 Wm(-2). This interesting finding, beyond doubts calls for a better understanding of the complex aerosol system over the BoB through more focused field campaigns.
Resumo:
Large eddy simulation (LES) is an emerging technique for obtaining an approximation to turbulent flow fields It is an improvement over the widely prevalent practice of obtaining means of turbulent flows when the flow has large scale, low frequency, unsteadiness An introduction to the method, its general formulation, and the more common modelling for flows without reaction, is discussed Some attempts at extension to flows with combustion have been made Examples from present work for flows with and without combustion are given The final example of the LES of the combustor of a helicopter engine illustrates the state-of-the-art in application of the technique