23 resultados para data complexity

em Universit


Relevância:

40.00% 40.00%

Publicador:

Resumo:

MOTIVATION: High-throughput sequencing technologies enable the genome-wide analysis of the impact of genetic variation on molecular phenotypes at unprecedented resolution. However, although powerful, these technologies can also introduce unexpected artifacts. Results: We investigated the impact of library amplification bias on the identification of allele-specific (AS) molecular events from high-throughput sequencing data derived from chromatin immunoprecipitation assays (ChIP-seq). Putative AS DNA binding activity for RNA polymerase II was determined using ChIP-seq data derived from lymphoblastoid cell lines of two parent-daughter trios. We found that, at high-sequencing depth, many significant AS binding sites suffered from an amplification bias, as evidenced by a larger number of clonal reads representing one of the two alleles. To alleviate this bias, we devised an amplification bias detection strategy, which filters out sites with low read complexity and sites featuring a significant excess of clonal reads. This method will be useful for AS analyses involving ChIP-seq and other functional sequencing assays.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Maximum entropy modeling (Maxent) is a widely used algorithm for predicting species distributions across space and time. Properly assessing the uncertainty in such predictions is non-trivial and requires validation with independent datasets. Notably, model complexity (number of model parameters) remains a major concern in relation to overfitting and, hence, transferability of Maxent models. An emerging approach is to validate the cross-temporal transferability of model predictions using paleoecological data. In this study, we assess the effect of model complexity on the performance of Maxent projections across time using two European plant species (Alnus giutinosa (L.) Gaertn. and Corylus avellana L) with an extensive late Quaternary fossil record in Spain as a study case. We fit 110 models with different levels of complexity under present time and tested model performance using AUC (area under the receiver operating characteristic curve) and AlCc (corrected Akaike Information Criterion) through the standard procedure of randomly partitioning current occurrence data. We then compared these results to an independent validation by projecting the models to mid-Holocene (6000 years before present) climatic conditions in Spain to assess their ability to predict fossil pollen presence-absence and abundance. We find that calibrating Maxent models with default settings result in the generation of overly complex models. While model performance increased with model complexity when predicting current distributions, it was higher with intermediate complexity when predicting mid-Holocene distributions. Hence, models of intermediate complexity resulted in the best trade-off to predict species distributions across time. Reliable temporal model transferability is especially relevant for forecasting species distributions under future climate change. Consequently, species-specific model tuning should be used to find the best modeling settings to control for complexity, notably with paleoecological data to independently validate model projections. For cross-temporal projections of species distributions for which paleoecological data is not available, models of intermediate complexity should be selected.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Neuroblastoma (NB) is a neural crest-derived childhood tumor characterized by a remarkable phenotypic diversity, ranging from spontaneous regression to fatal metastatic disease. Although the cancer stem cell (CSC) model provides a trail to characterize the cells responsible for tumor onset, the NB tumor-initiating cell (TIC) has not been identified. In this study, the relevance of the CSC model in NB was investigated by taking advantage of typical functional stem cell characteristics. A predictive association was established between self-renewal, as assessed by serial sphere formation, and clinical aggressiveness in primary tumors. Moreover, cell subsets gradually selected during serial sphere culture harbored increased in vivo tumorigenicity, only highlighted in an orthotopic microenvironment. A microarray time course analysis of serial spheres passages from metastatic cells allowed us to specifically "profile" the NB stem cell-like phenotype and to identify CD133, ABC transporter, and WNT and NOTCH genes as spheres markers. On the basis of combined sphere markers expression, at least two distinct tumorigenic cell subpopulations were identified, also shown to preexist in primary NB. However, sphere markers-mediated cell sorting of parental tumor failed to recapitulate the TIC phenotype in the orthotopic model, highlighting the complexity of the CSC model. Our data support the NB stem-like cells as a dynamic and heterogeneous cell population strongly dependent on microenvironmental signals and add novel candidate genes as potential therapeutic targets in the control of high-risk NB.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Species distribution models (SDMs) are widely used to explain and predict species ranges and environmental niches. They are most commonly constructed by inferring species' occurrence-environment relationships using statistical and machine-learning methods. The variety of methods that can be used to construct SDMs (e.g. generalized linear/additive models, tree-based models, maximum entropy, etc.), and the variety of ways that such models can be implemented, permits substantial flexibility in SDM complexity. Building models with an appropriate amount of complexity for the study objectives is critical for robust inference. We characterize complexity as the shape of the inferred occurrence-environment relationships and the number of parameters used to describe them, and search for insights into whether additional complexity is informative or superfluous. By building 'under fit' models, having insufficient flexibility to describe observed occurrence-environment relationships, we risk misunderstanding the factors shaping species distributions. By building 'over fit' models, with excessive flexibility, we risk inadvertently ascribing pattern to noise or building opaque models. However, model selection can be challenging, especially when comparing models constructed under different modeling approaches. Here we argue for a more pragmatic approach: researchers should constrain the complexity of their models based on study objective, attributes of the data, and an understanding of how these interact with the underlying biological processes. We discuss guidelines for balancing under fitting with over fitting and consequently how complexity affects decisions made during model building. Although some generalities are possible, our discussion reflects differences in opinions that favor simpler versus more complex models. We conclude that combining insights from both simple and complex SDM building approaches best advances our knowledge of current and future species ranges.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

SUMMARY: Large sets of data, such as expression profiles from many samples, require analytic tools to reduce their complexity. The Iterative Signature Algorithm (ISA) is a biclustering algorithm. It was designed to decompose a large set of data into so-called 'modules'. In the context of gene expression data, these modules consist of subsets of genes that exhibit a coherent expression profile only over a subset of microarray experiments. Genes and arrays may be attributed to multiple modules and the level of required coherence can be varied resulting in different 'resolutions' of the modular mapping. In this short note, we introduce two BioConductor software packages written in GNU R: The isa2 package includes an optimized implementation of the ISA and the eisa package provides a convenient interface to run the ISA, visualize its output and put the biclusters into biological context. Potential users of these packages are all R and BioConductor users dealing with tabular (e.g. gene expression) data. AVAILABILITY: http://www.unil.ch/cbg/ISA CONTACT: sven.bergmann@unil.ch

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The integration of geophysical data into the subsurface characterization problem has been shown in many cases to significantly improve hydrological knowledge by providing information at spatial scales and locations that is unattainable using conventional hydrological measurement techniques. The investigation of exactly how much benefit can be brought by geophysical data in terms of its effect on hydrological predictions, however, has received considerably less attention in the literature. Here, we examine the potential hydrological benefits brought by a recently introduced simulated annealing (SA) conditional stochastic simulation method designed for the assimilation of diverse hydrogeophysical data sets. We consider the specific case of integrating crosshole ground-penetrating radar (GPR) and borehole porosity log data to characterize the porosity distribution in saturated heterogeneous aquifers. In many cases, porosity is linked to hydraulic conductivity and thus to flow and transport behavior. To perform our evaluation, we first generate a number of synthetic porosity fields exhibiting varying degrees of spatial continuity and structural complexity. Next, we simulate the collection of crosshole GPR data between several boreholes in these fields, and the collection of porosity log data at the borehole locations. The inverted GPR data, together with the porosity logs, are then used to reconstruct the porosity field using the SA-based method, along with a number of other more elementary approaches. Assuming that the grid-cell-scale relationship between porosity and hydraulic conductivity is unique and known, the porosity realizations are then used in groundwater flow and contaminant transport simulations to assess the benefits and limitations of the different approaches.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this study was to propose a methodology allowing a detailed characterization of body sit-to-stand/stand-to-sit postural transition. Parameters characterizing the kinematics of the trunk movement during sit-to-stand (Si-St) postural transition were calculated using one initial sensor system fixed on the trunk and a data logger. Dynamic complexity of these postural transitions was estimated by fractal dimension of acceleration-angular velocity plot. We concluded that this method provides a simple and accurate tool for monitoring frail elderly and to objectively evaluate the efficacy of a rehabilitation program.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Imaging mass spectrometry (IMS) represents an innovative tool in the cancer research pipeline, which is increasingly being used in clinical and pharmaceutical applications. The unique properties of the technique, especially the amount of data generated, make the handling of data from multiple IMS acquisitions challenging. This work presents a histology-driven IMS approach aiming to identify discriminant lipid signatures from the simultaneous mining of IMS data sets from multiple samples. The feasibility of the developed workflow is evaluated on a set of three human colorectal cancer liver metastasis (CRCLM) tissue sections. Lipid IMS on tissue sections was performed using MALDI-TOF/TOF MS in both negative and positive ionization modes after 1,5-diaminonaphthalene matrix deposition by sublimation. The combination of both positive and negative acquisition results was performed during data mining to simplify the process and interrogate a larger lipidome into a single analysis. To reduce the complexity of the IMS data sets, a sub data set was generated by randomly selecting a fixed number of spectra from a histologically defined region of interest, resulting in a 10-fold data reduction. Principal component analysis confirmed that the molecular selectivity of the regions of interest is maintained after data reduction. Partial least-squares and heat map analyses demonstrated a selective signature of the CRCLM, revealing lipids that are significantly up- and down-regulated in the tumor region. This comprehensive approach is thus of interest for defining disease signatures directly from IMS data sets by the use of combinatory data mining, opening novel routes of investigation for addressing the demands of the clinical setting.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

High-resolution tomographic imaging of the shallow subsurface is becoming increasingly important for a wide range of environmental, hydrological and engineering applications. Because of their superior resolution power, their sensitivity to pertinent petrophysical parameters, and their far reaching complementarities, both seismic and georadar crosshole imaging are of particular importance. To date, corresponding approaches have largely relied on asymptotic, ray-based approaches, which only account for a very small part of the observed wavefields, inherently suffer from a limited resolution, and in complex environments may prove to be inadequate. These problems can potentially be alleviated through waveform inversion. We have developed an acoustic waveform inversion approach for crosshole seismic data whose kernel is based on a finite-difference time-domain (FDTD) solution of the 2-D acoustic wave equations. This algorithm is tested on and applied to synthetic data from seismic velocity models of increasing complexity and realism and the results are compared to those obtained using state-of-the-art ray-based traveltime tomography. Regardless of the heterogeneity of the underlying models, the waveform inversion approach has the potential of reliably resolving both the geometry and the acoustic properties of features of the size of less than half a dominant wavelength. Our results do, however, also indicate that, within their inherent resolution limits, ray-based approaches provide an effective and efficient means to obtain satisfactory tomographic reconstructions of the seismic velocity structure in the presence of mild to moderate heterogeneity and in absence of strong scattering. Conversely, the excess effort of waveform inversion provides the greatest benefits for the most heterogeneous, and arguably most realistic, environments where multiple scattering effects tend to be prevalent and ray-based methods lose most of their effectiveness.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In recent years there has been an explosive growth in the development of adaptive and data driven methods. One of the efficient and data-driven approaches is based on statistical learning theory (Vapnik 1998). The theory is based on Structural Risk Minimisation (SRM) principle and has a solid statistical background. When applying SRM we are trying not only to reduce training error ? to fit the available data with a model, but also to reduce the complexity of the model and to reduce generalisation error. Many nonlinear learning procedures recently developed in neural networks and statistics can be understood and interpreted in terms of the structural risk minimisation inductive principle. A recent methodology based on SRM is called Support Vector Machines (SVM). At present SLT is still under intensive development and SVM find new areas of application (www.kernel-machines.org). SVM develop robust and non linear data models with excellent generalisation abilities that is very important both for monitoring and forecasting. SVM are extremely good when input space is high dimensional and training data set i not big enough to develop corresponding nonlinear model. Moreover, SVM use only support vectors to derive decision boundaries. It opens a way to sampling optimization, estimation of noise in data, quantification of data redundancy etc. Presentation of SVM for spatially distributed data is given in (Kanevski and Maignan 2004).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

AbstractIn addition to genetic changes affecting the function of gene products, changes in gene expression have been suggested to underlie many or even most of the phenotypic differences among mammals. However, detailed gene expression comparisons were, until recently, restricted to closely related species, owing to technological limitations. Thus, we took advantage of the latest technologies (RNA-Seq) to generate extensive qualitative and quantitative transcriptome data for a unique collection of somatic and germline tissues from representatives of all major mammalian lineages (placental mammals, marsupials and monotremes) and birds, the evolutionary outgroup.In the first major project of my thesis, we performed global comparative analyses of gene expression levels based on these data. Our analyses provided fundamental insights into the dynamics of transcriptome change during mammalian evolution (e.g., the rate of expression change across species, tissues and chromosomes) and allowed the exploration of the functional relevance and phenotypic implications of transcription changes at a genome-wide scale (e.g., we identified numerous potentially selectively driven expression switches).In a second project of my thesis, which was also based on the unique transcriptome data generated in the context of the first project we focused on the evolution of alternative splicing in mammals. Alternative splicing contributes to transcriptome complexity by generating several transcript isoforms from a single gene, which can, thus, perform various functions. To complete the global comparative analysis of gene expression changes, we explored patterns of alternative splicing evolution. This work uncovered several general and unexpected patterns of alternative splicing evolution (e.g., we found that alternative splicing evolves extremely rapidly) as well as a large number of conserved alternative isoforms that may be crucial for the functioning of mammalian organs.Finally, the third and final project of my PhD consisted in analyzing in detail the unique functional and evolutionary properties of the testis by exploring the extent of its transcriptome complexity. This organ was previously shown to evolve rapidly both at the phenotypic and molecular level, apparently because of the specific pressures that act on this organ and are associated with its reproductive function. Moreover, my analyses of the amniote tissue transcriptome data described above, revealed strikingly widespread transcriptional activity of both functional and nonfunctional genomic elements in the testis compared to the other organs. To elucidate the cellular source and mechanisms underlying this promiscuous transcription in the testis, we generated deep coverage RNA-Seq data for all major testis cell types as well as epigenetic data (DNA and histone methylation) using the mouse as model system. The integration of these complete dataset revealed that meiotic and especially post-meiotic germ cells are the major contributors to the widespread functional and nonfunctional transcriptome complexity of the testis, and that this "promiscuous" spermatogenic transcription is resulting, at least partially, from an overall transcriptionally permissive chromatin state. We hypothesize that this particular open state of the chromatin results from the extensive chromatin remodeling that occurs during spermatogenesis which ultimately leads to the replacement of histones by protamines in the mature spermatozoa. Our results have important functional and evolutionary implications (e.g., regarding new gene birth and testicular gene expression evolution).Generally, these three large-scale projects of my thesis provide complete and massive datasets that constitute valuables resources for further functional and evolutionary analyses of mammalian genomes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

New stratigraphic data along a profile from the Helvetic Gotthard Massif to the remnants of the North Penninic Basin in eastern Ticino and Graubunden are presented. The stratigraphic record together with existing geochemical and structural data, motivate a new interpretation of the fossil European distal margin. We introduce a new group of Triassic facies, the North-Penninic-Triassic (NPT), which is characterised by the Ladinian "dolomie bicolori". The NPT was located in-between the Briançonnais carbonate platform and the Helvetic lands. The observed horizontal transition, coupled with the stratigraphic superposition of an Helvetic Liassic on a Briaçonnais Triassic in the Luzzone-Terri nappe, links, prior to Jurassic rifting, the Briançonnais paleogeographic domain at the Helvetic Margin, south of the Gotthard. Our observations suggest that the Jurassic rifting separated the Briançonnais domain from the Helvetic margin by complex and protracted extension. The syn-rift stratigraphic record in the Adula nappe and surroundings suggests the presence of a diffuse rising area with only moderately subsiding basins above a thinned continental and proto-oceanic crust. Strong subsidence occurred in a second phase following protracted extension and the resulting delamination of the rising area. The stratigraphic coherency in the Adula's Mesozoic questions the idea of a lithospheric mélange in the eclogitic Adula nappe, which is more likely to be a coherent alpine tectonic unit. The structural and stratigraphic observations in the Piz Terri-Lunschania zone suggest the activity of syn-rift detachments. During the alpine collision these faults are reactivated (and inverted) and played a major role in allowing the Adula subduction, the "Penninic Thrust" above it and in creating the structural complexity of the Central Alps.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Peptide toxins synthesized by venomous animals have been extensively studied in the last decades. To be useful to the scientific community, this knowledge has been stored, annotated and made easy to retrieve by several databases. The aim of this article is to present what type of information users can access from each database. ArachnoServer and ConoServer focus on spider toxins and cone snail toxins, respectively. UniProtKB, a generalist protein knowledgebase, has an animal toxin-dedicated annotation program that includes toxins from all venomous animals. Finally, the ATDB metadatabase compiles data and annotations from other databases and provides toxin ontology.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Peptide toxins synthesized by venomous animals have been extensively studied in the last decades. To be useful to the scientific community, this knowledge has been stored, annotated and made easy to retrieve by several databases. The aim of this article is to present what type of information users can access from each database. ArachnoServer and ConoServer focus on spider toxins and cone snail toxins, respectively. UniProtKB, a generalist protein knowledgebase, has an animal toxin-dedicated annotation program that includes toxins from all venomous animals. Finally, the ATDB metadatabase compiles data and annotations from other databases and provides toxin ontology.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this study was to assess a population of patients with diabetes mellitus by means of the INTERMED, a classification system for case complexity integrating biological, psychosocial and health care related aspects of disease. The main hypothesis was that the INTERMED would identify distinct clusters of patients with different degrees of case complexity and different clinical outcomes. Patients (n=61) referred to a tertiary reference care centre were evaluated with the INTERMED and followed 9 months for HbA1c values and 6 months for health care utilisation. Cluster analysis revealed two clusters: cluster 1 (62%) consisting of complex patients with high INTERMED scores and cluster 2 (38%) consisting of less complex patients with lower INTERMED. Cluster 1 patients showed significantly higher HbA1c values and a tendency for increased health care utilisation. Total INTERMED scores were significantly related to HbA1c and explained 21% of its variance. In conclusion, different clusters of patients with different degrees of case complexity were identified by the INTERMED, allowing the detection of highly complex patients at risk for poor diabetes control. The INTERMED therefore provides an objective basis for clinical and scientific progress in diabetes mellitus. Ongoing intervention studies will have to confirm these preliminary data and to evaluate if management strategies based on the INTERMED profiles will improve outcomes.