962 resultados para data complexity


Relevância:

30.00% 30.00%

Publicador:

Resumo:

SUMMARY: Large sets of data, such as expression profiles from many samples, require analytic tools to reduce their complexity. The Iterative Signature Algorithm (ISA) is a biclustering algorithm. It was designed to decompose a large set of data into so-called 'modules'. In the context of gene expression data, these modules consist of subsets of genes that exhibit a coherent expression profile only over a subset of microarray experiments. Genes and arrays may be attributed to multiple modules and the level of required coherence can be varied resulting in different 'resolutions' of the modular mapping. In this short note, we introduce two BioConductor software packages written in GNU R: The isa2 package includes an optimized implementation of the ISA and the eisa package provides a convenient interface to run the ISA, visualize its output and put the biclusters into biological context. Potential users of these packages are all R and BioConductor users dealing with tabular (e.g. gene expression) data. AVAILABILITY: http://www.unil.ch/cbg/ISA CONTACT: sven.bergmann@unil.ch

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The integration of geophysical data into the subsurface characterization problem has been shown in many cases to significantly improve hydrological knowledge by providing information at spatial scales and locations that is unattainable using conventional hydrological measurement techniques. The investigation of exactly how much benefit can be brought by geophysical data in terms of its effect on hydrological predictions, however, has received considerably less attention in the literature. Here, we examine the potential hydrological benefits brought by a recently introduced simulated annealing (SA) conditional stochastic simulation method designed for the assimilation of diverse hydrogeophysical data sets. We consider the specific case of integrating crosshole ground-penetrating radar (GPR) and borehole porosity log data to characterize the porosity distribution in saturated heterogeneous aquifers. In many cases, porosity is linked to hydraulic conductivity and thus to flow and transport behavior. To perform our evaluation, we first generate a number of synthetic porosity fields exhibiting varying degrees of spatial continuity and structural complexity. Next, we simulate the collection of crosshole GPR data between several boreholes in these fields, and the collection of porosity log data at the borehole locations. The inverted GPR data, together with the porosity logs, are then used to reconstruct the porosity field using the SA-based method, along with a number of other more elementary approaches. Assuming that the grid-cell-scale relationship between porosity and hydraulic conductivity is unique and known, the porosity realizations are then used in groundwater flow and contaminant transport simulations to assess the benefits and limitations of the different approaches.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Clonally complex infections by Mycobacterium tuberculosis are progressively more accepted. Studies of their dimension in epidemiological scenarios where the infective pressure is not high are scarce. Our study systematically searched for clonally complex infections (mixed infections by more than one strain and simultaneous presence of clonal variants) by applying mycobacterial interspersed repetitive-unit (MIRU)-variable-number tandem-repeat (VNTR) analysis to M. tuberculosis isolates from two population-based samples of respiratory (703 cases) and respiratory-extrapulmonary (R+E) tuberculosis (TB) cases (71 cases) in a context of moderate TB incidence. Clonally complex infections were found in 11 (1.6%) of the respiratory TB cases and in 10 (14.1%) of those with R+E TB. Among the 21 cases with clonally complex TB, 9 were infected by 2 independent strains and the remaining 12 showed the simultaneous presence of 2 to 3 clonal variants. For the 10 R+E TB cases with clonally complex infections, compartmentalization (different compositions of strains/clonal variants in independent infected sites) was found in 9 of them. All the strains/clonal variants were also genotyped by IS6110-based restriction fragment length polymorphism analysis, which split two MIRU-defined clonal variants, although in general, it showed a lower discriminatory power to identify the clonal heterogeneity revealed by MIRU-VNTR analysis. The comparative analysis of IS6110 insertion sites between coinfecting clonal variants showed differences in the genes coding for a cutinase, a PPE family protein, and two conserved hypothetical proteins. Diagnostic delay, existence of previous TB, risk for overexposure, and clustered/orphan status of the involved strains were analyzed to propose possible explanations for the cases with clonally complex infections. Our study characterizes in detail all the clonally complex infections by M. tuberculosis found in a systematic survey and contributes to the characterization that these phenomena can be found to an extent higher than expected, even in an unselected population-based sample lacking high infective pressure.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this study was to propose a methodology allowing a detailed characterization of body sit-to-stand/stand-to-sit postural transition. Parameters characterizing the kinematics of the trunk movement during sit-to-stand (Si-St) postural transition were calculated using one initial sensor system fixed on the trunk and a data logger. Dynamic complexity of these postural transitions was estimated by fractal dimension of acceleration-angular velocity plot. We concluded that this method provides a simple and accurate tool for monitoring frail elderly and to objectively evaluate the efficacy of a rehabilitation program.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Imaging mass spectrometry (IMS) represents an innovative tool in the cancer research pipeline, which is increasingly being used in clinical and pharmaceutical applications. The unique properties of the technique, especially the amount of data generated, make the handling of data from multiple IMS acquisitions challenging. This work presents a histology-driven IMS approach aiming to identify discriminant lipid signatures from the simultaneous mining of IMS data sets from multiple samples. The feasibility of the developed workflow is evaluated on a set of three human colorectal cancer liver metastasis (CRCLM) tissue sections. Lipid IMS on tissue sections was performed using MALDI-TOF/TOF MS in both negative and positive ionization modes after 1,5-diaminonaphthalene matrix deposition by sublimation. The combination of both positive and negative acquisition results was performed during data mining to simplify the process and interrogate a larger lipidome into a single analysis. To reduce the complexity of the IMS data sets, a sub data set was generated by randomly selecting a fixed number of spectra from a histologically defined region of interest, resulting in a 10-fold data reduction. Principal component analysis confirmed that the molecular selectivity of the regions of interest is maintained after data reduction. Partial least-squares and heat map analyses demonstrated a selective signature of the CRCLM, revealing lipids that are significantly up- and down-regulated in the tumor region. This comprehensive approach is thus of interest for defining disease signatures directly from IMS data sets by the use of combinatory data mining, opening novel routes of investigation for addressing the demands of the clinical setting.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One of the tantalising remaining problems in compositional data analysis lies in how to deal with data sets in which there are components which are essential zeros. By anessential zero we mean a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part. Such essential zeros occur inmany compositional situations, such as household budget patterns, time budgets,palaeontological zonation studies, ecological abundance studies. Devices such as nonzero replacement and amalgamation are almost invariably ad hoc and unsuccessful insuch situations. From consideration of such examples it seems sensible to build up amodel in two stages, the first determining where the zeros will occur and the secondhow the unit available is distributed among the non-zero parts. In this paper we suggest two such models, an independent binomial conditional logistic normal model and a hierarchical dependent binomial conditional logistic normal model. The compositional data in such modelling consist of an incidence matrix and a conditional compositional matrix. Interesting statistical problems arise, such as the question of estimability of parameters, the nature of the computational process for the estimation of both the incidence and compositional parameters caused by the complexity of the subcompositional structure, the formation of meaningful hypotheses, and the devising of suitable testing methodology within a lattice of such essential zero-compositional hypotheses. The methodology is illustrated by application to both simulated and real compositional data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

High-resolution tomographic imaging of the shallow subsurface is becoming increasingly important for a wide range of environmental, hydrological and engineering applications. Because of their superior resolution power, their sensitivity to pertinent petrophysical parameters, and their far reaching complementarities, both seismic and georadar crosshole imaging are of particular importance. To date, corresponding approaches have largely relied on asymptotic, ray-based approaches, which only account for a very small part of the observed wavefields, inherently suffer from a limited resolution, and in complex environments may prove to be inadequate. These problems can potentially be alleviated through waveform inversion. We have developed an acoustic waveform inversion approach for crosshole seismic data whose kernel is based on a finite-difference time-domain (FDTD) solution of the 2-D acoustic wave equations. This algorithm is tested on and applied to synthetic data from seismic velocity models of increasing complexity and realism and the results are compared to those obtained using state-of-the-art ray-based traveltime tomography. Regardless of the heterogeneity of the underlying models, the waveform inversion approach has the potential of reliably resolving both the geometry and the acoustic properties of features of the size of less than half a dominant wavelength. Our results do, however, also indicate that, within their inherent resolution limits, ray-based approaches provide an effective and efficient means to obtain satisfactory tomographic reconstructions of the seismic velocity structure in the presence of mild to moderate heterogeneity and in absence of strong scattering. Conversely, the excess effort of waveform inversion provides the greatest benefits for the most heterogeneous, and arguably most realistic, environments where multiple scattering effects tend to be prevalent and ray-based methods lose most of their effectiveness.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The increasing volume of data describing humandisease processes and the growing complexity of understanding, managing, and sharing such data presents a huge challenge for clinicians and medical researchers. This paper presents the@neurIST system, which provides an infrastructure for biomedical research while aiding clinical care, by bringing together heterogeneous data and complex processing and computing services. Although @neurIST targets the investigation and treatment of cerebral aneurysms, the system’s architecture is generic enough that it could be adapted to the treatment of other diseases.Innovations in @neurIST include confining the patient data pertaining to aneurysms inside a single environment that offers cliniciansthe tools to analyze and interpret patient data and make use of knowledge-based guidance in planning their treatment. Medicalresearchers gain access to a critical mass of aneurysm related data due to the system’s ability to federate distributed informationsources. A semantically mediated grid infrastructure ensures that both clinicians and researchers are able to seamlessly access andwork on data that is distributed across multiple sites in a secure way in addition to providing computing resources on demand forperforming computationally intensive simulations for treatment planning and research.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One of the disadvantages of old age is that there is more past than future: this,however, may be turned into an advantage if the wealth of experience and, hopefully,wisdom gained in the past can be reflected upon and throw some light on possiblefuture trends. To an extent, then, this talk is necessarily personal, certainly nostalgic,but also self critical and inquisitive about our understanding of the discipline ofstatistics. A number of almost philosophical themes will run through the talk: searchfor appropriate modelling in relation to the real problem envisaged, emphasis onsensible balances between simplicity and complexity, the relative roles of theory andpractice, the nature of communication of inferential ideas to the statistical layman, theinter-related roles of teaching, consultation and research. A list of keywords might be:identification of sample space and its mathematical structure, choices betweentransform and stay, the role of parametric modelling, the role of a sample spacemetric, the underused hypothesis lattice, the nature of compositional change,particularly in relation to the modelling of processes. While the main theme will berelevance to compositional data analysis we shall point to substantial implications forgeneral multivariate analysis arising from experience of the development ofcompositional data analysis…

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We continue the development of a method for the selection of a bandwidth or a number of design parameters in density estimation. We provideexplicit non-asymptotic density-free inequalities that relate the $L_1$ error of the selected estimate with that of the best possible estimate,and study in particular the connection between the richness of the classof density estimates and the performance bound. For example, our methodallows one to pick the bandwidth and kernel order in the kernel estimatesimultaneously and still assure that for {\it all densities}, the $L_1$error of the corresponding kernel estimate is not larger than aboutthree times the error of the estimate with the optimal smoothing factor and kernel plus a constant times $\sqrt{\log n/n}$, where $n$ is the sample size, and the constant only depends on the complexity of the family of kernels used in the estimate. Further applications include multivariate kernel estimates, transformed kernel estimates, and variablekernel estimates.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In recent years there has been an explosive growth in the development of adaptive and data driven methods. One of the efficient and data-driven approaches is based on statistical learning theory (Vapnik 1998). The theory is based on Structural Risk Minimisation (SRM) principle and has a solid statistical background. When applying SRM we are trying not only to reduce training error ? to fit the available data with a model, but also to reduce the complexity of the model and to reduce generalisation error. Many nonlinear learning procedures recently developed in neural networks and statistics can be understood and interpreted in terms of the structural risk minimisation inductive principle. A recent methodology based on SRM is called Support Vector Machines (SVM). At present SLT is still under intensive development and SVM find new areas of application (www.kernel-machines.org). SVM develop robust and non linear data models with excellent generalisation abilities that is very important both for monitoring and forecasting. SVM are extremely good when input space is high dimensional and training data set i not big enough to develop corresponding nonlinear model. Moreover, SVM use only support vectors to derive decision boundaries. It opens a way to sampling optimization, estimation of noise in data, quantification of data redundancy etc. Presentation of SVM for spatially distributed data is given in (Kanevski and Maignan 2004).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

AbstractIn addition to genetic changes affecting the function of gene products, changes in gene expression have been suggested to underlie many or even most of the phenotypic differences among mammals. However, detailed gene expression comparisons were, until recently, restricted to closely related species, owing to technological limitations. Thus, we took advantage of the latest technologies (RNA-Seq) to generate extensive qualitative and quantitative transcriptome data for a unique collection of somatic and germline tissues from representatives of all major mammalian lineages (placental mammals, marsupials and monotremes) and birds, the evolutionary outgroup.In the first major project of my thesis, we performed global comparative analyses of gene expression levels based on these data. Our analyses provided fundamental insights into the dynamics of transcriptome change during mammalian evolution (e.g., the rate of expression change across species, tissues and chromosomes) and allowed the exploration of the functional relevance and phenotypic implications of transcription changes at a genome-wide scale (e.g., we identified numerous potentially selectively driven expression switches).In a second project of my thesis, which was also based on the unique transcriptome data generated in the context of the first project we focused on the evolution of alternative splicing in mammals. Alternative splicing contributes to transcriptome complexity by generating several transcript isoforms from a single gene, which can, thus, perform various functions. To complete the global comparative analysis of gene expression changes, we explored patterns of alternative splicing evolution. This work uncovered several general and unexpected patterns of alternative splicing evolution (e.g., we found that alternative splicing evolves extremely rapidly) as well as a large number of conserved alternative isoforms that may be crucial for the functioning of mammalian organs.Finally, the third and final project of my PhD consisted in analyzing in detail the unique functional and evolutionary properties of the testis by exploring the extent of its transcriptome complexity. This organ was previously shown to evolve rapidly both at the phenotypic and molecular level, apparently because of the specific pressures that act on this organ and are associated with its reproductive function. Moreover, my analyses of the amniote tissue transcriptome data described above, revealed strikingly widespread transcriptional activity of both functional and nonfunctional genomic elements in the testis compared to the other organs. To elucidate the cellular source and mechanisms underlying this promiscuous transcription in the testis, we generated deep coverage RNA-Seq data for all major testis cell types as well as epigenetic data (DNA and histone methylation) using the mouse as model system. The integration of these complete dataset revealed that meiotic and especially post-meiotic germ cells are the major contributors to the widespread functional and nonfunctional transcriptome complexity of the testis, and that this "promiscuous" spermatogenic transcription is resulting, at least partially, from an overall transcriptionally permissive chromatin state. We hypothesize that this particular open state of the chromatin results from the extensive chromatin remodeling that occurs during spermatogenesis which ultimately leads to the replacement of histones by protamines in the mature spermatozoa. Our results have important functional and evolutionary implications (e.g., regarding new gene birth and testicular gene expression evolution).Generally, these three large-scale projects of my thesis provide complete and massive datasets that constitute valuables resources for further functional and evolutionary analyses of mammalian genomes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Digital information generates the possibility of a high degree of redundancy in the data available for fitting predictive models used for Digital Soil Mapping (DSM). Among these models, the Decision Tree (DT) technique has been increasingly applied due to its capacity of dealing with large datasets. The purpose of this study was to evaluate the impact of the data volume used to generate the DT models on the quality of soil maps. An area of 889.33 km² was chosen in the Northern region of the State of Rio Grande do Sul. The soil-landscape relationship was obtained from reambulation of the studied area and the alignment of the units in the 1:50,000 scale topographic mapping. Six predictive covariates linked to the factors soil formation, relief and organisms, together with data sets of 1, 3, 5, 10, 15, 20 and 25 % of the total data volume, were used to generate the predictive DT models in the data mining program Waikato Environment for Knowledge Analysis (WEKA). In this study, sample densities below 5 % resulted in models with lower power of capturing the complexity of the spatial distribution of the soil in the study area. The relation between the data volume to be handled and the predictive capacity of the models was best for samples between 5 and 15 %. For the models based on these sample densities, the collected field data indicated an accuracy of predictive mapping close to 70 %.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

New stratigraphic data along a profile from the Helvetic Gotthard Massif to the remnants of the North Penninic Basin in eastern Ticino and Graubunden are presented. The stratigraphic record together with existing geochemical and structural data, motivate a new interpretation of the fossil European distal margin. We introduce a new group of Triassic facies, the North-Penninic-Triassic (NPT), which is characterised by the Ladinian "dolomie bicolori". The NPT was located in-between the Briançonnais carbonate platform and the Helvetic lands. The observed horizontal transition, coupled with the stratigraphic superposition of an Helvetic Liassic on a Briaçonnais Triassic in the Luzzone-Terri nappe, links, prior to Jurassic rifting, the Briançonnais paleogeographic domain at the Helvetic Margin, south of the Gotthard. Our observations suggest that the Jurassic rifting separated the Briançonnais domain from the Helvetic margin by complex and protracted extension. The syn-rift stratigraphic record in the Adula nappe and surroundings suggests the presence of a diffuse rising area with only moderately subsiding basins above a thinned continental and proto-oceanic crust. Strong subsidence occurred in a second phase following protracted extension and the resulting delamination of the rising area. The stratigraphic coherency in the Adula's Mesozoic questions the idea of a lithospheric mélange in the eclogitic Adula nappe, which is more likely to be a coherent alpine tectonic unit. The structural and stratigraphic observations in the Piz Terri-Lunschania zone suggest the activity of syn-rift detachments. During the alpine collision these faults are reactivated (and inverted) and played a major role in allowing the Adula subduction, the "Penninic Thrust" above it and in creating the structural complexity of the Central Alps.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Peptide toxins synthesized by venomous animals have been extensively studied in the last decades. To be useful to the scientific community, this knowledge has been stored, annotated and made easy to retrieve by several databases. The aim of this article is to present what type of information users can access from each database. ArachnoServer and ConoServer focus on spider toxins and cone snail toxins, respectively. UniProtKB, a generalist protein knowledgebase, has an animal toxin-dedicated annotation program that includes toxins from all venomous animals. Finally, the ATDB metadatabase compiles data and annotations from other databases and provides toxin ontology.