962 resultados para Clustering a large document collection


Relevância:

30.00% 30.00%

Publicador:

Resumo:

This thesis is a collection of works focused on the topic of Earthquake Early Warning, with a special attention to large magnitude events. The topic is addressed from different points of view and the structure of the thesis reflects the variety of the aspects which have been analyzed. The first part is dedicated to the giant, 2011 Tohoku-Oki earthquake. The main features of the rupture process are first discussed. The earthquake is then used as a case study to test the feasibility Early Warning methodologies for very large events. Limitations of the standard approaches for large events arise in this chapter. The difficulties are related to the real-time magnitude estimate from the first few seconds of recorded signal. An evolutionary strategy for the real-time magnitude estimate is proposed and applied to the single Tohoku-Oki earthquake. In the second part of the thesis a larger number of earthquakes is analyzed, including small, moderate and large events. Starting from the measurement of two Early Warning parameters, the behavior of small and large earthquakes in the initial portion of recorded signals is investigated. The aim is to understand whether small and large earthquakes can be distinguished from the initial stage of their rupture process. A physical model and a plausible interpretation to justify the observations are proposed. The third part of the thesis is focused on practical, real-time approaches for the rapid identification of the potentially damaged zone during a seismic event. Two different approaches for the rapid prediction of the damage area are proposed and tested. The first one is a threshold-based method which uses traditional seismic data. Then an innovative approach using continuous, GPS data is explored. Both strategies improve the prediction of large scale effects of strong earthquakes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Mycobacterium abscessus, Mycobacterium bolletii, and Mycobacterium massiliense (Mycobacterium abscessus sensu lato) are closely related species that currently are identified by the sequencing of the rpoB gene. However, recent studies show that rpoB sequencing alone is insufficient to discriminate between these species, and some authors have questioned their current taxonomic classification. We studied here a large collection of M. abscessus (sensu lato) strains by partial rpoB sequencing (752 bp) and multilocus sequence analysis (MLSA). The final MLSA scheme developed was based on the partial sequences of eight housekeeping genes: argH, cya, glpK, gnd, murC, pgm, pta, and purH. The strains studied included the three type strains (M. abscessus CIP 104536(T), M. massiliense CIP 108297(T), and M. bolletii CIP 108541(T)) and 120 isolates recovered between 1997 and 2007 in France, Germany, Switzerland, and Brazil. The rpoB phylogenetic tree confirmed the existence of three main clusters, each comprising the type strain of one species. However, divergence values between the M. massiliense and M. bolletii clusters all were below 3% and between the M. abscessus and M. massiliense clusters were from 2.66 to 3.59%. The tree produced using the concatenated MLSA gene sequences (4,071 bp) also showed three main clusters, each comprising the type strain of one species. The M. abscessus cluster had a bootstrap value of 100% and was mostly compact. Bootstrap values for the M. massiliense and M. bolletii branches were much lower (71 and 61%, respectively), with the M. massiliense cluster having a fuzzy aspect. Mean (range) divergence values were 2.17% (1.13 to 2.58%) between the M. abscessus and M. massiliense clusters, 2.37% (1.5 to 2.85%) between the M. abscessus and M. bolletii clusters, and 2.28% (0.86 to 2.68%) between the M. massiliense and M. bolletii clusters. Adding the rpoB sequence to the MLSA-concatenated sequence (total sequence, 4,823 bp) had little effect on the clustering of strains. We found 10/120 (8.3%) isolates for which the concatenated MLSA gene sequence and rpoB sequence were discordant (e.g., M. massiliense MLSA sequence and M. abscessus rpoB sequence), suggesting the intergroup lateral transfers of rpoB. In conclusion, our study strongly supports the recent proposal that M. abscessus, M. massiliense, and M. bolletii should constitute a single species. Our findings also indicate that there has been a horizontal transfer of rpoB sequences between these subgroups, precluding the use of rpoB sequencing alone for the accurate identification of the two proposed M. abscessus subspecies.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Metacommunity ecology focuses on the interaction between local communities and is inherently linked to dispersal as a result. Within this framework, communities are structured by a combination of in-site responses to the immediate environment (species sorting), stochasticity (patch dynamics), and connections to other communities via distance between communities and dispersal (neutrality), and source-sink dynamics (mass effects; see Chapter 1 for a detailed description of metacommunity theory, the study site, and macroinvertebrate communities found). In Chapter 2 I describe spatial scale of study and dispersal ability as both have the ability to influence the degree to which communities interact. However, little is known about how these factors influence the importance of all metacommunity dynamics. I compared dispersal mode of immature aquatic insects and dispersal ability of winged adults across multiple spatial scales in a large river. The strongest drivers of river communities were patch dynamics, followed by species sorting, then neutrality. Active dispersers during aquatic lifestages on average exhibited lower patch dynamics, higher species sorting, and significant mass effects compared to passive dispersers. Active and strong dispersers also had a scale-independent influence of neutrality, while neutrality was stronger at broader spatial scale for passive and weak dispersers. These results indicate as dispersal ability increases patch dynamics decreases, species sorting increases, and neutrality should decrease. The perceived influence of neutrality may also be dependent on spatial scale and dispersal ability. In Chapter 3 I describe how river benthic macroinvertebrate communities may influence tributary invertebrate communities via adult flight and tributaries may influence mainstem communities via immature drift. This relationship may also depend on relative mainstem and tributary size, as well as abiotic tributary influence on mainstem habitat. To investigate the interaction between a larger river and tributary I sampled mainstem benthic invertebrate communities and quantified habitat of a 7th order river (West Branch Susquehanna River) above and below a 5th order tributary confluence, as well as 0.95-3.2 km upstream in the tributary. Non-metric multidimensional scaling showed similar patterns of clustering between sampling locations for both habitat characteristics and invertebrate communities. In addition, mainstem river communities and habitat directly downstream of the tributary confluence cluster tightly together, intermediate between tributary and mid-channel river samples. In Bray-Curtis dissimilarity comparisons between tributary and mainstem river communities the furthest upstream tributary communities were least similar to river communities. Middle tributary samples were also closest by Euclidean distance to the upstream mainstem riffle and exhibited higher similarity to mid-channel samples than the furthest downstream tributary communities. My results indicate river and tributary benthic invertebrate communities may interact and likely result in direct and indirect mass effects of a tributary on the downstream mainstem community by invertebrate drift and habitat restructuring via material delivery from the tributary. I also showed likely direct effects of adult dispersal from the river and oviposition in proximal tributary locations where Euclidian, rather than river, distance may be more important in determining river-tributary interactions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recombinant inbred lines (RILs) can serve as powerful tools for genetic mapping. Recently, members of the Complex Trait Consortium have proposed the development of a large panel of eight-way RILs in the mouse, derived from eight genetically diverse parental strains. Such a panel would be a valuable community resource. The use of such eight-way RILs will require a detailed understanding of the relationship between alleles at linked loci on an RI chromosome. We extend the work of Haldane and Waddington (1931) on twoway RILs and describe the map expansion, clustering of breakpoints, and other features of the genomes of multiple-strain RILs as a function of the level of crossover interference in meiosis. In this technical report, we present all of our results, in their gory detail. We don’t intend to include such details in the final publication, but want to present them here for those who might be interested.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Outcome-dependent, two-phase sampling designs can dramatically reduce the costs of observational studies by judicious selection of the most informative subjects for purposes of detailed covariate measurement. Here we derive asymptotic information bounds and the form of the efficient score and influence functions for the semiparametric regression models studied by Lawless, Kalbfleisch, and Wild (1999) under two-phase sampling designs. We show that the maximum likelihood estimators for both the parametric and nonparametric parts of the model are asymptotically normal and efficient. The efficient influence function for the parametric part aggress with the more general information bound calculations of Robins, Hsieh, and Newey (1995). By verifying the conditions of Murphy and Van der Vaart (2000) for a least favorable parametric submodel, we provide asymptotic justification for statistical inference based on profile likelihood.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In epidemiological work, outcomes are frequently non-normal, sample sizes may be large, and effects are often small. To relate health outcomes to geographic risk factors, fast and powerful methods for fitting spatial models, particularly for non-normal data, are required. We focus on binary outcomes, with the risk surface a smooth function of space. We compare penalized likelihood models, including the penalized quasi-likelihood (PQL) approach, and Bayesian models based on fit, speed, and ease of implementation. A Bayesian model using a spectral basis representation of the spatial surface provides the best tradeoff of sensitivity and specificity in simulations, detecting real spatial features while limiting overfitting and being more efficient computationally than other Bayesian approaches. One of the contributions of this work is further development of this underused representation. The spectral basis model outperforms the penalized likelihood methods, which are prone to overfitting, but is slower to fit and not as easily implemented. Conclusions based on a real dataset of cancer cases in Taiwan are similar albeit less conclusive with respect to comparing the approaches. The success of the spectral basis with binary data and similar results with count data suggest that it may be generally useful in spatial models and more complicated hierarchical models.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Use of microarray technology often leads to high-dimensional and low- sample size data settings. Over the past several years, a variety of novel approaches have been proposed for variable selection in this context. However, only a small number of these have been adapted for time-to-event data where censoring is present. Among standard variable selection methods shown both to have good predictive accuracy and to be computationally efficient is the elastic net penalization approach. In this paper, adaptation of the elastic net approach is presented for variable selection both under the Cox proportional hazards model and under an accelerated failure time (AFT) model. Assessment of the two methods is conducted through simulation studies and through analysis of microarray data obtained from a set of patients with diffuse large B-cell lymphoma where time to survival is of interest. The approaches are shown to match or exceed the predictive performance of a Cox-based and an AFT-based variable selection method. The methods are moreover shown to be much more computationally efficient than their respective Cox- and AFT- based counterparts.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The last two decades have seen intense scientific and regulatory interest in the health effects of particulate matter (PM). Influential epidemiological studies that characterize chronic exposure of individuals rely on monitoring data that are sparse in space and time, so they often assign the same exposure to participants in large geographic areas and across time. We estimate monthly PM during 1988-2002 in a large spatial domain for use in studying health effects in the Nurses' Health Study. We develop a conceptually simple spatio-temporal model that uses a rich set of covariates. The model is used to estimate concentrations of PM10 for the full time period and PM2.5 for a subset of the period. For the earlier part of the period, 1988-1998, few PM2.5 monitors were operating, so we develop a simple extension to the model that represents PM2.5 conditionally on PM10 model predictions. In the epidemiological analysis, model predictions of PM10 are more strongly associated with health effects than when using simpler approaches to estimate exposure. Our modeling approach supports the application in estimating both fine-scale and large-scale spatial heterogeneity and capturing space-time interaction through the use of monthly-varying spatial surfaces. At the same time, the model is computationally feasible, implementable with standard software, and readily understandable to the scientific audience. Despite simplifying assumptions, the model has good predictive performance and uncertainty characterization.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

AIMS: Multiple arrhythmia re-inductions were recently shown in His-Purkinje system (HPS) ventricular tachycardia (VT). We hypothesized that HPS VT was a frequent mechanism of repetitive or incessant VT and assessed diagnostic criteria to select patients likely to have HPS VT. METHODS AND RESULTS: Consecutive patients with clustering VT episodes (>3 sustained monomorphic VT within 2 weeks) were included in the analysis. HPS VT was considered plausible in patients with (i) impaired left ventricular function associated with dilated cardiomyopathy or valvular heart disease; or (ii) ECG during VT similar to sinus rhythm QRS or to bundle-branch block QRS. HPS VT was plausible in 12 of 48 patients and HPS VT was demonstrated in 6 of 12 patients (50%, or 13% of the whole study group). Median VT cycle length was 318 ms (250-550). Catheter ablation was successful in all six patients. CONCLUSION: His-Purkinje system VT is found in a significant number of patients with repetitive or incessant VT episodes, and in a large proportion of patients with predefined clinical or electrocardiographic characteristics. Since it is easily amenable to catheter ablation, our data support the screening of all patients with repetitive VT in this regard and an invasive approach in a selected group of patients.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present studies of the spatial clustering of inertial particles embedded in turbulent flow. A major part of the thesis is experimental, involving the technique of Phase Doppler Interferometry (PDI). The thesis also includes significant amount of simulation studies and some theoretical considerations. We describe the details of PDI and explain why it is suitable for study of particle clustering in turbulent flow with a strong mean velocity. We introduce the concept of the radial distribution function (RDF) as our chosen way of quantifying inertial particle clustering and present some original works on foundational and practical considerations related to it. These include methods of treating finite sampling size, interpretation of the magnitude of RDF and the possibility of isolating RDF signature of inertial clustering from that of large scale mixing. In experimental work, we used the PDI to observe clustering of water droplets in a turbulent wind tunnel. From that we present, in the form of a published paper, evidence of dynamical similarity (Stokes number similarity) of inertial particle clustering together with other results in qualitative agreement with available theoretical prediction and simulation results. We next show detailed quantitative comparisons of results from our experiments, direct-numerical-simulation (DNS) and theory. Very promising agreement was found for like-sized particles (mono-disperse). Theory is found to be incorrect regarding clustering of different-sized particles and we propose a empirical correction based on the DNS and experimental results. Besides this, we also discovered a few interesting characteristics of inertial clustering. Firstly, through observations, we found an intriguing possibility for modeling the RDF arising from inertial clustering that has only one (sensitive) parameter. We also found that clustering becomes saturated at high Reynolds number.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND During the past 25 years, many pregnancy and birth cohorts have been established. Each cohort provides unique opportunities for examining associations of early-life exposures with child development and health. However, to fully exploit the large amount of available resources and to facilitate cross-cohort collaboration, it is necessary to have accessible information on each cohort and its individual characteristics. The aim of this work was to provide an overview of European pregnancy and birth cohorts registered in a freely accessible database located at http://www.birthcohorts.net. METHODS European pregnancy and birth cohorts initiated in 1980 or later with at least 300 mother-child pairs enrolled during pregnancy or at birth, and with postnatal data, were eligible for inclusion. Eligible cohorts were invited to provide information on the data and biological samples collected, as well as the timing of data collection. RESULTS In total, 70 cohorts were identified. Of these, 56 fulfilled the inclusion criteria encompassing a total of more than 500,000 live-born European children. The cohorts represented 19 countries with the majority of cohorts located in Northern and Western Europe. Some cohorts were general with multiple aims, whilst others focused on specific health or exposure-related research questions. CONCLUSION This work demonstrates a great potential for cross-cohort collaboration addressing important aspects of child health. The web site, http://www.birthcohorts.net, proved to be a useful tool for accessing information on European pregnancy and birth cohorts and their characteristics.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Fog is a potential source of water that could be exploited using the innovative technology of fog collection. Naturally, the potential of fog has proven its significance in cloud forests that are thriving from fog interception. Historically, the remains of artificial structures in different countries prove that fog has been collected as an alternative and/or supplementary water source. In the beginning of the 19th century, fog collection was investigated as a potential natural resource. After the mid-1980s, following success in Chile, fog-water collection commenced in a number of developing countries. Most of these countries are located in arid and semi-arid regions with topographic and climatic conditions that favour fog-water collection. This paper reviews the technology of fog collection with initial background information on natural fog collection and its historical development. It reviews the climatic and topographic features that dictate fog formation (mainly advection and orographic) and the innovative technology to collect it, focusing on the amount collected, the quality of fog water, and the impact of the technology on the livelihoods of beneficiary communities. By and large, the technology described is simple, cost-effective, and energy-free. However, fog-water collection has disadvantages in that it is seasonal, localised, and the technology needs continual maintenance. Based on the experience in several countries, the sustainability of the technology could be guaranteed if technical, economic, social, and management factors are addressed during its planning and implementation.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We consider the problem of fitting a union of subspaces to a collection of data points drawn from one or more subspaces and corrupted by noise and/or gross errors. We pose this problem as a non-convex optimization problem, where the goal is to decompose the corrupted data matrix as the sum of a clean and self-expressive dictionary plus a matrix of noise and/or gross errors. By self-expressive we mean a dictionary whose atoms can be expressed as linear combinations of themselves with low-rank coefficients. In the case of noisy data, our key contribution is to show that this non-convex matrix decomposition problem can be solved in closed form from the SVD of the noisy data matrix. The solution involves a novel polynomial thresholding operator on the singular values of the data matrix, which requires minimal shrinkage. For one subspace, a particular case of our framework leads to classical PCA, which requires no shrinkage. For multiple subspaces, the low-rank coefficients obtained by our framework can be used to construct a data affinity matrix from which the clustering of the data according to the subspaces can be obtained by spectral clustering. In the case of data corrupted by gross errors, we solve the problem using an alternating minimization approach, which combines our polynomial thresholding operator with the more traditional shrinkage-thresholding operator. Experiments on motion segmentation and face clustering show that our framework performs on par with state-of-the-art techniques at a reduced computational cost.