849 resultados para large sample distributions
Resumo:
La dernière décennie a connu un intérêt croissant pour les problèmes posés par les variables instrumentales faibles dans la littérature économétrique, c’est-à-dire les situations où les variables instrumentales sont faiblement corrélées avec la variable à instrumenter. En effet, il est bien connu que lorsque les instruments sont faibles, les distributions des statistiques de Student, de Wald, du ratio de vraisemblance et du multiplicateur de Lagrange ne sont plus standard et dépendent souvent de paramètres de nuisance. Plusieurs études empiriques portant notamment sur les modèles de rendements à l’éducation [Angrist et Krueger (1991, 1995), Angrist et al. (1999), Bound et al. (1995), Dufour et Taamouti (2007)] et d’évaluation des actifs financiers (C-CAPM) [Hansen et Singleton (1982,1983), Stock et Wright (2000)], où les variables instrumentales sont faiblement corrélées avec la variable à instrumenter, ont montré que l’utilisation de ces statistiques conduit souvent à des résultats peu fiables. Un remède à ce problème est l’utilisation de tests robustes à l’identification [Anderson et Rubin (1949), Moreira (2002), Kleibergen (2003), Dufour et Taamouti (2007)]. Cependant, il n’existe aucune littérature économétrique sur la qualité des procédures robustes à l’identification lorsque les instruments disponibles sont endogènes ou à la fois endogènes et faibles. Cela soulève la question de savoir ce qui arrive aux procédures d’inférence robustes à l’identification lorsque certaines variables instrumentales supposées exogènes ne le sont pas effectivement. Plus précisément, qu’arrive-t-il si une variable instrumentale invalide est ajoutée à un ensemble d’instruments valides? Ces procédures se comportent-elles différemment? Et si l’endogénéité des variables instrumentales pose des difficultés majeures à l’inférence statistique, peut-on proposer des procédures de tests qui sélectionnent les instruments lorsqu’ils sont à la fois forts et valides? Est-il possible de proposer les proédures de sélection d’instruments qui demeurent valides même en présence d’identification faible? Cette thèse se focalise sur les modèles structurels (modèles à équations simultanées) et apporte des réponses à ces questions à travers quatre essais. Le premier essai est publié dans Journal of Statistical Planning and Inference 138 (2008) 2649 – 2661. Dans cet essai, nous analysons les effets de l’endogénéité des instruments sur deux statistiques de test robustes à l’identification: la statistique d’Anderson et Rubin (AR, 1949) et la statistique de Kleibergen (K, 2003), avec ou sans instruments faibles. D’abord, lorsque le paramètre qui contrôle l’endogénéité des instruments est fixe (ne dépend pas de la taille de l’échantillon), nous montrons que toutes ces procédures sont en général convergentes contre la présence d’instruments invalides (c’est-à-dire détectent la présence d’instruments invalides) indépendamment de leur qualité (forts ou faibles). Nous décrivons aussi des cas où cette convergence peut ne pas tenir, mais la distribution asymptotique est modifiée d’une manière qui pourrait conduire à des distorsions de niveau même pour de grands échantillons. Ceci inclut, en particulier, les cas où l’estimateur des double moindres carrés demeure convergent, mais les tests sont asymptotiquement invalides. Ensuite, lorsque les instruments sont localement exogènes (c’est-à-dire le paramètre d’endogénéité converge vers zéro lorsque la taille de l’échantillon augmente), nous montrons que ces tests convergent vers des distributions chi-carré non centrées, que les instruments soient forts ou faibles. Nous caractérisons aussi les situations où le paramètre de non centralité est nul et la distribution asymptotique des statistiques demeure la même que dans le cas des instruments valides (malgré la présence des instruments invalides). Le deuxième essai étudie l’impact des instruments faibles sur les tests de spécification du type Durbin-Wu-Hausman (DWH) ainsi que le test de Revankar et Hartley (1973). Nous proposons une analyse en petit et grand échantillon de la distribution de ces tests sous l’hypothèse nulle (niveau) et l’alternative (puissance), incluant les cas où l’identification est déficiente ou faible (instruments faibles). Notre analyse en petit échantillon founit plusieurs perspectives ainsi que des extensions des précédentes procédures. En effet, la caractérisation de la distribution de ces statistiques en petit échantillon permet la construction des tests de Monte Carlo exacts pour l’exogénéité même avec les erreurs non Gaussiens. Nous montrons que ces tests sont typiquement robustes aux intruments faibles (le niveau est contrôlé). De plus, nous fournissons une caractérisation de la puissance des tests, qui exhibe clairement les facteurs qui déterminent la puissance. Nous montrons que les tests n’ont pas de puissance lorsque tous les instruments sont faibles [similaire à Guggenberger(2008)]. Cependant, la puissance existe tant qu’au moins un seul instruments est fort. La conclusion de Guggenberger (2008) concerne le cas où tous les instruments sont faibles (un cas d’intérêt mineur en pratique). Notre théorie asymptotique sous les hypothèses affaiblies confirme la théorie en échantillon fini. Par ailleurs, nous présentons une analyse de Monte Carlo indiquant que: (1) l’estimateur des moindres carrés ordinaires est plus efficace que celui des doubles moindres carrés lorsque les instruments sont faibles et l’endogenéité modérée [conclusion similaire à celle de Kiviet and Niemczyk (2007)]; (2) les estimateurs pré-test basés sur les tests d’exogenété ont une excellente performance par rapport aux doubles moindres carrés. Ceci suggère que la méthode des variables instrumentales ne devrait être appliquée que si l’on a la certitude d’avoir des instruments forts. Donc, les conclusions de Guggenberger (2008) sont mitigées et pourraient être trompeuses. Nous illustrons nos résultats théoriques à travers des expériences de simulation et deux applications empiriques: la relation entre le taux d’ouverture et la croissance économique et le problème bien connu du rendement à l’éducation. Le troisième essai étend le test d’exogénéité du type Wald proposé par Dufour (1987) aux cas où les erreurs de la régression ont une distribution non-normale. Nous proposons une nouvelle version du précédent test qui est valide même en présence d’erreurs non-Gaussiens. Contrairement aux procédures de test d’exogénéité usuelles (tests de Durbin-Wu-Hausman et de Rvankar- Hartley), le test de Wald permet de résoudre un problème courant dans les travaux empiriques qui consiste à tester l’exogénéité partielle d’un sous ensemble de variables. Nous proposons deux nouveaux estimateurs pré-test basés sur le test de Wald qui performent mieux (en terme d’erreur quadratique moyenne) que l’estimateur IV usuel lorsque les variables instrumentales sont faibles et l’endogénéité modérée. Nous montrons également que ce test peut servir de procédure de sélection de variables instrumentales. Nous illustrons les résultats théoriques par deux applications empiriques: le modèle bien connu d’équation du salaire [Angist et Krueger (1991, 1999)] et les rendements d’échelle [Nerlove (1963)]. Nos résultats suggèrent que l’éducation de la mère expliquerait le décrochage de son fils, que l’output est une variable endogène dans l’estimation du coût de la firme et que le prix du fuel en est un instrument valide pour l’output. Le quatrième essai résout deux problèmes très importants dans la littérature économétrique. D’abord, bien que le test de Wald initial ou étendu permette de construire les régions de confiance et de tester les restrictions linéaires sur les covariances, il suppose que les paramètres du modèle sont identifiés. Lorsque l’identification est faible (instruments faiblement corrélés avec la variable à instrumenter), ce test n’est en général plus valide. Cet essai développe une procédure d’inférence robuste à l’identification (instruments faibles) qui permet de construire des régions de confiance pour la matrices de covariances entre les erreurs de la régression et les variables explicatives (possiblement endogènes). Nous fournissons les expressions analytiques des régions de confiance et caractérisons les conditions nécessaires et suffisantes sous lesquelles ils sont bornés. La procédure proposée demeure valide même pour de petits échantillons et elle est aussi asymptotiquement robuste à l’hétéroscédasticité et l’autocorrélation des erreurs. Ensuite, les résultats sont utilisés pour développer les tests d’exogénéité partielle robustes à l’identification. Les simulations Monte Carlo indiquent que ces tests contrôlent le niveau et ont de la puissance même si les instruments sont faibles. Ceci nous permet de proposer une procédure valide de sélection de variables instrumentales même s’il y a un problème d’identification. La procédure de sélection des instruments est basée sur deux nouveaux estimateurs pré-test qui combinent l’estimateur IV usuel et les estimateurs IV partiels. Nos simulations montrent que: (1) tout comme l’estimateur des moindres carrés ordinaires, les estimateurs IV partiels sont plus efficaces que l’estimateur IV usuel lorsque les instruments sont faibles et l’endogénéité modérée; (2) les estimateurs pré-test ont globalement une excellente performance comparés à l’estimateur IV usuel. Nous illustrons nos résultats théoriques par deux applications empiriques: la relation entre le taux d’ouverture et la croissance économique et le modèle de rendements à l’éducation. Dans la première application, les études antérieures ont conclu que les instruments n’étaient pas trop faibles [Dufour et Taamouti (2007)] alors qu’ils le sont fortement dans la seconde [Bound (1995), Doko et Dufour (2009)]. Conformément à nos résultats théoriques, nous trouvons les régions de confiance non bornées pour la covariance dans le cas où les instruments sont assez faibles.
Resumo:
This paper presents a simple Bayesian approach to sample size determination in clinical trials. It is required that the trial should be large enough to ensure that the data collected will provide convincing evidence either that an experimental treatment is better than a control or that it fails to improve upon control by some clinically relevant difference. The method resembles standard frequentist formulations of the problem, and indeed in certain circumstances involving 'non-informative' prior information it leads to identical answers. In particular, unlike many Bayesian approaches to sample size determination, use is made of an alternative hypothesis that an experimental treatment is better than a control treatment by some specified magnitude. The approach is introduced in the context of testing whether a single stream of binary observations are consistent with a given success rate p(0). Next the case of comparing two independent streams of normally distributed responses is considered, first under the assumption that their common variance is known and then for unknown variance. Finally, the more general situation in which a large sample is to be collected and analysed according to the asymptotic properties of the score statistic is explored. Copyright (C) 2007 John Wiley & Sons, Ltd.
Resumo:
The translation of an ensemble of model runs into a probability distribution is a common task in model-based prediction. Common methods for such ensemble interpretations proceed as if verification and ensemble were draws from the same underlying distribution, an assumption not viable for most, if any, real world ensembles. An alternative is to consider an ensemble as merely a source of information rather than the possible scenarios of reality. This approach, which looks for maps between ensembles and probabilistic distributions, is investigated and extended. Common methods are revisited, and an improvement to standard kernel dressing, called ‘affine kernel dressing’ (AKD), is introduced. AKD assumes an affine mapping between ensemble and verification, typically not acting on individual ensemble members but on the entire ensemble as a whole, the parameters of this mapping are determined in parallel with the other dressing parameters, including a weight assigned to the unconditioned (climatological) distribution. These amendments to standard kernel dressing, albeit simple, can improve performance significantly and are shown to be appropriate for both overdispersive and underdispersive ensembles, unlike standard kernel dressing which exacerbates over dispersion. Studies are presented using operational numerical weather predictions for two locations and data from the Lorenz63 system, demonstrating both effectiveness given operational constraints and statistical significance given a large sample.
Resumo:
Before the advent of genome-wide association studies (GWASs), hundreds of candidate genes for obesity-susceptibility had been identified through a variety of approaches. We examined whether those obesity candidate genes are enriched for associations with body mass index (BMI) compared with non-candidate genes by using data from a large-scale GWAS. A thorough literature search identified 547 candidate genes for obesity-susceptibility based on evidence from animal studies, Mendelian syndromes, linkage studies, genetic association studies and expression studies. Genomic regions were defined to include the genes ±10 kb of flanking sequence around candidate and non-candidate genes. We used summary statistics publicly available from the discovery stage of the genome-wide meta-analysis for BMI performed by the genetic investigation of anthropometric traits consortium in 123 564 individuals. Hypergeometric, rank tail-strength and gene-set enrichment analysis tests were used to test for the enrichment of association in candidate compared with non-candidate genes. The hypergeometric test of enrichment was not significant at the 5% P-value quantile (P = 0.35), but was nominally significant at the 25% quantile (P = 0.015). The rank tail-strength and gene-set enrichment tests were nominally significant for the full set of genes and borderline significant for the subset without SNPs at P < 10(-7). Taken together, the observed evidence for enrichment suggests that the candidate gene approach retains some value. However, the degree of enrichment is small despite the extensive number of candidate genes and the large sample size. Studies that focus on candidate genes have only slightly increased chances of detecting associations, and are likely to miss many true effects in non-candidate genes, at least for obesity-related traits.
Resumo:
This paper gives a first step toward a methodology to quantify the influences of regulation on short-run earnings dynamics. It also provides evidence on the patterns of wage adjustment adopted during the recent high inflationary experience in Brazil.The large variety of official wage indexation rules adopted in Brazil during the recent years combined with the availability of monthly surveys on labor markets makes the Brazilian case a good laboratory to test how regulation affects earnings dynamics. In particular, the combination of large sample sizes with the possibility of following the same worker through short periods of time allows to estimate the cross-sectional distribution of longitudinal statistics based on observed earnings (e.g., monthly and annual rates of change).The empirical strategy adopted here is to compare the distributions of longitudinal statistics extracted from actual earnings data with simulations generated from minimum adjustment requirements imposed by the Brazilian Wage Law. The analysis provides statistics on how binding were wage regulation schemes. The visual analysis of the distribution of wage adjustments proves useful to highlight stylized facts that may guide future empirical work.
Resumo:
This study aimed to investigate the phenomenology of obsessive compulsive disorder (OCD), addressing specific questions about the nature of obsessions and compulsions, and to contribute to the World Health Organization's (WHO) revision of OCD diagnostic guidelines. Data from 1001 patients from the Brazilian Research Consortium on Obsessive Compulsive Spectrum Disorders were used. Patients were evaluated by trained clinicians using validated instruments, including the Dimensional Yale Brown Obsessive Compulsive Scale, the University of Sao Paulo Sensory Phenomena Scale, and the Brown Assessment of Beliefs Scale. The aims were to compare the types of sensory phenomena (SP, subjective experiences that precede or accompany compulsions) in OCD patients with and without tic disorders and to determine the frequency of mental compulsions, the co-occurrence of obsessions and compulsions, and the range of insight. SP were common in the whole sample, but patients with tic disorders were more likely to have physical sensations and urges only. Mental compulsions occurred in the majority of OCD patients. It was extremely rare for OCD patients to have obsessions without compulsions. A wide range of insight into OCD beliefs was observed, with a small subset presenting no insight. The data generated from this large sample will help practicing clinicians appreciate the full range of OCD symptoms and confirm prior studies in smaller samples the degree to which insight varies. These findings also support specific revisions to the WHO's diagnostic guidelines for OCD, such as describing sensory phenomena, mental compulsions and level of insight, so that the world-wide recognition of this disabling disorder is increased. (C) 2014 Elsevier Ltd. All rights reserved.
Resumo:
Thanks to the Chandra and XMM–Newton surveys, the hard X-ray sky is now probed down to a flux limit where the bulk of the X-ray background is almost completely resolved into discrete sources, at least in the 2–8 keV band. Extensive programs of multiwavelength follow-up observations showed that the large majority of hard X–ray selected sources are identified with Active Galactic Nuclei (AGN) spanning a broad range of redshifts, luminosities and optical properties. A sizable fraction of relatively luminous X-ray sources hosting an active, presumably obscured, nucleus would not have been easily recognized as such on the basis of optical observations because characterized by “peculiar” optical properties. In my PhD thesis, I will focus the attention on the nature of two classes of hard X-ray selected “elusive” sources: those characterized by high X-ray-to-optical flux ratios and red optical-to-near-infrared colors, a fraction of which associated with Type 2 quasars, and the X-ray bright optically normal galaxies, also known as XBONGs. In order to characterize the properties of these classes of elusive AGN, the datasets of several deep and large-area surveys have been fully exploited. The first class of “elusive” sources is characterized by X-ray-to-optical flux ratios (X/O) significantly higher than what is generally observed from unobscured quasars and Seyfert galaxies. The properties of well defined samples of high X/O sources detected at bright X–ray fluxes suggest that X/O selection is highly efficient in sampling high–redshift obscured quasars. At the limits of deep Chandra surveys (∼10−16 erg cm−2 s−1), high X/O sources are generally characterized by extremely faint optical magnitudes, hence their spectroscopic identification is hardly feasible even with the largest telescopes. In this framework, a detailed investigation of their X-ray properties may provide useful information on the nature of this important component of the X-ray source population. The X-ray data of the deepest X-ray observations ever performed, the Chandra deep fields, allows us to characterize the average X-ray properties of the high X/O population. The results of spectral analysis clearly indicate that the high X/O sources represent the most obscured component of the X–ray background. Their spectra are harder (G ∼ 1) than any other class of sources in the deep fields and also of the XRB spectrum (G ≈ 1.4). In order to better understand the AGN physics and evolution, a much better knowledge of the redshift, luminosity and spectral energy distributions (SEDs) of elusive AGN is of paramount importance. The recent COSMOS survey provides the necessary multiwavelength database to characterize the SEDs of a statistically robust sample of obscured sources. The combination of high X/O and red-colors offers a powerful tool to select obscured luminous objects at high redshift. A large sample of X-ray emitting extremely red objects (R−K >5) has been collected and their optical-infrared properties have been studied. In particular, using an appropriate SED fitting procedure, the nuclear and the host galaxy components have been deconvolved over a large range of wavelengths and ptical nuclear extinctions, black hole masses and Eddington ratios have been estimated. It is important to remark that the combination of hard X-ray selection and extreme red colors is highly efficient in picking up highly obscured, luminous sources at high redshift. Although the XBONGs do not present a new source population, the interest on the nature of these sources has gained a renewed attention after the discovery of several examples from recent Chandra and XMM–Newton surveys. Even though several possibilities were proposed in recent literature to explain why a relatively luminous (LX = 1042 − 1043erg s−1) hard X-ray source does not leave any significant signature of its presence in terms of optical emission lines, the very nature of XBONGs is still subject of debate. Good-quality photometric near-infrared data (ISAAC/VLT) of 4 low-redshift XBONGs from the HELLAS2XMMsurvey have been used to search for the presence of the putative nucleus, applying the surface-brightness decomposition technique. In two out of the four sources, the presence of a nuclear weak component hosted by a bright galaxy has been revealed. The results indicate that moderate amounts of gas and dust, covering a large solid angle (possibly 4p) at the nuclear source, may explain the lack of optical emission lines. A weak nucleus not able to produce suffcient UV photons may provide an alternative or additional explanation. On the basis of an admittedly small sample, we conclude that XBONGs constitute a mixed bag rather than a new source population. When the presence of a nucleus is revealed, it turns out to be mildly absorbed and hosted by a bright galaxy.
Resumo:
Suppose that we are interested in establishing simple, but reliable rules for predicting future t-year survivors via censored regression models. In this article, we present inference procedures for evaluating such binary classification rules based on various prediction precision measures quantified by the overall misclassification rate, sensitivity and specificity, and positive and negative predictive values. Specifically, under various working models we derive consistent estimators for the above measures via substitution and cross validation estimation procedures. Furthermore, we provide large sample approximations to the distributions of these nonsmooth estimators without assuming that the working model is correctly specified. Confidence intervals, for example, for the difference of the precision measures between two competing rules can then be constructed. All the proposals are illustrated with two real examples and their finite sample properties are evaluated via a simulation study.
Resumo:
In this paper, we study panel count data with informative observation times. We assume nonparametric and semiparametric proportional rate models for the underlying recurrent event process, where the form of the baseline rate function is left unspecified and a subject-specific frailty variable inflates or deflates the rate function multiplicatively. The proposed models allow the recurrent event processes and observation times to be correlated through their connections with the unobserved frailty; moreover, the distributions of both the frailty variable and observation times are considered as nuisance parameters. The baseline rate function and the regression parameters are estimated by maximizing a conditional likelihood function of observed event counts and solving estimation equations. Large sample properties of the proposed estimators are studied. Numerical studies demonstrate that the proposed estimation procedures perform well for moderate sample sizes. An application to a bladder tumor study is presented to illustrate the use of the proposed methods.
Resumo:
Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.
Resumo:
The current study tested two competing models of Attention-Deficit/Hyperactivity Disorder (AD/HD), the inhibition and state regulation theories, by conducting fine-grained analyses of the Stop-Signal Task and another putative measure of behavioral inhibition, the Gordon Continuous Performance Test (G-CPT), in a large sample of children and adolescents. The inhibition theory posits that performance on these tasks reflects increased difficulties for AD/HD participants to inhibit prepotent responses. The model predicts that putative stop-signal reaction time (SSRT) group differences on the Stop-Signal Task will be primarily related to AD/HD participants requiring more warning than control participants to inhibit to the stop-signal and emphasizes the relative importance of commission errors, particularly "impulsive" type commissions, over other error types on the G-CPT. The state regulation theory, on the other hand, proposes response variability due to difficulties maintaining an optimal state of arousal as the primary deficit in AD/HD. This model predicts that SSRT differences will be more attributable to slower and/or more variable reaction time (RT) in the AD/HD group, as opposed to reflecting inhibitory deficits. State regulation assumptions also emphasize the relative importance of omission errors and "slow processing" type commissions over other error types on the G-CPT. Overall, results of Stop-Signal Task analyses were more supportive of state regulation predictions and showed that greater response variability (i.e., SDRT) in the AD/HD group was not reducible to slow mean reaction time (MRT) and that response variability made a larger contribution to increased SSRT in the AD/HD group than inhibitory processes. Examined further, ex-Gaussian analyses of Stop-Signal Task go-trial RT distributions revealed that increased variability in the AD/HD group was not due solely to a few excessively long RTs in the tail of the AD/HD distribution (i.e., tau), but rather indicated the importance of response variability throughout AD/HD group performance on the Stop-Signal Task, as well as the notable sensitivity of ex-Gaussian analyses to variability in data screening procedures. Results of G-CPT analyses indicated some support for the inhibition model, although error type analyses failed to further differentiate the theories. Finally, inclusion of primary variables of interest in exploratory factor analysis with other neurocognitive predictors of AD/HD indicated response variability as a separable construct and further supported its role in Stop-Signal Task performance. Response variability did not, however, make a unique contribution to the prediction of AD/HD symptoms beyond measures of motor processing speed in multiple deficit regression analyses. Results have implications for the interpretation of the processes reflected in widely-used variables in the AD/HD literature, as well as for the theoretical understanding of AD/HD.
Resumo:
Thesis (Master's)--University of Washington, 2016-06
Resumo:
The three-parameter lognormal distribution is the extension of the two-parameter lognormal distribution to meet the need of the biological, sociological, and other fields. Numerous research papers have been published for the parameter estimation problems for the lognormal distributions. The inclusion of the location parameter brings in some technical difficulties for the parameter estimation problems, especially for the interval estimation. This paper proposes a method for constructing exact confidence intervals and exact upper confidence limits for the location parameter of the three-parameter lognormal distribution. The point estimation problem is discussed as well. The performance of the point estimator is compared with the maximum likelihood estimator, which is widely used in practice. Simulation result shows that the proposed method is less biased in estimating the location parameter. The large sample size case is discussed in the paper.
Resumo:
Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space), and the challenge arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. For sample space partitioning, I propose a MEdian Selection Subset AGgregation Estimator ({\em message}) algorithm for solving these issues. The algorithm applies feature selection in parallel for each subset using regularized regression or Bayesian variable selection method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in sample size, and has theoretical guarantees. I provide extensive experiments to show excellent performance in feature selection, estimation, prediction, and computation time relative to usual competitors.
While sample space partitioning is useful in handling datasets with large sample size, feature space partitioning is more effective when the data dimension is high. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In the thesis, I propose a new embarrassingly parallel framework named {\em DECO} for distributed variable selection and parameter estimation. In {\em DECO}, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.
For datasets with both large sample sizes and high dimensionality, I propose a new "divided-and-conquer" framework {\em DEME} (DECO-message) by leveraging both the {\em DECO} and the {\em message} algorithm. The new framework first partitions the dataset in the sample space into row cubes using {\em message} and then partition the feature space of the cubes using {\em DECO}. This procedure is equivalent to partitioning the original data matrix into multiple small blocks, each with a feasible size that can be stored and fitted in a computer in parallel. The results are then synthezied via the {\em DECO} and {\em message} algorithm in a reverse order to produce the final output. The whole framework is extremely scalable.
Resumo:
Aims. The large and small-scale (pc) structure of the Galactic interstellar medium can be investigated by utilising spectra of early-type stellar probes of known distances in the same region of the sky. This paper determines the variation in line strength of Ca ii at 3933.661 Å as a function of probe separation for a large sample of stars, including a number of sightlines in the Magellanic Clouds.
Methods. FLAMES-GIRAFFE data taken with the Very Large Telescope towards early-type stars in 3 Galactic and 4 Magellanic open clusters in Ca ii are used to obtain the velocity, equivalent width, column density, and line width of interstellar Galactic calcium for a total of 657 stars, of which 443 are Magellanic Cloud sightlines. In each cluster there are between 43 and 111 stars observed. Additionally, FEROS and UVES Ca ii K and Na i D spectra of 21 Galactic and 154 Magellanic early-type stars are presented and combined with data from the literature to study the calcium column density - parallax relationship.
Results. For the four Magellanic clusters studied with FLAMES, the strength of the Galactic interstellar Ca ii K equivalent width on transverse scales from ∼0.05-9 pc is found to vary by factors of ∼1.8-3.0, corresponding to column density variations of ∼0.3-0.5 dex in the optically-thin approximation. Using FLAMES, FEROS, and UVES archive spectra, the minimum and maximum reduced equivalent widths for Milky Way gas are found to lie in the range ∼35-125 mÅ and ∼30-160 mÅ for Ca ii K and Na i D, respectively. The range is consistent with a previously published simple model of the interstellar medium consisting of spherical cloudlets of filling factor ∼0.3, although other geometries are not ruled out. Finally, the derived functional form for parallax (π) and Ca ii column density (NCaII) is found to be π(mas) = 1 / (2.39 × 10-13 × NCaII (cm-2) + 0.11). Our derived parallax is ∼25 per cent lower than predicted by Megier et al. (2009, A&A, 507, 833) at a distance of ∼100 pc and ∼15 percent lower at a distance of ∼200 pc, reflecting inhomogeneity in the Ca ii distribution in the different sightlines studied.