849 resultados para large sample distributions
Resumo:
Anthropogenic habitat alterations and water-management practices have imposed an artificial spatial scale onto the once contiguous freshwater marshes of the Florida Everglades. To gain insight into how these changes may affect biotic communities, we examined whether variation in the abundance and community structure of large fishes (SL . 8 cm) in Everglades marshes varied more at regional or intraregional scales, and whether this variation was related to hydroperiod, water depth, floating mat volume, and vegetation density. From October 1997 to October 2002, we used an airboat electrofisher to sample large fishes at sites within three regions of the Everglades. Each of these regions is subject to unique watermanagement schedules. Dry-down events (water depth , 10 cm) occurred at several sites during spring in 1999, 2000, 2001, and 2002. The 2001 dry-down event was the most severe and widespread. Abundance of several fishes decreased significantly through time, and the number of days post-dry-down covaried significantly with abundance for several species. Processes operating at the regional scale appear to play important roles in regulating large fishes. The most pronounced patterns in abundance and community structure occurred at the regional scale, and the effect size for region was greater than the effect size for sites nested within region for abundance of all species combined, all predators combined, and each of the seven most abundant species. Non-metric multi-dimensional scaling revealed distinct groupings of sites corresponding to the three regions. We also found significant variation in community structure through time that correlated with the number of days post-dry-down. Our results suggest that hydroperiod and water management at the regional scale influence large fish communities of Everglades marshes.
Resumo:
Exploring the relationship between early oral reading fluency ability and reading comprehension achievement among an ethnically and racially diverse sample of young learners from low-income families, attending elementary school within a large public school district in southeast Florida is the purpose of this longitudinal study. Although many studies have been conducted to address the relationship between oral reading fluency ability and reading comprehension achievement, most of the existing research failed either to disaggregate the data by demographic subgroups or secure a large enough sample of students to adequately represent the diverse subgroups. The research questions that guided this study were: (a) To what extent does early oral reading fluency ability measured in first, second, or third grade correlate with reading comprehension achievement in third grade? (b) To what extent does the relationship of early oral reading fluency ability and reading comprehension achievement vary by demographic subgroup membership (i.e., gender, race/ethnicity, socioeconomic status) among a diverse sample of students? A predictive research design using archived secondary data was employed in this nonexperimental quantitative methods study of 1,663 third grade students who attended a cohort of 25 Reading First funded schools. The data analyzed derived from the Dynamic Indicators of Basic Early Literacy Skills Oral Reading Fluency (DIBELS ORF) measure administered in first, second, and third grades and the Florida Comprehensive Assessment Test of the Sunshine State Standards (FCAT-SSS) Reading administered in third grade. Linear regression analyses between each of the oral reading fluency and reading comprehension measures produced significant positive correlations. Hierarchical regression analyses supported the predictive potential of all three oral reading fluency ability measures toward reading comprehension achievement, with the first grade oral reading fluency ability measure explaining the most significant variance in third grade reading comprehension achievement. Male students produced significant overall differences in variance when compared to female students as did the Other student subgroup (i.e., Asian, Multiracial, and Native American) when compared to Black, White, and Hispanic students. No significant differences in variance were produced between students from low and moderate socioeconomic families. These findings are vital toward adding to the literature of diverse young learners.
Resumo:
This research was undertaken to explore dimensions of the risk construct, identify factors related to risk-taking in education, and study risk propensity among employees at a community college. Risk-taking propensity (RTP) was measured by the 12-item BCDQ, which consisted of personal and professional risk-related situations balanced for the money, reputation, and satisfaction dimensions of the risk construct. Scoring ranged from 1.00 (most cautious) to 6.00 (most risky).^ Surveys including the BCDQ and seven demographic questions relating to age, gender, professional status, length of service, academic discipline, highest degree, and campus location were sent to faculty, administrators, and academic department heads. A total of 325 surveys were returned, resulting in a 66.7% response rate. Subjects were relatively homogeneous for age, length of service, and highest degree.^ Subjects were also homogeneous for risk-taking propensity: no substantive differences in RTP scores were noted within and among demographic groups, with the possible exception of academic discipline. The mean RTP score for all subjects was 3.77, for faculty was 3.76, for administrators was 3.83, and for department heads was 3.64.^ The relationship between propensity to take personal risks and propensity to take professional risks was tested by computing Pearson r correlation coefficients. The relationships for the total sample, faculty, and administrator groups were statistically significant, but of limited practical significance. Subjects were placed into risk categories by dividing the response scale into thirds. A 3 x 3 factorial ANOVA revealed no interaction effects between professional status and risk category with regard to RTP score. A discriminant analysis showed that a seven-factor model was not effective in predicting risk category.^ The homogeneity of the study sample and the effect of a risk-encouraging environment were discussed in the context of the community college. Since very little data on risk-taking in education is available, risk propensity data from this study could serve as a basis for comparison to future research. Results could be used by institutions to plan professional development activities, designed to increase risk-taking and encourage active acceptance of change. ^
Resumo:
This dissertation presents a study of the D( e, e′p)n reaction carried out at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) for a set of fixed values of four-momentum transfer Q 2 = 2.1 and 0.8 (GeV/c)2 and for missing momenta pm ranging from pm = 0.03 to pm = 0.65 GeV/c. The analysis resulted in the determination of absolute D(e,e′ p)n cross sections as a function of the recoiling neutron momentum and it's scattering angle with respect to the momentum transfer [vector] q. The angular distribution was compared to various modern theoretical predictions that also included final state interactions. The data confirmed the theoretical prediction of a strong anisotropy of final state interaction contributions at Q2 of 2.1 (GeV/c)2 while at the lower Q2 value, the anisotropy was much less pronounced. At Q2 of 0.8 (GeV/c)2, theories show a large disagreement with the experimental results. The experimental momentum distribution of the bound proton inside the deuteron has been determined for the first time at a set of fixed neutron recoil angles. The momentum distribution is directly related to the ground state wave function of the deuteron in momentum space. The high momentum part of this wave function plays a crucial role in understanding the short-range part of the nucleon-nucleon force. At Q2 = 2.1 (GeV/c)2, the momentum distribution determined at small neutron recoil angles is much less affected by FSI compared to a recoil angle of 75°. In contrast, at Q2 = 0.8 (GeV/c)2 there seems to be no region with reduced FSI for larger missing momenta. Besides the statistical errors, systematic errors of about 5–6 % were included in the final results in order to account for normalization uncertainties and uncertainties in the determi- nation of kinematic veriables. The measurements were carried out using an electron beam energy of 2.8 and 4.7 GeV with beam currents between 10 to 100 &mgr; A. The scattered electrons and the ejected protons originated from a 15cm long liquid deuterium target, and were detected in conicidence with the two high resolution spectrometers of Hall A at Jefferson Lab.^
Resumo:
The presence of inhibitory substances in biological forensic samples has, and continues to affect the quality of the data generated following DNA typing processes. Although the chemistries used during the procedures have been enhanced to mitigate the effects of these deleterious compounds, some challenges remain. Inhibitors can be components of the samples, the substrate where samples were deposited or chemical(s) associated to the DNA purification step. Therefore, a thorough understanding of the extraction processes and their ability to handle the various types of inhibitory substances can help define the best analytical processing for any given sample. A series of experiments were conducted to establish the inhibition tolerance of quantification and amplification kits using common inhibitory substances in order to determine if current laboratory practices are optimal for identifying potential problems associated with inhibition. DART mass spectrometry was used to determine the amount of inhibitor carryover after sample purification, its correlation to the initial inhibitor input in the sample and the overall effect in the results. Finally, a novel alternative at gathering investigative leads from samples that would otherwise be ineffective for DNA typing due to the large amounts of inhibitory substances and/or environmental degradation was tested. This included generating data associated with microbial peak signatures to identify locations of clandestine human graves. Results demonstrate that the current methods for assessing inhibition are not necessarily accurate, as samples that appear inhibited in the quantification process can yield full DNA profiles, while those that do not indicate inhibition may suffer from lowered amplification efficiency or PCR artifacts. The extraction methods tested were able to remove >90% of the inhibitors from all samples with the exception of phenol, which was present in variable amounts whenever the organic extraction approach was utilized. Although the results attained suggested that most inhibitors produce minimal effect on downstream applications, analysts should practice caution when selecting the best extraction method for particular samples, as casework DNA samples are often present in small quantities and can contain an overwhelming amount of inhibitory substances.
Resumo:
This research was undertaken to explore dimensions of the risk construct, identify factors related to risk-taking in education, and study risk propensity among employees at a community college. Risk-taking propensity (RTP) was measured by the 12-item BCDQ, which consisted of personal and professional risk-related situations balanced for the money, reputation, and satisfaction dimensions of the risk construct. Scoring ranged from 1.00 (most cautious) to 6.00 (most risky). Surveys including the BCDQ and seven demographic questions relating to age, gender, professional status, length of service, academic discipline, highest degree, and campus location were sent to faculty, administrators, and academic department heads. A total of 325 surveys were returned, resulting in a 66.7% response rate. Subjects were relatively homogeneous for age, length of service, and highest degree. Subjects were also homogeneous for risk-taking propensity: no substantive differences in RTP scores were noted within and among demographic groups, with the possible exception of academic discipline. The mean RTP score for all subjects was 3.77, for faculty was 3.76, for administrators was 3.83, and for department heads was 3.64. The relationship between propensity to take personal risks and propensity to take professional risks was tested by computing Pearson r correlation coefficients. The relationships for the total sample, faculty, and administrator groups were statistically significant, but of limited practical significance. Subjects were placed into risk categories by dividing the response scale into thirds. A 3 X 3 factorial ANOVA revealed no interaction effects between professional status and risk category with regard to RTP score. A discriminant analysis showed that a seven-factor model was not effective in predicting risk category. The homogeneity of the study sample and the effect of a risk encouraging environment were discussed in the context of the community college. Since very little data on risk-taking in education is available, risk propensity data from this study could serve as a basis for comparison to future research. Results could be used by institutions to plan professional development activities, designed to increase risk-taking and encourage active acceptance of change.
Resumo:
The terrigenous sediment proportion of the deep sea sediments from off Northwest Africa has been studied in order to distinguish between the aeolian and the fluvial sediment supply. The present and fossil Saharan dust trajectories were recognized from the distribution patterns of the aeolian sediment. The following timeslices have been investigated: Present, 6,000, 12,000 and 18,000 y. B. P. Furthermore, the quantity of dust deposited off the Saharan coast has been estimated. For this purpose, 80 surface sediment samples and 34 sediment cores have been analysed. The stratigraphy of the cores has been achieved from oxygen isotopic curves, 14C-dating, foraminiferal transfer temperatures, and carbonate contents. Silt sized biogenic opal generally accounts for less than 2 % of the total insoluble sediment proportion. Only under productive upwelling waters and off river mouths, the opal proportion exceeds 2 % significantly. The modern terrigenous sediment from off the Saharan coast is generally characterized by intensely stained quartz grains. They indicate an origin from southern Saharan and Sahelian laterites, and a zonal aeolian transport in midtropospheric levels, between 1.5 an 5.5 km, by 'Harmattan' Winds. The dust particles follow large outbreaks of Saharan air across the African coast between 15° and 21° N. Their trajectories are centered at about 18° N and continue further into a clockwise gyre situated south of the Canary Islands. This course is indicated by a sickle-shaped tongue of coarser grain sizes in the deep-sea sediment. Such loess-sized terrigenous particles only settle within a zone extending to 700 km offshore. Fine silt and clay sized particles, with grain sizes smaller than 10- 15 µm, drift still further west and can be traced up to more than 4,000 km distance from their source areas. Additional terrigenous silt which is poor in stained quartz occurs within a narrow zone off the western Sahara between 20° and 27° N only. It depicts the present dust supply by the trade winds close to the surface. The dust load originates from the northwestern Sahara, the Atlas Mountains and coastal areas, which contain a particularly low amount of stained quartz. The distribution pattern of these pale quartz sediments reveals a SSW-dispersal of dust being consistent with the present trade wind direction from the NNE. In comparison to the sediments from off the Sahara and the deeper subtropical Atlantic, the sediments off river mouths, in particular off the Senegal river, are characterized by an additional input of fine grained terrigenous particles (< 6 µm). This is due to fluvial suspension load. The fluvial discharge leads to a relative excess of fine grained particles and is observed in a correlation diagram of the modal grain sizes of terrigenous silt with the proportion of fine fraction (< 6 µm). The aeolian sediment contribution by the Harmattan Winds strongly decreased during the Climatic Optimum at 6,000 y. B. P. The dust discharge of the trade winds is hardly detectable in the deep-sea sediments. This probably indicates a weakened atmospheric circulation. In contrast, the fluvial sediment supply reached a maximum, and can be traced to beyond Cape Blanc. Thus, the Saharan climate was more humid at 6,000 y B. P. A latitudinal shift of the Harmattan driven dust outbreaks cannot be observed. Also during the Glacial, 18,000 y. B. P., Harmattan dust transport crossed the African coast at latitudes of 15°-20° N. Its sediment load increased intensively, and markedly coarser grains spread further into the Atlantic Ocean. An expanded zone of pale-quart sediments indicates an enhanced dust supply by the trade winds blowing from the NE. No synglacial fluvial sediment contribution can be recognized between 12° and 30° N. This indicates a dry glacial climate and a strengthened stmospheric circulation over the Sahelian and Saharan region. The climatic transition pahes, at 12, 000 y. B. P., between the last Glacial and the Intergalcial, which is compareable to the Alerod in Europe, is characterized by an intermediate supply of terrigenous particles. The Harmattan dust transport wa weaker than during the Glacial. The northeasterly trade winds were still intensive. River supply reached a first postglacial maximum seaward of the Senegal river mouth. This indicates increasing humidity over the southern Sahara and a weaker atmospheric circulation as compared to the glacial. The accumulation rates of the terrigenous silt proportion (> 6 µm) decrcase exponentially with increasing distance from the Saharan coast. Those of the terrigenous fine fraction (< 6 µm) follow the same trend and show almost similar gradients. Accordingly, also the terrigenous fine fraction is believed to result predominantly from aeolian transport. In the Atlantic deep-sea sediments, the annual terrigenous sediment accumulation has fluctuated, from about 60 million tons p. a. during the Late Glacial (13,500-18,000 y. B. P, aeolian supply only) to about 33 million tons p. a. during the Holocene Climatic Optimum (6,000-9,000 y. B. P, mainly fluvial supply), when the river supply has reached a maximum, and to about 45 million tons p. a. during the last 4,000 years B. P. (fluvial supply only south of 18° N).
Resumo:
The discovery of giant stars in the spectral regions G and K, showing moderate to rapid rotation and single behavior, namely with constant radial velocity, represents one important topic of study in Stellar Astrophysics. Indeed, such anomalous rotation clearly violates the theoretical predictions on the evolution of stellar rotation, since in evolved evolutionary stages is expected that the single stars essentially have low rotation due to the evolutionary expansion. This property is well-established from the observational point of view, with different studies showing that for single giant stars of spectral types G and K values of the rotation are typically smaller than 5kms−1 . This Thesis seeks an effective contribution to solving the paradigm described above, aiming to search for single stars of spectral types G and K with anomalous rotation, tipically rotation of moderate to rapid, in other luminosity classes. In this context, we analyzed a large stellar sample consisting of 2010 apparently single stars of luminosity classes IV, III, II and Ib with spectral types G and K, with rotational velocity v sin i and radial velocity measurements obtained from observations made by CORAVEL spectrometers. As a first result of impact we discovered the presence of anomalous rotators also among subgiants, bright giants and supergiants stars, namelly stars of luminosity classes IV, II and Ib, in contrast to previous studies, that reported anomalous rotators only in the luminosity class III classic giants. Such a finding of great significance because it allows us to analyze the presence of anomalous rotation at different intervals of mass, since the luminosity classes considered here cover a mass range between 0.80 and 20MJ, approximately. In the present survey we discovered 1 subgiant, 9 giants, 2 bright giants and 5 Ib supergiants, in spectral regions G and K, with values of v sin i ≥ 10kms−1 and single behavior. This amount of 17 stars corresponds to a frequency of 0.8% of G and K single evolved stars with anomalous rotation in the mentioned classes of luminosities, listed at the Bright Star Catalog, which is complete to visual magnitude 6.3. Given these new findings, based on a stellar sample complete in visual magnitude, as that of the Bright Star Catalog, we conducted a comparative statistical analysis using the Kolmogorov- Smirnov test, from where we conclude that the distributions of rotational velocity, v sin i, for single evolved stars with anomalous rotation in luminosity classes III and II, are similar to the distributions of v sin i for spectroscopic binary systems with evolved components with the same spectral type and luminosity class. This vii result indicates that the process of coalescence between stars of a binary system might be a possible mechanism to explain the observed abnormal rotation in the referred abnormal rotators, at least among the giants and bright giants, where the rotation in excess would be associated with the transfer of angular momentum for the star resulting from the merger. Another important result of this Thesis concerns the behavior of the infrared emission in most of the stars with anomalous rotation here studied, where 14 stars of the sample tend to have an excess in IR compared with single stars with low rotation, within of their luminosity class. This property represents an additional link in the search for the physical mechanisms responsible for the abnormal observed rotation, since recent theoretical studies show that the accretion of objects of sub-stellar mass, such as brown dwarfs and giant planets, by the hosting star, can significantly raise its rotation, producing also a circumstellar dust disk. This last result seems to point in that direction, since it is not expected that dust disks occurring during the stage of star formation can survive until the stages of subgiants, giants and supergiants Ib. In summary, in this Thesis, besides the discovery of single G and K evolved stars of luminosity classes IV, II and Ib with anomalously high rotation compared to what is predicted by stellar evolution theory, we also present the frequency of these abnormal rotators in a stellar sample complete to visual magnitude 6.3. We also present solid evidence that coalescence processes in stellar binary systems and processes of accretion of brown dwarfs star or giant planets, by the hosting stars, can act as mechanisms responsible for the puzzling phenomenon of anomalous rotation in single evolved stars.
Resumo:
Marine spatial planning and ecological research call for high-resolution species distribution data. However, those data are still not available for most marine large vertebrates. The dynamic nature of oceanographic processes and the wide-ranging behavior of many marine vertebrates create further difficulties, as distribution data must incorporate both the spatial and temporal dimensions. Cetaceans play an essential role in structuring and maintaining marine ecosystems and face increasing threats from human activities. The Azores holds a high diversity of cetaceans but the information about spatial and temporal patterns of distribution for this marine megafauna group in the region is still very limited. To tackle this issue, we created monthly predictive cetacean distribution maps for spring and summer months, using data collected by the Azores Fisheries Observer Programme between 2004 and 2009. We then combined the individual predictive maps to obtain species richness maps for the same period. Our results reflect a great heterogeneity in distribution among species and within species among different months. This heterogeneity reflects a contrasting influence of oceanographic processes on the distribution of cetacean species. However, some persistent areas of increased species richness could also be identified from our results. We argue that policies aimed at effectively protecting cetaceans and their habitats must include the principle of dynamic ocean management coupled with other area-based management such as marine spatial planning.
Resumo:
We explore the nature of Infrared Excess sources (IRX), which are proposed as candidates for luminous [L_X(2–10 keV) > 10^43 erg s^−1] Compton thick (NH > 2 × 1024 cm−2) QSOs at z≈ 2. Lower redshift, z≈ 1, analogues of the distant IRX population are identified by first redshifting to z= 2 the spectral energy distributions (SEDs) of all sources with secure spectroscopic redshifts in the AEGIS (6488) and the GOODS-North (1784) surveys and then selecting those that qualify as IRX sources at that redshift. A total of 19 galaxies are selected. The mean redshift of the sample is z≈ 1. We do not find strong evidence for Compton thick QSOs in the sample. For nine sources with X-ray counterparts, the X-ray spectra are consistent with Compton thin active galactic nucleus (AGN). Only three of them show tentative evidence for Compton thick obscuration. The SEDs of the X-ray undetected population are consistent with starburst activity. There is no evidence for a hot dust component at the mid-infrared associated with AGN heated dust. If the X-ray undetected sources host AGN, an upper limit of L_X(2–10 keV) = 10^43 erg s^−1 is estimated for their intrinsic luminosity. We propose that a large fraction of the z≈ 2 IRX population is not Compton thick quasi-stellar objects (QSOs) but low-luminosity [L_X(2–10 keV) < 10^43 erg s^−1], possibly Compton thin, AGN or dusty starbursts. It is shown that the decomposition of the AGN and starburst contribution to the mid-IR is essential for interpreting the nature of this population, as star formation may dominate this wavelength regime.
Resumo:
Aims. Long gamma-ray bursts (LGRBs) are associated with the deaths of massive stars and might therefore be a potentially powerful tool for tracing cosmic star formation. However, especially at low redshifts (z< 1.5) LGRBs seem to prefer particular types of environment. Our aim is to study the host galaxies of a complete sample of bright LGRBs to investigate the effect of the environment on GRB formation. Methods. We studied host galaxy spectra of the Swift/BAT6 complete sample of 14 z< 1 bright LGRBs. We used the detected nebular emission lines to measure the dust extinction, star formation rate (SFR), and nebular metallicity (Z) of the hosts and supplemented the data set with previously measured stellar masses M_*. The distributions of the obtained properties and their interrelations (e.g. mass-metallicity and SFR-M_* relations) are compared to samples of field star-forming galaxies. Results. We find that LGRB hosts at z< 1 have on average lower SFRs than if they were direct star formation tracers. By directly comparing metallicity distributions of LGRB hosts and star-forming galaxies, we find a good match between the two populations up to 12 +log (O/H)~8.4−8.5, after which the paucity of metal-rich LGRB hosts becomes apparent. The LGRB host galaxies of our complete sample are consistent with the mass-metallicity relation at similar mean redshift and stellar masses. The cutoff against high metallicities (and high masses) can explain the low SFR values of LGRB hosts. We find a hint of an increased incidence of starburst galaxies in the Swift/BAT6 z< 1 sample with respect to that of a field star-forming population. Given that the SFRs are low on average, the latter is ascribed to low stellar masses. Nevertheless, the limits on the completeness and metallicity availability of current surveys, coupled with the limited number of LGRB host galaxies, prevents us from investigating more quantitatively whether the starburst incidence is such as expected after taking into account the high-metallicity aversion of LGRB host galaxies.
Resumo:
Current interest in measuring quality of life is generating interest in the construction of computerized adaptive tests (CATs) with Likert-type items. Calibration of an item bank for use in CAT requires collecting responses to a large number of candidate items. However, the number is usually too large to administer to each subject in the calibration sample. The concurrent anchor-item design solves this problem by splitting the items into separate subtests, with some common items across subtests; then administering each subtest to a different sample; and finally running estimation algorithms once on the aggregated data array, from which a substantial number of responses are then missing. Although the use of anchor-item designs is widespread, the consequences of several configuration decisions on the accuracy of parameter estimates have never been studied in the polytomous case. The present study addresses this question by simulation, comparing the outcomes of several alternatives on the configuration of the anchor-item design. The factors defining variants of the anchor-item design are (a) subtest size, (b) balance of common and unique items per subtest, (c) characteristics of the common items, and (d) criteria for the distribution of unique items across subtests. The results of this study indicate that maximizing accuracy in item parameter recovery requires subtests of the largest possible number of items and the smallest possible number of common items; the characteristics of the common items and the criterion for distribution of unique items do not affect accuracy.
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.