9 resultados para Large Data
em CaltechTHESIS
Resumo:
The connections between convexity and submodularity are explored, for purposes of minimizing and learning submodular set functions.
First, we develop a novel method for minimizing a particular class of submodular functions, which can be expressed as a sum of concave functions composed with modular functions. The basic algorithm uses an accelerated first order method applied to a smoothed version of its convex extension. The smoothing algorithm is particularly novel as it allows us to treat general concave potentials without needing to construct a piecewise linear approximation as with graph-based techniques.
Second, we derive the general conditions under which it is possible to find a minimizer of a submodular function via a convex problem. This provides a framework for developing submodular minimization algorithms. The framework is then used to develop several algorithms that can be run in a distributed fashion. This is particularly useful for applications where the submodular objective function consists of a sum of many terms, each term dependent on a small part of a large data set.
Lastly, we approach the problem of learning set functions from an unorthodox perspective---sparse reconstruction. We demonstrate an explicit connection between the problem of learning set functions from random evaluations and that of sparse signals. Based on the observation that the Fourier transform for set functions satisfies exactly the conditions needed for sparse reconstruction algorithms to work, we examine some different function classes under which uniform reconstruction is possible.
Resumo:
The anisotropy of 1.3 - 2.3 MeV protons in interplanetary space has been measured using the Caltech Electron/Isotope Spectrometer aboard IMP-7 for 317 6-hour periods from 72/273 to 74/2. Periods dominated by prompt solar particle events are not included. The convective and diffusive anisotropies are determined from the observed anisotropy using concurrent solar wind speed measurements and observed energy spectra. The diffusive flow of particles is found to be typically toward the sun, indicating a positive radial gradient in the particle density. This anisotropy is inconsistent with previously proposed sources of low-energy proton increases seen at 1 AU which involve continual solar acceleration.
The typical properties of this new component of low-energy cosmic rays have been determine d for this period which is near solar minimum. The particles have a median intensity of 0.06 protons/ cm^(2)-sec-sr-MeV and a mean spectral index of -3.15.The amplitude of the diffusive anisotropy is approximately proportional to the solar wind speed. The rate at which particles are diffusing toward the sun is larger than the rate at which the solar wind is convecting the particles away from the sun. The 20 to 1 proton to alpha ratio typical of this new component has been reported by Mewaldt, et al. (1975b).
A propagation model with κ_(rr) assumed independent of radius and energy is used to show that the anisotropy could be due to increases similar to those found by McDonald, et al. (1975) at ~3 AU. The interplanetary Fermi-acceleration model proposed by Fisk (1976) to explain the increases seen near 3 AU is not consistent with the ~12 per cent diffusive anisotropy found.
The dependence of the diffusive anisotropy on various parameters is shown. A strong dependence of the direction of the diffusive anisotropy on the concurrently measured magnetic field direction is found, indicating a κ_⊥ less than κ_∥ to be typical for this large data set.
Resumo:
The evoked response, a signal present in the electro-encephalogram when specific sense modalities are stimulated with brief sensory inputs, has not yet revealed as much about brain function as it apparently promised when first recorded in the late 1940's. One of the problems has been to record the responses at a large number of points on the surface of the head; thus in order to achieve greater spatial resolution than previously attained, a 50-channel recording system was designed to monitor experiments with human visually evoked responses.
Conventional voltage versus time plots of the responses were found inadequate as a means of making qualitative studies of such a large data space. This problem was solved by creating a graphical display of the responses in the form of equipotential maps of the activity at successive instants during the complete response. In order to ascertain the necessary complexity of any models of the responses, factor analytic procedures were used to show that models characterized by only five or six independent parameters could adequately represent the variability in all recording channels.
One type of equivalent source for the responses which meets these specifications is the electrostatic dipole. Two different dipole models were studied: the dipole in a homogeneous sphere and the dipole in a sphere comprised of two spherical shells (of different conductivities) concentric with and enclosing a homogeneous sphere of a third conductivity. These models were used to determine nonlinear least squares fits of dipole parameters to a given potential distribution on the surface of a spherical approximation to the head. Numerous tests of the procedures were conducted with problems having known solutions. After these theoretical studies demonstrated the applicability of the technique, the models were used to determine inverse solutions for the evoked response potentials at various times throughout the responses. It was found that reliable estimates of the location and strength of cortical activity were obtained, and that the two models differed only slightly in their inverse solutions. These techniques enabled information flow in the brain, as indicated by locations and strengths of active sites, to be followed throughout the evoked response.
Resumo:
Forced vibration field tests and finite element studies have been conducted on Morrow Point (arch) Dam in order to investigate dynamic dam-water interaction and water compressibility. Design of the data acquisition system incorporates several special features to retrieve both amplitude and phase of the response in a low signal to noise environment. These features contributed to the success of the experimental program which, for the first time, produced field evidence of water compressibility; this effect seems to play a significant role only in the symmetric response of Morrow Point Dam in the frequency range examined. In the accompanying analysis, frequency response curves for measured accelerations and water pressures as well as their resonating shapes are compared to predictions from the current state-of-the-art finite element model for which water compressibility is both included and neglected. Calibration of the numerical model employs the antisymmetric response data since they are only slightly affected by water compressibility, and, after calibration, good agreement to the data is obtained whether or not water compressibility is included. In the effort to reproduce the symmetric response data, on which water compressibility has a significant influence, the calibrated model shows better correlation when water compressibility is included, but the agreement is still inadequate. Similar results occur using data obtained previously by others at a low water level. A successful isolation of the fundamental water resonance from the experimental data shows significantly different features from those of the numerical water model, indicating possible inaccuracy in the assumed geometry and/or boundary conditions for the reservoir. However, the investigation does suggest possible directions in which the numerical model can be improved.
Resumo:
In this thesis, a method to retrieve the source finiteness, depth of faulting, and the mechanisms of large earthquakes from long-period surface waves is developed and applied to several recent large events.
In Chapter 1, source finiteness parameters of eleven large earthquakes were determined from long-period Rayleigh waves recorded at IDA and GDSN stations. The basic data set is the seismic spectra of periods from 150 to 300 sec. Two simple models of source finiteness are studied. The first model is a point source with finite duration. In the determination of the duration or source-process times, we used Furumoto's phase method and a linear inversion method, in which we simultaneously inverted the spectra and determined the source-process time that minimizes the error in the inversion. These two methods yielded consistent results. The second model is the finite fault model. Source finiteness of large shallow earthquakes with rupture on a fault plane with a large aspect ratio was modeled with the source-finiteness function introduced by Ben-Menahem. The spectra were inverted to find the extent and direction of the rupture of the earthquake that minimize the error in the inversion. This method is applied to the 1977 Sumbawa, Indonesia, 1979 Colombia-Ecuador, 1983 Akita-Oki, Japan, 1985 Valparaiso, Chile, and 1985 Michoacan, Mexico earthquakes. The method yielded results consistent with the rupture extent inferred from the aftershock area of these earthquakes.
In Chapter 2, the depths and source mechanisms of nine large shallow earthquakes were determined. We inverted the data set of complex source spectra for a moment tensor (linear) or a double couple (nonlinear). By solving a least-squares problem, we obtained the centroid depth or the extent of the distributed source for each earthquake. The depths and source mechanisms of large shallow earthquakes determined from long-period Rayleigh waves depend on the models of source finiteness, wave propagation, and the excitation. We tested various models of the source finiteness, Q, the group velocity, and the excitation in the determination of earthquake depths.
The depth estimates obtained using the Q model of Dziewonski and Steim (1982) and the excitation functions computed for the average ocean model of Regan and Anderson (1984) are considered most reasonable. Dziewonski and Steim's Q model represents a good global average of Q determined over a period range of the Rayleigh waves used in this study. Since most of the earthquakes studied here occurred in subduction zones Regan and Anderson's average ocean model is considered most appropriate.
Our depth estimates are in general consistent with the Harvard CMT solutions. The centroid depths and their 90 % confidence intervals (numbers in the parentheses) determined by the Student's t test are: Colombia-Ecuador earthquake (12 December 1979), d = 11 km, (9, 24) km; Santa Cruz Is. earthquake (17 July 1980), d = 36 km, (18, 46) km; Samoa earthquake (1 September 1981), d = 15 km, (9, 26) km; Playa Azul, Mexico earthquake (25 October 1981), d = 41 km, (28, 49) km; El Salvador earthquake (19 June 1982), d = 49 km, (41, 55) km; New Ireland earthquake (18 March 1983), d = 75 km, (72, 79) km; Chagos Bank earthquake (30 November 1983), d = 31 km, (16, 41) km; Valparaiso, Chile earthquake (3 March 1985), d = 44 km, (15, 54) km; Michoacan, Mexico earthquake (19 September 1985), d = 24 km, (12, 34) km.
In Chapter 3, the vertical extent of faulting of the 1983 Akita-Oki, and 1977 Sumbawa, Indonesia earthquakes are determined from fundamental and overtone Rayleigh waves. Using fundamental Rayleigh waves, the depths are determined from the moment tensor inversion and fault inversion. The observed overtone Rayleigh waves are compared to the synthetic overtone seismograms to estimate the depth of faulting of these earthquakes. The depths obtained from overtone Rayleigh waves are consistent with the depths determined from fundamental Rayleigh waves for the two earthquakes. Appendix B gives the observed seismograms of fundamental and overtone Rayleigh waves for eleven large earthquakes.
Resumo:
Studies in turbulence often focus on two flow conditions, both of which occur frequently in real-world flows and are sought-after for their value in advancing turbulence theory. These are the high Reynolds number regime and the effect of wall surface roughness. In this dissertation, a Large-Eddy Simulation (LES) recreates both conditions over a wide range of Reynolds numbers Reτ = O(102)-O(108) and accounts for roughness by locally modeling the statistical effects of near-wall anisotropic fine scales in a thin layer immediately above the rough surface. A subgrid, roughness-corrected wall model is introduced to dynamically transmit this modeled information from the wall to the outer LES, which uses a stretched-vortex subgrid-scale model operating in the bulk of the flow. Of primary interest is the Reynolds number and roughness dependence of these flows in terms of first and second order statistics. The LES is first applied to a fully turbulent uniformly-smooth/rough channel flow to capture the flow dynamics over smooth, transitionally rough and fully rough regimes. Results include a Moody-like diagram for the wall averaged friction factor, believed to be the first of its kind obtained from LES. Confirmation is found for experimentally observed logarithmic behavior in the normalized stream-wise turbulent intensities. Tight logarithmic collapse, scaled on the wall friction velocity, is found for smooth-wall flows when Reτ ≥ O(106) and in fully rough cases. Since the wall model operates locally and dynamically, the framework is used to investigate non-uniform roughness distribution cases in a channel, where the flow adjustments to sudden surface changes are investigated. Recovery of mean quantities and turbulent statistics after transitions are discussed qualitatively and quantitatively at various roughness and Reynolds number levels. The internal boundary layer, which is defined as the border between the flow affected by the new surface condition and the unaffected part, is computed, and a collapse of the profiles on a length scale containing the logarithm of friction Reynolds number is presented. Finally, we turn to the possibility of expanding the present framework to accommodate more general geometries. As a first step, the whole LES framework is modified for use in the curvilinear geometry of a fully-developed turbulent pipe flow, with implementation carried out in a spectral element solver capable of handling complex wall profiles. The friction factors have shown favorable agreement with the superpipe data, and the LES estimates of the Karman constant and additive constant of the log-law closely match values obtained from experiment.
Resumo:
In the measurement of the Higgs Boson decaying into two photons the parametrization of an appropriate background model is essential for fitting the Higgs signal mass peak over a continuous background. This diphoton background modeling is crucial in the statistical process of calculating exclusion limits and the significance of observations in comparison to a background-only hypothesis. It is therefore ideal to obtain knowledge of the physical shape for the background mass distribution as the use of an improper function can lead to biases in the observed limits. Using an Information-Theoretic (I-T) approach for valid inference we apply Akaike Information Criterion (AIC) as a measure of the separation for a fitting model from the data. We then implement a multi-model inference ranking method to build a fit-model that closest represents the Standard Model background in 2013 diphoton data recorded by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC). Potential applications and extensions of this model-selection technique are discussed with reference to CMS detector performance measurements as well as in potential physics analyses at future detectors.
Resumo:
In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.
In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.
Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.
In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.
Resumo:
A large array has been used to investigate the P-wave velocity structure of the lower mantle. Linear array processing methods are reviewed and a method of nonlinear processing is presented. Phase velocities, travel times, and relative amplitudes of P waves have been measured with the large array at the Tonto Forest Seismological Observatory in Arizona for 125 earthquakes in the distance range of 30 to 100 degrees. Various models are assumed for the upper 771 km of the mantle and the Wiechert-Herglotz method applied to the phase velocity data to obtain a velocity depth structure for the lower mantle. The phase velocity data indicates the presence of a second-order discontinuity at a depth of 840 km, another at 1150 km, and less pronounced discontinuities at 1320, 1700 and 1950 km. Phase velocities beyond 85 degrees are interpreted in terms of a triplication of the phase velocity curve, and this results in a zone of almost constant velocity between depths of 2670 and 2800 km. Because of the uncertainty in the upper mantle assumptions, a final model cannot be proposed, but it appears that the lower mantle is more complicated than the standard models and there is good evidence for second-order discontinuities below a depth of 1000 km. A tentative lower bound of 2881 km can be placed on the depth to the core. The importance of checking the calculated velocity structure against independently measured travel times is pointed out. Comparisons are also made with observed PcP times and the agreement is good. The method of using measured values of the rate of change of amplitude with distances shows promising results.