3 resultados para Large Data Sets

em CaltechTHESIS


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.

In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.

Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.

In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Large quantities of teleseismic short-period seismograms recorded at SCARLET provide travel time, apparent velocity and waveform data for study of upper mantle compressional velocity structure. Relative array analysis of arrival times from distant (30° < Δ < 95°) earthquakes at all azimuths constrains lateral velocity variations beneath southern California. We compare dT/dΔ back azimuth and averaged arrival time estimates from the entire network for 154 events to the same parameters derived from small subsets of SCARLET. Patterns of mislocation vectors for over 100 overlapping subarrays delimit the spatial extent of an east-west striking, high-velocity anomaly beneath the Transverse Ranges. Thin lens analysis of the averaged arrival time differences, called 'net delay' data, requires the mean depth of the corresponding lens to be more than 100 km. Our results are consistent with the PKP-delay times of Hadley and Kanamori (1977), who first proposed the high-velocity feature, but we place the anomalous material at substantially greater depths than their 40-100 km estimate.

Detailed analysis of travel time, ray parameter and waveform data from 29 events occurring in the distance range 9° to 40° reveals the upper mantle structure beneath an oceanic ridge to depths of over 900 km. More than 1400 digital seismograms from earthquakes in Mexico and Central America yield 1753 travel times and 58 dT/dΔ measurements as well as high-quality, stable waveforms for investigation of the deep structure of the Gulf of California. The result of a travel time inversion with the tau method (Bessonova et al., 1976) is adjusted to fit the p(Δ) data, then further refined by incorporation of relative amplitude information through synthetic seismogram modeling. The application of a modified wave field continuation method (Clayton and McMechan, 1981) to the data with the final model confirms that GCA is consistent with the entire data set and also provides an estimate of the data resolution in velocity-depth space. We discover that the upper mantle under this spreading center has anomalously slow velocities to depths of 350 km, and place new constraints on the shape of the 660 km discontinuity.

Seismograms from 22 earthquakes along the northeast Pacific rim recorded in southern California form the data set for a comparative investigation of the upper mantle beneath the Cascade Ranges-Juan de Fuca region, an ocean-continent transit ion. These data consist of 853 seismograms (6° < Δ < 42°) which produce 1068 travel times and 40 ray parameter estimates. We use the spreading center model initially in synthetic seismogram modeling, and perturb GCA until the Cascade Ranges data are matched. Wave field continuation of both data sets with a common reference model confirms that real differences exist between the two suites of seismograms, implying lateral variation in the upper mantle. The ocean-continent transition model, CJF, features velocities from 200 and 350 km that are intermediate between GCA and T7 (Burdick and Helmberger, 1978), a model for the inland western United States. Models of continental shield regions (e.g., King and Calcagnile, 1976) have higher velocities in this depth range, but all four model types are similar below 400 km. This variation in rate of velocity increase with tectonic regime suggests an inverse relationship between velocity gradient and lithospheric age above 400 km depth.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Nearly all young stars are variable, with the variability traditionally divided into two classes: periodic variables and aperiodic or "irregular" variables. Periodic variables have been studied extensively, typically using periodograms, while aperiodic variables have received much less attention due to a lack of standard statistical tools. However, aperiodic variability can serve as a powerful probe of young star accretion physics and inner circumstellar disk structure. For my dissertation, I analyzed data from a large-scale, long-term survey of the nearby North America Nebula complex, using Palomar Transient Factory photometric time series collected on a nightly or every few night cadence over several years. This survey is the most thorough exploration of variability in a sample of thousands of young stars over time baselines of days to years, revealing a rich array of lightcurve shapes, amplitudes, and timescales.

I have constrained the timescale distribution of all young variables, periodic and aperiodic, on timescales from less than a day to ~100 days. I have shown that the distribution of timescales for aperiodic variables peaks at a few days, with relatively few (~15%) sources dominated by variability on tens of days or longer. My constraints on aperiodic timescale distributions are based on two new tools, magnitude- vs. time-difference (Δm-Δt) plots and peak-finding plots, for describing aperiodic lightcurves; this thesis provides simulations of their performance and presents recommendations on how to apply them to aperiodic signals in other time series data sets. In addition, I have measured the error introduced into colors or SEDs from combining photometry of variable sources taken at different epochs. These are the first quantitative results to be presented on the distributions in amplitude and time scale for young aperiodic variables, particularly those varying on timescales of weeks to months.