932 resultados para data complexity


Relevância:

30.00% 30.00%

Publicador:

Resumo:

The impulse response of wireless channels between the N-t transmit and N-r receive antennas of a MIMO-OFDM system are group approximately sparse (ga-sparse), i.e., NtNt the channels have a small number of significant paths relative to the channel delay spread and the time-lags of the significant paths between transmit and receive antenna pairs coincide. Often, wireless channels are also group approximately cluster-sparse (gac-sparse), i.e., every ga-sparse channel consists of clusters, where a few clusters have all strong components while most clusters have all weak components. In this paper, we cast the problem of estimating the ga-sparse and gac-sparse block-fading and time-varying channels in the sparse Bayesian learning (SBL) framework and propose a bouquet of novel algorithms for pilot-based channel estimation, and joint channel estimation and data detection, in MIMO-OFDM systems. The proposed algorithms are capable of estimating the sparse wireless channels even when the measurement matrix is only partially known. Further, we employ a first-order autoregressive modeling of the temporal variation of the ga-sparse and gac-sparse channels and propose a recursive Kalman filtering and smoothing (KFS) technique for joint channel estimation, tracking, and data detection. We also propose novel, parallel-implementation based, low-complexity techniques for estimating gac-sparse channels. Monte Carlo simulations illustrate the benefit of exploiting the gac-sparse structure in the wireless channel in terms of the mean square error (MSE) and coded bit error rate (BER) performance.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We consider near-optimal policies for a single user transmitting on a wireless channel which minimize average queue length under average power constraint. The power is consumed in transmission of data only. We consider the case when the power used in transmission is a linear function of the data transmitted. The transmission channel may experience multipath fading. Later, we also extend these results to the multiuser case. We show that our policies can be used in a system with energy harvesting sources at the transmitter. Next we consider data users which require minimum rate guarantees. Finally we consider the system which has both data and real time users. Our policies have low computational complexity, closed form expression for mean delays and require only the mean arrival rate with no queue length information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Decision Trees need train samples in the train data set to get classification rules. If the number of train data was too small, the important information might be missed and thus the model could not explain the classification rules of data. While it is not affirmative that large scale of train data set can get well model. This Paper analysis the relationship between decision trees and the train data scale. We use nine decision tree algorithms to experiment the accuracy, complexity and robustness of decision tree algorithms. Some results are demonstrated.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Data has always been fundamental to many areas of research but in recent years it has become central to more disciplines and inter-disciplinary projects and grown substantially in scale and complexity. There is increasing awareness of its strategic importance as a resource in addressing modern global challenges and the possibilities being unlocked by rapid technological advances and their application in research (NAS2009). The first Keeping Research Data Safe study funded by JISC made a major contribution to understanding of long-term preservation costs for research data by developing a cost model and identifying cost variables for preserving research data in UK universities (Beagrie et al, 2008). The Keeping Research Data Safe 2 (KRDS2) project has built on this work.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Hyper-spectral data allows the construction of more robust statistical models to sample the material properties than the standard tri-chromatic color representation. However, because of the large dimensionality and complexity of the hyper-spectral data, the extraction of robust features (image descriptors) is not a trivial issue. Thus, to facilitate efficient feature extraction, decorrelation techniques are commonly applied to reduce the dimensionality of the hyper-spectral data with the aim of generating compact and highly discriminative image descriptors. Current methodologies for data decorrelation such as principal component analysis (PCA), linear discriminant analysis (LDA), wavelet decomposition (WD), or band selection methods require complex and subjective training procedures and in addition the compressed spectral information is not directly related to the physical (spectral) characteristics associated with the analyzed materials. The major objective of this article is to introduce and evaluate a new data decorrelation methodology using an approach that closely emulates the human vision. The proposed data decorrelation scheme has been employed to optimally minimize the amount of redundant information contained in the highly correlated hyper-spectral bands and has been comprehensively evaluated in the context of non-ferrous material classification

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.

In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.

Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.

In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A generalized Bayesian population dynamics model was developed for analysis of historical mark-recapture studies. The Bayesian approach builds upon existing maximum likelihood methods and is useful when substantial uncertainties exist in the data or little information is available about auxiliary parameters such as tag loss and reporting rates. Movement rates are obtained through Markov-chain Monte-Carlo (MCMC) simulation, which are suitable for use as input in subsequent stock assessment analysis. The mark-recapture model was applied to English sole (Parophrys vetulus) off the west coast of the United States and Canada and migration rates were estimated to be 2% per month to the north and 4% per month to the south. These posterior parameter distributions and the Bayesian framework for comparing hypotheses can guide fishery scientists in structuring the spatial and temporal complexity of future analyses of this kind. This approach could be easily generalized for application to other species and more data-rich fishery analyses.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Understanding the dynamics of eukaryotic transcriptome is essential for studying the complexity of transcriptional regulation and its impact on phenotype. However, comprehensive studies of transcriptomes at single base resolution are rare, even for modern organisms, and lacking for rice. Here, we present the first transcriptome atlas for eight organs of cultivated rice. Using high-throughput paired-end RNA-seq, we unambiguously detected transcripts expressing at an extremely low level, as well as a substantial number of novel transcripts, exons, and untranslated regions. An analysis of alternative splicing in the rice transcriptome revealed that alternative cis-splicing occurred in similar to 33% of all rice genes. This is far more than previously reported. In addition, we also identified 234 putative chimeric transcripts that seem to be produced by trans-splicing, indicating that transcript fusion events are more common than expected. In-depth analysis revealed a multitude of fusion transcripts that might be by-products of alternative splicing. Validation and chimeric transcript structural analysis provided evidence that some of these transcripts are likely to be functional in the cell. Taken together, our data provide extensive evidence that transcriptional regulation in rice is vastly more complex than previously believed.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes a method for analysing the operational complexity in supply chains by using an entropic measure based on information theory. The proposed approach estimates the operational complexity at each stage of the supply chain and analyses the changes between stages. In this paper a stage is identified by the exchange of data and/or material. Through analysis the method identifies the stages where the operational complexity is both generated and propagated (exported, imported, generated or absorbed). Central to the method is the identification of a reference point within the supply chain. This is where the operational complexity is at a local minimum along the data transfer stages. Such a point can be thought of as a 'sink' for turbulence generated in the supply chain. Where it exists, it has the merit of stabilising the supply chain by attenuating uncertainty. However, the location of the reference point is also a matter of choice. If the preferred location is other than the current one, this is a trigger for management action. The analysis can help decide appropriate remedial action. More generally, the approach can assist logistics management by highlighting problem areas. An industrial application is presented to demonstrate the applicability of the method. © 2013 Operational Research Society Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Many current recognition systems use constrained search to locate objects in cluttered environments. Previous formal analysis has shown that the expected amount of search is quadratic in the number of model and data features, if all the data is known to come from a sinlge object, but is exponential when spurious data is included. If one can group the data into subsets likely to have come from a single object, then terminating the search once a "good enough" interpretation is found reduces the expected search to cubic. Without successful grouping, terminated search is still exponential. These results apply to finding instances of a known object in the data. In this paper, we turn to the problem of selecting models from a library, and examine the combinatorics of determining that a candidate object is not present in the data. We show that the expected search is again exponential, implying that naﶥ approaches to indexing are likely to carry an expensive overhead, since an exponential amount of work is needed to week out each of the incorrect models. The analytic results are shown to be in agreement with empirical data for cluttered object recognition.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector--a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a technique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational complexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse representations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets. and relevance is measured by the number of common clusters. This technique significantly improves retrieval performance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This longitudinal study tracked third-level French (n=10) and Chinese (n=7) learners of English as a second language (L2) during an eight-month study abroad (SA) period at an Irish university. The investigation sought to determine whether there was a significant relationship between length of stay (LoS) abroad and gains in the learners' oral complexity, accuracy and fluency (CAF), what the relationship was between these three language constructs and whether the two learner groups would experience similar paths to development. Additionally, the study also investigated whether specific reported out-of-class contact with the L2 was implicated in oral CAF gains. Oral data were collected at three equidistant time points; at the beginning of SA (T1), midway through the SA sojourn (T2) and at the end (T3), allowing for a comparison of CAF gains arising during one semester abroad to those arising during a subsequent semester. Data were collected using Sociolinguistic Interviews (Labov, 1984) and adapted versions of the Language Contact Profile (Freed et al., 2004). Overall, the results point to LoS abroad as a highly influential variable in gains to be expected in oral CAF during SA. While one semester in the TL country was not enough to foster statistically significant improvement in any of the CAF measures employed, significant improvement was found during the second semester of SA. Significant differences were also revealed between the two learner groups. Finally, significant correlations, some positive, some negative, were found between gains in CAF and specific usage of the L2. All in all, the disaggregation of the group data clearly illustrates, in line with other recent enquiries (e.g. Wright and Cong, 2014) that each individual learner's path to CAF development was unique and highly individualised, thus providing strong evidence for the recent claim that SLA is "an individualized nonlinear endeavor" (Polat and Kim, 2014: 186).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: Historically, only partial assessments of data quality have been performed in clinical trials, for which the most common method of measuring database error rates has been to compare the case report form (CRF) to database entries and count discrepancies. Importantly, errors arising from medical record abstraction and transcription are rarely evaluated as part of such quality assessments. Electronic Data Capture (EDC) technology has had a further impact, as paper CRFs typically leveraged for quality measurement are not used in EDC processes. METHODS AND PRINCIPAL FINDINGS: The National Institute on Drug Abuse Treatment Clinical Trials Network has developed, implemented, and evaluated methodology for holistically assessing data quality on EDC trials. We characterize the average source-to-database error rate (14.3 errors per 10,000 fields) for the first year of use of the new evaluation method. This error rate was significantly lower than the average of published error rates for source-to-database audits, and was similar to CRF-to-database error rates reported in the published literature. We attribute this largely to an absence of medical record abstraction on the trials we examined, and to an outpatient setting characterized by less acute patient conditions. CONCLUSIONS: Historically, medical record abstraction is the most significant source of error by an order of magnitude, and should be measured and managed during the course of clinical trials. Source-to-database error rates are highly dependent on the amount of structured data collection in the clinical setting and on the complexity of the medical record, dependencies that should be considered when developing data quality benchmarks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. METHODS/PRINCIPAL FINDINGS: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of "what if" situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. CONCLUSION/SIGNIFICANCE: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.