49 resultados para correlation-based feature selection

em University of Queensland eSpace - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Non-technical losses (NTL) identification and prediction are important tasks for many utilities. Data from customer information system (CIS) can be used for NTL analysis. However, in order to accurately and efficiently perform NTL analysis, the original data from CIS need to be pre-processed before any detailed NTL analysis can be carried out. In this paper, we propose a feature selection based method for CIS data pre-processing in order to extract the most relevant information for further analysis such as clustering and classifications. By removing irrelevant and redundant features, feature selection is an essential step in data mining process in finding optimal subset of features to improve the quality of result by giving faster time processing, higher accuracy and simpler results with fewer features. Detailed feature selection analysis is presented in the paper. Both time-domain and load shape data are compared based on the accuracy, consistency and statistical dependencies between features.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Feature selection is one of important and frequently used techniques in data preprocessing. It can improve the efficiency and the effectiveness of data mining by reducing the dimensions of feature space and removing the irrelevant and redundant information. Feature selection can be viewed as a global optimization problem of finding a minimum set of M relevant features that describes the dataset as well as the original N attributes. In this paper, we apply the adaptive partitioned random search strategy into our feature selection algorithm. Under this search strategy, the partition structure and evaluation function is proposed for feature selection problem. This algorithm ensures the global optimal solution in theory and avoids complete randomness in search direction. The good property of our algorithm is shown through the theoretical analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An investigation was conducted to evaluate the impact of experimental designs and spatial analyses (single-trial models) of the response to selection for grain yield in the northern grains region of Australia (Queensland and northern New South Wales). Two sets of multi-environment experiments were considered. One set, based on 33 trials conducted from 1994 to 1996, was used to represent the testing system of the wheat breeding program and is referred to as the multi-environment trial (MET). The second set, based on 47 trials conducted from 1986 to 1993, sampled a more diverse set of years and management regimes and was used to represent the target population of environments (TPE). There were 18 genotypes in common between the MET and TPE sets of trials. From indirect selection theory, the phenotypic correlation coefficient between the MET and TPE single-trial adjusted genotype means [r(p(MT))] was used to determine the effect of the single-trial model on the expected indirect response to selection for grain yield in the TPE based on selection in the MET. Five single-trial models were considered: randomised complete block (RCB), incomplete block (IB), spatial analysis (SS), spatial analysis with a measurement error (SSM) and a combination of spatial analysis and experimental design information to identify the preferred (PF) model. Bootstrap-resampling methodology was used to construct multiple MET data sets, ranging in size from 2 to 20 environments per MET sample. The size and environmental composition of the MET and the single-trial model influenced the r(p(MT)). On average, the PF model resulted in a higher r(p(MT)) than the IB, SS and SSM models, which were in turn superior to the RCB model for MET sizes based on fewer than ten environments. For METs based on ten or more environments, the r(p(MT)) was similar for all single-trial models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

There are many techniques for electricity market price forecasting. However, most of them are designed for expected price analysis rather than price spike forecasting. An effective method of predicting the occurrence of spikes has not yet been observed in the literature so far. In this paper, a data mining based approach is presented to give a reliable forecast of the occurrence of price spikes. Combined with the spike value prediction techniques developed by the same authors, the proposed approach aims at providing a comprehensive tool for price spike forecasting. In this paper, feature selection techniques are firstly described to identify the attributes relevant to the occurrence of spikes. A simple introduction to the classification techniques is given for completeness. Two algorithms: support vector machine and probability classifier are chosen to be the spike occurrence predictors and are discussed in details. Realistic market data are used to test the proposed model with promising results.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Document classification is a supervised machine learning process, where predefined category labels are assigned to documents based on the hypothesis derived from training set of labelled documents. Documents cannot be directly interpreted by a computer system unless they have been modelled as a collection of computable features. Rogati and Yang [M. Rogati and Y. Yang, Resource selection for domain-specific cross-lingual IR, in SIGIR 2004: Proceedings of the 27th annual international conference on Research and Development in Information Retrieval, ACM Press, Sheffied: United Kingdom, pp. 154-161.] pointed out that the effectiveness of document classification system may vary in different domains. This implies that the quality of document model contributes to the effectiveness of document classification. Conventionally, model evaluation is accomplished by comparing the effectiveness scores of classifiers on model candidates. However, this kind of evaluation methods may encounter either under-fitting or over-fitting problems, because the effectiveness scores are restricted by the learning capacities of classifiers. We propose a model fitness evaluation method to determine whether a model is sufficient to distinguish positive and negative instances while still competent to provide satisfactory effectiveness with a small feature subset. Our experiments demonstrated how the fitness of models are assessed. The results of our work contribute to the researches of feature selection, dimensionality reduction and document classification.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An on-line priming experiment was used to investigate discourse-level processing in four matched groups of subjects: individuals with nonthalamic subcortical lesions (NSL) ( n =10), normal control subjects ( n =10), subjects with Parkinsons disease (PD) ( n =10), and subjects with cortical lesions ( n =10). Subjects listened to paragraphs that ended in lexical ambiguities, and then made speeded lexical decisions on visual letter strings that were: nonwords, matched control words, contextually appropriate associates of the lexical ambiguity, contextually inappropriate associates of the ambiguity, and inferences (representing information which could be drawn from the paragraphs but was not explicitly stated). Targets were presented at an interstimulus interval (ISI) of 0 or 1000ms. NSL and PD subjects demonstrated priming for appropriate and inappropriate associates at the short ISI, similar to control subjects and cortical lesion subjects, but were unable to demonstrate selective priming of the appropriate associate and inference words at the long ISI. These results imply intact automatic lexical processing and a breakdown in discourse-based meaning selection and inference development via attentional/strategic mechanisms.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Ecological processes are central to the formation of new species when barriers to gene flow (reproductive isolation) evolve between populations as a result of ecologically-based divergent selection. Although laboratory and field studies provide evidence that 'ecological speciation' can occur, our understanding of the details of the process is incomplete. Here we review ecological speciation by considering its constituent components: an ecological source of divergent selection, a form of reproductive isolation, and a genetic mechanism linking the two. Sources of divergent selection include differences in environment or niche, certain forms of sexual selection, and the ecological interaction of populations. We explore the evidence for the contribution of each to ecological speciation. Forms of reproductive isolation are diverse and we discuss the likelihood that each may be involved in ecological speciation. Divergent selection on genes affecting ecological traits can be transmitted directly (via pleiotropy) or indirectly (via linkage disequilibrium) to genes causing reproductive isolation and we explore the consequences of both. Along with these components, we also discuss the geography and the genetic basis of ecological speciation. Throughout, we provide examples from nature, critically evaluate their quality, and highlight areas where more work is required.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recently, private health insurance rates have declined in many countries. In places requiring community rating in their health insurance premiums, a major cause is age-based adverse selection. However, even in countries without community rating, a de facto type of partial community rating tends to occur. In this note, a modified version of Pauly et al.'s guaranteed renewability model, which addresses the problem of age-based adverse selection (Pauly et al., 1995) is presented. Their model is extended from three to 35 periods. Also, probabilities are allowed to increase by age for low-risk types using actual age-based probabilities. This extension of their work shows that private health insurance contracts available stray far from optimal contracts that deal with age-based adverse selection. This suggests that government actions to affect private insurance options are warranted.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In the last few decades, private health insurance rates have declined in many countries. In countries and states with community rating, a major cause is adverse selection. In order to address age-based adverse selection, Australia has recently begun a novel approach which imposes stiff penalties for buying private insurance later in life, when expected costs are higher. In this paper, we analyze Australiarsquos Lifetime Cover in the context of a modified version of the Rothschild-Stiglitz insurance model (Rothschild and Stiglitz, 1976). We allow empirically-based probabilities to increase by age for low-risk types. The model highlights the shortcomings of the Australian plan. Based on empirically-based probabilities of illness, we predict that Lifetime Cover will not arrest adverse selection. The model has many policy implications for government regulation encouraging long-term health coverage.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Although the aim of conservation planning is the persistence of biodiversity, current methods trade-off ecological realism at a species level in favour of including multiple species and landscape features. For conservation planning to be relevant, the impact of landscape configuration on population processes and the viability of species needs to be considered. We present a novel method for selecting reserve systems that maximize persistence across multiple species, subject to a conservation budget. We use a spatially explicit metapopulation model to estimate extinction risk, a function of the ecology of the species and the amount, quality and configuration of habitat. We compare our new method with more traditional, area-based reserve selection methods, using a ten-species case study, and find that the expected loss of species is reduced 20-fold. Unlike previous methods, we avoid designating arbitrary weightings between reserve size and configuration; rather, our method is based on population processes and is grounded in ecological theory.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we explore the use of text-mining methods for the identification of the author of a text. We apply the support vector machine (SVM) to this problem, as it is able to cope with half a million of inputs it requires no feature selection and can process the frequency vector of all words of a text. We performed a number of experiments with texts from a German newspaper. With nearly perfect reliability the SVM was able to reject other authors and detected the target author in 60–80% of the cases. In a second experiment, we ignored nouns, verbs and adjectives and replaced them by grammatical tags and bigrams. This resulted in slightly reduced performance. Author detection with SVMs on full word forms was remarkably robust even if the author wrote about different topics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

A 12 week kayak training programme was evaluated in children who either had or did not have the anthropometric characteristics identified as being unique to senior elite sprint kayakers. Altogether, 234 male and female school children were screened to select 10 children with and 10 children without the identified key anthropometric characteristics. Before and after training, the children completed an all-out 2 min kayak ergometer simulation test; measures of oxygen consumption, plasma lactate and total work accomplished were recorded. In addition, a 500 m time trial was performed at weeks 3 and 12. The coaches were unaware which 20 children possessed those anthropometric characteristics deemed to favour development of kayak ability. All children improved in both the 2 min ergometer simulation test and 500 m time trial. However, boys who were selected according to favourable anthropometric characteristics showed greater improvement than those without such characteristics in the 2 min ergometer test only. In summary, in a small group of children selected according to anthropometric data unique to elite adult kayakers, 12 weeks of intensive kayak training did not influence the rate of improvement of on-water sprint kayak performance.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Background: Condition-dependence is a ubiquitous feature of animal life histories and has important implications for both natural and sexual selection. Mate choice, for instance, is typically based on condition-dependent signals. Theory predicts that one reason why condition-dependent signals may be special is that they allow females to scan for genes that confer high parasite resistance. Such explanations require a genetic link between immunocompetence and body condition, but existing evidence is limited to phenotypic associations. It remains unknown, therefore, whether females selecting males with good body condition simply obtain a healthy mate, or if they acquire genes for their offspring that confer high immunocompetence. Results: Here we use a cross-foster experimental design to partition the phenotypic covariance in indices of body condition and immunocompetence into genetic, maternal and environmental effects in a passerine bird, the zebra finch Taeniopygia guttata. We show that there is significant positive additive genetic covariance between an index of body condition and an index of cell-mediated immune response. In this case, genetic variance in the index of immune response explained 56% of the additive genetic variance in the index of body condition. Conclusion: Our results suggest that, in the context of sexual selection, females that assess males on the basis of condition-dependent signals may gain genes that confer high immunocompetence for their offspring. More generally, a genetic correlation between indices of body condition and imuunocompetence supports the hypothesis that parasite resistance may be an important target of natural selection. Additional work is now required to test whether genetic covariance exists among other aspects of both condition and immunocompetence.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.