50 resultados para selection methods

em Deakin Research Online - Australia


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The Generalized Estimating Equations (GEE) method is one of the most commonly used statistical methods for the analysis of longitudinal data in epidemiological studies. A working correlation structure for the repeated measures of the outcome variable of a subject needs to be specified by this method. However, statistical criteria for selecting the best correlation structure and the best subset of explanatory variables in GEE are only available recently because the GEE method is developed on the basis of quasi-likelihood theory. Maximum likelihood based model selection methods, such as the widely used Akaike Information Criterion (AIC), are not applicable to GEE directly. Pan (2001) proposed a selection method called QIC which can be used to select the best correlation structure and the best subset of explanatory variables. Based on the QIC method, we developed a computing program to calculate the QIC value for a range of different distributions, link functions and correlation structures. This program was written in Stata software. In this article, we introduce this program and demonstrate how to use it to select the most parsimonious model in GEE analyses of longitudinal data through several representative examples.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The purpose of instance selection is to identify which instances (examples, patterns) in a large dataset should be selected as representatives of the entire dataset, without significant loss of information. When a machine learning method is applied to the reduced dataset, the accuracy of the model should not be significantly worse than if the same method were applied to the entire dataset. The reducibility of any dataset, and hence the success of instance selection methods, surely depends on the characteristics of the dataset, as well as the machine learning method. This paper adopts a meta-learning approach, via an empirical study of 112 classification datasets from the UCI Repository [1], to explore the relationship between data characteristics, machine learning methods, and the success of instance selection method.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, we investigate the parameters selection for Eigenfaces. Our focus is on the eigenvectors and threshold selection issues. We will propose a systematic approach in selecting the eigenvectors based on relative errors of the eigenvalues for the covariance matrix. In addition, we have proposed a method for selecting the classification threshold that utilizes the information obtained from the training data set. Experimentation was conducted on two benchmark face databases, ORL and AMP, with results indicating that the proposed automatic eigenvectors and threshold selection methods produce better recognition performance in terms of precision and recall rates. Furthermore, we show that the eigenvector selection method outperforms energy and stretching dimension methods in terms of selected number of eigenvectors and computation cost.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, we investigate the parameter selection issues for Eigenfaces. Our focus is on the eigenvectors and threshold selection issues. We propose a systematic approach in selecting the eigenvectors based on the relative errors of the eigenvalues. In addition, we have designed a method for selecting the classification threshold that utilizes the information obtained from the training database effectively. Experimentation was conducted on the ORL and AMP face databases with results indicating that the automatic eigenvectors and threshold selection methods provide an optimum recognition in terms of precision and recall rates. Furthermore, we show that the eigenvector selection method outperforms energy and stretching dimension methods in terms of selected number of eigenvectors and computation cost.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients' diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1-penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can potentially assist clinical decision making for accurate medical prognosis.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Feature selection is an important step in building predictive models for most real-world problems. One of the popular methods in feature selection is Lasso. However, it shows instability in selecting features when dealing with correlated features. In this work, we propose a new method that aims to increase the stability of Lasso by encouraging similarities between features based on their relatedness, which is captured via a feature covariance matrix. Besides modeling positive feature correlations, our method can also identify negative correlations between features. We propose a convex formulation for our model along with an alternating optimization algorithm that can learn the weights of the features as well as the relationship between them. Using both synthetic and real-world data, we show that the proposed method is more stable than Lasso and many state-of-the-art shrinkage and feature selection methods. Also, its predictive performance is comparable to other methods.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper proposes a modification to the analytic hierarchy process (AHP) to select the most informative genes that serve as inputs to an interval type-2 fuzzy logic system (IT2FLS) for cancer classification. Unlike the conventional AHP, the modified AHP allows us to process quantitative factors that are ranking outcomes of individual gene selection methods including t-test, entropy, receiver operating characteristic curve, Wilcoxon test, and signal-to-noise ratio. The IT2FLS is introduced for the classification task due to its great ability for handling nonlinear, noisy, and outlier data, which are common problems in cancer microarray gene expression profiles. An unsupervised learning strategy using the fuzzy c-means clustering is employed to initialize parameters of the IT2FLS. Other classifiers such as multilayer perceptron network, support vector machine, and fuzzy ARTMAP are also implemented for comparisons. Experiments are carried out on three well-known microarray datasets: diffuse large B-cell lymphoma, leukemia cancer, and prostate. Rather than the traditional cross validation, leave-one-out cross-validation strategy is applied for the experiments. Results demonstrate the performance dominance of the IT2FLS against the competing classifiers. More noticeably, the modified AHP improves the classification performance not only of the IT2FLS but of all other classifiers as well. Accordingly, the proposed combination between the modified AHP and IT2FLS is a powerful tool for cancer classification and can be implemented as a real clinical decision support system that is useful for medical practitioners.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

1. Studies of landscape change are seldom conducted at scales commensurate with the processes they purport to investigate. Landscape change is a landscape-level process, yet most studies focus on patches. Even when landscape context is considered, inference remains at the patch-level. The unit of investigation must be extended beyond individual patches to whole mosaics in order to advance understanding of faunal responses to landscape change.

2. In this study, we aggregated data from multiple sites per landscape such that both the response and explanatory variables characterized 'whole' landscapes, allowing for landscape-level inference about factors influencing species' incidence.

3. We used hierarchical partitioning and Bayesian variable selection methods to develop species-specific models that examined the influence of four categories of landscape properties – habitat extent, habitat configuration, landscape composition and geographical location – on the incidence of 58 species of woodland-dependent birds in 24 agricultural landscapes (each 100 km2) in south-eastern Australia.

4. There was strong evidence for a positive effect of habitat extent for 27 species. Thirty species were related to at least one of the four landscape composition variables, and geographical location was important for 19 species. Habitat configuration was influential for 13 species and where important, the impacts of fragmentation per se were detrimental.

5. Variation among species in the influential landscape variables indicates that different species respond to different sets of cues in land mosaics. Thus, although all species were grouped a priori as 'woodland-dependent', expectations based on general ecological characteristics may prove unreliable.

6. Synthesis and applications. These results underscore the value of moving beyond the fragmentation paradigm focused on the spatial pattern of habitat vs. non-habitat, to a greater appreciation of the composition and heterogeneity of land mosaics. Landscape-level inference will enable improved conservation outcomes by recognizing the influence of landscape properties on biota and devising strategies at this scale to complement patch-based management. We provide strong empirical evidence that biodiversity management in agricultural landscapes must focus on habitat extent. Complementary management of other landscape attributes, such as habitat aggregation and intensity of agricultural land-use, will also enhance the value of agricultural landscapes for woodland birds.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Introduction: This study is based on the metaphor of the ‘rural pipeline’ into medical practice. The four stages of the rural
pipeline are: (1) contact between rural secondary schools and the medical profession; (2) selection of rural students into medical
programs; (3) rural exposure during medical training; and (4) measures to address retention of the rural medical workforce.
Methods: Using the rural pipeline template we conducted a literature review, analysed the selection methods of Australian
graduate entry medical schools and interviewed 17 interns about their medical career aspirations.
Results: Literature review: The literature was reviewed to assess the effectiveness of selection practices to predict successful
gradation and the impact of rural pipeline components on eventual rural practice. Undergraduate academic performance is the
strongest predictor of medical course academic performance. The predictive power of interviews is modest. There are limited data
on the predictive power of other measures of non-cognitive performance or the content of the undergraduate degree. Prior rural
residence is the strongest predictor of choice of a rural career but extended rural exposure during medical training also has a
significant impact. The most significant influencing factors are: professional support at national, state and local levels; career
pathway opportunities; contentedness of the practitioner’s spouse in rural communities; preparedness to adopt a rural lifestyle;
educational opportunities for children; and proximity to extended family and social circle. Analysis of selection methods: Staff
involved in student selection into 9 Australian graduate entry medical schools were interviewed. Four themes were identified:
(1) rurality as a factor in student selection; (2) rurality as a factor in student selection interviews; (3) rural representation on student
selection interview panels; (4) rural experience during the medical course. Interns’ career intentions: Three themes were identified:
(1) the efficacy of the rural pipeline; (2) community connectedness through the rural pipeline; (3) impediments to the effect of the
rural pipeline, the most significant being a partner who was not committed to rural life
Conclusion: Based on the literature review and interviews, 11 strategies are suggested to increase the number of graduates
choosing a career in rural medicine, and one strategy for maintaining practitioners in rural health settings after graduation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we investigate the face recognition problem via the overlapping energy histogram of the DCT coefficients. Particularly, we investigate some important issues relating to the recognition performance, such as the issue of selecting threshold and the number of bins. These selection methods utilise information obtained from the training dataset. Experimentation is conducted on the Yale face database and results indicate that the proposed parameter selection methods perform well in selecting the threshold and number of bins. Furthermore, we show that the proposed overlapping energy histogram approach outperforms the Eigenfaces, 2DPCA and energy histogram significantly.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this note, we examine the size and power properties and the break date estimation accuracy of the Lee and Strazicich (LS, 2003) two break endogenous unit root test, based on two different break date selection methods: minimising the test statistic and minimising the sum of squared residuals (SSR). Our results show that the performance of both Models A and C of the LS test are superior when one uses the minimising SSR procedure.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we propose, develop, and test a new single-feature evaluator called Significant Proportion of Target Instances (SPTI) to handle the direct-marketing data with the class imbalance problem. The SPTI feature evaluator demonstrates its stability and outstanding performance through empirical experiments in which the real- orld customer data of an e-recruitment firm are used. This research demonstrates that the feature selection using SPTI successfully improves the classifier’s performance in terms of two practical performance metrics. Additionally, we show that it outperforms other well-known feature selection methods and state-of-the-art remedies to the class-imbalance problem. Practically, the findings, when used with the classification model, will help telemarketers to better understand their customers.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Intelligent Water Drop (IWD) algorithm is a recent stochastic swarm-based method that is useful for solving combinatorial and function optimization problems. In this paper, we investigate the effectiveness of the selection method in the solution construction phase of the IWD algorithm. Instead of the fitness proportionate selection method in the original IWD algorithm, two ranking-based selection methods, namely linear ranking and exponential ranking, are proposed. Both ranking-based selection methods aim to solve the identified limitations of the fitness proportionate selection method as well as to enable the IWD algorithm to escape from local optima and ensure its search diversity. To evaluate the usefulness of the proposed ranking-based selection methods, a series of experiments pertaining to three combinatorial optimization problems, i.e., rough set feature subset selection, multiple knapsack and travelling salesman problems, is conducted. The results demonstrate that the exponential ranking selection method is able to preserve the search diversity, therefore improving the performance of the IWD algorithm. © 2014 Elsevier Ltd. All rights reserved.