16 resultados para variable selection

em Deakin Research Online - Australia


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The support vector machine (SVM) is a popular method for classification, well known for finding the maximum-margin hyperplane. Combining SVM with l1-norm penalty further enables it to simultaneously perform feature selection and margin maximization within a single framework. However, l1-norm SVM shows instability in selecting features in presence of correlated features. We propose a new method to increase the stability of l1-norm SVM by encouraging similarities between feature weights based on feature correlations, which is captured via a feature covariance matrix. Our proposed method can capture both positive and negative correlations between features. We formulate the model as a convex optimization problem and propose a solution based on alternating minimization. Using both synthetic and real-world datasets, we show that our model achieves better stability and classification accuracy compared to several state-of-the-art regularized classification methods.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The lasso procedure is an estimator-shrinkage and variable selection method. This paper shows that there always exists an interval of tuning parameter values such that the corresponding mean squared prediction error for the lasso estimator is smaller than for the ordinary least squares estimator. For an estimator satisfying some condition such as unbiasedness, the paper defines the corresponding generalized lasso estimator. Its mean squared prediction error is shown to be smaller than that of the estimator for values of the tuning parameter in some interval. This implies that all unbiased estimators are not admissible. Simulation results for five models support the theoretical results.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

1. Studies of landscape change are seldom conducted at scales commensurate with the processes they purport to investigate. Landscape change is a landscape-level process, yet most studies focus on patches. Even when landscape context is considered, inference remains at the patch-level. The unit of investigation must be extended beyond individual patches to whole mosaics in order to advance understanding of faunal responses to landscape change.

2. In this study, we aggregated data from multiple sites per landscape such that both the response and explanatory variables characterized 'whole' landscapes, allowing for landscape-level inference about factors influencing species' incidence.

3. We used hierarchical partitioning and Bayesian variable selection methods to develop species-specific models that examined the influence of four categories of landscape properties – habitat extent, habitat configuration, landscape composition and geographical location – on the incidence of 58 species of woodland-dependent birds in 24 agricultural landscapes (each 100 km2) in south-eastern Australia.

4. There was strong evidence for a positive effect of habitat extent for 27 species. Thirty species were related to at least one of the four landscape composition variables, and geographical location was important for 19 species. Habitat configuration was influential for 13 species and where important, the impacts of fragmentation per se were detrimental.

5. Variation among species in the influential landscape variables indicates that different species respond to different sets of cues in land mosaics. Thus, although all species were grouped a priori as 'woodland-dependent', expectations based on general ecological characteristics may prove unreliable.

6. Synthesis and applications. These results underscore the value of moving beyond the fragmentation paradigm focused on the spatial pattern of habitat vs. non-habitat, to a greater appreciation of the composition and heterogeneity of land mosaics. Landscape-level inference will enable improved conservation outcomes by recognizing the influence of landscape properties on biota and devising strategies at this scale to complement patch-based management. We provide strong empirical evidence that biodiversity management in agricultural landscapes must focus on habitat extent. Complementary management of other landscape attributes, such as habitat aggregation and intensity of agricultural land-use, will also enhance the value of agricultural landscapes for woodland birds.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper applies the generalised linear model for modelling geographical variation to esophageal cancer incidence data in the Caspian region of Iran. The data have a complex and hierarchical structure that makes them suitable for hierarchical analysis using Bayesian techniques, but with care required to deal with problems arising from counts of events observed in small geographical areas when overdispersion and residual spatial autocorrelation are present. These considerations lead to nine regression models derived from using three probability distributions for count data: Poisson, generalised Poisson and negative binomial, and three different autocorrelation structures. We employ the framework of Bayesian variable selection and a Gibbs sampling based technique to identify significant cancer risk factors. The framework deals with situations where the number of possible models based on different combinations of candidate explanatory variables is large enough such that calculation of posterior probabilities for all models is difficult or infeasible. The evidence from applying the modelling methodology suggests that modelling strategies based on the use of generalised Poisson and negative binomial with spatial autocorrelation work well and provide a robust basis for inference.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

BACKGROUND: Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.

METHODS: The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators.

RESULTS: After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001).

CONCLUSION: The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Stability in clinical prediction models is crucial for transferability between studies, yet has received little attention. The problem is paramount in high dimensional data, which invites sparse models with feature selection capability. We introduce an effective method to stabilize sparse Cox model of time-to-events using statistical and semantic structures inherent in Electronic Medical Records (EMR). Model estimation is stabilized using three feature graphs built from (i) Jaccard similarity among features (ii) aggregation of Jaccard similarity graph and a recently introduced semantic EMR graph (iii) Jaccard similarity among features transferred from a related cohort. Our experiments are conducted on two real world hospital datasets: a heart failure cohort and a diabetes cohort. On two stability measures – the Consistency index and signal-to-noise ratio (SNR) – the use of our proposed methods significantly increased feature stability when compared with the baselines.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Apostatic (frequency‐ or density‐dependent) selection, aposematic signals, and mate choice behavior generally require that the mean prey or potential mate density m value be high enough (above a threshold T) to result in sufficient encounter rates for the searcher to learn or retain the association between conspicuous signals and prey unprofitability, to forage apostatically, or to choose among mates. This assumes that all searchers experience , which implicitly assumes an even dispersion of targets among searcher territories. Uneven dispersion generates new phenomena. If , then only territories with local density x values that are greater than T favor experience‐based behavior, leading to spatially variable frequency‐ or density‐dependent selection intensity. As aggregation increases, the increase in percentage of targets in favorable territories ( ) is greater than the increase in the percentage of territories that are favorable. The relationship is reversed when . In both cases, because as few as 10% of the territories can contain 80% of the targets, only a few territory holders may account for most of the selection on most of the target population; accidents of experience in only a few searchers can have unexpectedly large effects on the target population. This also provides an explanation for high searcher behavior variation (personalities) : individuals from favorable territories will behave differently in behavioral experiments than those from unfavorable territories, at least with respect to similar kinds of targets. These effects will generate spatial heterogeneity in natural and sexual selection in what are otherwise uniform environments.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Generalized Estimating Equations (GEE) method is one of the most commonly used statistical methods for the analysis of longitudinal data in epidemiological studies. A working correlation structure for the repeated measures of the outcome variable of a subject needs to be specified by this method. However, statistical criteria for selecting the best correlation structure and the best subset of explanatory variables in GEE are only available recently because the GEE method is developed on the basis of quasi-likelihood theory. Maximum likelihood based model selection methods, such as the widely used Akaike Information Criterion (AIC), are not applicable to GEE directly. Pan (2001) proposed a selection method called QIC which can be used to select the best correlation structure and the best subset of explanatory variables. Based on the QIC method, we developed a computing program to calculate the QIC value for a range of different distributions, link functions and correlation structures. This program was written in Stata software. In this article, we introduce this program and demonstrate how to use it to select the most parsimonious model in GEE analyses of longitudinal data through several representative examples.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Blue whales Balaenoptera musculus aggregate to feed in a regional upwelling system during November–May between the Great Australian Bight (GAB) and Bass Strait. We analysed sightings from aerial surveys over 6 upwelling seasons (2001–02 to 2006–07) to assess within-season patterns of blue whale habitat selection, distribution, and relative abundance. Habitat variables were modelled using a general linear model (GLM) that ranked sea surface temperature (SST) and sea surface chlorophyll (SSC) of equal importance, followed by depth, distance to shore, SSC gradient, distance to shelf break, and SST gradient. Further discrimination by hierarchical partitioning indicated that SST accounted for 84.4% of variation in blue whale presence explained by the model, and that probability of sightings increased with increasing SST. The large study area was resolved into 3 zones showing diversity of habitat from the shallow narrow shelf and associated surface upwelling of the central zone, to the relatively deep upper slope waters, broad shelf and variable upwelling of the western zone, and the intermediate features of the eastern zone. Density kernel estimation showed a trend in distribution from the west during November–December, spreading south-eastward along the shelf throughout the central and eastern zones during January–April, with the central zone most consistently utilised. Encounter rates in central and eastern zones peaked in February, coinciding with peak upwelling intensity and primary productivity. Blue whales avoided inshore upwelling centres, selecting SST ~1°C cooler than remotely sensed ambient SST. Whales selected significantly higher SSC in the central and eastern zones than the western zone, where relative abundance was extremely variable. Most animals departed from the feeding ground by late April.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Conspecific nesting density affects many aspects of breeding biology, as well as habitat selection decisions. However, the large variations in breeding density observed in many species are yet to be fully explained. Here, we investigated the settlement patterns in a colonial species with variable breeding density and where resource distribution could be manipulated. The zebra finch, Taeniopygia guttata, is a classic avian model in evolutionary biology but we know surprisingly very little about nest site selection strategies and nesting densities in this species, and in fact, in nomadic species in general. Yet, important determinants of habitat selection strategies, including temporal predictability and breeding synchrony, are likely to be different in nomadic species than in the non-nomadic species studied to date. Here, we manipulated the distribution of nesting sites (by providing nest boxes) and food patches (feeders) to test four non-exclusive habitat selection hypotheses that could lead to nest aggregation: 1) attraction to resources, 2) attraction to breeding conspecifics, and 3) attraction to successful conspecifics and 4) use of private information (i.e. own reproductive success on a site). We found that wild zebra finches used conspecific presence and possibly reproductive success, to make decisions over where to locate their nests, but did not aggregate around water or food within the study areas. Moreover, there was a high degree of inter-individual variation in nesting density preference. We discuss the significance of our results for habitat selection strategy in nomadic species and with respect to the differential selection pressures that individuals breeding at different densities may experience.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The HIV-1 gp120-gp41 complex, which mediates viral fusion and cellular entry, undergoes rapid evolution within its external glycan shield to enable escape from neutralizing antibody (NAb). Understanding how conserved protein determinants retain functionality in the context of such evolution is important for their evaluation and exploitation as potential drug and/ or vaccine targets. In this study, we examined how the conserved gp120-gp41 association site, formed by the N- and Cterminal segments of gp120 and the disulfide-bonded region (DSR) of gp41, adapts to glycan changes that are linked to neutralization sensitivity. To this end, a DSR mutant virus (K601D) with defective gp120-association was sequentially passaged in peripheral blood mononuclear cells to select suppressor mutations. We reasoned that the locations of suppressors point to structural elements that are functionally linked to the gp120-gp41 association site. In culture 1, gp120 association and viral replication was restored by loss of the conserved glycan at Asn136 in V1 (T138N mutation) in
conjunction with the L494I substitution in C5 within the association site. In culture 2, replication was restored with deletion of the N139INN sequence, which ablates the overlapping Asn141-Asn142-Ser-Ser potential N-linked glycosylation sequons in
V1, in conjunction with D601N in the DSR. The 136 and 142 glycan mutations appeared to exert their suppressive effects by altering the dependence of gp120-gp41 interactions on the DSR residues, Leu593, Trp596 and Lys601. The 136 and/or 142
glycan mutations increased the sensitivity of HIV-1 pseudovirions to the glycan-dependent NAbs 2G12 and PG16, and also pooled IgG obtained from HIV-1-infected individuals. Thus adjacent V1 glycans allosterically modulate the distal gp120-
gp41 association site. We propose that this represents a mechanism for functional adaptation of the gp120-gp41 association site to an evolving glycan shield in a setting of NAb selection.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A changing climate is expected to have profound effects on many aspects of ectotherm biology. We report on a decade-long study of free-ranging sand lizards (Lacerta agilis), exposed to an increasing mean mating season temperature and with known operational sex ratios. We assessed year-to-year variation in sexual selection on body size and postcopulatory sperm competition and cryptic female choice. Higher temperature was not linked to strength of sexual selection on body mass, but operational sex ratio (more males) did increase the strength of sexual selection on body size. Elevated temperature increased mating rate and number of sires per clutch with positive effects on offspring fitness. In years when the “quality” of a female's partners was more variable (in standard errors of a male sexual ornament), clutches showed less multiple paternity. This agrees with prior laboratory trials in which females exercised stronger cryptic female choice when male quality varied more. An increased number of sires contributing to within-clutch paternity decreased the risk of having malformed offspring. Ultimately, such variation may contribute to highly dynamic and shifting selection mosaics in the wild, with potential implications for the evolutionary ecology of mating systems and population responses to rapidly changing environmental conditions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Aposematic signal variation is a paradox: predators are better at learning and retaining the association between conspicuousness and unprofitability when signal variation is low. Movement patterns and variable colour patterns are linked in non-aposematic species: striped patterns generate illusions of altered speed and direction when moving linearly, affecting predators' tracking ability; blotched patterns benefit instead from unpredictable pauses and random movement. We tested whether the extensive colour-pattern variation in an aposematic frog is linked to movement, and found that individuals moving directionally and faster have more elongated patterns than individuals moving randomly and slowly. This may help explain the paradox of polymorphic aposematism: variable warning signals may reduce protection, but predator defence might still be effective if specific behaviours are tuned to specific signals. The interacting effects of behavioural and morphological traits may be a key to the evolution of warning signals. © 2014 The Author(s) Published by the Royal Society. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Modern healthcare is getting reshaped by growing Electronic Medical Records (EMR). Recently, these records have been shown of great value towards building clinical prediction models. In EMR data, patients' diseases and hospital interventions are captured through a set of diagnoses and procedures codes. These codes are usually represented in a tree form (e.g. ICD-10 tree) and the codes within a tree branch may be highly correlated. These codes can be used as features to build a prediction model and an appropriate feature selection can inform a clinician about important risk factors for a disease. Traditional feature selection methods (e.g. Information Gain, T-test, etc.) consider each variable independently and usually end up having a long feature list. Recently, Lasso and related l1-penalty based feature selection methods have become popular due to their joint feature selection property. However, Lasso is known to have problems of selecting one feature of many correlated features randomly. This hinders the clinicians to arrive at a stable feature set, which is crucial for clinical decision making process. In this paper, we solve this problem by using a recently proposed Tree-Lasso model. Since, the stability behavior of Tree-Lasso is not well understood, we study the stability behavior of Tree-Lasso and compare it with other feature selection methods. Using a synthetic and two real-world datasets (Cancer and Acute Myocardial Infarction), we show that Tree-Lasso based feature selection is significantly more stable than Lasso and comparable to other methods e.g. Information Gain, ReliefF and T-test. We further show that, using different types of classifiers such as logistic regression, naive Bayes, support vector machines, decision trees and Random Forest, the classification performance of Tree-Lasso is comparable to Lasso and better than other methods. Our result has implications in identifying stable risk factors for many healthcare problems and therefore can potentially assist clinical decision making for accurate medical prognosis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Multicomponent signals are made up of interacting elements that generate a functional signaling unit. The interactions between signal components and their effects on individual fitness are not well understood, and the effect of environment is even less so. It is usually assumed that color patterns appear the same in all light environments and that the effects of each color are additive. Using guppies, Poecilia reticulata, we investigated the effect of water color on the interactions between components of sexually selected male coloration. Through behavioral mate choice trials in four different water colors, we estimated the attractiveness of male color patterns, using multivariate fitness estimates and overall signal contrast. Our results show that females exhibit preferences that favor groups of colors rather than individual colors independently and that each environment favors different color combinations. We found that these effects are consistent with female guppies selecting entire color patterns on the basis of overall visual contrast. This suggests that both individuals and populations inhabiting different light environments will be subject to divergent, multivariate selection. Although the appearance of color patterns changes with light environment, achromatic components change little, suggesting that these could function in species recognition or other aspects of communication that must work across environments. Consequently, we predict different phylogenetic patterns between chromatic and achromatic signals within the same clades.