836 resultados para Regression imputation
Resumo:
In recent years, a plethora of approaches have been proposed to deal with the increasingly challenging task of multi-output regression. This paper provides a survey on state-of-the-art multi-output regression methods, that are categorized as problem transformation and algorithm adaptation methods. In addition, we present the mostly used performance evaluation measures, publicly available data sets for multi-output regression real-world problems, as well as open-source software frameworks.
Resumo:
When it comes to information sets in real life, often pieces of the whole set may not be available. This problem can find its origin in various reasons, describing therefore different patterns. In the literature, this problem is known as Missing Data. This issue can be fixed in various ways, from not taking into consideration incomplete observations, to guessing what those values originally were, or just ignoring the fact that some values are missing. The methods used to estimate missing data are called Imputation Methods. The work presented in this thesis has two main goals. The first one is to determine whether any kind of interactions exists between Missing Data, Imputation Methods and Supervised Classification algorithms, when they are applied together. For this first problem we consider a scenario in which the databases used are discrete, understanding discrete as that it is assumed that there is no relation between observations. These datasets underwent processes involving different combina- tions of the three components mentioned. The outcome showed that the missing data pattern strongly influences the outcome produced by a classifier. Also, in some of the cases, the complex imputation techniques investigated in the thesis were able to obtain better results than simple ones. The second goal of this work is to propose a new imputation strategy, but this time we constrain the specifications of the previous problem to a special kind of datasets, the multivariate Time Series. We designed new imputation techniques for this particular domain, and combined them with some of the contrasted strategies tested in the pre- vious chapter of this thesis. The time series also were subjected to processes involving missing data and imputation to finally propose an overall better imputation method. In the final chapter of this work, a real-world example is presented, describing a wa- ter quality prediction problem. The databases that characterized this problem had their own original latent values, which provides a real-world benchmark to test the algorithms developed in this thesis.
Resumo:
Spatio-temporal modelling is an area of increasing importance in which models and methods have often been developed to deal with specific applications. In this study, a spatio-temporal model was used to estimate daily rainfall data. Rainfall records from several weather stations, obtained from the Agritempo system for two climatic homogeneous zones, were used. Rainfall values obtained for two fixed dates (January 1 and May 1, 2012) using the spatio-temporal model were compared with the geostatisticals techniques of ordinary kriging and ordinary cokriging with altitude as auxiliary variable. The spatio-temporal model was more than 17% better at producing estimates of daily precipitation compared to kriging and cokriging in the first zone and more than 18% in the second zone. The spatio-temporal model proved to be a versatile technique, adapting to different seasons and dates.
Resumo:
Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation datasets and extant map products based on different sampling and modeling strategies. The RF-kNN modeling approach was found to be very effective, especially for large-area estimation, and produced results statistically equivalent to the field observations or the estimates derived from secondary data sources. The models are useful to resource managers for operational and strategic purposes.
Resumo:
Adaptability and invisibility are hallmarks of modern terrorism, and keeping pace with its dynamic nature presents a serious challenge for societies throughout the world. Innovations in computer science have incorporated applied mathematics to develop a wide array of predictive models to support the variety of approaches to counterterrorism. Predictive models are usually designed to forecast the location of attacks. Although this may protect individual structures or locations, it does not reduce the threat—it merely changes the target. While predictive models dedicated to events or social relationships receive much attention where the mathematical and social science communities intersect, models dedicated to terrorist locations such as safe-houses (rather than their targets or training sites) are rare and possibly nonexistent. At the time of this research, there were no publically available models designed to predict locations where violent extremists are likely to reside. This research uses France as a case study to present a complex systems model that incorporates multiple quantitative, qualitative and geospatial variables that differ in terms of scale, weight, and type. Though many of these variables are recognized by specialists in security studies, there remains controversy with respect to their relative importance, degree of interaction, and interdependence. Additionally, some of the variables proposed in this research are not generally recognized as drivers, yet they warrant examination based on their potential role within a complex system. This research tested multiple regression models and determined that geographically-weighted regression analysis produced the most accurate result to accommodate non-stationary coefficient behavior, demonstrating that geographic variables are critical to understanding and predicting the phenomenon of terrorism. This dissertation presents a flexible prototypical model that can be refined and applied to other regions to inform stakeholders such as policy-makers and law enforcement in their efforts to improve national security and enhance quality-of-life.
Resumo:
Neuroimaging research involves analyses of huge amounts of biological data that might or might not be related with cognition. This relationship is usually approached using univariate methods, and, therefore, correction methods are mandatory for reducing false positives. Nevertheless, the probability of false negatives is also increased. Multivariate frameworks have been proposed for helping to alleviate this balance. Here we apply multivariate distance matrix regression for the simultaneous analysis of biological and cognitive data, namely, structural connections among 82 brain regions and several latent factors estimating cognitive performance. We tested whether cognitive differences predict distances among individuals regarding their connectivity pattern. Beginning with 3,321 connections among regions, the 36 edges better predicted by the individuals' cognitive scores were selected. Cognitive scores were related to connectivity distances in both the full (3,321) and reduced (36) connectivity patterns. The selected edges connect regions distributed across the entire brain and the network defined by these edges supports high-order cognitive processes such as (a) (fluid) executive control, (b) (crystallized) recognition, learning, and language processing, and (c) visuospatial processing. This multivariate study suggests that one widespread, but limited number, of regions in the human brain, supports high-level cognitive ability differences. Hum Brain Mapp, 2016. © 2016 Wiley Periodicals, Inc.
Resumo:
This study focuses on multiple linear regression models relating six climate indices (temperature humidity THI, environmental stress ESI, equivalent temperature index ETI, heat load HLI, modified HLI (HLI new), and respiratory rate predictor RRP) with three main components of cow’s milk (yield, fat, and protein) for cows in Iran. The least absolute shrinkage selection operator (LASSO) and the Akaike information criterion (AIC) techniques are applied to select the best model for milk predictands with the smallest number of climate predictors. Uncertainty estimation is employed by applying bootstrapping through resampling. Cross validation is used to avoid over-fitting. Climatic parameters are calculated from the NASA-MERRA global atmospheric reanalysis. Milk data for the months from April to September, 2002 to 2010 are used. The best linear regression models are found in spring between milk yield as the predictand and THI, ESI, ETI, HLI, and RRP as predictors with p-value < 0.001 and R2 (0.50, 0.49) respectively. In summer, milk yield with independent variables of THI, ETI, and ESI show the highest relation (p-value < 0.001) with R2 (0.69). For fat and protein the results are only marginal. This method is suggested for the impact studies of climate variability/change on agriculture and food science fields when short-time series or data with large uncertainty are available.
Resumo:
This study analyzes the impact of individual characteristics as well as occupation and industry on male wage inequality in nine European countries. Unlike previous studies, we consider regression models for five inequality measures and employ the recentered influence function regression method proposed by Firpo et al. (2009) to test directly the influence of covariates on inequality. We conclude that there is heterogeneity in the effects of covariates on inequality across countries and throughout wage distribution. Heterogeneity among countries is more evident in education and experience whereas occupation and industry characteristics as well as holding a supervisory position reveal more similar effects. Our results are compatible with the skill biased technological change, rapid rise in the integration of trade and financial markets as well as explanations related to the increase of the remunerative package of top executives.
Resumo:
Logistic regression is a statistical tool widely used for predicting species’ potential distributions starting from presence/absence data and a set of independent variables. However, logistic regression equations compute probability values based not only on the values of the predictor variables but also on the relative proportion of presences and absences in the dataset, which does not adequately describe the environmental favourability for or against species presence. A few strategies have been used to circumvent this, but they usually imply an alteration of the original data or the discarding of potentially valuable information. We propose a way to obtain from logistic regression an environmental favourability function whose results are not affected by an uneven proportion of presences and absences. We tested the method on the distribution of virtual species in an imaginary territory. The favourability models yielded similar values regardless of the variation in the presence/absence ratio. We also illustrate with the example of the Pyrenean desman’s (Galemys pyrenaicus) distribution in Spain. The favourability model yielded more realistic potential distribution maps than the logistic regression model. Favourability values can be regarded as the degree of membership of the fuzzy set of sites whose environmental conditions are favourable to the species, which enables applying the rules of fuzzy logic to distribution modelling. They also allow for direct comparisons between models for species with different presence/absence ratios in the study area. This makes themmore useful to estimate the conservation value of areas, to design ecological corridors, or to select appropriate areas for species reintroductions.
Resumo:
Background There is evidence that certain mutations in the double-strand break repair pathway ataxia-telangiectasia mutated gene act in a dominant-negative manner to increase the risk of breast cancer. There are also some reports to suggest that the amino acid substitution variants T2119C Ser707Pro and C3161G Pro1054Arg may be associated with breast cancer risk. We investigate the breast cancer risk associated with these two nonconservative amino acid substitution variants using a large Australian population-based case–control study. Methods The polymorphisms were genotyped in more than 1300 cases and 600 controls using 5' exonuclease assays. Case–control analyses and genotype distributions were compared by logistic regression. Results The 2119C variant was rare, occurring at frequencies of 1.4 and 1.3% in cases and controls, respectively (P = 0.8). There was no difference in genotype distribution between cases and controls (P = 0.8), and the TC genotype was not associated with increased risk of breast cancer (adjusted odds ratio = 1.08, 95% confidence interval = 0.59–1.97, P = 0.8). Similarly, the 3161G variant was no more common in cases than in controls (2.9% versus 2.2%, P = 0.2), there was no difference in genotype distribution between cases and controls (P = 0.1), and the CG genotype was not associated with an increased risk of breast cancer (adjusted odds ratio = 1.30, 95% confidence interval = 0.85–1.98, P = 0.2). This lack of evidence for an association persisted within groups defined by the family history of breast cancer or by age. Conclusion The 2119C and 3161G amino acid substitution variants are not associated with moderate or high risks of breast cancer in Australian women.
Resumo:
OBJECTIVE: To evaluate the scored Patient-generated Subjective Global Assessment (PG-SGA) tool as an outcome measure in clinical nutrition practice and determine its association with quality of life (QoL). DESIGN: A prospective 4 week study assessing the nutritional status and QoL of ambulatory patients receiving radiation therapy to the head, neck, rectal or abdominal area. SETTING: Australian radiation oncology facilities. SUBJECTS: Sixty cancer patients aged 24-85 y. INTERVENTION: Scored PG-SGA questionnaire, subjective global assessment (SGA), QoL (EORTC QLQ-C30 version 3). RESULTS: According to SGA, 65.0% (39) of subjects were well-nourished, 28.3% (17) moderately or suspected of being malnourished and 6.7% (4) severely malnourished. PG-SGA score and global QoL were correlated (r=-0.66, P<0.001) at baseline. There was a decrease in nutritional status according to PG-SGA score (P<0.001) and SGA (P<0.001); and a decrease in global QoL (P<0.001) after 4 weeks of radiotherapy. There was a linear trend for change in PG-SGA score (P<0.001) and change in global QoL (P=0.003) between those patients who improved (5%) maintained (56.7%) or deteriorated (33.3%) in nutritional status according to SGA. There was a correlation between change in PG-SGA score and change in QoL after 4 weeks of radiotherapy (r=-0.55, P<0.001). Regression analysis determined that 26% of the variation of change in QoL was explained by change in PG-SGA (P=0.001). CONCLUSION: The scored PG-SGA is a nutrition assessment tool that identifies malnutrition in ambulatory oncology patients receiving radiotherapy and can be used to predict the magnitude of change in QoL.
Resumo:
Farmers' exposure to pesticides is high in developing countries. As a result many farmers suffer from ill-health, both short and long term. Deaths are not uncommon. This paper addresses this issue. Field survey data from Sri Lanka are used to estimate farmers' expenditure on defensive behavior (DE) and to determine factors that influence DE. The avertive behavior approach is used to estimate costs. Tobit regression analysis is used to determine factors that influence DE. Field survey data show that farmers' expenditures on DE are low. This is inversely related to high incidence of ill health among farmers using pesticides.
Resumo:
This study develops a life-cycle model where investors make investment decisions in a realistic environment. Model results show that personal illiquid projects (housing and children), fixed costs (once-off/per-period participation costs plus variable/fixed transaction costs) and endogenous risky human capital (with permanent, transitory and disastrous shocks) together are able to address both the non-participation puzzle and the age-effects puzzle. Empirical implications of the model are examined using Heckman’s two-step method with the latest five Surveys of Consumer Finance (SCF). Regression results show that liquidity, informational cost and human capital are indeed the major determinants of participation and asset allocation decisions at different stages of an investor’s life.
Resumo:
Despite the best intentions of service providers and organisations, service delivery is rarely error-free. While numerous studies have investigated specific cognitive, emotional or behavioural responses to service failure and recovery, these studies do not fully capture the complexity of the services encounter. Consequently, this research develops a more holistic understanding of how specific service recovery strategies affect the responses of customers by combining two existing models—Smith & Bolton’s (2002) model of emotional responses to service performance and Fullerton and Punj’s (1993) structural model of aberrant consumer behaviour—into a conceptual framework. Specific service recovery strategies are proposed to influence consumer cognition, emotion and behaviour. This research was conducted using a 2x2 between-subjects quasi-experimental design that was administered via written survey. The experimental design manipulated two levels of two specific service recovery strategies: compensation and apology. The effect of the four recovery strategies were investigated by collecting data from 18-25 year olds and were analysed using multivariate analysis of covariance and multiple regression analysis. The results suggest that different service recovery strategies are associated with varying scores of satisfaction, perceived distributive justice, positive emotions, negative emotions and negative functional behaviour, but not dysfunctional behaviour. These finding have significant implications for the theory and practice of managing service recovery.
Resumo:
There is increased recognition that determinants of health should be investigated in a life-course perspective. Retirement is a major transition in the life course and offers opportunities for changes in physical activity that may improve health in the aging population. The authors examined the effect of retirement on changes in physical activity in the GLOBE Study, a prospective cohort study known by the Dutch acronym for "Health and Living Conditions of the Population of Eindhoven and surroundings," 1991–2004. They followed respondents (n = 971) by postal questionnaire who were employed and aged 40–65 years in 1991 for 13 years, after which they were still employed (n = 287) or had retired (n = 684). Physical activity included 1) work-related transportation, 2) sports participation, and 3) nonsports leisure-time physical activity. Multinomial logistic regression analyses indicated that retirement was associated with a significantly higher odds for a decline in physical activity from work-related transportation (odds ratio (OR) = 3.03, 95% confidence interval (CI): 1.97, 4.65), adjusted for sex, age, marital status, chronic diseases, and education, compared with remaining employed. Retirement was not associated with an increase in sports participation (OR = 1.12, 95% CI: 0.71, 1.75) or nonsports leisure-time physical activity (OR = 0.80, 95% CI: 0.54, 1.19). In conclusion, retirement introduces a reduction in physical activity from work-related transportation that is not compensated for by an increase in sports participation or an increase in nonsports leisure-time physical activity.