923 resultados para Quantile regressions
Resumo:
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.
Resumo:
In many online applications, we need to maintain quantile statistics for a sliding window on a data stream. The sliding windows in natural form are defined as the most recent N data items. In this paper, we study the problem of estimating quantiles over other types of sliding windows. We present a uniform framework to process quantile queries for time constrained and filter based sliding windows. Our algorithm makes one pass on the data stream and maintains an E-approximate summary. It uses O((1)/(epsilon2) log(2) epsilonN) space where N is the number of data items in the window. We extend this framework to further process generalized constrained sliding window queries and proved that our technique is applicable for flexible window settings. Our performance study indicates that the space required in practice is much less than the given theoretical bound and the algorithm supports high speed data streams.
Resumo:
Direct quantile regression involves estimating a given quantile of a response variable as a function of input variables. We present a new framework for direct quantile regression where a Gaussian process model is learned, minimising the expected tilted loss function. The integration required in learning is not analytically tractable so to speed up the learning we employ the Expectation Propagation algorithm. We describe how this work relates to other quantile regression methods and apply the method on both synthetic and real data sets. The method is shown to be competitive with state of the art methods whilst allowing for the leverage of the full Gaussian process probabilistic framework.
Resumo:
Peer reviewed
Resumo:
Quantile regression (QR) was first introduced by Roger Koenker and Gilbert Bassett in 1978. It is robust to outliers which affect least squares estimator on a large scale in linear regression. Instead of modeling mean of the response, QR provides an alternative way to model the relationship between quantiles of the response and covariates. Therefore, QR can be widely used to solve problems in econometrics, environmental sciences and health sciences. Sample size is an important factor in the planning stage of experimental design and observational studies. In ordinary linear regression, sample size may be determined based on either precision analysis or power analysis with closed form formulas. There are also methods that calculate sample size based on precision analysis for QR like C.Jennen-Steinmetz and S.Wellek (2005). A method to estimate sample size for QR based on power analysis was proposed by Shao and Wang (2009). In this paper, a new method is proposed to calculate sample size based on power analysis under hypothesis test of covariate effects. Even though error distribution assumption is not necessary for QR analysis itself, researchers have to make assumptions of error distribution and covariate structure in the planning stage of a study to obtain a reasonable estimate of sample size. In this project, both parametric and nonparametric methods are provided to estimate error distribution. Since the method proposed can be implemented in R, user is able to choose either parametric distribution or nonparametric kernel density estimation for error distribution. User also needs to specify the covariate structure and effect size to carry out sample size and power calculation. The performance of the method proposed is further evaluated using numerical simulation. The results suggest that the sample sizes obtained from our method provide empirical powers that are closed to the nominal power level, for example, 80%.
Resumo:
Bahadur representation and its applications have attracted a large number of publications and presentations on a wide variety of problems. Mixing dependency is weak enough to describe the dependent structure of random variables, including observations in time series and longitudinal studies. This note proves the Bahadur representation of sample quantiles for strongly mixing random variables (including ½-mixing and Á-mixing) under very weak mixing coe±cients. As application, the asymptotic normality is derived. These results greatly improves those recently reported in literature.
Resumo:
Tourist accommodation expenditure is a widely investigated topic as it represents a major contribution to the total tourist expenditure. The identification of the determinant factors is commonly based on supply-driven applications while little research has been made on important travel characteristics. This paper proposes a demand-driven analysis of tourist accommodation price by focusing on data generated from room bookings. The investigation focuses on modeling the relationship between key travel characteristics and the price paid to book the accommodation. To accommodate the distributional characteristics of the expenditure variable, the analysis is based on the estimation of a quantile regression model. The findings support the econometric approach used and enable the elaboration of relevant managerial implications.
Resumo:
In this work, we explore and demonstrate the potential for modeling and classification using quantile-based distributions, which are random variables defined by their quantile function. In the first part we formalize a least squares estimation framework for the class of linear quantile functions, leading to unbiased and asymptotically normal estimators. Among the distributions with a linear quantile function, we focus on the flattened generalized logistic distribution (fgld), which offers a wide range of distributional shapes. A novel naïve-Bayes classifier is proposed that utilizes the fgld estimated via least squares, and through simulations and applications, we demonstrate its competitiveness against state-of-the-art alternatives. In the second part we consider the Bayesian estimation of quantile-based distributions. We introduce a factor model with independent latent variables, which are distributed according to the fgld. Similar to the independent factor analysis model, this approach accommodates flexible factor distributions while using fewer parameters. The model is presented within a Bayesian framework, an MCMC algorithm for its estimation is developed, and its effectiveness is illustrated with data coming from the European Social Survey. The third part focuses on depth functions, which extend the concept of quantiles to multivariate data by imposing a center-outward ordering in the multivariate space. We investigate the recently introduced integrated rank-weighted (IRW) depth function, which is based on the distribution of random spherical projections of the multivariate data. This depth function proves to be computationally efficient and to increase its flexibility we propose different methods to explicitly model the projected univariate distributions. Its usefulness is shown in classification tasks: the maximum depth classifier based on the IRW depth is proven to be asymptotically optimal under certain conditions, and classifiers based on the IRW depth are shown to perform well in simulated and real data experiments.
Resumo:
There is great interindividual variability in the response to GH therapy. Ascertaining genetic factors can improve the accuracy of growth response predictions. Suppressor of cytokine signaling (SOCS)-2 is an intracellular negative regulator of GH receptor (GHR) signaling. The objective of the study was to assess the influence of a SOCS2 polymorphism (rs3782415) and its interactive effect with GHR exon 3 and -202 A/C IGFBP3 (rs2854744) polymorphisms on adult height of patients treated with recombinant human GH (rhGH). Genotypes were correlated with adult height data of 65 Turner syndrome (TS) and 47 GH deficiency (GHD) patients treated with rhGH, by multiple linear regressions. Generalized multifactor dimensionality reduction was used to evaluate gene-gene interactions. Baseline clinical data were indistinguishable among patients with different genotypes. Adult height SD scores of patients with at least one SOCS2 single-nucleotide polymorphism rs3782415-C were 0.7 higher than those homozygous for the T allele (P < .001). SOCS2 (P = .003), GHR-exon 3 (P= .016) and -202 A/C IGFBP3 (P = .013) polymorphisms, together with clinical factors accounted for 58% of the variability in adult height and 82% of the total height SD score gain. Patients harboring any two negative genotypes in these three different loci (homozygosity for SOCS2 T allele; the GHR exon 3 full-length allele and/or the -202C-IGFBP3 allele) were more likely to achieve an adult height at the lower quartile (odds ratio of 13.3; 95% confidence interval of 3.2-54.2, P = .0001). The SOCS2 polymorphism (rs3782415) has an influence on the adult height of children with TS and GHD after long-term rhGH therapy. Polymorphisms located in GHR, IGFBP3, and SOCS2 loci have an influence on the growth outcomes of TS and GHD patients treated with rhGH. The use of these genetic markers could identify among rhGH-treated patients those who are genetically predisposed to have less favorable outcomes.
Resumo:
The main objective of this work was to evaluate the linear regression between spectral response and soybean yield in regional scale. In this study were monitored 36 municipalities from the west region of the states of Parana using five images of Landsat 5/TM during 2004/05 season. The spectral response was converted in physical values, apparent and surface reflectances, by radiometric transformation and atmospheric corrections and both used to calculate NDVI and GVI vegetation indices. Those ones were compared by multiple and simple regression with government official yield values (IBGE). Diagnostic processing method to identify influents values or collinearity was applied to the data too. The results showed that the mean surface reflectance value from all images was more correlated with yield than individual dates. Further, the multiple regressions using all dates and both vegetation indices gave better results than simple regression.
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física
Resumo:
The dispersal and survival of the phlebotomines Nyssomyia intermedia and Nyssomyia neivai (both implicated as vectors of the cutaneous leishmaniasis agent) in an endemic area was investigated using a capture-mark-release technique in five experiments from August-December 2003 in municipality of Iporanga, state of São Paulo, Brazil. A total of 1,749 males and 1,262 females of Ny. intermedia and 915 males and 411 females of Ny. neivai were marked and released during the five experiments. Recapture attempts were made using automatic light traps, aspiration in natural resting places and domestic animal shelters and Shannon traps. A total of 153 specimens (3.48%) were recaptured: 2.59% (78/3,011) for Ny. intermedia and 5.35% (71/1,326) for Ny. neivai. Both species were recaptured up to 144 h post-release, with the larger part of them recaptured within 48 h. The median dispersion distances for Ny. intermedia and Ny. neivai, respectively, were 109 m and 100 m. The greatest dispersal range of Ny. intermedia was 180 m, while for Ny. neivai one female was recaptured in a pasture at 250 m and another in a pigsty at 520 m, showing a tendency to disperse to more open areas. The daily survival rates calculated based on regressions of the numbers of marked insects recaptured on the six successive days after release were 0.746 for males and 0.575 for females of Ny. intermedia and 0.649 for both sexes of Ny. neivai. The size of the populations in the five months ranged from 8,332-725,085 for Ny. intermedia males, 2,193-104,490 for Ny. intermedia females, 1,687-350,122 for Ny. neivai males and 254-49,705 for Ny. neivai females.
Resumo:
OBJETIVO: Analisar o consumo de frutas, legumes e verduras (FLV) de adolescentes e identificar fatores associados. MÉTODOS: Estudo transversal de base populacional com amostra representativa de 812 adolescentes de ambos os sexos de São Paulo, SP, em 2003. O consumo alimentar foi medido pelo recordatório alimentar de 24 horas. O consumo de FLV foi descrito em percentis e para investigar a associação entre a ingestão de FLV e variáveis explanatórias; foram utilizados modelos de regressão quantílica. RESULTADOS: Dos adolescentes entrevistados, 6,4% consumiram a recomendação mínima de 400 g/dia de FLV e 22% não consumiram nenhum tipo de FLV. Nos modelos de regressão quantílica, ajustados pelo consumo energético, faixa etária e sexo, a renda domiciliar per capita e a escolaridade do chefe de família associaram-se positivamente ao consumo de FLV, enquanto o hábito de fumar associou-se negativamente. Renda associou-se significativamente aos menores percentis de ingestão (p20 ao p55); tabagismo aos percentis intermediários (p45 ao p75) e escolaridade do chefe de família aos percentis finais de consumo de FLV (p70 ao p95). CONCLUSÕES: O consumo de FLV por adolescentes paulistanos mostrou-se abaixo das recomendações do Ministério da Saúde e é influenciado pela renda domiciliar per capita, pela escolaridade do chefe de família e pelo hábito de fumar.