957 resultados para Multivariate statistical methods
Resumo:
Many discussions have enlarged the literature in Bibliometrics since the Hirsch proposal, the so called h-index. Ranking papers according to their citations, this index quantifies a researcher only by its greatest possible number of papers that are cited at least h times. A closed formula for h-index distribution that can be applied for distinct databases is not yet known. In fact, to obtain such distribution, the knowledge of citation distribution of the authors and its specificities are required. Instead of dealing with researchers randomly chosen, here we address different groups based on distinct databases. The first group is composed of physicists and biologists, with data extracted from Institute of Scientific Information (IS!). The second group is composed of computer scientists, in which data were extracted from Google-Scholar system. In this paper, we obtain a general formula for the h-index probability density function (pdf) for groups of authors by using generalized exponentials in the context of escort probability. Our analysis includes the use of several statistical methods to estimate the necessary parameters. Also an exhaustive comparison among the possible candidate distributions are used to describe the way the citations are distributed among authors. The h-index pdf should be used to classify groups of researchers from a quantitative point of view, which is meaningfully interesting to eliminate obscure qualitative methods. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Genes involved in host-pathogen interactions are often strongly affected by positive natural selection. The Duffy antigen, coded by the Duffy antigen receptor for chemokines (DARC) gene, serves as a receptor for Plasmodium vivax in humans and for Plasmodium knowlesi in some nonhuman primates. In the majority of sub-Saharan Africans, a nucleic acid variant in GATA-1 of the gene promoter is responsible for the nonexpression of the Duffy antigen on red blood cells and consequently resistance to invasion by P. vivax. The Duffy antigen also acts as a receptor for chemokines and is expressed in red blood cells and many other tissues of the body. Because of this dual role, we sequenced a 3,000-bp region encompassing the entire DARC gene as well as part of its 5' and 3' flanking regions in a phylogenetic sample of primates and used statistical methods to evaluate the nature of selection pressures acting on the gene during its evolution. We analyzed both coding and regulatory regions of the DARC gene. The regulatory analysis showed accelerated rates of substitution at several sites near known motifs. Our tests of positive selection in the coding region using maximum likelihood by branch sites and maximum likelihood by codon sites did not yield statistically significant evidence for the action of positive selection. However, the maximum likelihood test in which the gene was subdivided into different structural regions showed that the known binding region for P. vivax/P. knowlesi is under very different selective pressures than the remainder of the gene. In fact, most of the gene appears to be under strong purifying selection, but this is not evident in the binding region. We suggest that the binding region is under the influence of two opposing selective pressures, positive selection possibly exerted by the parasite and purifying selection exerted by chemokines.
Resumo:
Chaabene, H, Hachana, Y, Franchini, E, Mkaouer, B, Montassar, M, and Chamari, K. Reliability and construct validity of the karate-specific aerobic test. J Strength Cond Res 26(12): 3454-3460, 2012-The aim of this study was to examine absolute and relative reliabilities and external responsiveness of the Karate-specific aerobic test (KSAT). This study comprised 43 male karatekas, 19 of them participated in the first study to establish test-retest reliability and 40, selected on the bases of their karate experience and level of practice, participated in the second study to identify external responsiveness of the KSAT. The latter group was divided into 2 categories: national-level group (G(n)) and regional-level group (Gr). Analysis showed excellent test-retest reliability of time to exhaustion (TE), with intraclass correlation coefficient ICC(3,1) >0.90, standard error of measurement (SEM) <5%: (3.2%) and mean difference (bias) +/- the 95% limits of agreement: -9.5 +/- 78.8 seconds. There was a significant difference between test-retest session in peak lactate concentration (Peak [La]) (9.12 +/- 2.59 vs. 8.05 +/- 2.67 mmol.L-1; p < 0.05) but not in peak heart rate (HRpeak) and rating of perceived exertion (RPE) (196 +/- 9 vs. 194 +/- 9 b.min(-1) and 7.6 +/- 0.93 vs. 7.8 +/- 1.15; p > 0.05), respectively. National-level karate athletes (1,032 +/- 101 seconds) were better than regional level (841 +/- 134 seconds) on TE performance during KSAT (p < 0.001). Thus, KSAT provided good external responsiveness. The area under the receiver operator characteristics curve was >0.70 (0.86; confidence interval 95%: 0.72-0.95). Significant difference was detected in Peak [La] between national- (6.09 +/- 1.78 mmol.L-1) and regional-level (8.48 +/- 2.63 mmol.L-1) groups, but not in HRpeak (194 +/- 8 vs. 195 +/- 8 b.min(-1)) and RPE (7.57 +/- 1.15 vs. 7.42 +/- 1.1), respectively. The result of this study indicates that KSAT provides excellent absolute and relative reliabilities. The KSAT can effectively distinguish karate athletes of different competitive levels. Thus, the KSAT may be suitable for field assessment of aerobic fitness of karate practitioners.
Resumo:
BACKGROUND: It is widely accepted that red wines constitute one of the most important sources of dietary polyphenolic antioxidants. However, it is still not known how some variables such as variety, vintage, country of origin, and retail price are associated with the antioxidant activity and sensory profile of South American red wines. In this regard, 80 samples produced in Brazil, Chile and Argentina were assessed in relation to their sensory properties, color and in vitro antioxidant activity, and results were subjected to multivariate statistical techniques. RESULTS: Samples were grouped in clusters, characterized by high, intermediate and low in vitro antioxidant activity, sensory properties and prices. It was possible to observe that wines with high antioxidant activity were associated to high retail prices and overall perception of sensory quality. CONCLUSION: South American wines produced from Vitis vinifera such as Syrah, Malbec and Cabernet Sauvignon had higher in vitro antioxidant activity and also higher sensory quality than wines produced from Vitis labrusca. This result was independent of vintage (2002-2010), corroborating the idea that the same grape varietal, even when produced in different years, displays similar sensory characteristics and antioxidant activity. (C) 2011 Society of Chemical Industry
Resumo:
Background Statistical methods for estimating usual intake require at least two short-term dietary measurements in a subsample of the target population. However, the percentage of individuals with a second dietary measurement (replication rate) may influence the precision of estimates, such as percentiles and proportions of individuals below cut-offs of intake. Objective To investigate the precision of the usual food intake estimates using different replication rates and different sample sizes. Participants/setting Adolescents participating in the continuous National Health and Nutrition Examination Survey 2007-2008 (n=1,304) who completed two 24-hour recalls. Statistical analyses performed The National Cancer Institute method was used to estimate the usual intake of dark green vegetables in the original sample comprising 1,304 adolescents with a replication rate of 100%. A bootstrap with 100 replications was performed to estimate CIs for percentiles and proportions of individuals below cut-offs of intake. Using the same bootstrap replications, four sets of data sets were sampled with different replication rates (80%, 60%, 40%, and 20%). For each data set created, the National Cancer Institute method was performed and percentiles, Cl, and proportions of individuals below cut-offs were calculated. Precision estimates were checked by comparing each Cl obtained from data sets with different replication rates with the Cl obtained from original data set. Further, we sampled 1,000, 750, 500, and 250 individuals from the original data set, and performed the same analytical procedures. Results Percentiles of intake and percentage of individuals below the cut-off points were similar throughout the replication rates and sample sizes, but the Cl increased as the replication rate decreased. Wider CIs were observed at 40% and 20% of replication rate. Conclusions The precision of the usual intake estimates decreased when low replication rates were used. However, even with different sample sizes, replication rates >40% may not lead to an important loss of precision. J Acad Nutr Diet. 2012;112:1015-1020.
Resumo:
In this work, 50 ceramic fragments from the Lago Grande and 30 from the Osvaldo archaeological site were compared to assess elemental similarities. The aim is to perform a preliminary comparison between the sites, which are located in the central Amazon, Brazil. The analytical technique employed to obtain the ceramics elemental composition was instrumental neutron activation analysis (INAA). The data set obtained was explored by the multivariate statistical techniques of cluster, principal component and discriminant analysis. The analyzed elements were: Na, Lu, U, Yb, La, Th, Cr, Cs, Sc, Fe, Eu, Ce and Hf. The results showed the existence of at least two compositional groups for Lago Grande and Osvaldo. Each compositional group of Osvaldo archaeological site matches with one group of Lago Grande. Correlated with the archaeological background, the results suggest commercial or cultural exchange in the region, which is an indicative of socio-cultural interactions between those sites.
Resumo:
Despite considerable research conducted on 'Tahiti' lime [Citrus latifolia (Yu Tanaka) Tanaka] in several countries, few long-term studies have focused on rootstock effects on fruit production and quality under non-irrigated conditions. As for many other fruit crops, rootstock studies for 'Tahiti' lime are often based on the evaluation of several horticultural responses simultaneously, instead of considering multivariate statistical approaches which may provide with more comprehensive information. Consequently, a trial was installed to evaluate the horticultural performance of non-irrigated 'Tahiti' lime trees budded onto the following 12 rootstocks: 'HRS 801' and 'HRS 827' hybrids; 'Rubidoux', 'FCAV' and 'Flying Dragon' trifoliates; 'Sun Chu Sha Kat' and 'Sunki' mandarins; 'Cravo Limeira' and 'Cravo FCAV' 'Rangpur' limes; 'Carrizo' citrange, 'Swingle' citrumelo, and 'Orlando' tangelo. The trial was installed in 2001, in an 8 m x 5 m spacing with no supplementary irrigation. Measurements of yield, fruit quality oriented to different consuming markets, canopy volume and tree tolerance to drought, were performed. A multivariate cluster analysis identified both 'Rangpur' lime rootstocks as those inducing larger cumulative yield and higher percentage of fruits for the domestic market, with highest drought tolerance to the trees. Despite of their high susceptibility to drought stress under non-irrigated conditions, the 'Flying Dragon' and 'FCAV' trifoliate rootstocks performed outstandingly for 'Tahiti' lime, inducing higher yield efficiency, early bearing and larger percentage of high-quality fruits for foreign markets, with smaller trees more suitable for high-density plantings. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
The polychaetes assemblage structure was used in order to investigate taxonomic sufficiency in a heavily polluted tropical bay. Species abundance was aggregated into progressively higher taxa matrices (genus, family, order) and was analyzed using univariate and multivariate techniques. Polychaetes distribution in Guanabara Bay (GB) was in accordance with a pollution gradient, probably ruled by the organic enrichment, consequent effects of hypoxia and altered redox conditions coupled with prevailing patterns of circulation. Within the sectors of GB, an increasing gradient in species richness and occurrence was observed, ranging from the azoic and impoverished stations in the inner sector to a well-structured community in terms of species composition and abundance inhabiting the outer sector. Multivariate statistical analysis showed similar results when species were aggregated into genera and families, while greater difference occurred at coarser taxonomic identification (order). The literature about taxonomic sufficiency has demonstrated that faunal patterns at different taxonomic levels tend to become similar with increased pollution. In GB, an analysis carried out solely at family level is perfectly adequate to describe the environmental gradient, considered a useful tool for a quick environmental assessment. (C) 2011 Elsevier Ltd. All rights reserved.
Resumo:
Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The predictive quality of a classification model can be evaluated based on measures such as sensitivity, specificity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illustrated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
In this article, we propose a new Bayesian flexible cure rate survival model, which generalises the stochastic model of Klebanov et al. [Klebanov LB, Rachev ST and Yakovlev AY. A stochastic-model of radiation carcinogenesis - latent time distributions and their properties. Math Biosci 1993; 113: 51-75], and has much in common with the destructive model formulated by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)]. In our approach, the accumulated number of lesions or altered cells follows a compound weighted Poisson distribution. This model is more flexible than the promotion time cure model in terms of dispersion. Moreover, it possesses an interesting and realistic interpretation of the biological mechanism of the occurrence of the event of interest as it includes a destructive process of tumour cells after an initial treatment or the capacity of an individual exposed to irradiation to repair altered cells that results in cancer induction. In other words, what is recorded is only the damaged portion of the original number of altered cells not eliminated by the treatment or repaired by the repair system of an individual. Markov Chain Monte Carlo (MCMC) methods are then used to develop Bayesian inference for the proposed model. Also, some discussions on the model selection and an illustration with a cutaneous melanoma data set analysed by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)] are presented.
Resumo:
Abstract Background Several mathematical and statistical methods have been proposed in the last few years to analyze microarray data. Most of those methods involve complicated formulas, and software implementations that require advanced computer programming skills. Researchers from other areas may experience difficulties when they attempting to use those methods in their research. Here we present an user-friendly toolbox which allows large-scale gene expression analysis to be carried out by biomedical researchers with limited programming skills. Results Here, we introduce an user-friendly toolbox called GEDI (Gene Expression Data Interpreter), an extensible, open-source, and freely-available tool that we believe will be useful to a wide range of laboratories, and to researchers with no background in Mathematics and Computer Science, allowing them to analyze their own data by applying both classical and advanced approaches developed and recently published by Fujita et al. Conclusion GEDI is an integrated user-friendly viewer that combines the state of the art SVR, DVAR and SVAR algorithms, previously developed by us. It facilitates the application of SVR, DVAR and SVAR, further than the mathematical formulas present in the corresponding publications, and allows one to better understand the results by means of available visualizations. Both running the statistical methods and visualizing the results are carried out within the graphical user interface, rendering these algorithms accessible to the broad community of researchers in Molecular Biology.
Resumo:
Abstract Background To understand the molecular mechanisms underlying important biological processes, a detailed description of the gene products networks involved is required. In order to define and understand such molecular networks, some statistical methods are proposed in the literature to estimate gene regulatory networks from time-series microarray data. However, several problems still need to be overcome. Firstly, information flow need to be inferred, in addition to the correlation between genes. Secondly, we usually try to identify large networks from a large number of genes (parameters) originating from a smaller number of microarray experiments (samples). Due to this situation, which is rather frequent in Bioinformatics, it is difficult to perform statistical tests using methods that model large gene-gene networks. In addition, most of the models are based on dimension reduction using clustering techniques, therefore, the resulting network is not a gene-gene network but a module-module network. Here, we present the Sparse Vector Autoregressive model as a solution to these problems. Results We have applied the Sparse Vector Autoregressive model to estimate gene regulatory networks based on gene expression profiles obtained from time-series microarray experiments. Through extensive simulations, by applying the SVAR method to artificial regulatory networks, we show that SVAR can infer true positive edges even under conditions in which the number of samples is smaller than the number of genes. Moreover, it is possible to control for false positives, a significant advantage when compared to other methods described in the literature, which are based on ranks or score functions. By applying SVAR to actual HeLa cell cycle gene expression data, we were able to identify well known transcription factor targets. Conclusion The proposed SVAR method is able to model gene regulatory networks in frequent situations in which the number of samples is lower than the number of genes, making it possible to naturally infer partial Granger causalities without any a priori information. In addition, we present a statistical test to control the false discovery rate, which was not previously possible using other gene regulatory network models.
Resumo:
The objective of this thesis is to improve the understanding of what processes and mechanism affects the distribution of polychlorinated biphenyls (PCBs) and organic carbon in coastal sediments. Because of the strong association of hydrophobic organic contaminants (HOCs) such as PCBs with organic matter in the aquatic environment, these two entities are naturally linked. The coastal environment is the most complex and dynamic part of the ocean when it comes to both cycling of organic matter and HOCs. This environment is characterised by the largest fluxes and most diverse sources of both entities. A wide array of methods was used to study these processes throughout this thesis. In the field sites in the Stockholm archipelago of the Baltic proper, bottom sediments and settling particulate matter were retrieved using sediment coring devices and sediment traps from morphometrically and seismically well-characterized locations. In the laboratory, the samples have been analysed for PCBs, stable carbon isotope ratios, carbon-nitrogen atom ratios as well as standard sediment properties. From the fieldwork in the Stockholm Archipelago and the following laboratory work it was concluded that the inner Stockholm archipelago has a low (≈ 4%) trapping efficiency for freshwater-derived organic carbon. The corollary is a large potential for long-range waterborne transport of OC and OC-associated nutrients and hydrophobic organic pollutants from urban Stockholm to more pristine offshore Baltic Sea ecosystems. Theoretical work has been carried out using Geographical Information Systems (GIS) and statistical methods on a database of 4214 individual sediment samples, each with reported individual PCB congener concentrations. From this work it was concluded that the continental shelf sediments are key global inventories and ultimate sinks of PCBs. Depending on congener, 10-80% of the cumulative historical emissions to the environment are accounted for in continental shelf sediments. Further it was concluded that the many infamous and highly contaminated surface sediments of urban harbours and estuaries of contaminated rivers cannot be of importance as a secondary source to sustain the concentrations observed in remote sediments. Of the global shelf PCB inventory < 1% are in sediments near population centres while ≥ 90% is in remote areas (> 10 km from any dwellings). The remote sub-basin of the North Atlantic Ocean contains approximately half of the global shelf sediment inventory for most of the PCBs studied.
Resumo:
Máster en Oceanografía