948 resultados para Cryptography Statistical methods
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
Genes involved in host-pathogen interactions are often strongly affected by positive natural selection. The Duffy antigen, coded by the Duffy antigen receptor for chemokines (DARC) gene, serves as a receptor for Plasmodium vivax in humans and for Plasmodium knowlesi in some nonhuman primates. In the majority of sub-Saharan Africans, a nucleic acid variant in GATA-1 of the gene promoter is responsible for the nonexpression of the Duffy antigen on red blood cells and consequently resistance to invasion by P. vivax. The Duffy antigen also acts as a receptor for chemokines and is expressed in red blood cells and many other tissues of the body. Because of this dual role, we sequenced a 3,000-bp region encompassing the entire DARC gene as well as part of its 5' and 3' flanking regions in a phylogenetic sample of primates and used statistical methods to evaluate the nature of selection pressures acting on the gene during its evolution. We analyzed both coding and regulatory regions of the DARC gene. The regulatory analysis showed accelerated rates of substitution at several sites near known motifs. Our tests of positive selection in the coding region using maximum likelihood by branch sites and maximum likelihood by codon sites did not yield statistically significant evidence for the action of positive selection. However, the maximum likelihood test in which the gene was subdivided into different structural regions showed that the known binding region for P. vivax/P. knowlesi is under very different selective pressures than the remainder of the gene. In fact, most of the gene appears to be under strong purifying selection, but this is not evident in the binding region. We suggest that the binding region is under the influence of two opposing selective pressures, positive selection possibly exerted by the parasite and purifying selection exerted by chemokines.
Resumo:
Chaabene, H, Hachana, Y, Franchini, E, Mkaouer, B, Montassar, M, and Chamari, K. Reliability and construct validity of the karate-specific aerobic test. J Strength Cond Res 26(12): 3454-3460, 2012-The aim of this study was to examine absolute and relative reliabilities and external responsiveness of the Karate-specific aerobic test (KSAT). This study comprised 43 male karatekas, 19 of them participated in the first study to establish test-retest reliability and 40, selected on the bases of their karate experience and level of practice, participated in the second study to identify external responsiveness of the KSAT. The latter group was divided into 2 categories: national-level group (G(n)) and regional-level group (Gr). Analysis showed excellent test-retest reliability of time to exhaustion (TE), with intraclass correlation coefficient ICC(3,1) >0.90, standard error of measurement (SEM) <5%: (3.2%) and mean difference (bias) +/- the 95% limits of agreement: -9.5 +/- 78.8 seconds. There was a significant difference between test-retest session in peak lactate concentration (Peak [La]) (9.12 +/- 2.59 vs. 8.05 +/- 2.67 mmol.L-1; p < 0.05) but not in peak heart rate (HRpeak) and rating of perceived exertion (RPE) (196 +/- 9 vs. 194 +/- 9 b.min(-1) and 7.6 +/- 0.93 vs. 7.8 +/- 1.15; p > 0.05), respectively. National-level karate athletes (1,032 +/- 101 seconds) were better than regional level (841 +/- 134 seconds) on TE performance during KSAT (p < 0.001). Thus, KSAT provided good external responsiveness. The area under the receiver operator characteristics curve was >0.70 (0.86; confidence interval 95%: 0.72-0.95). Significant difference was detected in Peak [La] between national- (6.09 +/- 1.78 mmol.L-1) and regional-level (8.48 +/- 2.63 mmol.L-1) groups, but not in HRpeak (194 +/- 8 vs. 195 +/- 8 b.min(-1)) and RPE (7.57 +/- 1.15 vs. 7.42 +/- 1.1), respectively. The result of this study indicates that KSAT provides excellent absolute and relative reliabilities. The KSAT can effectively distinguish karate athletes of different competitive levels. Thus, the KSAT may be suitable for field assessment of aerobic fitness of karate practitioners.
Resumo:
The use of statistical methods to analyze large databases of text has been useful in unveiling patterns of human behavior and establishing historical links between cultures and languages. In this study, we identified literary movements by treating books published from 1590 to 1922 as complex networks, whose metrics were analyzed with multivariate techniques to generate six clusters of books. The latter correspond to time periods coinciding with relevant literary movements over the last five centuries. The most important factor contributing to the distinctions between different literary styles was the average shortest path length, in particular the asymmetry of its distribution. Furthermore, over time there has emerged a trend toward larger average shortest path lengths, which is correlated with increased syntactic complexity, and a more uniform use of the words reflected in a smaller power-law coefficient for the distribution of word frequency. Changes in literary style were also found to be driven by opposition to earlier writing styles, as revealed by the analysis performed with geometrical concepts. The approaches adopted here are generic and may be extended to analyze a number of features of languages and cultures.
Resumo:
Background Statistical methods for estimating usual intake require at least two short-term dietary measurements in a subsample of the target population. However, the percentage of individuals with a second dietary measurement (replication rate) may influence the precision of estimates, such as percentiles and proportions of individuals below cut-offs of intake. Objective To investigate the precision of the usual food intake estimates using different replication rates and different sample sizes. Participants/setting Adolescents participating in the continuous National Health and Nutrition Examination Survey 2007-2008 (n=1,304) who completed two 24-hour recalls. Statistical analyses performed The National Cancer Institute method was used to estimate the usual intake of dark green vegetables in the original sample comprising 1,304 adolescents with a replication rate of 100%. A bootstrap with 100 replications was performed to estimate CIs for percentiles and proportions of individuals below cut-offs of intake. Using the same bootstrap replications, four sets of data sets were sampled with different replication rates (80%, 60%, 40%, and 20%). For each data set created, the National Cancer Institute method was performed and percentiles, Cl, and proportions of individuals below cut-offs were calculated. Precision estimates were checked by comparing each Cl obtained from data sets with different replication rates with the Cl obtained from original data set. Further, we sampled 1,000, 750, 500, and 250 individuals from the original data set, and performed the same analytical procedures. Results Percentiles of intake and percentage of individuals below the cut-off points were similar throughout the replication rates and sample sizes, but the Cl increased as the replication rate decreased. Wider CIs were observed at 40% and 20% of replication rate. Conclusions The precision of the usual intake estimates decreased when low replication rates were used. However, even with different sample sizes, replication rates >40% may not lead to an important loss of precision. J Acad Nutr Diet. 2012;112:1015-1020.
Resumo:
Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The predictive quality of a classification model can be evaluated based on measures such as sensitivity, specificity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illustrated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
In this article, we propose a new Bayesian flexible cure rate survival model, which generalises the stochastic model of Klebanov et al. [Klebanov LB, Rachev ST and Yakovlev AY. A stochastic-model of radiation carcinogenesis - latent time distributions and their properties. Math Biosci 1993; 113: 51-75], and has much in common with the destructive model formulated by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)]. In our approach, the accumulated number of lesions or altered cells follows a compound weighted Poisson distribution. This model is more flexible than the promotion time cure model in terms of dispersion. Moreover, it possesses an interesting and realistic interpretation of the biological mechanism of the occurrence of the event of interest as it includes a destructive process of tumour cells after an initial treatment or the capacity of an individual exposed to irradiation to repair altered cells that results in cancer induction. In other words, what is recorded is only the damaged portion of the original number of altered cells not eliminated by the treatment or repaired by the repair system of an individual. Markov Chain Monte Carlo (MCMC) methods are then used to develop Bayesian inference for the proposed model. Also, some discussions on the model selection and an illustration with a cutaneous melanoma data set analysed by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)] are presented.
Resumo:
Abstract Background Several mathematical and statistical methods have been proposed in the last few years to analyze microarray data. Most of those methods involve complicated formulas, and software implementations that require advanced computer programming skills. Researchers from other areas may experience difficulties when they attempting to use those methods in their research. Here we present an user-friendly toolbox which allows large-scale gene expression analysis to be carried out by biomedical researchers with limited programming skills. Results Here, we introduce an user-friendly toolbox called GEDI (Gene Expression Data Interpreter), an extensible, open-source, and freely-available tool that we believe will be useful to a wide range of laboratories, and to researchers with no background in Mathematics and Computer Science, allowing them to analyze their own data by applying both classical and advanced approaches developed and recently published by Fujita et al. Conclusion GEDI is an integrated user-friendly viewer that combines the state of the art SVR, DVAR and SVAR algorithms, previously developed by us. It facilitates the application of SVR, DVAR and SVAR, further than the mathematical formulas present in the corresponding publications, and allows one to better understand the results by means of available visualizations. Both running the statistical methods and visualizing the results are carried out within the graphical user interface, rendering these algorithms accessible to the broad community of researchers in Molecular Biology.
Resumo:
Background The genetic mechanisms underlying interindividual blood pressure variation reflect the complex interplay of both genetic and environmental variables. The current standard statistical methods for detecting genes involved in the regulation mechanisms of complex traits are based on univariate analysis. Few studies have focused on the search for and understanding of quantitative trait loci responsible for gene × environmental interactions or multiple trait analysis. Composite interval mapping has been extended to multiple traits and may be an interesting approach to such a problem. Methods We used multiple-trait analysis for quantitative trait locus mapping of loci having different effects on systolic blood pressure with NaCl exposure. Animals studied were 188 rats, the progenies of an F2 rat intercross between the hypertensive and normotensive strain, genotyped in 179 polymorphic markers across the rat genome. To accommodate the correlational structure from measurements taken in the same animals, we applied univariate and multivariate strategies for analyzing the data. Results We detected a new quantitative train locus on a region close to marker R589 in chromosome 5 of the rat genome, not previously identified through serial analysis of individual traits. In addition, we were able to justify analytically the parametric restrictions in terms of regression coefficients responsible for the gain in precision with the adopted analytical approach. Conclusion Future work should focus on fine mapping and the identification of the causative variant responsible for this quantitative trait locus signal. The multivariable strategy might be valuable in the study of genetic determinants of interindividual variation of antihypertensive drug effectiveness.
Resumo:
Abstract Background To understand the molecular mechanisms underlying important biological processes, a detailed description of the gene products networks involved is required. In order to define and understand such molecular networks, some statistical methods are proposed in the literature to estimate gene regulatory networks from time-series microarray data. However, several problems still need to be overcome. Firstly, information flow need to be inferred, in addition to the correlation between genes. Secondly, we usually try to identify large networks from a large number of genes (parameters) originating from a smaller number of microarray experiments (samples). Due to this situation, which is rather frequent in Bioinformatics, it is difficult to perform statistical tests using methods that model large gene-gene networks. In addition, most of the models are based on dimension reduction using clustering techniques, therefore, the resulting network is not a gene-gene network but a module-module network. Here, we present the Sparse Vector Autoregressive model as a solution to these problems. Results We have applied the Sparse Vector Autoregressive model to estimate gene regulatory networks based on gene expression profiles obtained from time-series microarray experiments. Through extensive simulations, by applying the SVAR method to artificial regulatory networks, we show that SVAR can infer true positive edges even under conditions in which the number of samples is smaller than the number of genes. Moreover, it is possible to control for false positives, a significant advantage when compared to other methods described in the literature, which are based on ranks or score functions. By applying SVAR to actual HeLa cell cycle gene expression data, we were able to identify well known transcription factor targets. Conclusion The proposed SVAR method is able to model gene regulatory networks in frequent situations in which the number of samples is lower than the number of genes, making it possible to naturally infer partial Granger causalities without any a priori information. In addition, we present a statistical test to control the false discovery rate, which was not previously possible using other gene regulatory network models.
Resumo:
The objective of this thesis is to improve the understanding of what processes and mechanism affects the distribution of polychlorinated biphenyls (PCBs) and organic carbon in coastal sediments. Because of the strong association of hydrophobic organic contaminants (HOCs) such as PCBs with organic matter in the aquatic environment, these two entities are naturally linked. The coastal environment is the most complex and dynamic part of the ocean when it comes to both cycling of organic matter and HOCs. This environment is characterised by the largest fluxes and most diverse sources of both entities. A wide array of methods was used to study these processes throughout this thesis. In the field sites in the Stockholm archipelago of the Baltic proper, bottom sediments and settling particulate matter were retrieved using sediment coring devices and sediment traps from morphometrically and seismically well-characterized locations. In the laboratory, the samples have been analysed for PCBs, stable carbon isotope ratios, carbon-nitrogen atom ratios as well as standard sediment properties. From the fieldwork in the Stockholm Archipelago and the following laboratory work it was concluded that the inner Stockholm archipelago has a low (≈ 4%) trapping efficiency for freshwater-derived organic carbon. The corollary is a large potential for long-range waterborne transport of OC and OC-associated nutrients and hydrophobic organic pollutants from urban Stockholm to more pristine offshore Baltic Sea ecosystems. Theoretical work has been carried out using Geographical Information Systems (GIS) and statistical methods on a database of 4214 individual sediment samples, each with reported individual PCB congener concentrations. From this work it was concluded that the continental shelf sediments are key global inventories and ultimate sinks of PCBs. Depending on congener, 10-80% of the cumulative historical emissions to the environment are accounted for in continental shelf sediments. Further it was concluded that the many infamous and highly contaminated surface sediments of urban harbours and estuaries of contaminated rivers cannot be of importance as a secondary source to sustain the concentrations observed in remote sediments. Of the global shelf PCB inventory < 1% are in sediments near population centres while ≥ 90% is in remote areas (> 10 km from any dwellings). The remote sub-basin of the North Atlantic Ocean contains approximately half of the global shelf sediment inventory for most of the PCBs studied.
Resumo:
Máster en Oceanografía
Resumo:
It is well known that the deposition of gaseous pollutants and aerosols plays a major role in causing the deterioration of monuments and built cultural heritage in European cities. Despite of many studies dedicated to the environmental damage of cultural heritage, in case of cement mortars, commonly used in the 20th century architecture, the deterioration due to air multipollutants impact, especially the formation of black crusts, is still not well explored making this issue a challenging area of research. This work centers on cement mortars – environment interactions, focusing on the diagnosis of the damage on the modern built heritage due to air multi-pollutants. For this purpose three sites, exposed to different urban areas in Europe, were selected for sampling and subsequent laboratory analyses: Centennial Hall, Wroclaw (Poland), Chiesa dell'Autostrada del Sole, Florence (Italy), Casa Galleria Vichi, Florence (Italy). The sampling sessions were performed taking into account the height from the ground level and protection from rain run off (sheltered, partly sheltered and exposed areas). The complete characterization of collected damage layer and underlying materials was performed using a range of analytical techniques: optical and scanning electron microscopy, X ray diffractometry, differential and gravimetric thermal analysis, ion chromatography, flash combustion/gas chromatographic analysis, inductively coupled plasma-optical emission spectrometer. The data were elaborated using statistical methods (i.e. principal components analyses) and enrichment factor for cement mortars was calculated for the first time. The results obtained from the experimental activity performed on the damage layers indicate that gypsum, due to the deposition of atmospheric sulphur compounds, is the main damage product at surfaces sheltered from rain run-off at Centennial Hall and Casa Galleria Vichi. By contrast, gypsum has not been identified in the samples collected at Chiesa dell'Autostrada del Sole. This is connected to the restoration works, particularly surface cleaning, regularly performed for the maintenance of the building. Moreover, the results obtained demonstrated the correlation between the location of the building and the composition of the damage layer: Centennial Hall is mainly undergoing to the impact of pollutants emitted from the close coal power stations, whilst Casa Galleria Vichi is principally affected by pollutants from vehicular exhaust in front of the building.
Resumo:
Form und Gestalt kraniofazialer Strukturen sind primär beeinflusst durch die inhärente Integration der unterschiedlichsten Funktionssysteme und externer selektiver Einflüsse. Die Variabilität der Schädel-Morphe ist ein Indikator für solche Einflussfaktoren und damit ein idealer Gegenstand für vergleichende Analysen morphogenetischer Formbildung. Zur Ermittlung morphologisch-adaptiver Trends und Muster wurden sowohl Hypothesen zur morphologischen Differenziertheit als auch zu Korrelationen zwischen modularen Schädelkompartimenten (fazial, neurokranial, basikranial) untersucht. Zusätzlich wurden aus Schichtröntgenaufnahmen (CT) virtuelle Modelle rekonstruiert, welche die Interpretation der statistischen Befunde unterstützen sollten. Zur Berechnung der Gestaltunterschiede wurden mittels eines mechanischen Gelenkarm-Messgerätes (MicroScribe-G2) max. 85 ektokraniale Messpunkte (Landmarks) bzw. dreidimensionale Koordinaten an ca. 520 Schädeln von fünf rezenten Gattungen der Überfamilie Hominoidea (Hylobates, Pongo, Gorilla, Pan und Homo) akquiriert. Aus dem Datensatz wurden geometrische Störfaktoren (Größe, Translation, Rotation) mathematisch eliminiert und die verbleibenden Residuale bzw. ‚Gestalt-Variablen‘ diversen multivariat-statistischen Verfahren unterzogen (Faktoren, Cluster-, Regressions- und Korrelationsanalysen sowie statistische Tests). Die angewandten Methoden erhalten die geometrische Information der Untersuchungsobjekte über alle Analyseschritte hinweg und werden unter der Bezeichnung „Geometric Morphometrics (GMM)“ als aktueller Ansatz der Morphometrie zusammengefasst. Für die unterschiedlichen Fragestellungen wurden spezifische Datensätze generiert. Es konnten diverse morphologische Trends und adaptive Muster mit Hilfe der Synthese statistischer Methoden und computer-basierter Rekonstruktionen aus den generierten Datensätzen ermittelt werden. Außerdem war es möglich, präzise zu rekonstruieren, welche kranialen Strukturen innerhalb der Stichprobe miteinander wechselwirken, einzigartige Variabilitäten repräsentieren oder eher homogen gestaltet sind. Die vorliegenden Befunde lassen erkennen, dass Fazial- und Neurokranium am stärksten miteinander korrelieren, während das Basikranium geringe Abhängigkeiten in Bezug auf Gesichts- oder Hirnschädelveränderungen zeigte. Das Basikranium erweist sich zudem bei den nicht-menschlichen Hominoidea und über alle Analysen hinweg als konservative und evolutiv-persistente Struktur mit dem geringsten Veränderungs-Potential. Juvenile Individuen zeigen eine hohe Affinität zueinander und zu Formen mit einem kleinem Gesichts- und großem Hirnschädel. Während das Kranium des rezenten Menschen primär von Enkephalisation und fazialer Retraktion (Orthognathisierung) dominiert ist und somit eine einzigartige Gestalt aufweist, zeigt sich der Kauapparat als maßgeblich formbildendes Kompartiment bei den nicht-menschlichen Formen. Die Verbindung von GMM mit den interaktiven Möglichkeiten computergenerierter Modelle erwies sich als valides Werkzeug zur Erfassung der aufgeworfenen Fragestellungen. Die Interpretation der Befunde ist durch massive Interkorrelationen der untersuchten Strukturen und der statistisch-mathematischen Prozeduren als hoch komplex zu kennzeichnen. Die Studie präsentiert einen innovativen Ansatz der modernen Morphometrie, welcher für zukünftige Untersuchungen im Bereich der kraniofazialen Gestaltanalyse ausgebaut werden könnte. Dabei verspricht die Verknüpfung mit ‚klassischen’ und modernen Zugängen (z. B. Molekularbiologie) gesteigerte Erkenntnismöglichkeiten für künftige morphometrische Fragestellungen.
Resumo:
Throughout the twentieth century statistical methods have increasingly become part of experimental research. In particular, statistics has made quantification processes meaningful in the soft sciences, which had traditionally relied on activities such as collecting and describing diversity rather than timing variation. The thesis explores this change in relation to agriculture and biology, focusing on analysis of variance and experimental design, the statistical methods developed by the mathematician and geneticist Ronald Aylmer Fisher during the 1920s. The role that Fisher’s methods acquired as tools of scientific research, side by side with the laboratory equipment and the field practices adopted by research workers, is here investigated bottom-up, beginning with the computing instruments and the information technologies that were the tools of the trade for statisticians. Four case studies show under several perspectives the interaction of statistics, computing and information technologies, giving on the one hand an overview of the main tools – mechanical calculators, statistical tables, punched and index cards, standardised forms, digital computers – adopted in the period, and on the other pointing out how these tools complemented each other and were instrumental for the development and dissemination of analysis of variance and experimental design. The period considered is the half-century from the early 1920s to the late 1960s, the institutions investigated are Rothamsted Experimental Station and the Galton Laboratory, and the statisticians examined are Ronald Fisher and Frank Yates.