39 resultados para Zipf


20.00% 20.00%



In Phys. Rev. Letters (73:2), Mantegna et al. conclude on the basis of Zipf rank frequency data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue on the contrary that an empirical fit to Zipf"s "law" cannot be used as a criterion for similarity to natural languages. Although DNA is a presumably "organized system of signs" in Mandelbrot"s (1961) sense, and observation of statistical featurs of the sort presented in the Mantegna et al. paper does not shed light on the similarity between DNA's "gramar" and natural language grammars, just as the observation of exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an M-sided die or a finite-state branching process.


20.00% 20.00%



The Zipf curves of log of frequency against log of rank for a large English corpus of 500 million word tokens, 689,000 word types and for a large Spanish corpus of 16 million word tokens, 139,000 word types are shown to have the usual slope close to –1 for rank less than 5,000, but then for a higher rank they turn to give a slope close to –2. This is apparently mainly due to foreign words and place names. Other Zipf curves for highlyinflected Indo-European languages, Irish and ancient Latin, are also given. Because of the larger number of word types per lemma, they remain flatter than the English curve maintaining a slope of –1 until turning points of about ranks 30,000 for Irish and 10,000 for Latin. A formula which calculates the number of tokens given the number of types is derived in terms of the rank at the turning point, 5,000 for both English and Spanish, 30,000 for Irish and 10,000 for Latin.


20.00% 20.00%



Pós-graduação em Física - IGCE


20.00% 20.00%



In questa tesi si è studiato un corpus di importanti testi della letteratura Italiana utilizzando la teoria dei network. Le misure topologiche tipiche dei network sono state calcolate sui testi letterari, poi sono state studiate le loro distribuzioni e i loro valori medi, per capire quali di esse possono distinguere un testo reale da sue modificazioni. Inoltre si è osservato come tutti i testi presentino due importanti leggi statistiche: la legge di Zipf e quella di Heaps.


20.00% 20.00%



Tropical Cyclones are a continuing threat to life and property. Willoughby (2012) found that a Pareto (power-law) cumulative distribution fitted to the most damaging 10% of US hurricane seasons fit their impacts well. Here, we find that damage follows a Pareto distribution because the assets at hazard follow a Zipf distribution, which can be thought of as a Pareto distribution with exponent 1. The Z-CAT model is an idealized hurricane catastrophe model that represents a coastline where populated places with Zipf- distributed assets are randomly scattered and damaged by virtual hurricanes with sizes and intensities generated through a Monte-Carlo process. Results produce realistic Pareto exponents. The ability of the Z-CAT model to simulate different climate scenarios allowed testing of sensitivities to Maximum Potential Intensity, landfall rates and building structure vulnerability. The Z-CAT model results demonstrate that a statistical significant difference in damage is found when only changes in the parameters create a doubling of damage.


10.00% 10.00%



本文利用地理信息系统(GIS)技术、景观生态学理论和方法、分形理论以及统计分析方法对北京地区植被景观的空间分布特征进行了分析,并对景观格局和景观多样性的分析方法进行了探讨,结果表明: (1)对几乎所有的斑块类型,其斑块大小的分布都不是对称的,而是右偏的。4种概率分布(Г—分布、对数正态分布、Weibull分布和(负)指数分布)都只能刻划部分斑块类型,并且服从对数正态分布的斑块类型最多,服从(负)指数分布的斑块类型最少。 (2)随着斑块面积的增加,边界效应越来越小,而斑块形状越来越不紧凑。 (3)分形分析识别出本地区植被景观中的两个尺度域:一个是斑块面积小于(大约)2.7km2,另一个是斑块面积大于(大约)2.7km2。两个域中的斑块复杂程度有很大差异,后一个域中的斑块明显比前一个域中的斑块复杂,并且随着斑块面积的增加,斑块形状越来越复杂。 (4)用斑块数作为多度指标时,该景观的斑块类型一多度分布服从(截断)对数正态分布和(截断)负二项分布,不服从对数级数分布和几何分布。用斑块面积作为多度指标时,该景观的斑块类型一多度分布服从对数正态分布、Weibull分布和Г—争布,不服从正态分布。从而该景观的斑块类型一多度分布不是对称的,也是右偏的。在4个优势度/多样性模型中,“生态位优先占领”模型和Zipf-Mandelbrot模型可以较好地刻划该景观的斑块类型一多度关系。 (5)样本大小对多样性测度有直接的影响。如果这种影响比较小,就说明测度指标比较稳定。三个丰富度指数中,Ri比R2和R3更稳定;五个多样性性指数中,D和Di最稳定,OD最不稳定,因此,OD是用于景观多样性监测的理想指标;五个均匀度指数中,Jgi最稳定。根据设计的3种计算临界样方数量(即多样性测度指标达到稳定时的样方数量)方法的计算结果,上述几个最稳定的测度指标在通常情况下只需要几个样方(即总抽样面积为数百km2)就达到稳定状态。 (6)斑块类型数目随面积的增加而增加。根据四个评价指标的评价结果,认为双曲线对该景观的斑块类型一面积关系的拟合效果最好。 (7)样本较大(对于一阶刀切估计,大于30个样方;对于二阶刀切估计,大于60个样方)时,刀切法能够给出斑块类型数目(NPT)较好的估计;样本较小(小于30个样方)时,Mingoti和Meeden提出的经验贝叶斯方法能够对NPT给出比刀切法和自助法更好的估计。斑块类型一面积曲线外推虽然也能给出NPT较好的估计,但这种方法需要慎重使用,不能外推得很远。 (8)列联表分析表明,该植被景观中的斑块类型与土壤类型、岩石类型、海拔高度和坡向各因子之间均存在显著的相关性。植被景观多样性与岩石类型多样性和地形多样性之间也均呈显著的正相关关系,即植被景观多样性随岩石类型多样性和地形多样性的增加而增加。但植被景观多样性与土壤类型多样性之间不存在显著的线性相关或秩相关关系,这可能是由于二者的分类体系不吻合。植被景观多样性与总的道路密度和第二类道路密度之间均呈显著的负相关关系,而与第一类和第三类道路密度之间的关系都不显著。这反映出景观样本单元(10kmxlOkm)的尺度对应于第二类道路的影响尺度。而道路密度在一定程度上反映了人类活动的强度,因此,在10kmxlOkm这个尺度上,人类活动愈剧烈,景观多样性就愈小。


10.00% 10.00%



The explosion of WWW traffic necessitates an accurate picture of WWW use, and in particular requires a good understanding of client requests for WWW documents. To address this need, we have collected traces of actual executions of NCSA Mosaic, reflecting over half a million user requests for WWW documents. In this paper we describe the methods we used to collect our traces, and the formats of the collected data. Next, we present a descriptive statistical summary of the traces we collected, which identifies a number of trends and reference patterns in WWW use. In particular, we show that many characteristics of WWW use can be modelled using power-law distributions, including the distribution of document sizes, the popularity of documents as a function of size, the distribution of user requests for documents, and the number of references to documents as a function of their overall rank in popularity (Zipf's law). Finally, we show how the power-law distributions derived from our traces can be used to guide system designers interested in caching WWW documents.


10.00% 10.00%



We present what we believe to be the first thorough characterization of live streaming media content delivered over the Internet. Our characterization of over five million requests spanning a 28-day period is done at three increasingly granular levels, corresponding to clients, sessions, and transfers. Our findings support two important conclusions. First, we show that the nature of interactions between users and objects is fundamentally different for live versus stored objects. Access to stored objects is user driven, whereas access to live objects is object driven. This reversal of active/passive roles of users and objects leads to interesting dualities. For instance, our analysis underscores a Zipf-like profile for user interest in a given object, which is to be contrasted to the classic Zipf-like popularity of objects for a given user. Also, our analysis reveals that transfer lengths are highly variable and that this variability is due to the stickiness of clients to a particular live object, as opposed to structural (size) properties of objects. Second, based on observations we make, we conjecture that the particular characteristics of live media access workloads are likely to be highly dependent on the nature of the live content being accessed. In our study, this dependence is clear from the strong temporal correlations we observed in the traces, which we attribute to the synchronizing impact of live content on access characteristics. Based on our analyses, we present a model for live media workload generation that incorporates many of our findings, and which we implement in GISMO [19].


10.00% 10.00%



Temporal locality of reference in Web request streams emerges from two distinct phenomena: the popularity of Web objects and the {\em temporal correlation} of requests. Capturing these two elements of temporal locality is important because it enables cache replacement policies to adjust how they capitalize on temporal locality based on the relative prevalence of these phenomena. In this paper, we show that temporal locality metrics proposed in the literature are unable to delineate between these two sources of temporal locality. In particular, we show that the commonly-used distribution of reference interarrival times is predominantly determined by the power law governing the popularity of documents in a request stream. To capture (and more importantly quantify) both sources of temporal locality in a request stream, we propose a new and robust metric that enables accurate delineation between locality due to popularity and that due to temporal correlation. Using this metric, we characterize the locality of reference in a number of representative proxy cache traces. Our findings show that there are measurable differences between the degrees (and sources) of temporal locality across these traces, and that these differences are effectively captured using our proposed metric. We illustrate the significance of our findings by summarizing the performance of a novel Web cache replacement policy---called GreedyDual*---which exploits both long-term popularity and short-term temporal correlation in an adaptive fashion. Our trace-driven simulation experiments (which are detailed in an accompanying Technical Report) show the superior performance of GreedyDual* when compared to other Web cache replacement policies.


10.00% 10.00%



Power law distributions, also known as heavy tail distributions, model distinct real life phenomena in the areas of biology, demography, computer science, economics, information theory, language, and astronomy, amongst others. In this paper, it is presented a review of the literature having in mind applications and possible explanations for the use of power laws in real phenomena. We also unravel some controversies around power laws.


10.00% 10.00%



Power laws, also known as Pareto-like laws or Zipf-like laws, are commonly used to explain a variety of real world distinct phenomena, often described merely by the produced signals. In this paper, we study twelve cases, namely worldwide technological accidents, the annual revenue of America׳s largest private companies, the number of inhabitants in America׳s largest cities, the magnitude of earthquakes with minimum moment magnitude equal to 4, the total burned area in forest fires occurred in Portugal, the net worth of the richer people in America, the frequency of occurrence of words in the novel Ulysses, by James Joyce, the total number of deaths in worldwide terrorist attacks, the number of linking root domains of the top internet domains, the number of linking root domains of the top internet pages, the total number of human victims of tornadoes occurred in the U.S., and the number of inhabitants in the 60 most populated countries. The results demonstrate the emergence of statistical characteristics, very close to a power law behavior. Furthermore, the parametric characterization reveals complex relationships present at higher level of description.


10.00% 10.00%



10.00% 10.00%



10.00% 10.00%



10.00% 10.00%



Poner al alcance de los docentes una información sobre el léxico real que poseen los alumnos de EGB y BUP, el léxico de su lengua materna. 3150 alumnos (1890 de EGB y 1260 de BUP), 2520 de enseñanza oficial y el resto de enseñanza privada. Divididos en bloques por edades y centros de Oviedo, Gijón, Avilés, Mieres, Pola de Siero, Llanes y Navia. Se adoptaron mediante reuniones y discusiones una norma para mayor claridad y simplicidad entre lo que figura: 1. Cada vocablo es una entrada diferente, igual que los sinónimos. Se consideró una sola entrada aquellos términos que aparecían con muchas variantes. Se dio cabida a neologismos formados por composición o adopción. Los vocablos en bable se consideran como entradas aparte. Se elaboraron dos tipos de encuesta, que se aplicaban en distintos momentos: 1. Encuesta libre en la que escriben todos los términos que espontáneamente acudan a su mente. 2. Encuesta controlada, en la que escriben 20 términos que se les ocurran de un centro de interés (los animales, el campo, la casa y los muebles, la ciudad, comidas y bebidas, los oficios, la escuela y el material escolar, los medios de transporte, las partes del cuerpo, los vestidos). Se elaboró una lista alfabética de los 9782 vocablos recogidos en la encuesta libre y una lista por orden de frecuencia decreciente hasta llegar a 20 (1352 vocablos). En la encuesta controlada de todos los vocablos (16761) y uno con orden de frecuencia decreciente hasta 20 (2740). Anotando en cada edad la frecuencia absoluta y el reparto por grupos conjuntos, así como la suma de cada una para dar la frecuencia y reparto total para cada vocablo. En el tomo I se presenta la metodología utilizada y la encuesta libre; en el tomo II la encuesta controlada, las listas de cada centro de interés y otra lista síntesis de todas ellas. Se piensa en la conveniencia de constatar la frecuencia de las categorías gramaticales; relacionar las listas-resultado de ambas encuestas; aplicar a los resultados la Ley de Zipf.