ACL2020

A Three-Parameter Rank-Frequency Relation in Natural Languages

Chenchen Ding, Masao Utiyama, Eiichiro Sumita

Abstract

We present that, the rank-frequency relation in textual data follows f ∝ r -α (r + γ) -β , where f is the token frequency and r is the rank by frequency, with (α, β, γ) as parameters. The formulation is derived based on the empirical observation that d 2 (x+y)/dx 2 is a typical impulse function, where (x, y) = (log r, log f ). The formulation is the power law when β = 0 and the Zipf-Mandelbrot law when α = 0. We illustrate that α is related to the analytic features of syntax and β + γ to those of morphology in natural languages from an investigation of multilingual corpora.