HAL Id: halshs-01483599
https://halshs.archives-ouvertes.fr/halshs-01483599
Preprint submitted on 6 Mar 2017
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Copyright
evidence of a near-critical dynamics in language change
Q Feltgen, Benjamin Fagard, Jean-Pierre Nadal
To cite this version:
Q Feltgen, Benjamin Fagard, Jean-Pierre Nadal. Frequency patterns of semantic change: Corpus-
based evidence of a near-critical dynamics in language change. 2017. �halshs-01483599�
Corpus-based evidence of a near-critical dynamics in language change
Q. Feltgen
1, B. Fagard
2and J.P. Nadal
1,31
Laboratoire de Physique Statistique, ´ Ecole Normale Sup´ erieure, PSL Research University; Universit´ e Paris Diderot,
Sorbonne Paris-Cit´ e; Sorbonne Universit´ es, UPMC – Univ. Paris 06; CNRS; Paris, France.
2
Laboratoire Langues, Textes, Traitements informatique, Cognition (LaTTiCe, UMR 8094 CNRS - ENS - Universit´ e Paris 3), ´ Ecole normale sup´ erieure, Paris, France.
3
Ecole des Hautes ´ ´ Etudes en Sciences Sociales, PSL Research University, CNRS, Centre d’Analyse et de Math´ ematique Sociales, Paris, France.
It is generally believed that, when a linguistic item acquires a new meaning, its overall frequency of use in the language rises with time with an S-shaped growth curve. Yet, this claim has only been supported by a limited number of case studies. In this paper, we provide the first corpus- based quantitative confirmation of the genericity of the S-curve in language change. Moreover, we uncover another generic pattern, a latency phase of variable duration preceding the S-growth, during which the frequency of use of the semantically expanding word remains low and more or less constant. We also propose a usage-based model of language change supported by cognitive considerations, which predicts that both phases, the latency and the fast S-growth, take place. The driving mechanism is a stochastic dynamics, a random walk in the space of frequency of use. The underlying deterministic dynamics highlights the role of a control parameter, the strength of the cognitive impetus governing the onset of change, which tunes the system at the vicinity of a saddle- node bifurcation. In the neighborhood of the critical point, the latency phase corresponds to the diffusion time over the critical region, and the S-growth to the fast convergence that follows. The duration of the two phases is computed as specific first passage times of the random walk process, leading to distributions that fit well the ones extracted from our dataset. We argue that our results are not specific to the studied corpus, but apply to semantic change in general.
Language can be approached through three different, complementary perspectives. Ultimately, it exists in the mind of language users, so that it is a cognitive entity, rooted in a neuro-psychological basis. But language ex- ists only because people interact with each other: It emerges as a convention among a community of speakers, and answers to their communicative needs. Thirdly, lan- guage can be seen as something in itself: An autonomous, emergent entity, obeying its own inner logic. If it was not for this third Dasein of language, it would be less obvious to speak of language change as such.
The social and cognitive nature of language informs and constrains this inner consistency. Zipf’s law, for instance, may be seen as resulting from a trade-off be- tween the ease of producing the utterance, and the ease of processing it [1]. It relies thus both on the cognitive grounding of the language, and on its communicative nature. Those two external facets of language, cogni- tive and sociological, are similarly expected to channel the regularities of linguistic change. Modeling attempts (see [2] for an overview) have explored both how socio- linguistic factors can shape the process of this change [3, 4] and how this change arises through language learn- ing by new generations of users [5, 6]. Some models also consider mutations of language itself, without providing further details on the social or cognitive mechanisms of change [7]. In this paper, we propose to view language change as initiated by language use, which is the repeated call to one’s linguistic resources in order to express one- self or to make sense of linguistic productions of others.
This approach is in line with exemplar models [8] and re- lated works, such as the Utterance Selection Model [9] or the model proposed by Victorri [10], which describes an out-of-equilibrium shaping of semantic structure through repeated events of communication.
Leaving aside socio-linguistic factors, we focus on a cognitive approach of linguistic change, more precisely of semantic expansion. Semantic expansion occurs when a new meaning is gained by a word or a construction (we will henceforth refer more vaguely to a linguistic ‘form’, so as to remain as general as possible). For instance, way, in the construction way too, has come to serve as an in- tensifier (e.g. ‘The only other newspaper in the history of Neopia is the Ugga Ugg Times, which, of course, is way too prehistoric to read.’ [11]). The fact that polysemy is pervasive in any language [12] suggests that semantic expansion is a common process of language change and happens constantly throughout the history of a language.
Grammaticalization [13] – a process by which forms ac- quire a (more) grammatical status, like the example of way too above – and other interesting phenomena of lan- guage change [14, 15], fall within the scope of semantic expansion.
Semantic change is known to be associated with an in- crease of frequency of the form whose meaning expands.
This increase is expected indeed: As the form comes to
carry more meanings, it is used in a broader number of
contexts, hence more often. This implies that any in-
stance of semantic change should have its empirical coun-
terpart in the frequency rise of the use of the form. This
rise is furthermore believed to follow an S-curve [16, 17], yet such claim, to our knowledge, has not been quanti- tatively grounded on more than a few chosen examples.
Besides, it is not easily accounted through theoretical modeling: In a sociolinguistic framework for instance, it requires either a very specific social structure, or the as- sumption that the new use is favored intrinsically [18].
Such a framework also suffers from what is known as the Threshold Problem, the fact that a novelty will fail to take over an entire community of speakers, because of the isolated status of an exceptional deviation [19].
In this paper, we provide a broad corpus-based inves- tigation of the frequency patterns associated with a few hundred semantic expansions. It turns out that the S- curve pattern is corroborated, but must be completed by a preceding latency part, in which the frequency of the form does not significantly increase, even if the new meaning is already present in the language. To explain this surprising behavior, which seems to have escaped notice so far, we propose a usage-based model of the process of semantic expansion, implementing basic cog- nitive hypotheses regarding language use. By means of our model, we relate the micro-process of language use at the individual scale, to the observed macro-phenomenon of a recurring frequency pattern occurring in semantic expansion.
I. QUANTIFICATION OF CHANGES IN A LARGE CORPUS
We worked on the French corpus Frantext [20], to our knowledge the only textual database allowing for a reli- able study covering several centuries (see Material and Methods and Appendix A). We studied changes in fre- quency of use for 400 forms which have undergone one or several semantic expansions, on a time range going from 1321 up to nowadays. We choose forms so as to focus on semantic expansions leading to a functional meaning
— such as discursive, prepositional, or procedural mean- ings. Semantic expansions whose outcome remains in the lexical realm (as the one undergone by sentence, whose meaning evolved from ‘verdict, judgment’ to ‘meaningful string of words’) have been left out. Functional mean- ings indeed present several advantages: They are often accompanied by a change of syntagmatic context, allow- ing to track the semantic expansion more accurately (e.g.
way in way too + adj.); they are also less sensitive to socio-cultural and historical influences; finally they are less dependent on the specific content of a text, be it literary or academic.
The profiles of frequency of use extracted from the database are illustrated on Figure 1 for nine forms. We find that 286 cases display at least one sigmoidal increase of frequency in the course of their evolution, which makes up more than 70% of the total. We provide a small selec- tion of the observed frequency patterns (Fig. 2a), whose associated logit transforms (Fig. 2b) follows a linear be-
havior, indicative of the sigmoidal nature of the growth (see Material and Methods). We thus find a robust sta- tistical validation of the sigmoidal pattern, confirming the general claim made in the literature.
Furthermore, we find two major phenomena besides this sigmoidal pattern. The first one is that, in most cases, the final plateau towards which the frequency is expected to stabilize after its sigmoidal rise is not to be found: The frequency immediately starts to decrease af- ter having reached a maximum (Fig. 1). However, such a decrease process is not symmetrical with the increase, in contrast with other cases of fashion-driven evolution in language, e.g. first names distribution [21]. Though this decrease may be, in a few handful of cases, imputable to the disappearance of a form (ex: apr` es ce, replaced in Modern French by apr` es quoi), in most cases it is more likely to be the sign of a narrowing of its uses.
The second feature is that the fast growth is very often preceded by a long latency up to several centuries, during which the new form is used, but with a comparatively low and rather stable frequency (Fig. 2a). One should note that the latency times may be underestimated: If the av- erage frequency is very low during the latency part, the word may not show up at all in the corpus, especially in decades for which the available texts are sparse. The pat- tern of frequency increase is thus better conceived of as a latency followed by a growth, as exemplified by de toute fa¸ con (Fig. 3) — best translated by anyway in English, since the present meanings of these two terms are very close, and remarkably, despite quite different origins, the two have followed parallel paths of change.
To our knowledge, these two features, latency and ab- sence of a stable plateau, have not been documented be- fore, even though a number of specific cases of latency have been observed. For instance, it has been remarked in the case of just because that the fast increase is only one stage in the evolution [22]). In the following, we propose a model describing both the latency and the S- growth periods. We leave for future work the study of the decrease of frequency following the S-growth.
II. A COGNITIVE SCENARIO
To account for the specific frequency pattern evidenced
by our data analysis, we propose a scenario focusing
on cognitive aspects of language use, leaving all socio-
linguistic effects back-grounded by making use of a repre-
sentative agent, mean-field type, approach. We limit our-
selves to the case of a competition between two linguistic
variants, given that most cases of semantic expansion can
be understood as such, even if the two competing variants
cannot always be explicitly identified. Initially, in some
concept or context of use C
1, one of the two variants,
henceforth noted Y , is systematically chosen, so that it
conventionally expresses this concept. The question we
address is thus how a new variant, say X , can be used in
this context and eventually evict the old variant Y ?
FIG. 1. Frequency evolution on the whole time range (1321-2020) of nine different forms. Each blue bar shows the frequency associated to a decade. Frequency has been multiplied by a 10
5factor for an easier reading.
A. Hypotheses
The main hypothesis we propose is that the new vari- ant almost never is a brand new merging of phonemes whose meaning would pop out of nowhere. As Haspel- math highlights [23], a new variant is almost always a periphrastic construction, i.e., actual parts of language, put together in a new, meaningful way. Furthermore, such a construction, though it may be exapted to a new use, may have showed up from time to time in the time course of the language history, in an entirely composi- tional way; this is the case for par ailleurs, which inci- dentally appears as early as the xiv
thin our corpus, but arises as a construction in its own right during the first part of the xix
thcentury only. In other words, the use of a linguistic form X in a context C
1may be entirely new, but the form X was most probably already there in another context of use C
0, or equivalently, with another meaning.
We make use of the well-grounded idea [24] that there exists links between concepts due to the intrinsic pol- ysemy of language: There are no isolated meanings, as each concept is interwoven with many others, in a compli- cated tapestry. These links between concepts are asym- metrical, and they can express both universal mappings
between concepts [25, 26] and cultural ones (e.g. en- trenched metaphors [27]). As the conceptual texture of language is a complex network of living relations rather than a collection of isolated and self-sufficient monads, semantic change is expected to happen as the natural course of language evolution and to occur repetitively throughout its history, so that at any point of time, there are always several parts of language which are undergoing changes. The simplest layout accounting for this network structure in a competitive situation consists then in two sites, such that one is influencing the other through a cognitive connexion of some sort.
B. Model formalism
We now provide details on the modeling of a compe- tition between two variants X and Y for a given context of use, or concept, C
1, also considering the effect exerted by the related context or concept C
0on this evolution.
• Each concept C
i, i = 0, 1, is represented by a set
of exemplars of the different linguistic forms. We note
N
µi(t) the number at time t of encoded exemplars (or
occurrences) of form µ ∈ {X, Y }, in context C
i, in the
memory, of the representative agent.
(a)
(b)
FIG. 2. (a) A selection of frequency evolutions showing the latency period and the S-growth, separated by a red vertical line. (b) Logit transforms of the S-growth part of the preced- ing curves. Red dots correspond to data points and the green line to the linear fit of this set of points.
• The memory capacity of an individual being finite, the population of exemplars attached to each concept C
ihas a finite size M
i. For simplicity we assume that all memory sizes are equal (M
0= M
1= M ). As we consider only two forms X and Y , for each i the relation N
Xi(t) +N
Yi(t) = M always hold: We can focus on one of the two forms, here X , and drop out the form subscript, granted that all quantities refer to X .
• The absolute frequency x
itof form X at time t in context C
i— the fraction of ‘balls’ of type X in the bag attached to C
i— is thus given by the ratio N
i(t)/M . In the initial situation, X and Y are assumed to be estab- lished convention for respectively expressing C
0and C
1, so that we start with N
0(t = 0) = M and N
1(t = 0) = 0.
FIG. 3. Overall evolution of the frequency of use of de toute fa¸ con (main panel), with focus on the S-shape increase (right inner panel), whose logit transformation follows a linear fit (left inner panel). Preceding the S-growth, one observes a long period of very low frequency (up to 34 decades).
• Finally, C
0exerts an influence on context C
1, but this influence is assumed to be unilateral. Consequently, the content of C
0will not change in the course of the evolution and we can focus on C
1. An absence of explicit indication of context is thus to be understood as referring to C
1.
C. Dynamics
The dynamics of the system runs as follow. At each time t, one of the two linguistic forms is chosen to express concept C
1. The form X is uttered with some probabil- ity P (t), to be specified below, and Y with probability 1 − P (t). In order to keep constant the memory size of the population of occurrences in C
1, a past occurrence is randomly chosen (with a uniform distribution) and the new occurrence takes its place. This dynamics is then repeated a large number of times. Note that this model focuses on a speaker perspective (for alternative variants, see Appendix B).
We want to explicit the way P (t) depends on x(t), the absolute frequency of X in this context at time t. The simplest choice would be P (t) = x(t). However, we want to take into account several facts, as explained below.
• As context C
0exerts an influence on context C
1, de- noting by γ the strength of this influence, we assume the probability P to rather depend on an effective frequency f(t) (Fig. 4a),
f(t) = N
1(t) + γN
0(t)
M + γM = x(t) + γ
1 + γ . (1)
• We now specify the probability P (f ) to select X at
time t as a function of f = f (t). First, P (f ) must be
nonlinear. Otherwise, the change occurs with certainty
as soon as the effective frequency f of the novelty is non-
zero, that is, insofar two meanings are related, the form
expressing the former will also be recruited to express the
latter. This change would also start in too abrupt a way, while sudden, instantaneous takeovers are not known to happen in language change.
Second, one should preserve the symmetry between the two forms, that is, P(f ) = 1 − P (1 − f ), as well as verify P (0) = 0 and P (1) = 1. Note that this symmetry is stated in terms of the effective frequency f instead of the actual frequency x, as production in one context always accounts for the contents of neighboring ones.
For the numerical simulations, we made the following specific choice which satisfies these constraints:
P (f) = 1 2
(
1 + tanh β f − (1 − f) p f(1 − f)
!)
, (2)
where β is a parameter governing the non-linearity of the curve. Replacing f in terms of x, the probability to choose X is thus a function P
γ(x) of the current absolute frequency x:
P
γ(x) = 1 2
(
1 + tanh β 2x − 1 + γ p (x + γ)(1 − x)
!) (3)
D. Analysis: Bifurcation and latency time The dynamics outlined above (Fig. 4b) is equivalent to a random walk on the segment [0; 1] with a reflecting boundary at 0 and an absorbing one at 1, and with steps of size 1/M. The probability of going forward at site x is equal to (1 − x)P
γ(x), and the probability of going backward to x(1 − P
γ(x)).
For large M , a continuous, deterministic approxima- tion of this random walk leads, after a rescaling of the time M t → t, to the first order differential equation for x(t):
˙
x = P
γ(x) − x . (4)
This dynamics admits either one or three fixed points (Fig. 5a), x = 1 always being one. Below a threshold value γ
c, which depends on the non-linearity parameter β, a saddle-node bifurcation occurs and two other fixed points appear. The system, starting from x = 0, is stuck at the smallest stable fixed point. The transmission time, i.e. the time required for the system to go from 0 to 1, is therefore infinite (Fig. 5b). Above the threshold value γ
c, only the fixed point x = 1 remains, so that the new variant eventually takes over the context for which it is competing. Our model thus describes how the strength- ening of a cognitive link can trigger a semantic expansion process.
Slightly above the transition, a stranglehold region ap- pears where the speed almost vanishes. Accordingly, the time spent in this region diverges. The frequency of the new variant will stick to low values for a long time, in a way similar to the latent behavior evidenced by our dataset. This latency time in the process of change can
(a) (b)
FIG. 4. (a) Difference between absolute frequency x and rela- tive frequency f in context C
1. Absolute frequency x is given by the ratio of X occurrences encoded in C
1. Effective fre- quency f also takes into account the M occurrences contained in the influential context C
0, with a weight γ standing for the strength of this influence. (b) Schematic view of the process.
At each iteration, either X or Y is chosen to be produced and thus encoded in memory, with respective probability P
γ(x) and 1 − P
γ(x); the produced occurrence is here represented in the purple capsule. Another occurrence, already encoded in the memory, is uniformly chosen to be erased (red circle) so as to keep the population size constant. Hence the number of X occurrences, N
X, either increases by 1 if X is produced and Y erased, decreases by 1 if Y is produced and X erased, or remains constant if the erased occurrence is the same as the one produced.
thus be understood as a near-critical slowing down of the underlying dynamics.
Past this deterministic approximation, there is no more clear-cut transition (Fig. 5b) and the above explanation needs to be refined. The deterministic speed can be un- derstood as a drift velocity of the Brownian motion on the [0; 1] segment, so that in the region where the speed vanishes, the system does not move in average. In this region of vanishing drift, the frequency fluctuates over a small set of values and does not evolve significantly over time. Once it escapes this region, the drift velocity drives the process again, and the replacement process takes off.
Latency time can thus be understood as a first-passage time out of a trapping region.
III. NUMERICAL RESULTS A. Model simulations
We ran numerical simulations of the process described above (Fig. 4b), with the following choice of parameters:
β = 0.808, δ = 0.0 and M = 5000, where δ = (γ − γ
c)/γ
cis the distance to the threshold. The specific value of
β corresponds to a maximization of x
c, the frequency
value at which the system gets stuck. It reflects the as-
sumption that the linguistic system should allow for syn-
onymic variation in the situation where no replacement
takes place. We chose δ = 0.0 in order for the system to
be purely diffusive in the vicinity of x
c. The choice of M
(a) (b)
FIG. 5. (a) Speed ˙ x of the deterministic process for each of the sites, for different values of β and δ = (γ −γ
c)/γ
c, the distance to threshold. Depending on the sign of δ, there is either one or three fixed points. (b) Inverse transmission time (time required for the system to go from 0 to 1), for the deterministic process (blue dotted line), and for the averaged stochastic process (green line), as a function of the control parameter δ. Deterministic transmission time diverges at the transition while averaged stochastic transmission time remains finite.
is arbitrary.
From the model simulations, data is extracted and an- alyzed in two parallel ways. On one side, simulations provide surrogate data: We can mimic the corpus data analysis and count how many tokens of the new variant are produced in a given timespan (set equal to M ), to be compared with the total number of tokens produced in this timespan. We then extract ’empirical’ latency and growing times (Fig. 6a), applying the same procedure as for the corpus data.
One the other side, for each run we track down the po- sition of the walker, which is the frequency x(t) achieved by the new variant at time t. This allows to compute first passage times. We then alternatively compute analyti- cal latency and growth times (‘analytical’ to distinguish them from the former ‘empirical’ times) as follows. La- tency time is here defined as the difference between the first-passage times at the exit and the entrance of a ‘trap’
region (see Appendix C for additional details). Analyt- ical growth time is defined as the remaining time of the process once this exit has been reached. Their distribu- tion over 10, 000 runs of the process are fitted with In- verse Gaussian distribution, which would be the expected distributions if the jump probabilities were homogeneous over the corresponding regions (an approximation then better suited for latency time than for growth time).
Figure 6d shows the remarkable agreement between the
‘empirical’ and ‘analytical’ approaches, together with the quality of the fits with the Inverse Gaussian distribution.
Crucially, those two macroscopic phenomena, latency and growth, are thus to be understood as of the same nature, which explains why their statistical distribution must be of the same kind. Furthermore, the boundaries of the trap region leading to the best correspondence be- tween first passage times and empirically determined la- tency and growth times are meaningful, as they corre- spond to the region where the uncertainty on the trans- mission time significantly decreases (Fig. 6b).
B. Confrontation with corpus data
Our model predicts that both latency and growth times should be governed by the same kind of statis- tics, Inverse Gaussian being a suited approximation of those. Inverse Gaussian distribution is governed by two parameters, its mean µ and a parameter λ given by the ratio µ
3/σ
2, σ
2being the variance. We fit the empirical histograms with an Inverse Gaussian distribution whose parameters are given by the empirical mean and variance of the relevant quantities. We find a good agreement for both the latency and the growth times (Fig. 7).
Although there are short growth times in the frequency patterns of the forms we studied, below six decades they are not described by enough data points to assess reli- ably the specificity of the sigmoid fit. On the histogram there is therefore no data for these growth times. This issue is further discussed in Appendix D. However, the distribution must decrease when growth time approaches 0 (notably an exponential fit is to be ruled out); other- wise, instantaneous changes would be far too numerous, so that language would be completely unstable. The de- crease predicted by the Inverse Gaussian is realistic in this aspect.
The main quantitative features extracted from the dataset are thus correctly mirrored by the behavior of our model. We confronted the model with the data on other quantities, such as the correlation between growth time and latency time. There again, the model proves to match appropriately quantitative aspects of semantic expansion processes Appendix E.
IV. DISCUSSION
Based on a corpus-based analysis of frequency of use, we have uncovered two robust stylized facts of seman- tic change: an S-curve of frequency growth, preceded by a latency period where the semantic change has already taken place while the frequency remains low. We have proposed a model predicting that these two features, al- beit qualitatively quite different, are two aspects of one and the same phenomenon.
The hypotheses on which this model lies are well- grounded on claims from Cognitive Linguistics: Lan- guage is resilient to change (non-linearity of the P func- tion); language users have cognitive limitations; the se- mantic territory is organized as a network whose neigh- boring sites are asymmetrically influencing each other.
The overall agreement with empirical data tends to sug- gest that language change may indeed be cognitively driven by semantic bridges of different kinds between the concepts of the mind, and constrained by the mnemonic limitations of this very same mind. We note that our model may however be given a different, purely socio- linguistic, interpretation: this, together with the limits of such a view point, is discussed in Appendix B 5.
According to our model, the onset of change depends
(a)
(c) (b)
FIG. 6. (a) Time evolution of the frequency of produced occurrences (output of a single run). Growth part and latency part are shown respectively in blue and red. The logit transform (with linear fit) of the growth is shown in the inset. (b) Distribution of latency time (top) and growth time (bottom) over 10k processes, extracted from an empirical approach (blue wide histogram) and a first-passage time one (magenta thin histogram), with their respective Inverse Gaussian fits (in red: Empirical approach;
in green: First-passage time approach). (c) Uncertainty on the transmission time given the position of the walker. The entrance and the exit of the trap are shown, respectively, by green and magenta line. The trap corresponds to the region where the uncertainty drops from a high value to a low value.
on the strength of the conceptual link between the source context and the target context: If the link is strong enough, that is, above a given threshold, it serves as a channel so that a form can ‘invade’ the target con- text and then oust the previously established form. In a sense, the sole existence of this cognitive mapping is already a semantic expansion of some sort, yet not nec- essarily translated into linguistic use. Latency is specifi- cally understood as resulting from a near-critical behav- ior: If the link is barely strong enough for the change to take off, then the channel becomes extremely tight and the invasion process slows down drastically. These nar- row channels are likely to be found between lexical and grammatical meanings [28, 29]. This would explain why
the latency-growth pattern is much more prominent in the processes of grammaticalization, positing latency as a phenomenological hint of this latter category.
Finally, we argue that our results, though grounded
on instances of semantic expansion in French, apply to
semantic expansion in general. The time period covered
is long enough (700 years) to exclude the possibility that
our results be ascribable to a specific historical, sociolog-
ical, or cultural context. The French language itself has
evolved, so that Middle French and contemporary French
could be considered as two different languages, yet our
analysis apply to both indistinctly. Besides, the latency-
growth pattern is to be found in other languages; for
instance, Google Ngram queries for constructions such
FIG. 7. Inversian Gaussian fit of the latency times (left) and the growth times (right) extracted from corpus data. Parameters are computed from the mean and the variance of the data. Data points are shown by a blue histogram, the Inverse Gaussian fit being represented as red dots. The discrepancy observed for six decades is discussed in Appendix D.
as way too, save for, no matter what, yield qualitative frequency profiles consistent with our claims. Our model also tends to confirm the genericity of this pattern, as it relies on cognitive mechanisms whose universality has been well evidenced [30].
V. MATERIALS AND METHODS
We worked on the Frantext corpus [20], which in 2016 contained for the chosen time range 4674 texts and 232 millions of words. More details are given in Appendix A.
It would have been tempting to make use of the large database Google Ngram, yet it was not deemed appro- priate for out study, as we explain in Appendix F.
We studied changes in frequency of use for nearly 400 instances of semantic expansion processes in French, on a time range going from 1321 up to nowadays. See Ap- pendix G for a complete list of the studied forms.
A. Extracting patterns from corpus data a. Measuring frequencies. We divided our corpus into 70 decades. Then, for each form, we recorded the number of occurrences per decade, dividing this number by the total number of occurrences in the database for that decade. The output number is called here the fre- quency of the occurrence for the decade, and is noted x
ifor decade i. In order to smooth the obtained data, we replaced x
iby a moving average, taht is, for i ≥ i
0+ 4, i
0being the first decade of our corpus: x
i←
15P
ik=i−4