Memory cannot be understood, either, without a mathematical approach. The fundamental given is the ratio between the amount of time in the lived life and the amount of time from that life that is stored in memory. No one has ever tried to calculate this ratio, and in fact there exists no technique for doing so; yet without much risk of error I could assume that the memory retains no more than a millionth, a hundred-millionth, in short an utterly infinitesimal bit of the lived life. That fact too is part of the essence of man. If someone could retain in his memory everything he had experienced, if he could at any time call up any fragment of his past, he would be nothing like human beings: neither his loves nor his friendships nor his angers nor his capacity to forgive or avenge would resemble ours.
— Milan Kundera, Ignorance (2000)
Human memory is fragile. We’ve probably already forgotten what we ate last Christmas Eve, let alone those complicated equations of reactions that we learned long ago in Chemistry classes. A couple lasting for years can struggle to remember the passion ignited at the beginning of their love story. Does all this mean that we should blame the long and tedious time for our poor memory? If so, how do we find an excuse for the unfortunate man who is now in the supermarket and once again forgets to bring his family grocery list? The exam for him is to recall as many as possible things to buy before driving back with missing items. If he fails, a fight will surely await him at home. The unreliable memory, as a cutely annoying defect of mankind, is responsible for so much forgetting and forgiveness in everyday trivia, and likely accounts for countless regression and recurrence through history.
Scientists know so little about our memory yet. There are different types (e.g., sensory, short-term, and long-term) and many aspects (e.g., formation, storage, and retrieval) of memory one can wonder about, but modestly most of them remain in a thick fog of ignorance. A widely accepted postulate is that the learning process changes the weights of synaptic connections among neurons in the brain, and those modified connections may encode our memories (Kandel, 2001). At the molecular and cellular level, the related biochemical mechanisms about synaptic plasticity are still being searched as a giant jigsaw puzzle with many but yet not enough pieces found (Kern et al., 2015; Harward et al., 2016; Ford et al., 2019). At the level of neural systems, many elegant models and theories have been proposed (Hopfield, 1982; Seung, 1996; Goldman, 2009), but were not easy to be tested with biological evidence (connectomics is bringing some hope). At the behavioral and cognitive level, human memory properties are extremely variable across different experimental conditions (Waugh, 1967; Standing, 1973; Howard and Kahana, 1999), which makes quantitative descriptions challenging to acquire.
No wonder Kundera complained in his novel about the limited knowledge of memory, especially the absence of good mathematical understanding. One ultimate challenge for the memory study is to derive some law-like principles comparable in precision and generality to Newton’s laws. Is this possible? I think today’s psychology and neuroscience seem most favorable for an affirmative answer. At least, there are some old and recent scientific stories of which I believe both a sympathetic novelist and a forgetful husband would find delighted to hear.
Kundera’s ratio
What Kundera argued to be important, was the ratio between the number of items retained in memory (\(M\)) and the number of items presented (\(S\)). Here I stealthy convert his measurement of continuous time to the counts of discrete items for simplification. He suspected that this ratio was infinitesimal for a lived life. Was he correct?
Intuitively, we can have a quick check based on our own experience. Childhood was the heyday of our memory. Children observe the world carefully and seem to be able to remember almost everything that happened in their surroundings. The pattern on sheets, names of toys, the story the grandpa told after last Saturday’s dinner, none of them would be too difficult for a child. Therefore, Kundera’s ratio is conceivably high for children, not only because they are curious observers of the world but also because they just start learning things so that the ratio’s denominator is small. However, as we grew up, we perceived a more complex world, and soon the total number of things presented (\(S\)) far surpassed the total number of things that we remembered (\(M\)). The ratio proposed by Kundera is less likely a constant but decreases rapidly as the lived life cumulates. The retained memory increases sublinearly to the things presented, so the ratio indeed can be small for our normal life span.
Kundera said that no one has ever tried to calculate this ratio and there was no technique like telepathy. In fact, some endeavors to calculating this ratio were made in the 1960s by Nickerson (1965) and Shepard (1967) for short-term memory, and is extended to long-term memory by Nickerson (1968). Nickerson showed 612 pictures to subjects, and then asked them to select the correct pictures in two-alternative recognition tests. The accuracy was pretty high at 98%. These results seem to contradict Kundera’s guess. However, sublinear retention growth is supported by an influential study (Standing, 1973). Kundera might miss these interesting studies in 2000, when Google Scholar hasn’t been invented.
Standing’s scaling law
In a 1973 study, Lionel Standing designed a delayed recognition task to estimate how many things can be retained in our long-term memory. Subjects in groups first observed different sets of stimuli with strict concentration in a single-trial learning phase. Those stimuli are selected from a population of 11,000 normal pictures, 1,200 vivid pictures, and 25,000 most common English words from Merriam-Webster dictionary, and presented Two days after, they joined a recognition test, where in each trial, two stimuli were presented side by side with only one correct stimulus from the learning sets, and the other was uniformly chosen at random from the rest of stimuli from the whole set. Subjects were forced to write down an “L” or “R” to indicate which stimulus is more familiar to them. The estimated number of items in memory (\(M\)) was then calculated as
\[M = S(1-2E/T),\]where \(E\) is the number of errors, \(T\) is the total number of trials, and \(S\) is the size of the learning set. What is the rationale behind this calculation? If the shown correct stimulus is in a subject’s memory, he or she must answer correctly, while if the shown correct stimulus is not in a subject’s memory, he or she would randomly choose one, and the error rate, in this case, is 0.5. Therefore the total error rate should be
\[E/T = 0.5\times(1-M/S).\]Rearranging the above equation gives the estimate of \(M\).
With the experimental data, Standing found that the picture memory (◾ and ◯) is better than verbal memory (▼). When the size of the learning set was 1,000, the accuracy for word recognition was only 62%, but for vivid images it was 88%. Also, there was a power-law relationship between the number of presented items for learning \((S)\) and the number of retained items in memory (\(M\)). As shown in the below figure, all three stimuli populations yielded straight lines when plotting \(\log M\) against \(\log S\), and the Pearson correlation was above 0.99.
Figure 1 in Standing (1973). Number of items retained in memory (\(M\)), as a function of number presented for learning (\(S\)). The diagonal dashed line indicates perfect memory. ◾ vivid pictures (\(\log M \approx 0.97\log S + 0.04\)); ◯ normal pictures (\(\log M\approx 0.93 \log S + 0.08\)); ▼ words (\(\log M \approx 0.92\log S-0.01\)). Note that the coordinates are log-log.
The straight line in the log-log plot suggested that a general mathematical form for retained memory,
\[M = a S^{\beta},\]where \(a >0, 0 <\beta \leq 1\). The exponent \(\beta\) indicates how fast the retained memory increases as the number of presented items increases, which are the slopes of lines in log-log plot (\(\log M = \beta \log S + \log a\)). When \(\beta=1\), the ratio \(M/S\) is constant. The \(\beta\) for normal picture memory is about 0.93 in Standing’s experiment. This implies, if the Kundera’s ratio is less than a millionth, we need the minimal learning set should contain ~\(10^{86}\) stimuli. It sounds improbable as there are only about \(3.2 \times 10^9\) seconds for a person who lives for 100 years. If we use the same \(\beta\), and assume 1) the time scale for memorization is about seconds, 2) the successfully formed memories won’t be erased, then the Kundera’s ratio for a 100 years life span should be about 23%. Our brain seems much better than Kundera’s wild estimate to retain things in memory.
Recognition and free recall
Standing’s scaling law sets a high expectation of the man who forgot to bring the grocery list to the supermarket. Suppose there were 20 items on the list. The expected number of items retained in his memory should be roughly \(20^{0.92}\approx16\). But why, he felt so frustrated to remember any of them? Can we say those memories had already been erased? Or actually the memories were still there, but he just couldn’t find a way to recall them all?
I suspect it’s the latter case. If the man accidentally saw the item he planned to buy, he would probably recognize it on a shelf without too much thinking. It is like those embarrassing moments we suddenly forgot the name of a friend. But so long as someone mentioned it, we would shout out, “Aha! I knew her!” Recognizing a thing that we should remember is essentially different from the cue-less recall process, and the latter should be intuitively harder. In a recognition test, a hint is given to trigger the correct memory. However, in a free recall task, there is no external hint given. One has to exhaustively search in his memory palace to find the correct answer.
Figure 2 in Standing (1973). Number of items retained in memory or successfully recalled (\(M\)), as a function of the number presented for learning (\(S\)). □ Nickerson 1965 (pictures, recognition); △ Shepard 1967 (pictures, recognition) ◾ Shepard 1967 (words, recognition); ▼ Binet and Henri 1894 (words, recall); ◯ Murdock 1960 (words, recall). The coordinates are log-log.
Our impression that recognition performance exceeds recall is supported by experimental data. In the above Figure, Standing summarized experimental results based on free recall tests from Binet and Henri (1894) and Murdock (1960) (two lines with smaller slopes in the above figure) and compared them to previous recognition results. All data lie on straight lines in the log-log plot with Pearson correlation above 0.99, which implies the power law holds for both recognition and recall. The number of items successfully recalled \((R)\) can be expressed as
\[R = c S^{\gamma},\]where \(c >0, 0 <\gamma \leq 1\). The exponent \(\gamma\) for the recalled is lower than the exponent \(\beta\) for actual retained memory estimated by recognition tests, which implies that sometimes we cannot remember a thing because we are bad at recalling the things retained in memory, rather than we’ve forgotten them completely. Another important deduction is that the number of recalled items is also a power function of the number of items retained in memory.
\[R = dM^{\theta},\]where \(d = ca^{-\theta}\) and \(\theta = \gamma/\beta\). Even though we are good at storing things in memory with \(\beta \approx 0.93\), if the retrieval is less efficient, say \(\theta \approx 0.5\), the ratio between the number of things we can recall and the number of things we’ve experienced will soon fall to less than a millionth for a 100-year lifespan, which is so close to Kundera’s guess.
Fundamental law of memory recall
There are some classic theoretical models that can fit the above empirical data for memory retrieval well. One old theory is called probabilistic search of associative memory (SAM) (Raaijmakers and Shiffrin, 1980; 1981), which assumes all presented stimuli are stored in memory buffers, and the recall process is modeled as a series of attempted probabilistic sampling of memory items until a certain number of failed attempts is reached. Another influential theory is temporal context model (TCM) (Howard and Kahana, 2002; Polyn, Norman, & Kahana, 2009), which further considers the time-dependent context to reproduce the end-of-list recency and lag-recency effects of free recall (Murdock, 1962; Kahana, 1996). However, these models have many parameters that have to be tuned for each size of the learning set (\(S\)) and are too complex to reveal the fundamental neural network mechanism of memory retrieval (Grossberg and Pearson, 2008). A series of recent work from Sandro Romani, Mikhail Katkov, Misha Tsodyks, and their collaborators (Romani et al., 2013; Recanatesi, 2015; Katkov et al., 2017; Naim et al., 2020) propose a simple neural network model of memory recall which gives the scaling law similar to Standing (1973) without free parameters. According to their theory, there exists a fundamental relationship between the number of recalled items (\(R\)) and the number of items retained in memory (\(M\)), which can be expressed as
\[R \approx \sqrt{3\pi/2} \cdot M^{1/2}.\]This is an extremely simple formula from a pure mathematical deduction, but fits experimental data magically well. In Naim et al. (2020), they estimated \(M\) from the recognition tasks, and estimated \(R\) from the free recall tasks similar to Standing (1973). Their stimuli include a collection of 751 English words and a collection of 325 short sentences expressing common knowledge facts (e.g., “Earth is round”). All experimental data support theoretical predictions.
Figure 2 in Naim et al. (2020). The average number of words (sentences) recalled as a function of the average number of words (sentences) retained in memory. Black line: Theoretical prediction (\(R =\sqrt{3\pi M/2}\)). Green line: experimental results for presentation rate 1 word/sec. Yellow line: experimental results for presentation rate 1.5 word/sec. Blue line: experimental results for short sentences of common knowledge facts.
The theory not only produces the scaling law that was empirically summarized by psychologists, but also tell us about the parameters of the scaling law. Where there does the strange constant \(d = \sqrt{3\pi/2}\approx 2.17\) come from? And how do we understand the exponent \(\theta = 0.5\)? Through the rest part of the article, we will gain some mathematical insights.
Associative memory networks
The proposed theory is based on two principles (Katkov, Romani, & Tsodyks, 2017):
- [Encoding Principle] - An item is encoded in the brain by a specific group of neurons in a dedicated memory network. When an item is recalled, either spontaneously or when triggered by an external cue, this specific group of neurons is activated.
- [Associativity Principle] - In the absence of sensory cues, a retrieved item plays the role of an internal cue that triggers the retrieval of the next item.
The first encoding principle can be realized by the classic associative memory networks (Hopfield, 1982), where the items retained in memory are encoded by the binary activity patterns of \(N\) neurons from a specific network, e.g.,
\[\boldsymbol \xi^{\mu} = \underbrace{1000101\cdots001}_{N \text{ neurons}},\]where 1 indicates an active neuron, and 0 indicates a slient neuron. Neurons in the network are interconnected in a particular way such that when only part of the neurons activated by some external cue with noise, the network activity will soon converge to its most similar memory that has been stored. These memories are also called attractors of the neural networks.
An illustration of memory retrieval from my slides. An image “T” encoded by binary neural activity is stored in an associative memory model. Each pixel here indicates a neuron. When the external cue, a corrupted “T”, is given, the network is able to restore the correct image. See my slides to quickly recap the classic idea of Hopfield nets. The slides also introduce recent advances (Krotov and Hopfield, 2016; Krotov and Hopfield, 2020) to increase the memory capacity of Hopfield nets from \(\mathcal O(N)\) to \(\mathcal O(e^N)\).
The original Hopfield nets can only reliably store multiple uncorrelated random memories, i.e., for each stored memory, around half of neurons are on. However, as neurophysiological data indicate, the neural activity is usually very sparse. This means the percentile of active neurons for each memory, \(f\), should be very small. In such a case, we need to modify the synaptic connectivity as follows (Tsodyks & Feigel’Man, 1988):
\[T_{ij} = \frac{1}{Nf(1-f)}\sum_{\mu=1}^{M}(\xi_i^\mu - f)(\xi_j^\mu - f),\]with specially \(T_{ii} = 0\), where \(T_{ij}\) indicates the connection from neuron \(j\) to neuron \(i\), \(M\) is the total number of items retained in memory, and \(N\) is the total number of neurons. When \(f \ll 1\), neuron \(i\) and neuron \(j\) are strongly connected when they are both active in one memory. This means if we only look at strong connections in an associative memory network, the stored memories form \(M\) (potentially overlapped) cliques.
The neural activity dynamics \(\sigma_i(t)\in \{0, 1\}\) is given by (Romani et al., 2013)
\[\sigma_i(t+1) = \Theta\left(\sum_{j=1}^N T_{ij}\sigma_j(t) - \frac{J}{Nf}\sum_{j=1}^{N}\sigma_j(t) - \theta_{i}(t)\right),\]where \(\Theta(\cdot)\) is the Heaviside function, \(J>0\) is the strength of global feedback inhibition that controls the overall level of activity, and \(\theta_i(t) \in [-\kappa, \kappa]\) is the threshold for the neuron \(i\). Consider the energy function,
\[\begin{aligned}H(\boldsymbol \sigma) &:= -\frac{1}{2}\sum_{i,j} \left(T_{ij} - \frac{J}{Nf}\right) \sigma_i\sigma_j + \sum_{i}\theta_i\sigma_i \\ & =-\frac{1}{2Nf(1-f)}\sum_{\mu=1}^{K}\left(\sum_{i} \left(\xi_i^{\mu} - f\right)\sigma_i \right)^2 \\&~~~~~+\sum_{i,j}\frac{J}{Nf}\sigma_i\sigma_j+ \sum_i\left(\theta_i - \frac{\sum_{\mu}(\xi^{\mu}_i -f)^2}{2Nf(1-f)}\right)\sigma_i,\end{aligned}\]where the first quadratic term contains the dot product of the memory \(\boldsymbol \xi^{\mu}-f\) and current neural activity \(\boldsymbol \sigma\), the second and third terms are large when \(\boldsymbol \sigma\) is dense. The neural activity dynamics can be interpreted as monotonically minimizing the energy, because when the activity of neuron \(i\) changes by \(\Delta \sigma_i\), the change of energy can be expressed by
\[\Delta H = - \left(\sum_{j=1}^{N}T_{ij}\sigma_j - \frac{J}{Nf} \sum_{j=1}^N\sigma_j- \theta_i\right)\Delta\sigma_i\]\(\Delta H < 0\) if and only if \(\Delta \sigma_i\) has the same sign as \(\sum_{j=1}^{N}T_{ij}\sigma_j - \frac{J}{Nf} \sum_{j=1}^N\sigma_j- \theta_i\), which is always ensured by the neural activity dynamics. Therefore, the local minima of \(H(\boldsymbol \sigma)\) are steady activities. When \(\boldsymbol \sigma(t)\) is dense, the global inhibition is strong, which turns off many neurons to encourage sparseness. When \(M\) is under some capacity limit (memories \(\boldsymbol \xi^{\mu\in\{1,\dots,M\}}\)are less likely to “interfere” with each other), with a certain small \(J\) and \(\theta_i\), we can roughly read from the expression of Hamiltonian that the steady activity is one of the stored memory, i.e., \(\sigma_i = \xi^{\mu}_i\), which minimize the first quadratic term of \(H\) while satisfying the sparsity constraints. In other words, with some condition of the global inhibition strength \(J\) and threshold \(\theta_i\) for each neuron, a memory can be stably retrieved by a similar external cue.
Memory transition via oscillatory inhibition
When there is no external cue, how is it possible to recall things in the associative memory network? Intuitively, if two memories are similar, we should be able to associate one with the other. This is the second associativity principle. The idea proposed in Romani et al. (2013) is to make the steady activity alternate between stored memory representations and the intersection of two memory representations, so that the mental state can smoothly travel from one attractor to another without an external cue given.
The steady neural activity can be controlled by the strength \(J\) of the global feedback inhibition. From the previous section, we know that for small \(J\) the memory representations are stable. When increasing \(J\) more neurons are turned off, which may guide the neural activity changing towards the intersection of two memory representations.
When \(N\) is large and memory representations are sparse, i.e., \(\xi_i^{\mu}\sim \text{Bernoulli}(f)\) independently for all \(i\) and \(\mu\), where \(f \ll 1\), by the law of large numbers we have
\[\sum_{i=1}^{N}\xi_i^{\mu}\approx Nf, \text{ and }\sum_{i=1}^{N}\xi_i^{\mu}\xi_i^{\nu} \approx Nf^2 ~(\text{if } \mu\neq \nu).\]The condition for stable memory representations can be obtained by setting the initial activity as a memory \(\boldsymbol \xi^{\mu}\). By applying the above statistics, we get self-consistent equations for all \(i\),
\[\xi^{\mu}_i = \Theta\left(\left(\xi^{\mu}_i -f\right)-J-\theta_i(t)\right).\]Since \(\theta_i \in [-\kappa, \kappa]\), solving the above equation gives \(J\in (\kappa - f, 1-\kappa-f)\).
Similarly, to obtain the condition for stable overlaps of two memory representations, we initialized the neural activity with \(\boldsymbol \sigma(0) = \boldsymbol \xi^{\mu}\odot \boldsymbol \xi^{\nu}\), where “\(\odot\)” denotes the element-wise product. It gives a set of self-consistent equations
\[\xi^{\mu}_i\xi^{\nu}_i = \Theta\left(\left(\xi^{\mu}_i -f\right)f + \left(\xi^{\nu}_i -f\right)f-Jf-\theta_i(t)\right).\]Thus, when \(J\in (1 - 2f + \kappa/f, 2-2f-\kappa/f)\), the overlap of two memories are stable.
Figure 1 in Romani et al. (2013). (A) Dynamics of a network storing 16 items with periodically oscillatory inhibition and threshold adaptation. (B) The matrix of overlaps between memory representations.
Simulations show that periodically oscillating \(J\) between these two regimes can successfully generate a sequence of recalled items in memory. To prevent the neural activity from falling back to the memory just being recalled previously, thresholds \(\theta_i(t)\) are adaptive in an activity-dependent way. When an item is recalled in simulations, it typically triggers a subsequent retrieval of an item with which it shares the largest memory representation overlap. This phenomenon can be interpreted as that the recall process comprises a series of associations among similar memories, which provides us with an interesting mathematical insight about the limit of our capability to recall things.
Birthday paradox of the recall model
The recall process realized by the above neural network model with periodic modulation of inhibition can be further abstracted and simplified as a walk among stored memories. The mental state travels from one memory to another with which it has the most active neurons in common. If that most similar memory is just recalled, it moves to the one having the second largest overlap.
The relative size of the memory overlap, or similarity score of two different memories \(\mu\) and \(\nu\), is measured by the dot product of their binary neuronal representations normalized by the total number of neurons,
\[s^{\mu \nu} := \frac{1}{N}\sum_{i=1}^{N}\xi^{\mu}_i\xi^{\nu}_i.\]The similarity scores are symmetric for any two distinct memories, so that they can be written as an \(M\times M\) symmetric similarity matrix. For convenience, here we don’t consider self-similarity and set all diagonal elements \(s^{\mu\mu}:=0\). Although similarities are symmetric, it is interesting to note that the induced “most similar” relationship is not necessarily symmetric. This is analogous to the disheartening fact that you have only a 1/2 chance of being the best friend of your best friend, even though the friendship is mutual. This probability is justified by
\[{\tt Pr}\left[s^{\nu\mu} = \max_{\omega}s^{\nu\omega} \big | s^{\mu\nu} = \max_{\omega}s^{\mu\omega}\right] ={\tt Pr}\left[\max_{\omega\neq\mu}s^{\nu\omega} \leq \max_{\omega}s^{\mu\omega}\right]\approx \frac{1}{2},\]where \(s^{\mu\nu} = s^{\nu\mu}\) and the number of friends (\(M\)) is large. We also need to assume large \(N\) so that the similarity scores \(s^{\mu\nu}\) can be seen as taking continuous value, which ensures the chance of two pairs of items having the same similarity score is almost zero.
The asymmetric “most similar” relationship brings forth a directed pathway connecting different items in memory, i.e., one can transit from item \(\mu\) to item \(\arg\max_\omega s^{\mu\omega}\) to visit multiple items in a chain. As illustrated in an example similarity matrix below, item 13 has a most similar item 1, while the most similar memory to item 1 is item 14. Hence, 13 → 1 → 14 can a sequence of recalled items. However, as we discussed, at each move there is a 1/2 chance of going back to the previously recalled item and being trapped in a two-items cycle. The expected number of recalled items is always three no matter how many items are stored (Note that here the randomness comes from the similarity matrix, but transition are deterministic when assuming a large \(N\)). To prevent this issue, we allow the recall process to move to the “second most similar” item when the most similar one is just visited one step before. This modification results in a longer path involving much more items. The recall process terminates when it begins to cycle over already visited items.
Modified from Figure 1 in Naim et al. (2020). Left: a symmetric similarity matrix among 16 items retained in memory. Warmer color indicate a higher similarity score. Black circles and red circles indicate the highest and the second highest similarity scores in each row, respectively. Right: the recall process as a deterministic walk among items stored in memory. Black arrows indicate moving to the most similar item, and red arrows indicate moving to the second most similar item when the memory of the current item is just triggered by its most similar one. The recall path starting from the item 1: 1 → 14 → 10 → 7 → 6 → 12 → 16 → 10 → 14 → 1 → 13 → 5 → 11 → 12 → 16 … (cycle).
How many unique items can be retrieved in such a recall process? Each time we move back to an already recalled item \(\mu\) from a newly recalled item \(\nu\), the next item to be retrieved is a new item only when \(\mu\) is the initial item and never visited by the transit from other items. This is because if \(\mu\) is not the initial item, assuming its first retrieval is \(u\) → \(\mu\) → \(w\), then either \(s^{\mu w}\) or \(s^{\mu u}\) must be the maximum in the row \(s^{\mu ~\cdot}\). This implies there are only two cases for its subsequent retrieval:
- repeating the past transition between two already recalled items, which must result in a loop;
- traveling backward along the already visited items until reaching a new item, otherwise results in a loop.
Case 1 marks the terminal of the recall process, which is illustrated by the move 11 → 12 in the above figure. Case 2 is illustrated by the moves 16 → 10 → 14 → 1 → 13. Case 2 can happen at most multiple times in a recall process. At the first time case 2 happens, it must end up with a new sub-trajectory since at least it can get back to the initial state and start a new route. When it happens a second or third time, it may enter a loop since it is possible that all items in the backward path have already used their second most similar transition (e.g., if we change the transition 11 → 12 to 11 → 7 in the example pathway, it is the case 2 but ends up with a loop.). I think this situation still needs to be discussed in Naim et al. (2020). To reproduce their theoretical results, we assume when \(M\) is large enough, the chance of the loop ending in case 2 is negligible (further justification is needed but it is very likely).
Thus, to count the number of unique items recalled, we need to know the probability \(p_r\) of returning to an already recalled item \(\mu\) from a newly recalled item \(\nu\), and the probability \(p_{\ell}\) of the subsequent retrieval being the case 1. These probabilities is difficult to analyze since the similarity scores are not independent. Romani et al. (2013) prove that the correlations between similarity scores are negligible in the limit of very sparse memory representation (\(f\ll 1\)). However, I do not understand how independence is justified and why their Gaussian approximation \(\tilde{s}^{\mu\nu} = zm_\mu m_\nu\)which holds original first- and second-order statistics is valid. In principle, we can compare the joint probability with the products of two marginal probabilities, and show that when \(Nf\) is a constant, they are asymptotically equivalent as \(N\) grows. Hence, at least in some sparse limit, the similarity scores in each row can be asymptotically independent of each other.
Under the independence assumption, the probability \(p_r\) is given by
\[\begin{aligned}p_r &= {\tt Pr}\left[s^{\mu\nu} = \max_{\omega\neq {\tt pred}(\nu)}s^{\nu\omega}\Big |s^{\mu\nu} < \max_\omega s^{\mu\omega}\right]\\ & \approx \left(\frac{1}{M}\cdot \frac{1}{2}\right) /\left(\frac{M-1}{M}\right) \approx \frac{1}{2M}. \end{aligned}\]And for a chain \(u\) → \(\mu\) → \(w\), since \(\mu\) → \(u\) cannot be the most similar transition in case 1, the probability \(p_{\ell}\) is given by
\[p_\ell = {\tt Pr}\left[\max_\omega s^{\mu\omega} > \max_{\omega}s^{u\omega}\Big |\max_\omega s^{\nu\omega} < \max_\omega s^{\mu\omega}\right] \approx \frac{2}{3}.\]Together, the probability for the process returns to some recalled item \(\mu\) and terminates after each newly retrieved item is
\[p = p_rp_{\ell} \approx \frac{1}{3M}\]The recall process is thereby similar to the famous “birthday paradox.” Suppose that we have a group of \(M\) randomly chosen people with uniformly distributed birthdays. We ask them to enter a room one by one, until a pair of people in the room have the same birthday. The probability of two people sharing the same birthday is \(p.\) And the expected number of people in the room at the end of the process equals the expected number of unique memories recalled.
The probability that the number of unique items recalled (\(R\)) is greater than or equal to \(k\) can be expressed by
\[\begin{aligned}{\tt Pr}[R \geq k] & = (1-p)(1-2p)\cdots(1-(k-2)p)\\ & \approx e^{-p\sum_{j=1}^{k-2}j} \approx e^{-pk^2/2}\end{aligned}\]for \(k \ll M\) and \(1\ll M\). When \(M\) is finite, and \(k\) is large, this probability decays exponentially. Thus the expectation under this probability distribution is dominant by small \(k\) where the above approximation is valid. The expectation of \(R\) can be asymptotically calculated as
\[\langle R\rangle = \sum_{k=1}^{\infty} k\cdot {\tt Pr}[R=k] = \sum_{k=1}^{\infty} {\tt Pr}[R \geq k] \approx \frac{1}{\sqrt{p}}\int_{0}^{\infty}e^{-pk^2/2} \mathrm d(\sqrt{p}k) = \frac{1}{2}\sqrt{2\pi/p}.\]Substituting \(p = 1/(3M)\) yields
\[\langle R \rangle \approx \sqrt{3\pi M /2}.\]Hence, both the exponent (\(\theta = 1/2\)) and the coefficient (\(d=\sqrt{3\pi/2}\) ) of the obtained power law are consequences of a “birthday paradox” of the assumed recall process, where the mental state deterministically travels from one memory to its most similar or second most similar memories, and eventually ends with a cycle when colliding with some visited memory.