# (Laje & Buonomano, 2013) Robust Timing in RNN

Implementation of the paper:

Thanks to the original implementation codes: https://github.com/ReScience-Archives/Vitay-2016

:

import brainpy as bp
import brainpy.math as bm


## Model Descriptions

### Recurrent network

The recurrent network is composed of $$N=800$$ neurons, receiving inputs from a variable number of input neurons $$N_i$$ and sparsely connected with each other. Each neuron’s firing rate $$r_i(t)$$ applies the tanh transfer function on an internal variable $$x_i(t)$$ which follows a first-order linear ordinary differential equation (ODE):

$\tau \cdot \frac{d x_i(t)}{dt} = - x_i(t) + \sum_{j=1}^{N_i} W_{ij}^{in} \cdot y_j(t) + \sum_{j=1}^{N} W_{ij}^{rec} \cdot r_j(t) + I^{noise}_i(t)$
$r_i(t) = \tanh(x_i(t))$

The weights of the input matrix $$W^{in}$$ are taken from the normal distribution, with mean 0 and variance 1, and multiply the rates of the input neurons $$y_j(t)$$. The variance of these weights does not depend on the number of inputs, as only one input neuron is activated at the same time. The recurrent connectivity matrix $$W^{rec}$$ is sparse with a connection probability $$pc = 0.1$$ (i.e. 64000 non-zero elements) and existing weights are taken randomly from a normal distribution with mean 0 and variance $$g/\sqrt{pc \cdot N}$$, where $$g$$ is a scaling factor. It is a well known result for sparse recurrent networks that for high values of $$g$$, the network dynamics become chaotic. $$I^{noise}_i(t)$$ is an additive noise, taken randomly at each time step and for each neuron from a normal distribution with mean 0 and variance $$I_0$$. $$I_0$$ is chosen very small in the experiments reproduced here ($$I_0 = 0.001$$) but is enough to highlight the chaotic behavior of the recurrent neurons (non-reproducibility between two trials).

The read-out neurons simply sum the activity of the recurrent neurons using a matrix $$W^{out}$$:

$z_i(t) = \sum_{j=1}^{N} W_{ij}^{out} \cdot r_j(t)$

The read-out matrix is initialized randomly from the normal distribution with mean 0 and variance $$1/\sqrt{N}$$.

### Learning rule

The particularity of the reservoir network proposed by (Laje & Buonomano, Nature Neuroscience, 2013) is that both the recurrent weights $$W^{rec}$$ and the read-out weights are trained in a supervised manner. More precisely, in this implementation, only 60% of the recurrent neurons have plastic weights, the 40% others keep the same weights throughout the simulation.

Learning is done using the recursive least squares (RLS) algorithm (Haykin, 2002). It is a supervised error-driven learning rule, i.e. the weight changes depend on the error made by each neuron: the difference between the firing rate of a neuron $$r_i(t)$$ and a desired value $$R_i(t)$$.

$e_i(t) = r_i(t) - R_i(t)$

For the recurrent neurons, the desired value is the recorded rate of that neuron during an initial trial (to enforce the reproducibility of the trajectory). For the read-out neurons, it is a function which we want the network to reproduce (e.g. handwriting as in Fig. 2).

Contrary to the delta learning rule which modifies weights proportionally to the error and to the direct input to a synapse ($$\Delta w_{ij} = - \eta \cdot e_i \cdot r_j$$), the RLS learning uses a running estimate of the inverse correlation matrix of the inputs to each neuron:

$\Delta w_{ij} = - e_i \sum_{k \in \mathcal{B}(i)} P^i_{jk} \cdot r_k$

Each neuron $$i$$ therefore stores a square matrix $$P^i$$, whose size depends of the number of weights arriving to the neuron. Read-out neurons receive synapses from all $$N$$ recurrent neurons, so the $$P$$ matrix is $$N*N$$. Recurrent units have a sparse random connectivity ($$pc = 0.1$$), so each recurrent neuron stores only a $$80*80$$ matrix on average. In the previous equation, $$\mathcal{B}(i)$$ represents those existing weights.

The inverse correlation matrix $$P$$ is updated at each time step with the following rule:

$\Delta P^i_{jk} = - \frac{\sum_{m \in \mathcal{B}(i)} \sum_{n \in \mathcal{B}(i)} P^i_{jm} \cdot r_m \cdot r_n \cdot P^i_{nk} }{ 1 + \sum_{m \in \mathcal{B}(i)} \sum_{n \in \mathcal{B}(i)} r_m \cdot P^i_{mn} \cdot r_n}$

Each matrix $$P^i$$ is initialized to the diagonal matrix and scaled by a factor $$1/\delta$$, where $$\delta$$ is 1 in the current implementation and can be used to modify implicitly the learning rate (Sussilo & Larry, Neuron, 2009).

Matrix/Vector mode

The above algorithms can be written into the matrix/vector forms. For the recurrent units, each matrix $$P^i$$ has a different size ($$80*80$$ on average), so we will still need to iterate over all post-synaptic neurons. If we note $$\mathbf{W}$$ the vector of weights coming to a neuron (80 on average), $$\mathbf{r}$$ the corresponding vector of firing rates (also 80), $$e$$ the error of that neuron (a scalar) and $$\mathbf{P}$$ the inverse correlation matrix (80*80), the update rules become:

$\Delta \mathbf{W} = - e \cdot \mathbf{P} \cdot \mathbf{r}$
$\Delta \mathbf{P} = - \frac{(\mathbf{P} \cdot \mathbf{r}) \cdot (\mathbf{P} \cdot \mathbf{r})^T}{1 + \mathbf{r}^T \cdot \mathbf{P} \cdot \mathbf{r}}$

In the original Matlab code, one notices that the weight update $$\Delta \mathbf{W}$$ is also normalized by the denominator of the update rule for $$\mathbf{P}$$:

$\Delta \mathbf{W} = - e \cdot \frac{\mathbf{P} \cdot \mathbf{r}}{1 + \mathbf{r}^T \cdot \mathbf{P} \cdot \mathbf{r}}$

Removing this normalization from the learning rule impairs learning completely, so we kept this variant of the RLS rule in our implementation.

### Training procedure

The training procedure is split into different trials, which differ from one experiment to another (Fig. 1, Fig. 2 and Fig. 3). Each trial begins with a relaxation period of t_offset = 200 ms, followed by a brief input impulse of duration 50 ms and variable amplitude. This impulse has the effect of bringing all recurrent neurons into a deterministic state (due to the tanh transfer function, the rates are saturated at either +1 or -1). This impulse is followed by a training window of variable length (in the seconds range) and finally another relaxation period. In Fig. 1 and Fig. 2, an additional impulse (duration 10 ms, smaller amplitude) can be given a certain delay after the initial impulse to test the ability of the network to recover its acquired trajectory after learning.

In all experiments, the first trial is used to acquire an innate trajectory for the recurrent neurons in the absence of noise ($$I_0$$ is set to 0). The firing rate of all recurrent neurons over the training window is simply recorded and stored in an array without applying the learning rules. This innate trajectory for each recurrent neuron is used in the following trials as the target for the RLS learning rule, this time in the presence of noise ($$I_0 = 0.001$$). The RLS learning rule itself is only applied to the recurrent neurons during the training window, not during the impulse or the relaxation periods. Such a learning trial is repeated 20 or 30 times depending on the experiments. Once the recurrent weights have converged and the recurrent neurons are able to reproduce the innate trajectory, the read-out weights are trained using a custom desired function as target (10 trials).