# Gated recurrent unit

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.[1] Their performance on polyphonic music modeling and speech signal modeling was found to be similar to that of long short-term memory (LSTM). However, GRUs have been shown to exhibit better performance on smaller datasets.[2]

They have fewer parameters than LSTM, as they lack an output gate.[3]

## Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.

The operator ${\displaystyle \circ }$ denotes the Hadamard product in the following.

### Fully gated unit

Gated Recurrent Unit, fully gated version

Initially, for ${\displaystyle t=0}$, the output vector is ${\displaystyle h_{0}=0}$.

{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma _{g}(W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\h_{t}&=(1-z_{t})\circ h_{t-1}+z_{t}\circ \sigma _{h}(W_{h}x_{t}+U_{h}(r_{t}\circ h_{t-1})+b_{h})\end{aligned}}}

Variables

• ${\displaystyle x_{t}}$: input vector
• ${\displaystyle h_{t}}$: output vector
• ${\displaystyle z_{t}}$: update gate vector
• ${\displaystyle r_{t}}$: reset gate vector
• ${\displaystyle W}$, ${\displaystyle U}$ and ${\displaystyle b}$: parameter matrices and vector
• ${\displaystyle \sigma _{g}}$: The original is a sigmoid function.
• ${\displaystyle \sigma _{h}}$: The original is a hyperbolic tangent.

Alternative activation functions are possible, provided that ${\displaystyle \sigma _{g}(x)\in [0,1]}$.

Type 1
Type 2
Type 3

Alternate forms can be created by changing ${\displaystyle z_{t}}$ and ${\displaystyle r_{t}}$ [4]

• Type 1, each gate depends only on the previous hidden state and the bias.
{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma _{g}(U_{r}h_{t-1}+b_{r})\\\end{aligned}}}
• Type 2, each gate depends only on the previous hidden state.
{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(U_{z}h_{t-1})\\r_{t}&=\sigma _{g}(U_{r}h_{t-1})\\\end{aligned}}}
• Type 3, each gate is computed using only the bias.
{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(b_{z})\\r_{t}&=\sigma _{g}(b_{r})\\\end{aligned}}}

### Minimal gated unit

The minimal gated unit is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed [5]

{\displaystyle {\begin{aligned}f_{t}&=\sigma _{g}(W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\h_{t}&=f_{t}\circ h_{t-1}+(1-f_{t})\circ \sigma _{h}(W_{h}x_{t}+U_{h}(f_{t}\circ h_{t-1})+b_{h})\end{aligned}}}

Variables

• ${\displaystyle x_{t}}$: input vector
• ${\displaystyle h_{t}}$: output vector
• ${\displaystyle f_{t}}$: forget vector
• ${\displaystyle W}$, ${\displaystyle U}$ and ${\displaystyle b}$: parameter matrices and vector