Differentiable neural computer

A differentiable neural computer being trained to store and recall dense binary numbers. Performance of a reference task during training shown. Upper left: the input (red) and target (blue), as 5-bit words and a 1 bit interrupt signal. Upper right: the model's output

A differentiable neural computer (DNC) is a recurrent artificial neural network architecture with an autoassociative memory. The model was published in 2016 by Alex Graves et al. of DeepMind.^[1]

Applications[edit]

So far, DNCs have only been demonstrated to handle relatively simple tasks, which could have been easily solved using conventional computer programming decades ago. But DNCs don't need to be programmed for each problem they are applied to, but can instead be trained. This attention span allows the user to feed complex data structures such as graphs sequentially, and recall them during later use. Furthermore, they can learn some aspects of symbolic reasoning and apply it to the use of working memory. Some experts see promise that they can be trained to perform complex, structured tasks^[1]^[2] and address big-data applications that require some sort of rational reasoning, such as generating video commentaries or semantic text analysis.^[3]^[4]

DNC can be trained to navigate a variety of rapid transit systems, and then what the DNC learns can be applied, for example, to get around on the London Underground. A neural network without memory would typically have to learn about each different transit system from scratch. On graph traversal and sequence-processing tasks with supervised learning, DNCs performed better than alternatives such as long short-term memory or a neural turing machine.^[5] With a reinforcement learning approach to a block puzzle problem inspired by SHRDLU, DNC was trained via curriculum learning, and learned to make a plan. It performed better than a traditional recurrent neural network.^[5]

Architecture[edit]

DNC system diagram.

DNC networks were introduced as an extension of the Neural Turing Machine (NTM), with the addition of memory attention mechanisms that control where the memory is stored, and temporal attention that records the order of events. This structure allows DNCs to be more robust and abstract than a NTM, and still perform tasks that have longer-term dependencies than some of its predecessors such as the LSTM network. The memory, which is simply a matrix, can be allocated dynamically and accessed indefinitely. The DNC is differentiable end-to-end (each subcomponent of the model is differentiable, therefore so is the whole model). This makes it possible to optimize them efficiently using gradient descent. It learns how to store and retrieve the information such that it satisfies the task execution.^[3]^[6]^[7]

The DNC model is similar to the Von Neumann architecture, and because of the resizability of memory, it is Turing complete.^[8] Differentiable Neural Computers were inspired by the mammalian hippocampus.^[5]

Traditional DNC[edit]

DNC, as originally published^[1]

Independent variables
$\mathbf {x} _{t}$	Input vector
$\mathbf {z} _{t}$	Target vector
Controller
${\boldsymbol {\chi }}_{t}=[\mathbf {x} _{t};\mathbf {r} _{t-1}^{1};\cdots ;\mathbf {r} _{t-1}^{R}]$	Controller input matrix

Deep (layered) LSTM	$\forall \;0\leq l\leq L$
$\mathbf {i} _{t}^{l}=\sigma (W_{i}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{i}^{l})$	Input gate vector
$\mathbf {o} _{t}^{l}=\sigma (W_{o}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{o}^{l})$	Output gate vector
$\mathbf {f} _{t}^{l}=\sigma (W_{f}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{f}^{l})$	Forget gate vector
$\mathbf {s} _{t}^{l}=\mathbf {f} _{t}^{l}\mathbf {s} _{t-1}^{l}+\mathbf {i} _{t}^{l}\tanh(W_{s}^{l}[{\boldsymbol {\chi }}_{t};\mathbf {h} _{t-1}^{l};\mathbf {h} _{t}^{l-1}]+\mathbf {b} _{s}^{l})$	State gate vector, $s_{0}=0$
$\mathbf {h} _{t}^{l}=\mathbf {o} _{t}^{l}\tanh(\mathbf {s} _{t}^{l})$	Hidden gate vector, $h_{0}=0;h_{t}^{0}=0\;\forall \;t$

$\mathbf {y} _{t}=W_{y}[\mathbf {h} _{t}^{1};\cdots ;\mathbf {h} _{t}^{L}]+W_{r}[\mathbf {r} _{t}^{1};\cdots ;\mathbf {r} _{t}^{R}]$	DNC output vector
Read & Write heads
$\xi _{t}=W_{\xi }[h_{t}^{1};\cdots ;h_{t}^{L}]$	Interface parameters
$=[\mathbf {k} _{t}^{r,1};\cdots ;\mathbf {k} _{t}^{r,R};{\hat {\beta }}_{t}^{r,1};\cdots ;{\hat {\beta }}_{t}^{r,R};\mathbf {k} _{t}^{w};{\hat {\beta _{t}^{w}}};\mathbf {\hat {e}} _{t};\mathbf {v} _{t};{\hat {f_{t}^{1}}};\cdots ;{\hat {f_{t}^{R}}};{\hat {g}}_{t}^{a};{\hat {g}}_{t}^{w};{\hat {\boldsymbol {\pi }}}_{t}^{1};\cdots ;{\hat {\boldsymbol {\pi }}}_{t}^{R}]$

Read heads	$\forall \;1\leq i\leq R$
$\mathbf {k} _{t}^{r,i}$	Read keys
$\beta _{t}^{r,i}={\text{oneplus}}({\hat {\beta }}_{t}^{r,i})$	Read strengths
$f_{t}^{i}=\sigma ({\hat {f}}_{t}^{i})$	Free gates
${\boldsymbol {\pi }}_{t}^{i}={\text{softmax}}({\hat {\boldsymbol {\pi }}}_{t}^{i})$	Read modes, ${\boldsymbol {\pi }}_{t}^{i}\in \mathbb {R} ^{3}$

Write head
$\mathbf {k} _{t}^{w}$	Write key
$\beta _{t}^{w}={\hat {\beta }}_{t}^{w}$	Write strength
$\mathbf {e} _{t}=\sigma (\mathbf {\hat {e}} _{t})$	Erase vector
$\mathbf {v} _{t}$	Write vector
$g_{t}^{a}=\sigma ({\hat {g}}_{t}^{a})$	Allocation gate
$g_{t}^{w}=\sigma ({\hat {g}}_{t}^{w})$	Write gate
Memory
$M_{t}=M_{t-1}\circ (E-\mathbf {w} _{t}^{w}\mathbf {e} _{t}^{\intercal })+\mathbf {w} _{t}^{w}\mathbf {v} _{t}^{\intercal }$	Memory matrix, Matrix of ones $E\in \mathbb {R} ^{N\times W}$
$\mathbf {u} _{t}=(\mathbf {u} _{t-1}+\mathbf {w} _{t-1}^{w}-\mathbf {u} _{t-1}\circ \mathbf {w} _{t-1}^{w})\circ {\boldsymbol {\psi }}_{t}$	Usage vector
$\mathbf {p} _{t}=\left(1-\sum _{i}\mathbf {w} _{t}^{w}[i]\right)\mathbf {p} _{t-1}+\mathbf {w} _{t}^{w}$	Precedence weighting, $\mathbf {p} _{0}=\mathbf {0}$
$L_{t}=(\mathbf {1} -\mathbf {I} )\left[(1-\mathbf {w} _{t}^{w}[i]-\mathbf {w} _{t}^{j})L_{t-1}[i,j]+\mathbf {w} _{t}^{w}[i]\mathbf {p} _{t-1}^{j}\right]$	Temporal link matrix, $L_{0}=\mathbf {0}$
$\mathbf {w} _{t}^{w}=g_{t}^{w}[g_{t}^{a}\mathbf {a} _{t}+(1-g_{t}^{a})\mathbf {c} _{t}^{w}]$	Write weighting
$\mathbf {w} _{t}^{r,i}={\boldsymbol {\pi }}_{t}^{i}[1]\mathbf {b} _{t}^{i}+{\boldsymbol {\pi }}_{t}^{i}[2]c_{t}^{r,i}+{\boldsymbol {\pi }}_{t}^{i}[3]f_{t}^{i}$	Read weighting
$\mathbf {r} _{t}^{i}=M_{t}^{\intercal }\mathbf {w} _{t}^{r,i}$	Read vectors

${\mathcal {C}}(M,\mathbf {k} ,\beta )[i]={\frac {\exp\{{\mathcal {D}}(\mathbf {k} ,M[i,\cdot ])\beta \}}{\sum _{j}\exp\{{\mathcal {D}}(\mathbf {k} ,M[j,\cdot ])\beta \}}}$	Content-based addressing, Lookup key $\mathbf {k}$ , key strength $\beta$
$\phi _{t}$	Indices of $\mathbf {u} _{t}$ , sorted in ascending order of usage
$\mathbf {a} _{t}[\phi _{t}[j]]=(1-\mathbf {u} _{t}[\phi _{t}[j]])\prod _{i=1}^{j-1}\mathbf {u} _{t}[\phi _{t}[i]]$	Allocation weighting
$\mathbf {c} _{t}^{w}={\mathcal {C}}(M_{t-1},\mathbf {k} _{t}^{w},\beta _{t}^{w})$	Write content weighting
$\mathbf {c} _{t}^{r,i}={\mathcal {C}}(M_{t-1},\mathbf {k} _{t}^{r,i},\beta _{t}^{r,i})$	Read content weighting
$\mathbf {f} _{t}^{i}=L_{t}\mathbf {w} _{t-1}^{r,i}$	Forward weighting
$\mathbf {b} _{t}^{i}=L_{t}^{\intercal }\mathbf {w} _{t-1}^{r,i}$	Backward weighting
${\boldsymbol {\psi }}_{t}=\prod _{i=1}^{R}\left(\mathbf {1} -f_{t}^{i}\mathbf {w} _{t-1}^{r,i}\right)$	Memory retention vector
Definitions
$\mathbf {W} ,\mathbf {b}$	Weight matrix, bias vector
$\mathbf {0} ,\mathbf {1} ,\mathbf {I}$	Zeros matrix, ones matrix, identity matrix
$\circ$	Element-wise multiplication
${\mathcal {D}}(\mathbf {u} ,\mathbf {v} )={\frac {\mathbf {u} \cdot \mathbf {v} }{\\|\mathbf {u} \\|\\|\mathbf {v} \\|}}$	Cosine similarity
$\sigma (x)=1/(1+e^{-x})$	Sigmoid function
${\text{oneplus}}(x)=1+\log(1+e^{x})$	Oneplus function
${\text{softmax}}(\mathbf {x} )_{j}={\frac {e^{x_{j}}}{\sum _{k=1}^{K}e^{x_{k}}}}$ for j = 1, …, K.	Softmax function

Extensions[edit]

Refinements to the model have been published since the original paper's release. Sparse memory addressing results in a time and space complexity reduction of thousands of times. This can be achieved by using an approximate nearest neighbors algorithm, such as Locality-sensitive hashing, or a random k-d tree like the Fast Library for Approximate Nearest Neighbors from UBC.^[9] Adding Adaptive Computation Time (ACT) separates computation time from data time, which uses the fact that problem length and problem difficulty are not always the same.^[10] Training using synthetic gradients performs considerably better than Backpropagation through time (BPTT).^[11] Another training improvement in terms of robustness can be achieved with use of DNA normalization and a Bypass Dropout as regularization. ^[12]

References[edit]

^ ^a ^b ^c Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574.
^ "Differentiable neural computers | DeepMind". DeepMind. Retrieved 2016-10-19.
^ ^a ^b Burgess, Matt. "DeepMind's AI learned to ride the London Underground using human-like reason and memory". WIRED UK. Retrieved 2016-10-19.
^ Jaeger, Herbert (2016-10-12). "Artificial intelligence: Deep neural reasoning". Nature. 538 (7626): 467–468. Bibcode:2016Natur.538..467J. doi:10.1038/nature19477. ISSN 1476-4687. PMID 27732576.
^ ^a ^b ^c James, Mike. "DeepMind's Differentiable Neural Network Thinks Deeply". www.i-programmer.info. Retrieved 2016-10-20.
^ "DeepMind AI 'Learns' to Navigate London Tube". PCMAG. Retrieved 2016-10-19.
^ Mannes, John. "DeepMind's differentiable neural computer helps you navigate the subway with its memory". TechCrunch. Retrieved 2016-10-19.
^ "RNN Symposium 2016: Alex Graves - Differentiable Neural Computer".
^ Jack W Rae; Jonathan J Hunt; Harley, Tim; Danihelka, Ivo; Senior, Andrew; Wayne, Greg; Graves, Alex; Timothy P Lillicrap (2016). "Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes". arXiv:1610.09027 [cs.LG].
^ Graves, Alex (2016). "Adaptive Computation Time for Recurrent Neural Networks". arXiv:1603.08983 [cs.NE].
^ Jaderberg, Max; Wojciech Marian Czarnecki; Osindero, Simon; Vinyals, Oriol; Graves, Alex; Silver, David; Kavukcuoglu, Koray (2016). "Decoupled Neural Interfaces using Synthetic Gradients". arXiv:1608.05343 [cs.LG].
^ Franke, Jörg; Niehues, Jan; Waibel, Alex (2018). "Robust and Scalable Differentiable Neural Computer for Question Answering". arXiv:1807.02658 [cs.CL].

External links[edit]

A bit-by-bit guide to the equations governing differentiable neural computers

[DNCnature2016-1] Graves, Alex; Wayne, Greg; Reynolds, Malcolm; Harley, Tim; Danihelka, Ivo; Grabska-Barwińska, Agnieszka; Colmenarejo, Sergio Gómez; Grefenstette, Edward; Ramalho, Tiago (2016-10-12). "Hybrid computing using a neural network with dynamic external memory". Nature. 538 (7626): 471–476. Bibcode:2016Natur.538..471G. doi:10.1038/nature20101. ISSN 1476-4687. PMID 27732574.

[2] "Differentiable neural computers | DeepMind". DeepMind. Retrieved 2016-10-19.

[:0-3] Burgess, Matt. "DeepMind's AI learned to ride the London Underground using human-like reason and memory". WIRED UK. Retrieved 2016-10-19.

[4] Jaeger, Herbert (2016-10-12). "Artificial intelligence: Deep neural reasoning". Nature. 538 (7626): 467–468. Bibcode:2016Natur.538..467J. doi:10.1038/nature19477. ISSN 1476-4687. PMID 27732576.

[:1-5] James, Mike. "DeepMind's Differentiable Neural Network Thinks Deeply". www.i-programmer.info. Retrieved 2016-10-20.

[6] "DeepMind AI 'Learns' to Navigate London Tube". PCMAG. Retrieved 2016-10-19.

[7] Mannes, John. "DeepMind's differentiable neural computer helps you navigate the subway with its memory". TechCrunch. Retrieved 2016-10-19.

[8] "RNN Symposium 2016: Alex Graves - Differentiable Neural Computer".

[9] Jack W Rae; Jonathan J Hunt; Harley, Tim; Danihelka, Ivo; Senior, Andrew; Wayne, Greg; Graves, Alex; Timothy P Lillicrap (2016). "Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes". arXiv:1610.09027 [cs.LG].

[10] Graves, Alex (2016). "Adaptive Computation Time for Recurrent Neural Networks". arXiv:1603.08983 [cs.NE].

[11] Jaderberg, Max; Wojciech Marian Czarnecki; Osindero, Simon; Vinyals, Oriol; Graves, Alex; Silver, David; Kavukcuoglu, Koray (2016). "Decoupled Neural Interfaces using Synthetic Gradients". arXiv:1608.05343 [cs.LG].

[12] Franke, Jörg; Niehues, Jan; Waibel, Alex (2018). "Robust and Scalable Differentiable Neural Computer for Question Answering". arXiv:1807.02658 [cs.CL].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Differentiable neural computer

Contents

Applications[edit]

Architecture[edit]

Traditional DNC[edit]

Extensions[edit]

References[edit]

External links[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Interaction

Tools

Print/export

Languages