HAL Id: hal-02400746
https://hal.inria.fr/hal-02400746
Submitted on 9 Dec 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of
sci-entific research documents, whether they are
pub-lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Adjoint computation and Backpropagation
Guillaume Pallez
To cite this version:
Guillaume Pallez. Adjoint computation and Backpropagation. Meeting of the Royal Society –
Nu-merical algorithms for high-performance computational science, Apr 2019, London, United Kingdom.
�hal-02400746�
Adjoint computation and Backpropagation
Guillaume Pallez (Aupy)
Inria & University of Bordeaux
Who’s who
Automatic Differentiation
Paul Hovland (Argonne)
Navjot Kukreja (Imperial College)
Krishna Narayanan (Argonne)
Machine Learning (I)
Alexis Joly (Inria)
Machine Learning (II)
Alena Shilova (Inria)
Scheduling
Guillaume Pallez (Inria)
Olivier Beaumont (Inria)
Adjoint Computation and
Backpropagation
Ice-sheet model (I)
“In climate modelling, Ice-sheet models use numerical methods to simulate the
evolution, dynamics and thermodynamics of ice sheets.” (wikipedia)
Simpler Version:
proc Model Algorithm(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end /* F (u0) = fn◦ fn−1◦ . . . ◦ f0(u0) */ Compute∇F (u0)y ; end
Ice-sheet model (I)
“In climate modelling, Ice-sheet models use numerical methods to simulate the
evolution, dynamics and thermodynamics of ice sheets.” (wikipedia)
Simpler Version:
proc Model Algorithm(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end /* F (u0) = fn◦ fn−1◦ . . . ◦ f0(u0) */ Compute∇F (u0)y ; end
Ice-sheet model (II)
A quick reminder about the gradient:
F (u0) = fn◦ fn−1◦ . . . ◦ f1◦ f0(u0)
∇F (u0)y = J f0(u0)T · ∇(fn◦ f1)(u1) · y
= J f0(u0)T · J f1(u1)T · . . . · J fn−1(un−1)T· J fn(un)T · y
J fT = Transpose Jacobian matrix of f ; ui+1 = fi(ui) = fi(fi−1◦ . . . ◦ f0(u0)) .
A better solution?
∇F (u
0)y = J f
0(u
0)
T· J f
1(u
1)
T· . . . · J f
n−1(u
n−1)
T· J f
n(u
n)
T· y
proc Algo A(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end Compute ∇F (u0)y; end
→ What is the problem with Algo A?
proc Algo B(u0, y)
begin Do stuff; for i = 0 to n do
ui+1= fi(ui);
Do stuff;
vi+1= vi· Jfi+1(ui+1)T;
end end
→ What is the problem with Algo B?
∇F (u0)y = . . .J f0T· J f1T · . . . · J fn−1T · J fnT · y nMatMatops ∇F (u0)y = J f0T· J f1T· . . . · J fn−1T· J fnT· y. . . nMatVecops
A better solution?
∇F (u
0)y = J f
0(u
0)
T· J f
1(u
1)
T· . . . · J f
n−1(u
n−1)
T· J f
n(u
n)
T· y
proc Algo A(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end Compute ∇F (u0)y; end
→ What is the problem with Algo A?
proc Algo B(u0, y)
begin Do stuff; for i = 0 to n do
ui+1= fi(ui);
Do stuff;
vi+1= vi· Jfi+1(ui+1)T;
end end
→ What is the problem with Algo B?
∇F (u0)y = . . .J f0T· J f1T · . . . · J fn−1T · J fnT · y nMatMatops ∇F (u0)y = J f0T· J f1T· . . . · J fn−1T· J fnT· y. . . nMatVecops
A better solution?
∇F (u
0)y = J f
0(u
0)
T· J f
1(u
1)
T· . . . · J f
n−1(u
n−1)
T· J f
n(u
n)
T· y
proc Algo A(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end Compute ∇F (u0)y; end
→ What is the problem with Algo A?
proc Algo B(u0, y)
begin Do stuff; for i = 0 to n do
ui+1= fi(ui);
Do stuff;
vi+1= vi· Jfi+1(ui+1)T;
end end
→ What is the problem with Algo B?
∇F (u0)y = . . .J f0T· J f1T · . . . · J fn−1T · J fnT · y nMatMatops ∇F (u0)y = J f0T· J f1T· . . . · J fn−1T· J fnT· y. . . nMatVecops
A better solution?
∇F (u
0)y = J f
0(u
0)
T· J f
1(u
1)
T· . . . · J f
n−1(u
n−1)
T· J f
n(u
n)
T· y
proc Algo A(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end Compute ∇F (u0)y; end
→ What is the problem with Algo A?
proc Algo B(u0, y)
begin Do stuff; for i = 0 to n do
ui+1= fi(ui);
Do stuff;
vi+1= vi· Jfi+1(ui+1)T;
end end
→ What is the problem with Algo B?
∇F (u0)y = . . .J f0T· J f1T · . . . · J fn−1T · J fnT · y nMatMatops ∇F (u0)y = J f0T· J f1T· . . . · J fn−1T· J fnT· y. . . nMatVecops
A better solution?
∇F (u
0)y = J f
0(u
0)
T· J f
1(u
1)
T· . . . · J f
n−1(u
n−1)
T· J f
n(u
n)
T· y
proc Algo A(u0, y)
begin Do stuff; for i = 0 to n do ui+1= fi(ui); Do stuff; end Compute ∇F (u0)y; end
→ What is the problem with Algo A?
proc Algo B(u0, y)
begin Do stuff; for i = 0 to n do
ui+1= fi(ui);
Do stuff;
vi+1= vi· Jfi+1(ui+1)T;
end end
→ What is the problem with Algo B?
∇F (u0)y = . . .J f0T· J f1T · . . . · J fn−1T · J fnT · y nMatMatops ∇F (u )y = J f T·J f T· . . . ·J f T·J fnT· y. . . nMatVecops
Adjoint computation
Fi(xi) = xi+1 i < l ¯ Fi(xi, ¯xi+1) = ¯xi i ≤ l F0 F1 · · · Fl−2 Fl−1 ¯ F0 F¯1 F¯2 · · · F¯l−1 T x0 x1 x2 xl−2 xl−1 ¯ xl ¯ xl−1 ¯ x3 ¯ x2 ¯ x1 ¯ x0 x0 x1 x2 xl−1 xlAdjoint computation
Fi(xi) = xi+1 i < l ¯ Fi(xi, ¯xi+1) = ¯xi i ≤ l F0 F1 · · · Fl−2 Fl−1 ¯ F0 F¯1 F¯2 · · · F¯l−1 T x0 x1 x2 xl−2 xl−1 ¯ xl ¯ xl−1 ¯ x3 ¯ x2 ¯ x1 ¯ x0 x0 x1 x2 xl−1 xlAdjoint computation
Fi(xi) = xi+1 i < l ¯ Fi(xi, ¯xi+1) = ¯xi i ≤ l F0 F1 · · · Fl−2 Fl−1 ¯ F0 F¯1 F¯2 · · · F¯l−1 T x0 x1 x2 xl−2 xl−1 ¯ xl ¯ xl−1 ¯ x3 ¯ x2 ¯ x1 ¯ x0 x0 x1 x2 xl−1 xlAdjoint computation
Fi(xi) = xi+1 i < l ¯ Fi(xi, ¯xi+1) = ¯xi i ≤ l F0 F1 · · · Fl−2 Fl−1 ¯ F0 F¯1 F¯2 · · · F¯l−1 T x0 x1 x2 xl−2 xl−1 ¯ xl ¯ xl−1 ¯ x3 ¯ x2 ¯ x1 ¯ x0 x0 x1 x2 xl−1 xlAdjoint computation
Fi(xi) = xi+1 i < l ¯ Fi(xi, ¯xi+1) = ¯xi i ≤ l F0 F1 · · · Fl−2 Fl−1 ¯ F0 F¯1 F¯2 · · · F¯l−1 T x0 x1 x2 xl−2 xl−1 ¯ xl ¯ xl−1 ¯ x3 ¯ x2 ¯ x1 ¯ x0 x0 x1 x2 xl−1 xlRelation to IA? (I)
GoogleNet graph:
Relation to IA? (I)
Relation to IA? (II)
4
Derivatives in machine learning
“Backprop” and gradient descent are at the core of all recent advances Computer vision
NVIDIA DRIVE PX 2 segmentation
Top-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015)
Speech recognition & synthesis
Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4F4¯¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4F4¯¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4F4¯¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4
Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F0 F1 F2 F3 F4 Peak Mem ¯ F0 F1¯ F2¯ F3¯ F4¯ TI Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F0 F1F1 F2 F3 F4 Peak Mem ¯ F0 F1¯ F2¯ F3¯ F4¯ TI Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F0 F1F1 F2F2 F3 F4 Peak Mem ¯ F0 F1¯ F2¯ F3¯ F4¯ TI Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F0 F1F1 F2F2 F3F3 F4 Peak Mem ¯ F0 F1¯ F2¯ F3¯ F4¯ TI Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1F1 F2F2 F3F3 F4F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1F1 F2F2 F3F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1F1 F2F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4F4¯¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2F2¯¯ F3¯ F4¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1F1¯¯ F2¯ F3¯ F4¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯ F0¯
F0 F1¯ F2¯ F3¯ F4¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 6
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 3
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 3
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 3
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 3
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none”
$$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4F4¯¯ T 3
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ TT 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4F4¯¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2F2¯¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2F2¯¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1F1¯¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯ F0¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0
F0 F1F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2¯ F3F3¯¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1¯ F2F2¯¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯
F0 F1F1¯¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Model of execution
F0 F1 F2 F3 F4 Peak Mem
¯ F0¯
F0 F1¯ F2¯ F3¯ F4¯ T 4
I Memory to store output of
computations (xi or ¯xi). Initial state: contains x0.
I Cost to write: wm= 0, I Cost to read: rm= 0.
Example of execution
Strategy Time Space
Store all $ $$$
Store “none” $$$ $
No reuse $$ $$
Problem formulation
We want to minimize the makespan of:
Initial state: AC graph: size l Steps: uf, ub Memory: cm, wm= rm= 0, Mini= {x0} Storage k: ck, wk, rk, Sini= ∅ F0 F1 · · · Fl−2 Fl−1 ¯ F0 F¯1 F¯2 · · · F¯l−1 T
Existing work
Question: How to organize the reverse execution of intermediate steps?
What do we store, what do we recompute?
I Store all: memory expensive
I Recompute all: compute expensive
I Intermediate status?
Bounded memory
Griewand and Walther, 2000: Revolve(l, cm), optimal algorithm with cmmemory
Storage hierarchy
A., Herrmann, Hovland, Robert, 2015: Optimal algorithm for two level of storage: cheap bounded memory and costly unbounded disks.
A., Herrmann, 2019: Library of optimal schedules for any number of storage level. (https://gitlab.inria.fr/adjoint-computation)
Forward phase
F0 Fl−1
Revolve(m, cm)
What directions for AI?
Then what? Are we done? Just let AD people and ML people talk together!
Cut the middle-(scheduling)-people!
What directions for AI?
Then what? Are we done? Just let AD people and ML people talk together!
Cut the middle-(scheduling)-people!
What directions for AI?
While the core of the algorithms remain similar, the problematics are different:
I Shallower graphs (O(100 − 1000) levels).
I Cost functions (time/memory) are not necessarily uniform.
I Graphs with more structure than chains.
I Multi-Learners/Hyperparameter tuning (independent graphs executed
simultaneously), shared memory?
I Etc.
What directions for AI?
While the core of the algorithms remain similar, the problematics are different:
I Shallower graphs (O(100 − 1000) levels).
I Cost functions (time/memory) are not necessarily uniform.
I Graphs with more structure than chains.
I Multi-Learners/Hyperparameter tuning (independent graphs executed
simultaneously), shared memory?
I Etc.
What directions for AI?
While the core of the algorithms remain similar, the problematics are different:
I Shallower graphs (O(100 − 1000) levels).
I Cost functions (time/memory) are not necessarily uniform.
I Graphs with more structure than chains.
I Multi-Learners/Hyperparameter tuning (independent graphs executed
simultaneously), shared memory?
I Etc.
Dir. for AI: Graph structure
Dir. for AI: Graph structure
Remember this google network:
Nice try, we are not going to
Dir. for AI: Graph structure II
Source: Rao et al., A Deep Siamese Neural Network (...), 2016
Source: Sur´ıs et al., Cross-Modal Embeddings for Video and Audio Retrieval, 2018
Dir. for AI: Graph structure II
Pitchfork graph
1(aka join graphs):
Theorem (A., Beaumont, Herrmann, Shilova, 2019)
Given a bounded memory and a pitchfork with a bounded number of “teeth”, we can find in polynomial time the solution that backpropagates it in minimal time.
A Grasp of the proof? (I)
Three phase algorithm:
1
Forward phase
2Turn
3
Backward phase
I Traverse all branches. Write some
intermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed data
from one of the branches, backpropagate a
subset of the graph (can write additional
intermediate data)
A Grasp of the proof? (I)
Three phase algorithm:
1
Forward phase
2Turn
3
Backward phase
I Traverse all branches. Write some
intermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed data
from one of the branches, backpropagate a
subset of the graph (can write additional
intermediate data)
A Grasp of the proof? (I)
Three phase algorithm:
1
Forward phase
2Turn
3
Backward phase
I Traverse all branches. Write some
intermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed data
from one of the branches, backpropagate a
subset of the graph (can write additional
intermediate data)
A Grasp of the proof? (I)
Three phase algorithm:
1
Forward phase
2Turn
3
Backward phase
I Traverse all branches. Write some
intermediate data
I Backpropagate the handle of the pitchfork
I Iteratively, read some checkpointed data
from one of the branches, backpropagate a
subset of the graph (can write additional
intermediate data)
A Grasp of the proof? (II)
It relies on key properties of the
backward phase:
I Stability of execution
I Checkpoint persistence
A Grasp of the proof? (II)
It relies on key properties of the
backward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
Lemma (Stability 1)
If F
iis “backpropagated”, then there are
A Grasp of the proof? (II)
It relies on key properties of the
backward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
Lemma (Checkpoint persistence)
If x
iis stored, until F
iis
“backpropagated”, there are no F
jfor
A Grasp of the proof? (II)
It relies on key properties of the
backward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
Lemma (Stability 2)
If x
iis read, then there are no F
jon
A Grasp of the proof? (II)
It relies on key properties of the
backward phase:
I Stability of execution
I Checkpoint persistence
which give us a multi-phase approach.
In this case, for a given forward phase,
we get a multi-phase backward phase:
Forward phase
F0 F1 F2 Fl−1
I Where do we schedule the
checkpoints in the forward phase?
I In which order do we execute the
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!
With a memory of size O(M ):
I Store All can execute a graph of size O(M ) in time O(M ); I Revolve can execute a graph of size O(eM) in time O(M eM)! I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!
With a memory of size O(M ):
I Store All can execute a graph of size O(M ) in time O(M ); I Revolve can execute a graph of size O(eM) in time O(M eM)! I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!
With a memory of size O(M ):
I Store All can execute a graph of size O(M ) in time O(M ); I Revolve can execute a graph of size O(eM) in time O(M eM)! I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!
With a memory of size O(M ):
I Store All can execute a graph of size O(M ) in time O(M ); I Revolve can execute a graph of size O(eM) in time O(M eM)! I H-Revolve inproves performance by a factor of magnitude.
I Machine Learning perspective: deeper networks!
Is it worth it?
I From a scheduling perspective: Yes! (new fun problems)
I From an adjoint perspective: Yes!
With a memory of size O(M ):
I Store All can execute a graph of size O(M ) in time O(M ); I Revolve can execute a graph of size O(eM) in time O(M eM)! I H-Revolve inproves performance by a factor of magnitude.