HAL Id: tel-01526482
https://tel.archives-ouvertes.fr/tel-01526482
Submitted on 23 May 2017HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Des comportements flexibles aux comportements
habituels : meta-apprentissage neuro-inspiré pour la
robotique autonome
Erwan Renaudo
To cite this version:
Erwan Renaudo. Des comportements flexibles aux comportements habituels : meta-apprentissage neuroinspiré pour la robotique autonome. Robotique [cs.RO]. Université Pierre et Marie Curie -Paris VI, 2016. Français. �NNT : 2016PA066508�. �tel-01526482�
• •
`
– –
–
–
–
– –
–
–
–
–
– S – A – T – R < S,A,T,R > π π π:S →A π:S×A →[0,1] T:S×A×S →[0,1]
R:S →R R:S×A →R s∈S a∈ A T,R s0 s1 N Sk=(D0,...,Dn) T ai 1 Dj rt γ∈[0,1] Rt= ∞ ∑ i=0 γi·r t+i+1 γ=0
– S0 r=0 S1 r=1 S0 α 1−α
– x γ Rt= T ∑ i=0 rt+i+1 π Vπ(s) Qπ(s,a) s a s π Vπ(s)=E π{Rt|st=s}=Eπ {∞ ∑ i=0 γi·r t+i+1|st=s } Qπ(s,a)=E π{Rt|st=s,at=a}=Eπ {∞ ∑ i=0 γi·r t+i+1|st=s,at=a }
Vπ(s) Vπ(s)s s Vπ(s)=∑ s∈S T(s,π(s),s)(R(s,a,s)+γVπ(s)) π V π π Vπ π π ∀s∈S,Vπ(s)≥Vπ(s) V∗ ∀s∈S,V∗(s)= π V π(s)= a ∑ s∈S T(s,a,s)(R(s,a,s)+γV∗(s)) ∀(s,a)∈S×A,Q∗(s,a)= π Q π(s,a)=∑ s∈S T(s,a,s)(R(s,a,s)+ b γQ ∗(s,b))
Vπ π0 π∗ V∗ Vπ π ´ ´ ∆← 0 s∈S v← Vπ(s) Vπ(s)←∑sT(s,π(s),s)[Rπ(s)+γVπ(s)] ∆← (∆,|v−Vπ(s)|) `∆<ε ε stable← vrai
s∈S b← π(s)
π(s)← argmaxa∑sT(s,a,s)[Rπ(s)+γVπ(s)] b=π(s) stable← faux
stable Vπ π
Vπ ´ ´ ∆← 0 s∈S v← Vπ(s) Vπ(s)← a∑sT(s,a,s)[Rπ(s)+γV(s)] ∆← (∆,|v−Vπ(s)|) `∆<ε ε s∈S π(s)← a∑sT(s,a,s)[Rπ(s)+γV(s)] – T,R – 1m×1m 5×5
15 V(s) Q(s,a) s s Vπ Qπ Vk+1(s)=Vk(s)+α·δk α∈[0,1] δk s k α TD(0) s Vk+1(s) s
Vk(s) π Vk+1(s)=Vk(s)+α r(s)+γVk(s)− Vk(s) δk s π V(s) Q(s,a) Q Q(s,a) r(s,a) s s s V(s) s Q(s,a) Qk+1(s,a)=Qk(s,a)+α r(s,a)+γQk(s,a) ≡Vπ(s) −Qk(s,a) π Qπ(s,a) s s
s Qk+1(s,a)=Qk(s,a)+α r(s,a)+γ b Qk(s,b) ≡Vπ(s) −Qk(s,a) k V π p(s,a) a s p(s,a) s s a (s,a,s,r(s,a))
(s,a) C(s,a,s) s (s,a,s) s a C(s,a)=∑ u∈S C(s,a,u) P(s|s,a)=T(s,a,s)=C(sC(s,a,a),s) R(s,a)= ∑ trt(s,a) C(s,a) E3 (s,a) (s,a)
–
∀(s,a)∈S×A(s) Q(s,a) (s,a) s←
a← (s,Q)
a s r
Q(s,a)← Q(s,a)+α[(r+γ bQ(s,b)−Qk(s,a)] (s,a)← s,r
N s←
a← s
s,r← (s,a)
s a a∗ X [0,1] a= { X ≤ argmaxaQ(s,a) P P(a|s)=∑ (Q(s,a)/τ) b∈A (Q(s,b)/τ) τ τ τ
ß
(s1,a1) (s2,a2)
r1 (s1,a1)
a2 s2 s2
N p – ae t e st pt(st,a[i])= N ∑ e=1 I(a[i],ae t)
I(x,y) a[i]=ae
t – Ce e we t(a[i]) |A| |A|−1 pt(st,a[i])= N ∑ e=1 we t(a[i]) – pt(st,a[i])= N ∏ e=1 πe t(st,a[i]) – pt(st,a[i])= N ∑ e=1 πe t(st,a[i])
ß
–
Qs,a(q)=P(Q(s,a)=q)
Q(s,a)
h
h h
– A1 S S S A2 A1 A2 {A1,A2} A2 A2 A1 S S {A1,A2} S S a a∗ s
A(s,a)=Q(s,a)−V(s)
a a s
C(s,a,a)=∑ s
a s a a
−C(s,a,a)< ˆRτ (s,a)← (s,{a,a}) (s,{a,a})← (s,a)
– – – Q(s,a) – P P Q
–
s r
–
S si
W
Q(S,aj)
Q(s,a) s τ aj St Wj=(w0j,...,wNj) Qt(St,aj)=atj=Wjt·(St,1) Wt a δ=rt+γHab·maxb Wbt−1·St)− Wat−1·St−1) Wt a=Wat−1+αHabδ/ ∑ n sn rt a St−1αHab γHab (S,a,S) T S a S S,a,S T(S,a,S) (S,a) (S,a) Tt(S,a,S)=Tt−1(S,a,S)+αMB·(1−Tt−1(S,a,S)) Rt(S,a)=rt
– 100 1 64 γ Q(s, a) Qt(s, a)=max rt(s, a),γMB· s Tt−1(s, a, s)·a Qt(s,a)
VS S V
P(S) S
H(VS) Hmax
Rc
H(VS)=− ∑ S∈S
P(S)·log2(P(S)) Rc=H(VH S)
max Hmax=log2|S|
Rc Rn
Rn=(1−ω)+ω·Rc ω=1+e−σ|S−S1 0|
S0=50 sigma=0.25 N
– Cc Ca vb dib dib vbs vb vb Cc Ca pbs pbt Mbs Mbt
– (0,1) Cc (0,1) (0,2) (0,2) (1,4) (0,0)
– pbt,pbs mbt(0),mbs(0) Mbt, Mbs t st pbt=C a· pbs=Cc· tmax=0.1s
mbst,bt(i)=mbst−1,bt(i−1) ∀i∈|M| mbst,bt(0)=pbs,bt
– α γ τ 3 Rt=0 Cc pbs Rt= −0.03 pbt Ca Rt=−0.03 Rt=0.97 LC DN PA DN LC Cc Ca LC vb= / vb= . / t=
a=0.01 / Ca . / . / . / . / . /
–
– – Ehab EGD t¯rt λ ∆¯rt=¯rt−¯rt−1 ¯ rt=(1−λ)·¯rt−1+λ·rt – ∆¯rt
–
– ∆¯rt HtHab,GD HE t(x)=− |A| ∑ i=0 Pi∗log2(Pi) Pi=p(a=ai|s)
–
α γ τ
– 1er 3me
0 500 1000 1500 2000 2500 Temps (Decision)
0.06 0.08 0.10 0.12 0.14 Te mp s (s ) RaZ Cons.
Temps planification moyen, RC
0 500 1000 1500 2000 2500 Temps (Decision)
0.06 0.08 0.10 0.12 0.14 Te mp s (s ) RaZ Cons.
Temps planification moyen, SS
–
0 50 100 150 200 250 0 20 40 60 80 100 120
Recompense cumulee
Va ri an ce r ec o mp en se
Distribution desjeux de parametres evalues, GD.
−50 50 150 250 350 400 0 50 100 150 200
Recompense cumulee
Va ri an ce r ec o mp en se
Distribution desjeux de parametres evalues, Hab.
–
–
–
– ´ – – – – – –
.− /
. /
. /
THab,TGD δQ
δP
α=0.2 α=0.02
VE E
VHab=−(αHab·δQ+βHab·THab) VGD=−(αGD·δP+βGD·TGD) α β
– 30 αGD=1 αHab= 12 β THab TGD , × ,
–
x y
N
–
. /
–
s 0.5
–
– α γ τ (s,a,s) n s,a s – 120 – 60 60 30 29,31,34 30 10−3
–
10−2
–
10−3
–
–
´
–
s s
29 5 30
–
– – –
– ´
– —
– – ´ – – — – – ´ – – — – –
– – – ´ – – – – ´ – – – ´ ´ ´ ´ ¨ –
– – ´ – – – ¨ – – – – – `
– — – – ´ – – – – – ß – — – — – – – ` – – –
– ¨ – – – – ´ – – – ¨ ` ´ – – – – –
– ¨ – – ´ – – – – – – – – ´ –
– – – – – – – – ¨ – – ´ ´
´ ´ ´ ` – – – – – – – – – – – –
´´ – ´´ – – – – ¨ ´ – ` – ´ –
– – – ` – – ¨ – – – ´ – ´ – – – –
— – — – – ε – – – – ` – — –
– – ` ´ λ – – – – – – – – – ´ – `