An Introduction to Transfer Learning and Domain Adaptation

(1)

An Introduction to Transfer Learning and Domain Adaptation

Amaury Habrard

Laboratoire Hubert Curien, UMR CNRS 5516, Universit´e de Saint-Etienne

EPAT 2014

(LaHC) Domain Adaptation - EPAT’14 1 / 95

(2)

Some resources

List of transfer learning papers

http://www1.i2r.a-star.edu.sg/~jspan/conferenceTL.htm List of available softwares

http://www.cse.ust.hk/TL/index.html Surveys

Patel, Gopalan, Chellappa. Visual Domain Adaptation: An Overview of Recent Advances. Tech report, 2014.

Qi Li. Literature Survey: Domain Adaptation Algorithms for Natural Language Processing, Tech report, 2012

Margolis. A Literature Review of Domain Adaptation with Unlabeled Data. Tech report 2011.

Pan and Yang. A survey on Transfer Learning’, TKDE 2010.

J. Quionero-Candela and M. Sugiyama and A. Schwaighofer and N.D.

Lawrence. Dataset Shift in Machine Learning. MIT Press.

(3)

Credits and acknowledgments

Documents used for this talk:

D. Xu, K. Saenko, I. Tsang. Tutorial on Domain Transfer Learning for Vision Applications, CVPR’12.

S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI’13.

S. Ben-David. Towards Theoretical Understanding of Domain Adaptation Learning, workshop LNIID at ECML’09.

F. Sha and B. Kingsbury. Domain Adaptation in Machine Learning and Speech Recognition, Tutorial - Interspeech 2012.

K. Grauman. Adaptation for Objects and Attributes, workshop VisDA at ICCV’13 J. Blitzer and H. Daum´eIII. Domain Adaptation, Tutorial - ICML 2010.

A. Habrard, JP Peyrache, M. Sebban. Iterative Self-labeling Domain Adaptation for Linear Structured Image Classification, IJAIT 2013.

A. Habrard, JP Peyrache, M. Sebban. Boosting for unsupervised domain adaptation, ECML 2013

Acknowledegments: B. Fernando, P. Germain, E. Morvant, JP Peyrache, M. Sebban.

(4)

Transfer Learning

Definition [Pan, TL-IJCAI’13 tutorial]

Ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel domains/tasks

An example

• We havelabeledimages from a Web image corpus

• Is there a Personinunlabeled images from aVideocorpus ?

Person no Person

?

Is there a Person?

(5)

Outline

1 Introduction/Motivation

2 Reweighting/Instance based methods

3 Theoretical frameworks

4 Feature/projection based methods

5 Adjusting/Iterative methods

6 A quick word on model selection

(6)

Introduction

(7)

Settings

?

from the same domain

?

from different domains

Domains are modeled as probability distributions over an instance space

Tasks associated to a domain (classification, regression, clustering, ...) Objective: Fromsource totarget

⇒ Improve a target predictive function in thetarget domain using knowledge from thesource domain

(8)

A Taxonomy of Transfer Learning

“A survey on Transfer Learning” [Pan and Yang, TKDE 2010]

(9)

In this presentation

We will make a focus on domain adaptation We will focus on classification tasks

⇒ How can we learn, using labeled data from asource distribution, a low-error classifier for another related target distribution?

⇒ “Hot topic” - tutorials at ICML 2010, CVPR 2012, Interspeech 2012, workshops at ICCV 2013, NIPS 2013,ECML 2014

⇒ Motivating examples

(10)

A toy problem: Inter-twinning moons

(a) 10^◦ (b) 20^◦ (c) 30^◦

(d) 40^◦ (e) 50^◦ (f) 70^◦

(11)

Intuition and motivation from a CV perspective

“Can we train classifiers with Flickr photos, as they have already been collected and annotated, and hope the classifiers still work well on mobile camera images?” [Gonq et al., CVPR 2012]

“object classifiers optimized on benchmark dataset often exhibit significant degradation in recognition accuracy when evaluated on another one” [Gonq et al.,ICML 2013, Torralba et al., CVPR 2011, Perronnin et al., CVPR 2010]

“Hot topic” -Visual domain adaptation [Tutorial CVPR’12, ICCV’13]

(12)

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]

(19)

Natural Language Processing

Text are represented by “words” (Bag of Words)

Part of Speech Tagging: Adapt a tagger learned from medical papers to a journal (Wall Street Journal) - Newsgroup

Spam detection: Adapt a classifier from one mailbox to another

Sentiment analysis:

(20)

Domain Adaptation for sentiment analysis

(21)

Domain Adaptation for sentiment analysis - ex [Pan-IJCAI’13 tutorial]

Electronics Video games

(1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never buy HP again.

(6) It is so boring. I am extremely unhappy and will probably never buy UbiSoft again.

Source specific: compact, sharp, blurry.

Target specific: hooked, realistic, boring.

Domain independent: good, excited, nice,never buy, unhappy.

(22)

Other applications

Speech recognition [Tutorial at Interspeech’12]

Medecine Biology Time series Wifi localization

(23)

Notations

X ⊆R^d input space,Y ={−1,+1} output space PS sourcedomain: distribution overX ×Y D_S marginal distribution over X

PT targetdomain: different distribution over X×Y D_T marginal distribution over X

H ⊆Y^X: hypothesis class

Expected error of a hypothesis h:X →Y R_P_S(h) = E

(x^s,y^s)∼P_S I

h(x^s)6=y^s

sourcedomain error R_P_T(h) = E

(x^t,y^t)∼P_T I

h(x^t)6=y^t

targetdomain error

Domain Adaptation: find h∈ H with R_P_T smallfrom data ∼D_T andP_S

(24)

Classical result in supervised learning

Empirical error

RS ={(x^s_i,y_i^s)}^m_i=1^s ∼(PS)^m^s a labeled sample drawn i.i.d. fromPS

Associatedempirical error of an hypothesish:

R_S(h) = 1 ms

ms

X

i=1

I

h(x^s_i)6=y_i^s

Classical PAC result: From the same distribution R_P_S(h)≤R_S(h) +O(complexity(h)

√m_s )

⇒ Occam razor principle

What about RPT if we have no or very few labeled data? → try to make use of source information

(25)

Domain Adaptation

Setting

LabeledSource Sample

S = {(x_i,y_i)}^m_i₌₁^s Sourcesample drawn i.i.d. fromP_S Unlabeled TargetSample

T = {x_j}^m_j=1^t TargetSample drawn i.i.d. fromDT

optionnal: a few labeled target examples

Ifh is learned fromsourcedomain, how does it perform on target domain?

(26)

Domain Adaptation

Setting

LabeledSource Sample

S = {(x_i,y_i)}^m_i₌₁^s Sourcesample drawn i.i.d. fromP_S Unlabeled TargetSample

T = {x_j}^m_j=1^t TargetSample drawn i.i.d. fromDT

optionnal: a few labeled target examples

Ifh is learned fromsourcedomain, how does it perform on target domain?

(27)

Illustration settings

Classical supervised learning

Apprentissage supervisé

Échantillon étiqueté i.i.d. selon P

Modèle Apprentissage

DistributionP_S

S

Domain adaptation

Échantillon non étiqueté i.i.d. selon D

Adaptation de domaine

Échantillon étiqueté i.i.d. selon P

Modèle Apprentissage

DistributionP_S

T

Distribution différente P_T

S

(28)

A bit of vocabulary

Unsupervised Transfer Learning No labels

Unsupervised DA

Presence of source labels, no target labels Semi-supervised DA

Presence of source labels, few target labels and a lot of unlabeled data 6= Semi-supervised learning

No distribution shift, few labeled data and a lot of unlabeled data from the same domain

(29)

Some key points

Estimating of the distribution shift

Deriving generalization guarantees RPT(h)≤?R_P_S(h)?+?

(30)

Some key points

Deriving generalization guarantees R_P_T(h)≤?R_P_S(h)?+?

Characterizing when the adaptation is possible

(31)

Some key points

Defining algorithms

Underlying idea: Try to move closer the two distributions

(32)

Some key points

Defining algorithms

Underlying idea: Try to move closer the two distributions Applying model selection principle

How to tune hyperparameters with no labeled information from target

(33)

3 main classes of algorithms

Reweighting/Instance-based methods

Correct a sample bias by reweighting source labeled data:

source instances close to target instances are more important.

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Huang et al., Bickel et al., ’07]

[Sugiyama et al., ’08]

[Sethy et al., ’06]

[This work]

Adjusting mismatched models

[Evgeniou and Pontil, ’05]

[Duan et al., ’09]

[Duan et al., Daumé III et al., Saenko et al., ’10]

[Kulis et al., Chen et al., ’11]

+ -

- -- ++

+

- - - -

+ +

Inferring domain- invariant features

[Pan et al., ’09]

[Blitzer et al., ’06] [Gopalan et al., ’11]

[Chen et al., ’12] [Daumé III, ’07]

[Argyriou et al, ’08] [Gong et al., ’12]

[Muandet et al., ’13]

++ - +

- - - + + +

Feature-based methods/Find new representation spaces Find a common space where source and target are close (projection, new features, etc)

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Sethy et al., ’09] [This work]

Adjusting mismatched models

+ -

--- ++

+

- --

-

+

Inferring domain- invariant features

[Pan et al., ’09]

[Chen et al., ’12]

[Daumé III, ’07]

++ -+ - --+++ Ajustement/Iterative methods

Modify the model by incorporating pseudo-labeled information

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Sethy et al., ’09] [This work]

Adjusting mismatched models

+ -

- -- ++

+

- - - -

+ +

Inferring domain- invariant features

[Pan et al., ’09]

++ - +

- - - + + +

= E

(x^t,y^t)∼P_T

PS(x^t,y^t) P_S(x^t,y^t)I

h(x^t)6=y^t

= X

(x^t,y^t)

P_T(x^t,y^t)P_S(x^t,y^t) P_S(x^t,y^t)I

h(x^t)6=y^t

(39)

A first analysis

R_P_T(h) = E

(x^t,y^t)∼P_T I

h(x^t)6=y^t

= E

(x^t,y^t)∼P_T

P_S(x^t,y^t) PS(x^t,y^t)I

h(x^t)6=y^t

= X

(x^t,y^t)

P_T(x^t,y^t)P_S(x^t,y^t) PS(x^t,y^t)I

h(x^t)6=y^t

= E

(x^t,y^t)∼P_S

PT(x^t,y^t) P_S(x^t,y^t)I

h(x^t)6=y^t

(40)

Covariate shift [Shimodaira,’00]

⇒ Assume similar tasks, P_S(y|x) =P_T(y|x), then:

= E

(x^t,y^t)∼P_S

D_T(x^t)P_T(y^t|x^t) D_S(x^t)P_S(y^t|x^t)I

h(x^t)6=y^t

= E

(x^t,y^t)∼P_S

D_T(x^t) DS(x^t)I

h(x^t)6=y^t

= E

(x^t)∼DS

D_T(x^t) DS(x^t) E

y^t∼PS(y^t|x^t) }I

h(x^t)6=y^t

⇒ weighted erroron the source domain: ω(x^t) = ^D_D^T^(x^t⁾

S(x^t)

Idea reweight labeled sourcedata according to an estimate of ω(x^t):

E

(x^t,y^t)∼P_S ω(x^t)I

h(x^t)6=y^t

(41)

Illustration

No Bias

D_S(x) =D_T(x)⇒ω(x) = 1

With Bias

D_S(x)6=D_T(x)⇒ω(x)6= 1

(42)

Difficult case

No shared support

∃x,D_S(x) = 0 andD_T(x)6= 0

Shared support

D_S(x) = 0 if and only ifD_T(x) = 0

Intuition: the quality of the adaptation depends on the magnitude on the weights

(43)

How to deal with the sample selection bias?

Setting

A source sampleS = {(x^s_i,y_i^s)}^m_i=1^s and a target sample T = {x^t_j}^m_j₌₁^t Estimate new weights without using labels

ˆ ω(x^s_i)=

PrˆT(x^s_i) Prˆ_S(x^s_i) Learn a classifier on the classifier w.r.t. ˆω

X

(x^s_i,y_i^s)∈S

ˆ

ω(x^s_i)I[h(x^s_i)6=y_i^s]

(Other losses: margin-based hinge-loss, least-square)

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[This work]

Adjusting mismatched models

+ -

- -- ++

+

- - - -

+ +

Inferring domain- invariant features

[Pan et al., ’09]

++ - +

- - - + + +

Some existing approaches (1/2)

Density estimators

Build density estimators for source and target domains and estimate the ratio between them - Ex [Sugiyama et al.,NIPS’07]:

ˆ

ω(x) =Pb

l=1α_lψ_l(x)

Learning: argmin_αKL(ˆωD_S,D_T)

Learn the weights discriminatively [Bickel et al.,ICML’07]

Assume ^D_D^T^(xⁱ⁾

S(xi) ∝ _p(q=1|x,θ)¹

Labelsource with label 1,target with label 0 and train a classifier (ˆθ) to classify examples1 or 0(e.g. with logistic regression)

Compute the new weights ω(xˆ ^s_i) = 1 p(q = 1|x^s_i; ˆθ)

(50)

Some existing approaches (2/2)

Kernel Mean Matching [Huang et al.,NIPS’06]

Maximum Mean Discrepancy MMD(S,T) =k_m¹

S

Pm_s

i=1φ(x^s_i)−_m¹

t

Pm_t

j=1φ(x^t_j)kH

minβk 1 mS

m_s

X

i=1

β(x^s_i)φ(x^s_i)− 1 mS

m_t

X

j=1

φ(x^t_j)kH

s.t. β(x^s_i)∈[0,B] and|_m¹

S

Pm_s

i=1β(x^s_i)−1|<

min_β1

2β^TK_Tβ−κ^T_S,T s.t. β_i∈[0,B] and|Pm_s

i=1β(x^s_i)−m_s|<m_s

Guarantees [Gretton et al.,2008] - Under covariate shift assumptions

|RP_T(h)−weighted(RS(h))|<

rO(1/δ) +O(maxx ω(x)²) ms

+Cand

k 1 mS

ms

X

i=1

ω(x^si)φ(x^si)− 1 mt

mt

X

j=1

φ(x^tj)k ≤O((1/δ) q

ω²max/ms+ 1/mt

(51)

Bad news

DA is hard, even under covariate shift [Ben-David et al.,ALT’12]

⇒ To learn a classifier the number of examples depend on|H|(finite) or exponentially on the dimension of X

Covariate shift assumption may fail: Tasks are not similar in general P_S(y|x)6=P_T(y|x)

We did not consider the hypothesis space.

Can define a general theory about DA?

(52)

Theoretical frameworks for Domain Adaptation

(53)

A keypoint: estimating the distribution shift

First idea: Total variation measure d_L₁(D_S,D_T) = sup_B_⊆X|D_S(B)−D_T(B)|

Subset of points maximizing the divergence

But:

Not computable in general, and thus not estimable from finite samples

(54)

A keypoint: estimating the distribution shift

First idea: Total variation measure d_L₁(D_S,D_T) = sup_B_⊆X|D_S(B)−D_T(B)|

Subset of points maximizing the divergence But:

Not computable in general, and thus not estimable from finite samples Not related to the hypothesis class

Do not characterize the difficulty of the problem forH

(55)

The H ∆ H-divergence [Ben-David et al.,NIPS’06;MLJ’10]

Definition

dH∆H(DS,DT) = sup

(h,h⁰)∈H²

RD_T(h,h⁰)−RD_S(h,h⁰)

= sup

(h,h⁰)∈H²

E

x^t∼DT

I

h(x^t)6=h⁰(x^t)

− E

x^s∼DS

I

h(x^s)6=h⁰(x^s)

(56)

The H ∆ H-divergence [Ben-David et al.,NIPS’06;MLJ’10]

Definition

d_H∆H(DS,DT) = sup

(h,h⁰)∈H²

RD_T(h,h⁰)−RD_S(h,h⁰)

= sup

(h,h⁰)∈H²

E

x^t∼DT

I

h(x^t)6=h⁰(x^t)

− E

x^s∼DS

I

h(x^s)6=h⁰(x^s)

Illustration with only 2 hypothesis in H h and h⁰

Note: With a larger H, the distance will be highsince we can easily find two hypothesis able to distinguishthe two domains

(57)

Computable from samples

Consider two samples S,T of size m fromDS andDT

dH∆H(D_S,D_T)≤dH∆H(S,T) +O(complexity(H)

qlog(m)

m )

complexity(H): VC-dimension [Ben-david et al.,06;’10], Rademacher [Mansour et al.,’09]

Empirical estimation

dˆ_H_∆H(S,T) = 2



1−min_h∈H



 1 m

X

x:h(x)=−1

I[x∈S] + 1 m

X

x:h(x)=1

I[x∈T]









⇒ Already seen: labelsource examples as -1,target ones as +1 and try to learn a classifier in H minimizing the associated empirical error

(58)

Going to a generalization bound

Preliminaries

RP_T(h,h⁰) = E

(x,y)∼P_S I[h(x)6=h⁰(x)] = E

x∼D_T I[h(x)6=h⁰(x)]

R_P_T (R_P_S) fulfills the triangle inequality

|R_P_T(h,h⁰)−RP_S(h,h⁰)| ≤ ¹₂dH∆H(DS,DT) since dH∆H(DS,DT) = 2 sup_(h,h⁰_)∈H²

RD_T(h,h⁰)−RD_S(h,h⁰) h^∗_S =argmin_h∈HR_P_S(h): best on source

h^∗_T =argmin_h∈HR_P_T(h): best ontarget Ideal joint hypothesis

h^∗ =argmin_h∈HR_P_S(h) +R_P_T(h) ;λ=R_P_S(h^∗) +R_P_T(h^∗)

(59)

A first bound

R_P_T(h)≤

(60)

R_P_T(h)≤R_P_T(h^∗) +R_P_T(h,h^∗)

≤RPT(h^∗) +RPS(h,h^∗) +RPT(h,h^∗)−RPS(h,h^∗)

≤R_P_T(h^∗) +R_P_S(h,h^∗) +|R_P_T(h,h^∗)−R_P_S(h,h^∗)|

≤R_P_T(h^∗) +R_P_S(h,h^∗) +1

2dH∆H(D_S,D_T)

(64)

A first bound

R_P_T(h)≤R_P_T(h^∗) +R_P_T(h,h^∗)

≤RP_T(h^∗) +RP_S(h,h^∗) +|R_P_T(h,h^∗)−RP_S(h,h^∗)|

≤RPT(h^∗) +RPS(h,h^∗) +1

2dH∆H(DS,DT)

≤RPT(h^∗) +RPS(h) +RPS(h^∗) +1

2dH∆H(DS,DT)

(65)

A first bound

RP_T(h)≤RP_T(h^∗) +RP_T(h,h^∗)

≤R_P_T(h^∗) +R_P_S(h,h^∗) +1

2dH∆H(D_S,D_T)

≤R_P_T(h^∗) +R_P_S(h) +R_P_S(h^∗) +1

2dH∆H(D_S,D_T)

≤R_P_S(h) +1

2dH∆H(D_S,D_T) +λ

(66)

A first bound

R_P_T(h)≤R_P_T(h^∗) +R_P_T(h,h^∗)

≤R_P_T(h^∗) +R_P_S(h,h^∗) +1

2dH∆H(D_S,D_T)

≤R_P_T(h^∗) +R_P_S(h) +R_P_S(h^∗) +1

2dH∆H(D_S,D_T)

≤R_P_S(h) +1

2dH∆H(D_S,D_T) +λ

≤R_S(h) +1

2dH∆H(S,T) +O(complexity(H)

rlog(m) m ) +λ

(67)

Main theoretical bound

Theorem [Ben-David et al.,MLJ’10,NIPS’06]

Let H a symmetric hypothesis space. If D_S and D_T are respectively the marginal distributions of source and target instances, then for all δ ∈ (0,1], with probability at least 1−δ :

∀h∈ H, R_P_T(h) ≤R_P_S(h) + 1

2d_H_∆_H(D_S,D_T) + λ Formalizes a natural approach: Move closer the two distributions while ensuring a low error on the source domain.

Justifies many algorithms:

reweighting methods, feature-based methods, adjusting/iterative methods.

(68)

Another analysis [Mansour et al.,COLT’09]

R_P_T(h)≤

(69)

Another analysis [Mansour et al.,COLT’09]

R_P_T(h)≤R_P_T(h,h^∗_S) +R_P_T(h^∗_S,h^∗_T) +R_P_T(h^∗_T)

=R_P_T(h,h^∗_S) +ν

≤R_P_S(h,h_S^∗) +R_P_T(h,h_S^∗)−R_P_S(h,h^∗_S) +ν

≤RP_S(h,h_S^∗) +|R_P_T(h,h^∗_S)−RP_S(h,h_S^∗)|+ν

≤R_P_S(h,h_S^∗)) + 1

2d_H_∆_H(D_S,D_T) +ν

(74)

Another analysis [Mansour et al.,COLT’09]

RP_T(h)≤RP_T(h,h_S^∗) +RP_T(h^∗_S,h^∗_T) +RP_T(h^∗_T)

=R_P_T(h,h_S^∗) +ν

≤R_P_S(h,h^∗_S) +R_P_T(h,h^∗_S)−R_P_S(h,h^∗_S) +ν

≤R_P_S(h,h^∗_S) +|R_P_T(h,h^∗_S)−R_P_S(h,h_S^∗)|+ν

≤R_P_S(h,h^∗_S)) + 1

2dH∆H(D_S,D_T) +ν (≤R_P_S(h) +1

2dH∆H(D_S,D_T) +R_P_S(h^∗_S) +ν) ifh^∗_S is not the true labeling function

(75)

Another analysis [Mansour et al.,COLT’09]

R_P_T(h)≤R_P_T(h,h_S^∗) +R_P_T(h^∗_S,h^∗_T) +R_P_T(h^∗_T)

=R_P_T(h,h_S^∗) +ν

≤R_P_S(h,h^∗_S) +R_P_T(h,h^∗_S)−R_P_S(h,h^∗_S) +ν

≤RP_S(h,h^∗_S) +|R_P_T(h,h^∗_S)−RP_S(h,h_S^∗)|+ν

≤RPS(h,h^∗_S)) + 1

2dH∆H(DS,DT) +ν (≤RPS(h) +1

2dH∆H(DS,DT) +RPS(h^∗_S) +ν) ifh^∗_S is not the true labeling function

This analysis can lead to smaller when adaptation is possible Leads to the same type of bound, just the constant changes ...

(76)

Characterization of the possibility of domain adaptation

Constants characterize when adaptation is possible

λ=RP_S(h^∗) +RP_T(h^∗), h^∗ =argmin_h∈HRP_S(h) +RPT(h) There must exist an ideal joint hypothesis with small error ν =R_P_T(h^∗_S,h_T^∗) +R_P_T(h_T^∗)

there must exist a very good hypothesis on the target and the best hypothesis on source must be close to the best on target w.r.t toD_T

(77)

Other settings

Discrepancy [Mansour et al.,COLT’09]

instead of the 0-1 loss, more general loss functions `(i.e. Lp norms) disc^`(D_S,D_T) = sup_h,h⁰_∈H|R_D^`

S(h,h⁰)−R_D^`

T(h,h⁰)|

This discrepancy can be minimized and used as a reweighting method ([Mansour et al.,COLT’09] - polynomial for L₂ norm for example) Using some target labeled data

Weighting the empirical source and target risks [Ben David et al.,2010]

Using a divergence taking into account target labels [Zhang et al.,NIPS’12] (a divergence must take into account marginals overX andY, theλconstant counts for Y)

(78)

Other settings

Averaged quantities [Germain et al.,ICML’13]

Consider a probability distribution ρ (posterior) over Hto learn and the following risk: E

h∼ρR_P_T(h) Definition of an averaged distance dis(D_S,D_T) =

E

h,h⁰∼ρ²[R_D_T(h,h⁰)−R_D_S(h,h⁰)]

Similar generalization bound

h∼ρE RPT(h)≤ E

h∼ρRPS(h) + dis(DS,DT) +λρ^∗

Estimation from samples controlled by PAC-Bayesian theory Bound tighter without a supremum

(79)