• Aucun résultat trouvé

An Introduction to Transfer Learning and Domain Adaptation

N/A
N/A
Protected

Academic year: 2022

Partager "An Introduction to Transfer Learning and Domain Adaptation"

Copied!
132
0
0

Texte intégral

(1)

An Introduction to Transfer Learning and Domain Adaptation

Amaury Habrard

Laboratoire Hubert Curien, UMR CNRS 5516, Universit´e de Saint-Etienne

EPAT 2014

(LaHC) Domain Adaptation - EPAT’14 1 / 95

(2)

Some resources

List of transfer learning papers

http://www1.i2r.a-star.edu.sg/~jspan/conferenceTL.htm List of available softwares

http://www.cse.ust.hk/TL/index.html Surveys

Patel, Gopalan, Chellappa. Visual Domain Adaptation: An Overview of Recent Advances. Tech report, 2014.

Qi Li. Literature Survey: Domain Adaptation Algorithms for Natural Language Processing, Tech report, 2012

Margolis. A Literature Review of Domain Adaptation with Unlabeled Data. Tech report 2011.

Pan and Yang. A survey on Transfer Learning’, TKDE 2010.

J. Quionero-Candela and M. Sugiyama and A. Schwaighofer and N.D.

Lawrence. Dataset Shift in Machine Learning. MIT Press.

(LaHC) Domain Adaptation - EPAT’14 2 / 95

(3)

Credits and acknowledgments

Documents used for this talk:

D. Xu, K. Saenko, I. Tsang. Tutorial on Domain Transfer Learning for Vision Applications, CVPR’12.

S. Pan, Q. Yang and W. Fan. Tutorial: Transfer Learning with Applications, IJCAI’13.

S. Ben-David. Towards Theoretical Understanding of Domain Adaptation Learning, workshop LNIID at ECML’09.

F. Sha and B. Kingsbury. Domain Adaptation in Machine Learning and Speech Recognition, Tutorial - Interspeech 2012.

K. Grauman. Adaptation for Objects and Attributes, workshop VisDA at ICCV’13 J. Blitzer and H. Daum´eIII. Domain Adaptation, Tutorial - ICML 2010.

A. Habrard, JP Peyrache, M. Sebban. Iterative Self-labeling Domain Adaptation for Linear Structured Image Classification, IJAIT 2013.

A. Habrard, JP Peyrache, M. Sebban. Boosting for unsupervised domain adaptation, ECML 2013

Acknowledegments: B. Fernando, P. Germain, E. Morvant, JP Peyrache, M. Sebban.

(LaHC) Domain Adaptation - EPAT’14 3 / 95

(4)

Transfer Learning

Definition [Pan, TL-IJCAI’13 tutorial]

Ability of a system to recognize and apply knowledge and skills learned in previous domains/tasks to novel domains/tasks

An example

• We havelabeledimages from a Web image corpus

• Is there a Personinunlabeled images from aVideocorpus ?

Person no Person

?

Is there a Person?

(LaHC) Domain Adaptation - EPAT’14 4 / 95

(5)

Outline

1 Introduction/Motivation

2 Reweighting/Instance based methods

3 Theoretical frameworks

4 Feature/projection based methods

5 Adjusting/Iterative methods

6 A quick word on model selection

(LaHC) Domain Adaptation - EPAT’14 5 / 95

(6)

Introduction

(LaHC) Domain Adaptation - EPAT’14 6 / 95

(7)

Settings

?

from the same domain

?

from different domains

Domains are modeled as probability distributions over an instance space

Tasks associated to a domain (classification, regression, clustering, ...) Objective: Fromsource totarget

⇒ Improve a target predictive function in thetarget domain using knowledge from thesource domain

(LaHC) Domain Adaptation - EPAT’14 7 / 95

(8)

A Taxonomy of Transfer Learning

“A survey on Transfer Learning” [Pan and Yang, TKDE 2010]

(LaHC) Domain Adaptation - EPAT’14 8 / 95

(9)

In this presentation

We will make a focus on domain adaptation We will focus on classification tasks

⇒ How can we learn, using labeled data from asource distribution, a low-error classifier for another related target distribution?

⇒ “Hot topic” - tutorials at ICML 2010, CVPR 2012, Interspeech 2012, workshops at ICCV 2013, NIPS 2013,ECML 2014

⇒ Motivating examples

(LaHC) Domain Adaptation - EPAT’14 9 / 95

(10)

A toy problem: Inter-twinning moons

(a) 10 (b) 20 (c) 30

(d) 40 (e) 50 (f) 70

(LaHC) Domain Adaptation - EPAT’14 10 / 95

(11)

Intuition and motivation from a CV perspective

“Can we train classifiers with Flickr photos, as they have already been collected and annotated, and hope the classifiers still work well on mobile camera images?” [Gonq et al., CVPR 2012]

“object classifiers optimized on benchmark dataset often exhibit significant degradation in recognition accuracy when evaluated on another one” [Gonq et al.,ICML 2013, Torralba et al., CVPR 2011, Perronnin et al., CVPR 2010]

“Hot topic” -Visual domain adaptation [Tutorial CVPR’12, ICCV’13]

(LaHC) Domain Adaptation - EPAT’14 11 / 95

(12)

Brief recap on computer vision issues [Slides from J. Sivic]

(LaHC) Domain Adaptation - EPAT’14 12 / 95

(13)

Brief recap on computer vision issues (2) [Slides from J.

Sivic]

(LaHC) Domain Adaptation - EPAT’14 13 / 95

(14)

Brief recap on computer vision issues (3) [Slides from J.

Sivic]

(LaHC) Domain Adaptation - EPAT’14 14 / 95

(15)

Brief recap on computer vision issues (4) [Slides from J.

Sivic]

(LaHC) Domain Adaptation - EPAT’14 15 / 95

(16)

Brief recap on computer vision issues (5) [Slides from J.

Sivic]

(LaHC) Domain Adaptation - EPAT’14 16 / 95

(17)

Problems with data representations

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12](LaHC) Domain Adaptation - EPAT’14 17 / 95

(18)

Hard to predict what will change in the new domain

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]

(LaHC) Domain Adaptation - EPAT’14 18 / 95

(19)

Natural Language Processing

Text are represented by “words” (Bag of Words)

Part of Speech Tagging: Adapt a tagger learned from medical papers to a journal (Wall Street Journal) - Newsgroup

Spam detection: Adapt a classifier from one mailbox to another

Sentiment analysis:

(LaHC) Domain Adaptation - EPAT’14 19 / 95

(20)

Domain Adaptation for sentiment analysis

(LaHC) Domain Adaptation - EPAT’14 20 / 95

(21)

Domain Adaptation for sentiment analysis - ex [Pan-IJCAI’13 tutorial]

Electronics Video games

(1) Compact; easy to operate; very good picture quality; looks sharp!

(2) A very good game! It is action packed and full of excitement. I am very much hooked on this game.

(3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp.

(4) Very realistic shooting action and good plots. We played this and were hooked.

(5) It is also quite blurry in very dark settings. I will never buy HP again.

(6) It is so boring. I am extremely unhappy and will probably never buy UbiSoft again.

Source specific: compact, sharp, blurry.

Target specific: hooked, realistic, boring.

Domain independent: good, excited, nice,never buy, unhappy.

(LaHC) Domain Adaptation - EPAT’14 21 / 95

(22)

Other applications

Speech recognition [Tutorial at Interspeech’12]

Medecine Biology Time series Wifi localization

(LaHC) Domain Adaptation - EPAT’14 22 / 95

(23)

Notations

Notations

X ⊆Rd input space,Y ={−1,+1} output space PS sourcedomain: distribution overX ×Y DS marginal distribution over X

PT targetdomain: different distribution over X×Y DT marginal distribution over X

H ⊆YX: hypothesis class

Expected error of a hypothesis h:X →Y RPS(h) = E

(xs,ys)∼PS I

h(xs)6=ys

sourcedomain error RPT(h) = E

(xt,yt)∼PT I

h(xt)6=yt

targetdomain error

Domain Adaptation: find h∈ H with RPT smallfrom data ∼DT andPS

(LaHC) Domain Adaptation - EPAT’14 23 / 95

(24)

Classical result in supervised learning

Empirical error

RS ={(xsi,yis)}mi=1s ∼(PS)ms a labeled sample drawn i.i.d. fromPS

Associatedempirical error of an hypothesish:

RS(h) = 1 ms

ms

X

i=1

I

h(xsi)6=yis

Classical PAC result: From the same distribution RPS(h)≤RS(h) +O(complexity(h)

√ms )

⇒ Occam razor principle

What about RPT if we have no or very few labeled data? → try to make use of source information

(LaHC) Domain Adaptation - EPAT’14 24 / 95

(25)

Domain Adaptation

Setting

LabeledSource Sample

S = {(xi,yi)}mi=1s Sourcesample drawn i.i.d. fromPS Unlabeled TargetSample

T = {xj}mj=1t TargetSample drawn i.i.d. fromDT

optionnal: a few labeled target examples

Ifh is learned fromsourcedomain, how does it perform on target domain?

(LaHC) Domain Adaptation - EPAT’14 25 / 95

(26)

Domain Adaptation

Setting

LabeledSource Sample

S = {(xi,yi)}mi=1s Sourcesample drawn i.i.d. fromPS Unlabeled TargetSample

T = {xj}mj=1t TargetSample drawn i.i.d. fromDT

optionnal: a few labeled target examples

Ifh is learned fromsourcedomain, how does it perform on target domain?

(LaHC) Domain Adaptation - EPAT’14 25 / 95

(27)

Illustration settings

Classical supervised learning

Apprentissage supervisé

Échantillon étiqueté i.i.d. selon P

Modèle Apprentissage

DistributionPS

S

Domain adaptation

Échantillon non étiqueté i.i.d. selon D

Adaptation de domaine

Échantillon étiqueté i.i.d. selon P

Modèle Apprentissage

DistributionPS

T

Distribution différente PT

S

(LaHC) Domain Adaptation - EPAT’14 26 / 95

(28)

A bit of vocabulary

Unsupervised Transfer Learning No labels

Unsupervised DA

Presence of source labels, no target labels Semi-supervised DA

Presence of source labels, few target labels and a lot of unlabeled data 6= Semi-supervised learning

No distribution shift, few labeled data and a lot of unlabeled data from the same domain

(LaHC) Domain Adaptation - EPAT’14 27 / 95

(29)

Some key points

Estimating of the distribution shift

Deriving generalization guarantees RPT(h)≤?RPS(h)?+?

(LaHC) Domain Adaptation - EPAT’14 28 / 95

(30)

Some key points

Estimating of the distribution shift

Deriving generalization guarantees RPT(h)≤?RPS(h)?+?

Characterizing when the adaptation is possible

(LaHC) Domain Adaptation - EPAT’14 28 / 95

(31)

Some key points

Estimating of the distribution shift

Deriving generalization guarantees RPT(h)≤?RPS(h)?+?

Characterizing when the adaptation is possible

Defining algorithms

Underlying idea: Try to move closer the two distributions

(LaHC) Domain Adaptation - EPAT’14 28 / 95

(32)

Some key points

Estimating of the distribution shift

Deriving generalization guarantees RPT(h)≤?RPS(h)?+?

Characterizing when the adaptation is possible

Defining algorithms

Underlying idea: Try to move closer the two distributions Applying model selection principle

How to tune hyperparameters with no labeled information from target

(LaHC) Domain Adaptation - EPAT’14 28 / 95

(33)

3 main classes of algorithms

Reweighting/Instance-based methods

Correct a sample bias by reweighting source labeled data:

source instances close to target instances are more important.

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Huang et al., Bickel et al., ’07]

[Sugiyama et al., ’08]

[Sethy et al., ’06]

[Sethy et al., ’09]

[This work]

Adjusting mismatched models

[Evgeniou and Pontil, ’05]

[Duan et al., ’09]

[Duan et al., Daumé III et al., Saenko et al., ’10]

[Kulis et al., Chen et al., ’11]

+ -

- -- ++

+

- - - -

+ +

Inferring domain- invariant features

[Pan et al., ’09]

[Blitzer et al., ’06] [Gopalan et al., ’11]

[Chen et al., ’12] [Daumé III, ’07]

[Argyriou et al, ’08] [Gong et al., ’12]

[Muandet et al., ’13]

++ - +

- - - + + +

Feature-based methods/Find new representation spaces Find a common space where source and target are close (projection, new features, etc)

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Huang et al., Bickel et al., ’07]

[Sugiyama et al., ’08]

[Sethy et al., ’06]

[Sethy et al., ’09] [This work]

Adjusting mismatched models

[Evgeniou and Pontil, ’05]

[Duan et al., ’09]

[Duan et al., Daumé III et al., Saenko et al., ’10]

[Kulis et al., Chen et al., ’11]

+ -

--- ++

+

- --

-

+

+

Inferring domain- invariant features

[Pan et al., ’09]

[Blitzer et al., ’06] [Gopalan et al., ’11]

[Chen et al., ’12]

[Daumé III, ’07]

[Argyriou et al, ’08] [Gong et al., ’12]

[Muandet et al., ’13]

++ -+ - --+++ Ajustement/Iterative methods

Modify the model by incorporating pseudo-labeled information

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Huang et al., Bickel et al., ’07]

[Sugiyama et al., ’08]

[Sethy et al., ’06]

[Sethy et al., ’09] [This work]

Adjusting mismatched models

[Evgeniou and Pontil, ’05]

[Duan et al., ’09]

[Duan et al., Daumé III et al., Saenko et al., ’10]

[Kulis et al., Chen et al., ’11]

+ -

- -- ++

+

- - - -

+ +

Inferring domain- invariant features

[Pan et al., ’09]

[Blitzer et al., ’06] [Gopalan et al., ’11]

[Chen et al., ’12] [Daumé III, ’07]

[Argyriou et al, ’08] [Gong et al., ’12]

[Muandet et al., ’13]

++ - +

- - - + + +

(LaHC) Domain Adaptation - EPAT’14 29 / 95

(34)

Reweighting/Instance based Methods

(LaHC) Domain Adaptation - EPAT’14 30 / 95

(35)

Context

Motivation

Domains share the samesupport (i.e. bag of words)

Distribution shift is caused by sampling bias/shift between marginals

Idea

Reweight orselect instances to reduce the discrepancy between source and targetdomains.

(LaHC) Domain Adaptation - EPAT’14 31 / 95

(36)

A first analysis

RPT(h) = E

(xt,yt)∼PT I

h(xt)6=yt

(LaHC) Domain Adaptation - EPAT’14 32 / 95

(37)

A first analysis

RPT(h) = E

(xt,yt)∼PT I

h(xt)6=yt

= E

(xt,yt)∼PT

PS(xt,yt) PS(xt,yt)I

h(xt)6=yt

(LaHC) Domain Adaptation - EPAT’14 32 / 95

(38)

A first analysis

RPT(h) = E

(xt,yt)∼PT I

h(xt)6=yt

= E

(xt,yt)∼PT

PS(xt,yt) PS(xt,yt)I

h(xt)6=yt

= X

(xt,yt)

PT(xt,yt)PS(xt,yt) PS(xt,yt)I

h(xt)6=yt

(LaHC) Domain Adaptation - EPAT’14 32 / 95

(39)

A first analysis

RPT(h) = E

(xt,yt)∼PT I

h(xt)6=yt

= E

(xt,yt)∼PT

PS(xt,yt) PS(xt,yt)I

h(xt)6=yt

= X

(xt,yt)

PT(xt,yt)PS(xt,yt) PS(xt,yt)I

h(xt)6=yt

= E

(xt,yt)∼PS

PT(xt,yt) PS(xt,yt)I

h(xt)6=yt

(LaHC) Domain Adaptation - EPAT’14 32 / 95

(40)

Covariate shift [Shimodaira,’00]

⇒ Assume similar tasks, PS(y|x) =PT(y|x), then:

= E

(xt,yt)∼PS

DT(xt)PT(yt|xt) DS(xt)PS(yt|xt)I

h(xt)6=yt

= E

(xt,yt)∼PS

DT(xt) DS(xt)I

h(xt)6=yt

= E

(xt)∼DS

DT(xt) DS(xt) E

yt∼PS(yt|xt) }I

h(xt)6=yt

⇒ weighted erroron the source domain: ω(xt) = DDT(xt)

S(xt)

Idea reweight labeled sourcedata according to an estimate of ω(xt):

E

(xt,yt)∼PS ω(xt)I

h(xt)6=yt

(LaHC) Domain Adaptation - EPAT’14 33 / 95

(41)

Illustration

No Bias

DS(x) =DT(x)⇒ω(x) = 1

With Bias

DS(x)6=DT(x)⇒ω(x)6= 1

(LaHC) Domain Adaptation - EPAT’14 34 / 95

(42)

Difficult case

No shared support

∃x,DS(x) = 0 andDT(x)6= 0

Shared support

DS(x) = 0 if and only ifDT(x) = 0

Intuition: the quality of the adaptation depends on the magnitude on the weights

(LaHC) Domain Adaptation - EPAT’14 35 / 95

(43)

How to deal with the sample selection bias?

Setting

A source sampleS = {(xsi,yis)}mi=1s and a target sample T = {xtj}mj=1t Estimate new weights without using labels

ˆ ω(xsi)=

PrˆT(xsi) PrˆS(xsi) Learn a classifier on the classifier w.r.t. ˆω

X

(xsi,yis)∈S

ˆ

ω(xsi)I[h(xsi)6=yis]

(Other losses: margin-based hinge-loss, least-square)

Much recent research

Correcting sampling bias

[Shimodaira, ’00]

[Huang et al., Bickel et al., ’07]

[Sugiyama et al., ’08]

[Sethy et al., ’06]

[Sethy et al., ’09]

[This work]

Adjusting mismatched models

[Evgeniou and Pontil, ’05]

[Duan et al., ’09]

[Duan et al., Daumé III et al., Saenko et al., ’10]

[Kulis et al., Chen et al., ’11]

+ -

- -- ++

+

- - - -

+ +

Inferring domain- invariant features

[Pan et al., ’09]

[Blitzer et al., ’06] [Gopalan et al., ’11]

[Chen et al., ’12] [Daumé III, ’07]

[Argyriou et al, ’08] [Gong et al., ’12]

[Muandet et al., ’13]

++ - +

- - - + + +

(LaHC) Domain Adaptation - EPAT’14 36 / 95

(44)

Illustration

+ + + + + +

- +

+

(LaHC) Domain Adaptation - EPAT’14 37 / 95

(45)

Illustration

+ +

+ + + +

- +

+

(LaHC) Domain Adaptation - EPAT’14 37 / 95

(46)

Illustration

+ +

+ + + +

- +

+

(LaHC) Domain Adaptation - EPAT’14 37 / 95

(47)

Illustration

+ +

+ + + +

- +

+

(LaHC) Domain Adaptation - EPAT’14 37 / 95

(48)

Illustration

+ +

+ + + +

- +

+

(LaHC) Domain Adaptation - EPAT’14 37 / 95

(49)

Some existing approaches (1/2)

Density estimators

Build density estimators for source and target domains and estimate the ratio between them - Ex [Sugiyama et al.,NIPS’07]:

ˆ

ω(x) =Pb

l=1αlψl(x)

Learning: argminαKL(ˆωDS,DT)

Learn the weights discriminatively [Bickel et al.,ICML’07]

Assume DDT(xi)

S(xi)p(q=1|x,θ)1

Labelsource with label 1,target with label 0 and train a classifier (ˆθ) to classify examples1 or 0(e.g. with logistic regression)

Compute the new weights ω(xˆ si) = 1 p(q = 1|xsi; ˆθ)

(LaHC) Domain Adaptation - EPAT’14 38 / 95

(50)

Some existing approaches (2/2)

Kernel Mean Matching [Huang et al.,NIPS’06]

Maximum Mean Discrepancy MMD(S,T) =km1

S

Pms

i=1φ(xsi)m1

t

Pmt

j=1φ(xtj)kH

minβk 1 mS

ms

X

i=1

β(xsi)φ(xsi) 1 mS

mt

X

j=1

φ(xtj)kH

s.t. β(xsi)[0,B] and|m1

S

Pms

i=1β(xsi)1|<

minβ1

2βTKTβκTS,T s.t. βi[0,B] and|Pms

i=1β(xsi)ms|<ms

Guarantees [Gretton et al.,2008] - Under covariate shift assumptions

|RPT(h)weighted(RS(h))|<

rO(1/δ) +O(maxx ω(x)2) ms

+Cand

k 1 mS

ms

X

i=1

ω(xsi)φ(xsi) 1 mt

mt

X

j=1

φ(xtj)k ≤O((1/δ) q

ω2max/ms+ 1/mt

(LaHC) Domain Adaptation - EPAT’14 39 / 95

(51)

Bad news

DA is hard, even under covariate shift [Ben-David et al.,ALT’12]

⇒ To learn a classifier the number of examples depend on|H|(finite) or exponentially on the dimension of X

Covariate shift assumption may fail: Tasks are not similar in general PS(y|x)6=PT(y|x)

We did not consider the hypothesis space.

Can define a general theory about DA?

(LaHC) Domain Adaptation - EPAT’14 40 / 95

(52)

Theoretical frameworks for Domain Adaptation

(LaHC) Domain Adaptation - EPAT’14 41 / 95

(53)

A keypoint: estimating the distribution shift

First idea: Total variation measure dL1(DS,DT) = supB⊆X|DS(B)−DT(B)|

Subset of points maximizing the divergence

But:

Not computable in general, and thus not estimable from finite samples

(LaHC) Domain Adaptation - EPAT’14 42 / 95

(54)

A keypoint: estimating the distribution shift

First idea: Total variation measure dL1(DS,DT) = supB⊆X|DS(B)−DT(B)|

Subset of points maximizing the divergence But:

Not computable in general, and thus not estimable from finite samples Not related to the hypothesis class

Do not characterize the difficulty of the problem forH

(LaHC) Domain Adaptation - EPAT’14 42 / 95

(55)

The H ∆ H-divergence [Ben-David et al.,NIPS’06;MLJ’10]

Definition

dHH(DS,DT) = sup

(h,h0)∈H2

RDT(h,h0)RDS(h,h0)

= sup

(h,h0)∈H2

E

xt∼DT

I

h(xt)6=h0(xt)

E

xs∼DS

I

h(xs)6=h0(xs)

(LaHC) Domain Adaptation - EPAT’14 43 / 95

(56)

The H ∆ H-divergence [Ben-David et al.,NIPS’06;MLJ’10]

Definition

dHH(DS,DT) = sup

(h,h0)∈H2

RDT(h,h0)RDS(h,h0)

= sup

(h,h0)∈H2

E

xt∼DT

I

h(xt)6=h0(xt)

E

xs∼DS

I

h(xs)6=h0(xs)

Illustration with only 2 hypothesis in H h and h0

Note: With a larger H, the distance will be highsince we can easily find two hypothesis able to distinguishthe two domains

(LaHC) Domain Adaptation - EPAT’14 43 / 95

(57)

Computable from samples

Consider two samples S,T of size m fromDS andDT

dHH(DS,DT)≤dHH(S,T) +O(complexity(H)

qlog(m)

m )

complexity(H): VC-dimension [Ben-david et al.,06;’10], Rademacher [Mansour et al.,’09]

Empirical estimation

dˆH∆H(S,T) = 2

1minh∈H

1 m

X

x:h(x)=−1

I[xS] + 1 m

X

x:h(x)=1

I[xT]

⇒ Already seen: labelsource examples as -1,target ones as +1 and try to learn a classifier in H minimizing the associated empirical error

(LaHC) Domain Adaptation - EPAT’14 44 / 95

(58)

Going to a generalization bound

Preliminaries

RPT(h,h0) = E

(x,y)∼PS I[h(x)6=h0(x)] = E

x∼DT I[h(x)6=h0(x)]

RPT (RPS) fulfills the triangle inequality

|RPT(h,h0)−RPS(h,h0)| ≤ 12dHH(DS,DT) since dHH(DS,DT) = 2 sup(h,h0)∈H2

RDT(h,h0)−RDS(h,h0) hS =argminh∈HRPS(h): best on source

hT =argminh∈HRPT(h): best ontarget Ideal joint hypothesis

h =argminh∈HRPS(h) +RPT(h) ;λ=RPS(h) +RPT(h)

(LaHC) Domain Adaptation - EPAT’14 45 / 95

(59)

A first bound

RPT(h)≤

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(60)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(61)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

≤RPT(h) +RPS(h,h) +RPT(h,h)−RPS(h,h)

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(62)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

≤RPT(h) +RPS(h,h) +RPT(h,h)−RPS(h,h)

≤RPT(h) +RPS(h,h) +|RPT(h,h)−RPS(h,h)|

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(63)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

≤RPT(h) +RPS(h,h) +RPT(h,h)−RPS(h,h)

≤RPT(h) +RPS(h,h) +|RPT(h,h)−RPS(h,h)|

≤RPT(h) +RPS(h,h) +1

2dHH(DS,DT)

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(64)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

≤RPT(h) +RPS(h,h) +RPT(h,h)−RPS(h,h)

≤RPT(h) +RPS(h,h) +|RPT(h,h)−RPS(h,h)|

≤RPT(h) +RPS(h,h) +1

2dHH(DS,DT)

≤RPT(h) +RPS(h) +RPS(h) +1

2dHH(DS,DT)

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(65)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

≤RPT(h) +RPS(h,h) +RPT(h,h)−RPS(h,h)

≤RPT(h) +RPS(h,h) +|RPT(h,h)−RPS(h,h)|

≤RPT(h) +RPS(h,h) +1

2dHH(DS,DT)

≤RPT(h) +RPS(h) +RPS(h) +1

2dHH(DS,DT)

≤RPS(h) +1

2dHH(DS,DT) +λ

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(66)

A first bound

RPT(h)≤RPT(h) +RPT(h,h)

≤RPT(h) +RPS(h,h) +RPT(h,h)−RPS(h,h)

≤RPT(h) +RPS(h,h) +|RPT(h,h)−RPS(h,h)|

≤RPT(h) +RPS(h,h) +1

2dHH(DS,DT)

≤RPT(h) +RPS(h) +RPS(h) +1

2dHH(DS,DT)

≤RPS(h) +1

2dHH(DS,DT) +λ

≤RS(h) +1

2dHH(S,T) +O(complexity(H)

rlog(m) m ) +λ

(LaHC) Domain Adaptation - EPAT’14 46 / 95

(67)

Main theoretical bound

Theorem [Ben-David et al.,MLJ’10,NIPS’06]

Let H a symmetric hypothesis space. If DS and DT are respectively the marginal distributions of source and target instances, then for all δ ∈ (0,1], with probability at least 1−δ :

∀h∈ H, RPT(h) ≤RPS(h) + 1

2dHH(DS,DT) + λ Formalizes a natural approach: Move closer the two distributions while ensuring a low error on the source domain.

Justifies many algorithms:

reweighting methods, feature-based methods, adjusting/iterative methods.

(LaHC) Domain Adaptation - EPAT’14 47 / 95

(68)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(69)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(70)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

=RPT(h,hS) +ν

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(71)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

=RPT(h,hS) +ν

≤RPS(h,hS) +RPT(h,hS)−RPS(h,hS) +ν

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(72)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

=RPT(h,hS) +ν

≤RPS(h,hS) +RPT(h,hS)−RPS(h,hS) +ν

≤RPS(h,hS) +|RPT(h,hS)−RPS(h,hS)|+ν

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(73)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

=RPT(h,hS) +ν

≤RPS(h,hS) +RPT(h,hS)−RPS(h,hS) +ν

≤RPS(h,hS) +|RPT(h,hS)−RPS(h,hS)|+ν

≤RPS(h,hS)) + 1

2dHH(DS,DT) +ν

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(74)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

=RPT(h,hS) +ν

≤RPS(h,hS) +RPT(h,hS)−RPS(h,hS) +ν

≤RPS(h,hS) +|RPT(h,hS)−RPS(h,hS)|+ν

≤RPS(h,hS)) + 1

2dHH(DS,DT) +ν (≤RPS(h) +1

2dHH(DS,DT) +RPS(hS) +ν) ifhS is not the true labeling function

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(75)

Another analysis [Mansour et al.,COLT’09]

RPT(h)≤RPT(h,hS) +RPT(hS,hT) +RPT(hT)

=RPT(h,hS) +ν

≤RPS(h,hS) +RPT(h,hS)−RPS(h,hS) +ν

≤RPS(h,hS) +|RPT(h,hS)−RPS(h,hS)|+ν

≤RPS(h,hS)) + 1

2dHH(DS,DT) +ν (≤RPS(h) +1

2dHH(DS,DT) +RPS(hS) +ν) ifhS is not the true labeling function

This analysis can lead to smaller when adaptation is possible Leads to the same type of bound, just the constant changes ...

(LaHC) Domain Adaptation - EPAT’14 48 / 95

(76)

Characterization of the possibility of domain adaptation

Constants characterize when adaptation is possible

λ=RPS(h) +RPT(h), h =argminh∈HRPS(h) +RPT(h) There must exist an ideal joint hypothesis with small error ν =RPT(hS,hT) +RPT(hT)

there must exist a very good hypothesis on the target and the best hypothesis on source must be close to the best on target w.r.t toDT

(LaHC) Domain Adaptation - EPAT’14 49 / 95

(77)

Other settings

Discrepancy [Mansour et al.,COLT’09]

instead of the 0-1 loss, more general loss functions `(i.e. Lp norms) disc`(DS,DT) = suph,h0∈H|RD`

S(h,h0)−RD`

T(h,h0)|

This discrepancy can be minimized and used as a reweighting method ([Mansour et al.,COLT’09] - polynomial for L2 norm for example) Using some target labeled data

Weighting the empirical source and target risks [Ben David et al.,2010]

Using a divergence taking into account target labels [Zhang et al.,NIPS’12] (a divergence must take into account marginals overX andY, theλconstant counts for Y)

(LaHC) Domain Adaptation - EPAT’14 50 / 95

(78)

Other settings

Averaged quantities [Germain et al.,ICML’13]

Consider a probability distribution ρ (posterior) over Hto learn and the following risk: E

h∼ρRPT(h) Definition of an averaged distance dis(DS,DT) =

E

h,h0∼ρ2[RDT(h,h0)−RDS(h,h0)]

Similar generalization bound

h∼ρE RPT(h)≤ E

h∼ρRPS(h) + dis(DS,DT) +λρ

Estimation from samples controlled by PAC-Bayesian theory Bound tighter without a supremum

(LaHC) Domain Adaptation - EPAT’14 51 / 95

(79)

Feature/Projection based Approaches

(LaHC) Domain Adaptation - EPAT’14 52 / 95

Références

Documents relatifs

This algorithm, working in the difficult unsupervised DA setting, itera- tively builds a combination of weak DA learners able to minimize both the source classification error and

It relies on an upper bound on the target risk, expressed as a trade-off between the voters’ disagreement on the target domain, the voters’ joint errors on the source one, and a

Therefore, rather than discussing about its tightness, it is worth noticing that the previous bound is the first one which jointly relates (i) the target risk in domain adaptation,

Pascal Germain, Amaury Habrard, François Laviolette, Emilie Morvant.. An Improvement to the Domain Adaptation Bound in a

Hence, as pointed out in [2], Equation (1) together with the usual VC-bound theory, express a multiple trade-off between the accuracy of some particular hypothesis h, the complexity

In this paper, we derive a novel DA bound relying on a simpler expression: The target error is upper- bounded by a trade-off between the voters’ disagreement on the target domain,

It relies on an upper bound on the target risk, expressed as a trade-off between the voters’ disagree- ment on the target domain, the voters’ joint errors on the source one, and a

In this work, we tackle the hard [6] problem of unsupervised Domain Adaptation (DA), which arises when we want to learn from a distribution – the source domain – a well performing