Graph-based Dependency Parsing (Chu-Liu-Edmonds algorithm) Sam Thomson (with thanks to Swabha Swayamdipta)

(1)

Graph-based Dependency Parsing

(Chu-Liu-Edmonds algorithm)

Sam Thomson (with thanks to Swabha Swayamdipta)

University of Washington, CSE 490u

February 22, 2017

(2)

Outline

I Dependency trees

I Three main approaches to parsing

I Chu-Liu-Edmonds algorithm

I Arc scoring / Learning

(3)

Dependency Parsing - Output

(4)

Dependency Parsing

TurboParser output from

http://demo.ark.cs.cmu.edu/parse?sentence=I%20ate%20the%20fish%20with%20a%20fork.

(5)

Dependency Parsing - Output Structure

A parse is anarborescence (akadirected rooted tree):

I Directed [Labeled] Graph

I Acyclic

I Single Root

I Connected and Spanning: ∃ directed path from root to every other word

(6)

Projective / Non-projective

I Some parses are projective: edges don’t cross

I Most English sentences are projective, but non-projectivity is common in other languages (e.g. Czech, Hindi)

Non-projective sentence in English:

and Czech:

Examples fromNon-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

(7)

Dependency Parsing - Approaches

(8)

Dependency Parsing Approaches

I Chart (Eisner, CKY)

I O(n³)

I Onlyproduces projective parses

I Shift-reduce

I O(n) (fast!), but inexact

I “Pseudo-projective” trick can capture some non-projectivity

I Graph-based (MST)

I O(n²) for arc-factored

I Can produce projectiveandnon-projective parses

(9)

Dependency Parsing Approaches

I O(n³)

I Shift-reduce

I Graph-based (MST)

(10)

Dependency Parsing Approaches

I O(n³)

I Shift-reduce

I Graph-based (MST)

(11)

Graph-based Dependency Parsing

(12)

Arc-Factored Model

Every possible labeled directed edgee between every pair of nodes gets a score,score(e).

G =hV,Ei=

(O(n²) edges)

Example fromNon-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

(13)

Arc-Factored Model

Every possible labeled directed edgee between every pair of nodes gets a score,score(e).

G =hV,Ei=

(O(n²) edges)

(14)

Arc-Factored Model

Best parse is:

A^∗ = arg max

A⊆G s.t.Aan arborescence

X

e∈A

score(e)

etc. . .

The Chu-Liu-Edmonds algorithm finds this argmax.

(15)

Arc-Factored Model

Best parse is:

A^∗ = arg max

X

e∈A

score(e)

etc. . .

(16)

Arc-Factored Model

Best parse is:

A^∗ = arg max

X

e∈A

score(e)

etc. . .

(17)

Chu-Liu-Edmonds

Chu and Liu ’65, On the Shortest Arborescence of a Directed Graph, Science Sinica

Edmonds ’67, Optimum Branchings, JRNBS

(18)

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edge

In fact, every connected component that doesn’t containROOT needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycleC.

I Arborescences can’t have cycles, so we can’t keep every edge in C. One edge inC must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge forC determines which edge to kick out

(19)

Chu-Liu-Edmonds - Intuition

(20)

Chu-Liu-Edmonds - Intuition

(21)

Chu-Liu-Edmonds - Intuition

(22)

Chu-Liu-Edmonds - Intuition

(23)

Chu-Liu-Edmonds - Intuition

I Arborescences can’t have cycles, so we can’t keep every edge in C. One edge in C must get kicked out.

(24)

Chu-Liu-Edmonds - Intuition

(25)

Chu-Liu-Edmonds - Intuition

(26)

Chu-Liu-Edmonds - Recursive (Inefficient) Definition

defmaxArborescence(V , E ,ROOT):

””” returns best arborescence as a map from each node to its parent ”””

forvinV\ROOT:

bestInEdge[v]←arg maxe∈inEdges[v]e.score ifbestInEdgecontains a cycle C:

# build a new graph whereCis contracted into a single node v_C←newNode()

V⁰←V∪ {v_C} \C

E⁰← {adjust(e)fore∈E\C}

A←maxArborescence(V⁰, E⁰,ROOT)

return{e.originalfore∈A} ∪C\ {A[v_C].kicksOut}

# each node got a parent without creating any cycles returnbestInEdge

defadjust(e):

e⁰←copy(e) e⁰.original←e ife.dest∈C:

e⁰.dest←v_C

e⁰.kicksOut←bestInEdge[e.dest]

e⁰.score←e.score−e⁰.kicksOut.score elife.src∈C:

e⁰.src←v_C returne⁰

(27)

Chu-Liu-Edmonds

Consists of two stages:

I Contracting (everythingbeforethe recursive call)

I Expanding (everythingafterthe recursive call)

(28)

Chu-Liu-Edmonds - Preprocessing

I Remove every edge incoming toROOT

I This ensures thatROOTis in fact the root of any solution

I For every ordered pair of nodes,v_i,v_j, remove all but the highest-scoring edge from vi tovj

(29)

Chu-Liu-Edmonds - Contracting Stage

I For each non-ROOT nodev, set bestInEdge[v] to be its highest scoring incoming edge.

I If a cycleC is formed:

I contractthe nodes inC into a new nodevC

I edges outgoing from any node inC now get sourcevC I edges incoming to any node inC now get destinationvC I For each nodeu inC, and for each edgee incoming toufrom

outside ofC:

I sete.kicksOuttobestInEdge[u], and

I sete.scoreto bee.score−e.kicksOut.score.

I Repeat until every non-ROOT node has an incoming edge and no cycles are formed

(30)

An Example - Contracting Stage

V1

ROOT

V3 V2

a: 5 b: 1 c: 1

f : 5 d: 11

h: 9 e: 4

i: 8 g: 10

bestInEdge V1

V2 V3

kicksOut a

b c d e f g h i

(31)

An Example - Contracting Stage

V1

ROOT

V3 V2

a: 5 b: 1 c: 1

f : 5 d: 11

h: 9 e: 4

i: 8 g: 10

bestInEdge

V1 g

V2 V3

kicksOut a

b c d e f g h i

(32)

An Example - Contracting Stage

V1

ROOT

V3 V2

a: 5 b: 1 c: 1

f : 5 d: 11

h: 9 e: 4

i: 8 g: 10

bestInEdge

V1 g

V2 d

V3

kicksOut a

b c d e f g h i

(33)

An Example - Contracting Stage

V1

ROOT

V3 V2

a: 5−10 b: 1−11 c: 1

f : 5 d: 11

h: 9−10 e: 4

i: 8−11 g: 10

V4

bestInEdge

V1 g

V2 d

V3

kicksOut

a g

b d

c d e f g

h g

i d

(34)

An Example - Contracting Stage

V4

ROOT

V3

b:−10 c: 1

f: 5 a:−5

h:−1 e: 4 i:−3

bestInEdge

V1 g

V2 d

V3 V4

kicksOut

a g

b d

c d e f g

h g

i d

(35)

An Example - Contracting Stage

V4

ROOT

V3

b:−10 c: 1

f: 5 a:−5

h:−1 e: 4 i:−3

bestInEdge

V1 g

V2 d

V3 f

V4

kicksOut

a g

b d

c d e f g

h g

i d

(36)

An Example - Contracting Stage

V4

ROOT

V3

b:−10 c: 1

f: 5 a:−5

h:−1 e: 4 i:−3

bestInEdge

V1 g

V2 d

V3 f

V4 h

kicksOut

a g

b d

c d e f g

h g

i d

(37)

An Example - Contracting Stage

V4

ROOT

V3

b:−10− −1 c: 1−5

f: 5 a:−5− −1

h:−1 e: 4 i:−3

V5

bestInEdge

V1 g

V2 d

V3 f

V4 h

V5

kicksOut

a g, h

b d, h

c f

d e f g

h g

i d

(38)

An Example - Contracting Stage

V5 ROOT

b:−9 a:−4 c:−4

bestInEdge

V1 g

V2 d

V3 f

V4 h

V5

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(39)

An Example - Contracting Stage

V5 ROOT

b:−9 a:−4 c:−4

bestInEdge

V1 g

V2 d

V3 f

V4 h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(40)

Chu-Liu-Edmonds - Expanding Stage

After the contracting stage, every contracted node will have exactly onebestInEdge. This edge will kick out one edge inside the contracted node, breaking the cycle.

I Go through each bestInEdgee in the reverseorder that we added them

I lock down e, andremove every edge inkicksOut(e) from bestInEdge.

(41)

An Example - Expanding Stage

V5 ROOT

b:−9 a:−4 c:−4

bestInEdge

V1 g

V2 d

V3 f

V4 h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(42)

An Example - Expanding Stage

V5 ROOT

b:−9 a:−4 c:−4

bestInEdge

V1 a

g

V2 d

V3 f

V4 a

^h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(43)

An Example - Expanding Stage

V4

ROOT

V3

b:−10 c: 1

f: 5 a:−5

h:−1 e: 4 i:−3

bestInEdge

V1 a

^g

V2 d

V3 f

V4 a

^h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(44)

An Example - Expanding Stage

V4

ROOT

V3

b:−10 c: 1

f: 5 a:−5

h:−1 e: 4 i:−3

bestInEdge

V1 a

^g

V2 d

V3 f

V4 a

^h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(45)

An Example - Expanding Stage

V1

ROOT

V3 V2

a: 5 b: 1 c: 1

f : 5 d: 11

h: 9 e: 4

i: 8 g: 10

bestInEdge

V1 a

g

V2 d

V3 f

V4 a

h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(46)

An Example - Expanding Stage

V1

ROOT

V3 V2

a: 5 b: 1 c: 1

f : 5 d: 11

h: 9 e: 4

i: 8 g: 10

bestInEdge

V1 a

g

V2 d

V3 f

V4 a

h

V5 a

kicksOut

a g, h

b d, h

c f

d

e f

f g

h g

i d

(47)

Chu-Liu-Edmonds - Notes

I This is a greedy algorithm with a clever form of delayed back-tracking to recover from inconsistent decisions (cycles).

I CLE is exact: it always recovers the optimal arborescence.

(48)

Chu-Liu-Edmonds - Notes

I Efficient implementation:

Tarjan ’77, Finding Optimum Branchings, Networks

Not recursive. Uses aunion-find(a.k.a. disjoint-set) data structure to keep track of collapsed nodes.

I Even more efficient:

Gabow et al. ’86, Efficient Algorithms for Finding Minimum Spanning Trees in Undirected and Directed Graphs, Combinatorica

Uses a Fibonacci heapto keep incoming edges sorted. Finds cycles by following bestInEdgeinstead of randomly visiting nodes.

Describes how to constrain ROOT to have only one outgoing edge

(49)

Chu-Liu-Edmonds - Notes

I Efficient (wrong)implementation:

Tarjan ’77, Finding Optimum Branchings*, Networks

*corrected in Camerini et al. ’79, A note on finding optimum branchings, Networks

Uses a Fibonacci heapto keep incoming edges sorted. Finds cycles by following bestInEdgeinstead of randomly visiting nodes.

(50)

Chu-Liu-Edmonds - Notes

I Efficient (wrong)implementation:

Tarjan ’77, Finding Optimum Branchings*, Networks

*corrected in Camerini et al. ’79, A note on finding optimum branchings, Networks

Uses a Fibonacci heapto keep incoming edges sorted.

Finds cycles by following bestInEdgeinstead of randomly visiting nodes.

(51)

Arc Scoring / Learning

(52)

Arc Scoring

Features

can look at source (head), destination (child), and arc label.

For example:

I number of words between head and child,

I sequence of POS tags between head and child,

I is head to the left or right of child?

I vector state of a recurrent neural net at head and child,

I vector embedding of label,

I etc.

(53)

Learning

Recall that when we have a parameterized model, and we have a decoder that can make predictions given that model. . .

we can use structured perceptron, or structured hinge loss:

L_θ(xi,yi) = max

y∈Y{score_θ(y) +cost(y,yi)} −scoreθ(yi)

(54)

Learning

Recall that when we have a parameterized model, and we have a decoder that can make predictions given that model. . .

we can use structured perceptron, or structured hinge loss:

L_θ(xi,yi) = max

y∈Y{score_θ(y) +cost(y,yi)} −scoreθ(yi)