Joint Arc-factored Parsing of Syntactic and Semantic Dependencies
Xavier Llu´ıs and Xavier Carreras and Llu´ıs M`arquez TALP Research Center
Universitat Polit`ecnica de Catalunya Jordi Girona 1–3, 08034 Barcelona
{xlluis,carreras,lluism}@lsi.upc.edu
Abstract
In this paper we introduce a joint arc-factored model for syntactic and semantic dependency parsing. The semantic role labeler predicts the full syntactic paths that connect predicates with their arguments. This process is framed as a linear assignment task, which allows to control some well-formedness constraints.
For the syntactic part, we define a standard arc-factored dependency model that predicts the full syntactic tree. Finally, we employ dual decomposition techniques to produce consis- tent syntactic and predicate-argument struc- tures while searching over a large space of syntactic configurations. In experiments on the CoNLL-2009 English benchmark we ob- serve very competitive results.
1 Introduction
Semantic role labeling (SRL) is the task of identi- fying the arguments of lexical predicates in a sen- tence and labeling them with semantic roles (Gildea and Jurafsky, 2002; M`arquez et al., 2008). SRL is an important shallow semantic task in NLP since predicate-argument relations directly represent se- mantic properties of the type “who” did “what” to
“whom”, “how”, and “why” for events expressed by predicates (typically verbs and nouns).
Predicate-argument relations are strongly related to the syntactic structure of the sentence: the ma- jority of predicate arguments correspond to some syntactic constituent, and the syntactic structure that connects an argument with the predicate is a strong indicator of its semantic role. Actually, semantic
roles represent an abstraction of the syntactic form of a predicative event. While syntactic functions of arguments change with the form of the event (e.g., active vs. passive forms), the semantic roles of argu- ments remain invariant to their syntactic realization.
Consequently, since the first works, SRL systems have assumed access to the syntactic structure of the sentence (Gildea and Jurafsky, 2002; Carreras and M`arquez, 2005). A simple approach is to obtain the parse trees as a pre-process to the SRL system, which allows the use of unrestricted features of the syntax. However, as in other pipeline approaches in NLP, it has been shown that the errors of the syn- tactic parser severely degrade the predictions of the SRL model (Gildea and Palmer, 2002). A common approach to alleviate this problem is to work with multiple alternative syntactic trees and let the SRL system optimize over any input tree or part of it (Toutanova et al., 2008; Punyakanok et al., 2008).
As a step further, more recent work has proposed parsing models that predict syntactic structure aug- mented with semantic predicate-argument relations (Surdeanu et al., 2008; Hajiˇc et al., 2009; Johansson, 2009; Titov et al., 2009; Llu´ıs et al., 2009), which is the focus of this paper. These joint models should favor the syntactic structure that is most consistent with the semantic predicate-argument structures of a sentence. In principle, these models can exploit syntactic and semantic features simultaneously, and could potentially improve the accuracy for both syn- tactic and semantic relations.
One difficulty in the design of joint syntactic- semantic parsing models is that there exist impor- tant structural divergences between the two layers.
219
Mary loves to play guitar .
�
SBJ OPRD IM OBJ
P
ARG0 ARG1
ARG0
ARG1
Figure 1: An example . . .
Das et al. (2012) . . .
Riedel and McCallum (2011) . . .
3 A Syntactic-Semantic Dependency Model
We will describe structures of syntactic and seman- tic dependencies with vectors of binary variables.
We will denote by yh,m,l a syntactic dependency from head token h to dependant token m labeled with syntactic function l. Similarly we will denote by zp,a,r a semantic dependency between predicate token pand argument token a labeled with seman- tic roler. We will useyandzto denote vectors of binary variables indexed by syntactic and semantic dependencies, respectively.
A joint model for syntactic and semantic depen- dency parsing could be defined as:
argmax
y,z s syn(x,y) + s srl(x,z,y) . In the equation, s syn(x,y) gives a score for the syntactic tree y. In the literature, it is standard to use arc-factored models defined as
s syn(x,y) = �
yh,m,l=1
s syn(x, h, m, l) ,
where we overload s syn to be a function that computes scores for individual labeled syntactic dependencies. In discriminative models one has s syn(x, h, m, l) = wsyn · fsyn(x, h, m, l), where fsynis a feature vector for the syntactic dependency andwsynis a vector of parameters (McDonald et al., 2005).
The other term, s srl(x,z,y), gives a score for a semantic dependency structure using the syntactic structure y as features. Previous work has empiri-
features in the semantic component (Gildea and Ju- rafsky, 2002; Xue and Palmer, 2004; Punyakanok et al., 2008). However, without further assumptions, this property makes the optimization problem com- putationally hard. One simple approximation is to use a pipeline model: first compute the optimal syn- tactic tree, and then optimize for the best semantic structure given the syntactic tree. In the rest of the paper we describe a method that searches over syn- tactic and semantic dependency structures jointly.
We first impose the assumption that syntactic fea- tures of the semantic component are restricted to the syntactic path between a predicate and an argument, following previous work (Johansson, 2009). For- mally, for a predicate p, argument aand role r we will define a vector of dependency indicatorsπp,a,r similar to the ones above: πp,a,rh,m,l indicates if a de- pendency�h, m, l�is part of the syntactic path that links predicatepwith tokena. Figure 1 gives an ex- ample of one such paths. Given full syntactic and semantic structuresyandzit is trivial to construct a vectorπthat is a concatenation of vectorsπp,a,r for all�p, a, r�inz. We can now define a linear seman- tic model as
s srl(x,z,π) = �
zp,a,r=1
s srl(x, p, a, r,πp,a,r) , (1) where s srl computes a score for a semantic de- pendency �p, a, r� together with its syntactic path πp,a,r. As in the syntactic component, this function is typically defined as a linear function over a set of features of the semantic dependency and its path.
With this joint model, the inference problem can be formulated as:
argmax
y,z,π s syn(x,y) + s srl(x,z,π) (2) subject to
cTree: yis a valid dependency tree cRole: ∀p, r: �
a
zp,a,r ≤ 1
cArg: ∀p, a: �
r
zp,a,r ≤1 cPath: ∀p, a, r: ifzp,a,r = 1then
πp,a,r is a path fromptoa otherwiseπp,a,r =0
Figure 1: A sentence with syntactic dependencies (top) and semantic dependencies for the predicates “loves” and
“play” (bottom). The thick arcs illustrate a structural di- vergence where the argument “Mary” is linked to “play”
with a path involving three syntactic dependencies.
This is clearly seen in dependency-based representa- tions of syntax and semantic roles (Surdeanu et al., 2008), such as in the example in Figure 1: the con- struct “loves to” causes the argument “Mary” to be syntactically distant from the predicate “play”. Lin- guistic phenomena such as auxiliary verbs, control and raising, typically result in syntactic structures where semantic arguments are not among the direct dependants of their predicate —e.g., about 25% of arguments aredistantin the English development set of the CoNLL-2009 shared task. Besides, standard models for dependency parsing crucially depend on arc factorizations of the dependency structure (Mc- Donald et al., 2005; Nivre and Nilsson, 2005), other- wise their computational properties break. Hence, it is challenging to define efficient methods for syntac- tic and semantic dependency parsing that can exploit features of both layers simultaneously. In this paper we propose a method for this joint task.
In our method we define predicate-centric seman- tic models that, rather than predicting just the ar- gument that realizes each semantic role, they pre- dict the full syntactic path that connects the predi- cate with the argument. We show how efficient pre- dictions with these models can be made using as- signment algorithms in bipartite graphs. Simulta- neously, we use a standard arc-factored dependency model that predicts the full syntactic tree of the sen- tence. Finally, we employ dual decomposition tech- niques (Koo et al., 2010; Rush et al., 2010; Sontag et al., 2010) to find agreement between the full de- pendency tree and the partial syntactic trees linking each predicate with its arguments. In summary, the
main contributions of this paper are:
• We frame SRL as a weighted assignment prob- lem in a bipartite graph. Under this framework we can control assignment constraints between roles and arguments. Key to our method, we can efficiently search over a large space of syn- tactic realizations of semantic arguments.
• We solve joint inference of syntactic and se- mantic dependencies with a dual decomposi- tion method, similar to that of Koo et al. (2010).
Our system produces consistent syntactic and predicate-argument structures while searching over a large space of syntactic configurations.
In the experimental section we compare joint and pipeline models. The final results of our joint syntactic-semantic system are competitive with the state-of-the-art and improve over the best results published by a joint method on the CoNLL-2009 English dataset.
2 A Syntactic-Semantic Dependency Model
We first describe how we represent structures of syn- tactic and semantic dependencies like the one in Fig- ure 1. Throughout the paper, we will assume a fixed input sentencexwithntokens where lexical predi- cates are marked. We will also assume fixed sets of syntactic functionsRsyn and semantic roles Rsem. We will represent depencency structures using vec- tors of binary variables. A variableyh,m,l will in- dicate the presence of a syntactic dependency from head tokenhto dependant tokenmlabeled with syn- tactic functionl. Then, a syntactic tree will be de- noted as a vectoryof variables indexed by syntactic dependencies. Similarly, a variablezp,a,r will indi- cate the presence of a semantic dependency between predicate tokenpand argument tokenalabeled with semantic roler. We will represent a semantic role structure as a vectorz indexed by semantic depen- dencies. Whenever we enumerate syntactic depen- dencieshh, m, liwe will assume that they are in the valid range for x, i.e. 0 ≤ h ≤ n, 1 ≤ m ≤ n, h 6= m and l ∈ Rsyn, where h = 0 stands for a special root token. Similarly, for semantic depen- dencies hp, a, ri we will assume that p points to a predicate ofx,1≤a≤nandr∈ Rsem.
A joint model for syntactic and semantic depen- dency parsing could be defined as:
argmax
y,z
s syn(x,y) + s srl(x,z,y) . (1) In the equation, s syn(x,y) gives a score for the syntactic treey. In the literature, it is standard to use arc-factored models defined as
s syn(x,y) = X
yh,m,l=1
s syn(x, h, m, l) , (2) where we overload s syn to be a function that computes scores for individual syntactic depen- dencies. In linear discriminative models one has s syn(x, h, m, l) = wsyn ·fsyn(x, h, m, l), where fsyn is a feature vector for a syntactic dependency and wsyn is a vector of parameters (McDonald et al., 2005). In Section 6 we describe how we trained score functions with discriminative methods.
The other term in Eq. 1, s srl(x,z,y), gives a score for a semantic dependency structure z using features of the syntactic structurey. Previous work has empirically proved the importance of exploit- ing syntactic features in the semantic component (Gildea and Jurafsky, 2002; Xue and Palmer, 2004;
Punyakanok et al., 2008). However, without further assumptions, this property makes the optimization problem computationally hard. One simple approx- imation is to use a pipeline model: first compute the optimal syntactic treey, and then optimize for the best semantic structurezgiveny. In the rest of the paper we describe a method that searches over syn- tactic and semantic dependency structures jointly.
We first note that for a fixed semantic dependency, the semantic component will typically restrict the syntactic features representing the dependency to a specific subtree of y. For example, previous work has restricted such features to the syntactic path that links a predicate with an argument (Moschitti, 2004;
Johansson, 2009), and in this paper we employ this restriction. Figure 1 gives an example of a sub- tree, where we highlight the syntactic path that con- nects the semantic dependency between “play” and
“Mary” with roleARG0.
Formally, for a predicate p, argument aand role r we define a local syntactic subtree πp,a,r repre- sented as a vector: πp,a,rh,m,l indicates if a dependency
hh, m, liis part of the syntactic path that links pred- icatepwith tokenaand roler.1Given full syntactic and semantic structuresyandz it is trivial to con- struct a vectorπthat concatenates vectorsπp,a,rfor allhp, a, riinz. The semantic model becomes
s srl(x,z,π) = X
zp,a,r=1
s srl(x, p, a, r,πp,a,r) , (3) where s srl computes a score for a semantic de- pendency hp, a, ri together with its syntactic path πp,a,r. As in the syntactic component, this function is typically defined as a linear function over a set of features of the semantic dependency and its path.
The inference problem of our joint model is:
argmax
y,z,π s syn(x,y) + s srl(x,z,π) (4) subject to
cTree: yis a valid dependency tree cRole: ∀p, r: X
a
zp,a,r ≤1 cArg: ∀p, a: X
r
zp,a,r ≤1 cPath: ∀p, a, r: ifzp,a,r= 1then
πp,a,ris a path fromptoa, otherwiseπp,a,r=0
cSubtree: ∀p, a, r:πp,a,ris a subtree ofy Constraint cTree dictates that y is a valid depen- dency tree; see (Martins et al., 2009) for a detailed specification. The next two sets of constraints con- cern the semantic structure only.cRoleimposes that each semantic role is realized at most once.2 Con- versely, cArgdictates that an argument can realize at most one semantic role in a predicate. The final two sets of constraints model the syntactic-semantic interdependencies. cPath imposes that eachπp,a,r represents a syntactic path betweenp andawhen- ever there exists a semantic relation. Finally,cSub- treeimposes that the paths inπare consistent with the full syntactic structure, i.e. they are subtrees.
1In this paper we say that structuresπp,a,rare paths from predicates to arguments, but they could be more general sub- trees. The condition to build a joint system is that these subtrees must be parseable in the way we describe in Section 3.1.
2In general a semantic role can be realized with more than one argument, though it is rare. It is not hard to modify our framework to allow for a maximum number of occurrences of a semantic role.
In Section 3 we define a process that optimizes the semantic structure ignoring constraintcSubtree.
Then in Section 4 we describe a dual decomposition method that uses the first process repeatedly to solve the joint problem.
3 SRL as Assignment
In this section we frame the problem of finding se- mantic dependencies as a linear assignment task.
The problem we optimize is:
argmax
z,π s srl(x,z,π) (5) subject to cRole,cArg,cPath
In this case we dropped the full syntactic structure y from the optimization in Eq. 4, as well as the corresponding constraintscTree andcSubtree. As a consequence, we note that the syntactic pathsπ are not tied to any consistency constraint other than each of the paths being a well-formed sequence of dependencies linking the predicate to the argument.
In other words, the optimal solution in this case does not guarantee that the set of paths from a predicate to all of its arguments satisfies tree constraints. We first describe how these paths can be optimized locally.
Then we show how to find a solution z satisfying cRoleandcArgusing an assignment algorithm.
3.1 Local Optimization of Syntactic Paths Letbzandπbbe the optimal values of Eq. 5. For any hp, a, ri, let
π˜p,a,r= argmax
πp,a,r s srl(x, p, a, r,πp,a,r). (6) For anyhp, a, risuch thatzbp,a,r= 1it has to be that b
πp,a,r= ˜πp,a,r. If this was not true, replacingπbp,a,r with π˜p,a,r would improve the objective of Eq. 5 without violating the constraints, thus contradicting the hypothesis about optimality ofπb. Therefore, for eachhp, a, riwe can optimize its best syntactic path locally as defined in Eq. 6.
In this paper, we will assume access to a list of likely syntactic paths for each predicatepand argu- ment candidatea, such that the optimization in Eq. 6 can be solved explicitly by looping over each path in the list. The main advantage of this method is that, since paths are precomputed, our model can make unrestricted use of syntactic path features.
Mary(1) plays(2) guitar(3) NULL(4) NULL(5) NULL(6) (1)
ARG0 ARG(2)1 ARG(3)2 NULL(4) NULL(5) NULL(6)
W1,1
W4,2
W2,3
W3,4
W5,5 W6,6
W1,1
W4,2
W2,3
W3,4
W5,5 W6,6
W1,1
W4,2
W2,3
W3,4
W5,5 W6,6
W1,1
W4,2
W2,3
W3,4
W5,5 W6,6
W1,1
W4,2
W2,3
W3,4
W5,5 W6,6
W1,1
W4,2
W2,3
W3,4
W5,5 W6,6
Figure 2: Illustration of the assignment graph for the sen- tence “Mary plays guitar”, where the predicate “plays”
can have up to three roles:ARG0 (agent),ARG1 (theme) andARG2 (benefactor). Nodes labeled NULLrepresent a null role or token. Highlighted edges are the correct assignment.
It is simple to employ a probabilistic syntactic de- pendency model to create the list of likely paths for each predicate-argument pair. In the experiments we explore this approach and show that with an average of 44 paths per predicate we can recover 86.2% of the correct paths.
We leave for future work the development of ef- ficient methods to recover the most likely syntactic structure linking an argument with its predicate.
3.2 The Assignment Algorithm
Coming back to solving Eq. 5, it is easy to see that an optimal solution satisfying constraintscRole and cArg can be found with a linear assignment algorithm. The process we describe determines the predicate-argument relations separately for each predicate. Assume a bipartite graph of sizeN with rolenodesr1. . . rNon one side andargumentnodes a1. . . aN on the other side. Assume also a matrix of non-negative scoresWi,jcorresponding to assigning argumentaj to role ri. A linear assignment algo- rithm finds a bijectionf :i→ jfrom roles to argu- ments that maximizesPN
i=1Wi,f(i). The Hungarian algorithm finds the exact solution to this problem in O(N3)time (Kuhn, 1955; Burkard et al., 2009).
All that is left is to construct a bipartite graph rep- resenting predicate roles and sentence tokens, such that some roles and tokens can be left unassigned, which is a common setting for assignment tasks. Al- gorithm 1 describes a procedure for constructing a weighted bipartite graph for SRL, and Figure 2 il- lustrates an example of a bipartite graph. We then
Algorithm 1Construction of an Assignment Graph for Semantic Role Labeling
Letpbe a predicate withkpossible roles. Letnbe the number of argument candidates in the sentence. This al- gorithm creates a bipartite graph withN =n+kvertices on each side.
1. Createroleverticesrifori= 1. . . N, where
• for1≤i≤k,riis thei-th role,
• for1≤i≤n,rk+iis a specialNULLrole.
2. Createargumentverticesajforj= 1. . . N, where
• for1≤j ≤n,ajis thej-th argument candidate,
• for1≤j ≤k,an+jis a specialNULLargument.
3. Define a matrix of model scoresS∈R(k+1)×n: (a) Optimization of syntactic paths:
For1≤i≤k,1≤j ≤n Si,j= max
πp,aj ,ris srl(x, p, aj, ri,πp,aj,ri) (b) Scores ofNULLassignments3:
For1≤j≤n Sk+1,j = 0
4. LetS0 = mini,jSi,j, the minimum of any score inS. Define a matrix ofnon-negativescoresW ∈ RN×N as follows:
(a) for1≤i≤k, 1≤j≤n Wi,j =Si,j−S0
(b) fork < i≤N, 1≤j≤n Wi,j =Sk+1,j −S0
(c) for1< i≤N, n < j≤N Wi,j = 0
run the Hungarian algorithm on the weighted graph and obtain a bijectionf : ri → aj, from which it is trivial to recover the optimal solution of Eq. 5.
Finally, we note that it is simple to allow for multi- ple instances of a semantic role by adding more role nodes in step 1; it would be straightforward to add penalties in step 3 for multiple instances of roles.
4 A Dual Decomposition Algorithm
We now present a dual decomposition method to op- timize Eq. 4, that uses the assignment algorithm pre- sented above as a subroutine. Our method is sim- ilar to that of Koo et al. (2010), in the sense that
3In our model we fix the score of null assignments to 0. It is straightforward to compute a discriminative score instead.
our joint optimization can be decomposed into two sub-problems that need to agree on the syntactic dependencies they predict. For a detailed descrip- tion of dual decomposition methods applied to NLP see (Sontag et al., 2010; Rush et al., 2010).
We note that in Eq. 4 the constraintcSubtreeties the syntactic and semantic structures, imposing that any pathπp,a,rthat links a predicatepwith an argu- mentamust be a subtree of the full syntactic struc- turey. Formally the set of constraints is:
yh,m,l ≥πp,a,rh,m,l ∀p, a, r, h, m, l . These constraints can be compactly written as
c·yh,m,l ≥ X
p,a,r
πp,a,rh,m,l ∀h, m, l ,
where c is a constant equal to the number of dis- tinct semantic dependencies hp, a, ri. In addition, we can introduce a vector non-negative slack vari- ablesξwith a component for each syntactic depen- dencyξh,m,l, turning the constraints into:
c·yh,m,l−X
p,a,r
πh,m,lp,a,r −ξh,m,l = 0 ∀h, m, l We can now rewrite Eq. 4 as:
argmax
y,z,π,ξ≥0
s syn(x,y) + s srl(x,z,π) (7) subject to
cTree,cRole,cArg,cPath
∀h, m, l:c·yh,m,l−X
p,a,r
πh,m,lp,a,r −ξh,m,l = 0 As in Koo et al. (2010), we will relax subtree cons- traints by introducing a vector of Lagrange multipli- ersλ indexed by syntactic dependencies, i.e. each coordinate λh,m,l is a Lagrange multiplier for the constraint associated withhh, m, li. The Lagrangian of the problem is:
L(y,z,π,ξ,λ)= s syn(x,y) + s srl(x,z,π) +λ· c·y−X
p,a,r
πp,a,r−ξ
! (8) We can now formulate Eq. 7 as:
y,z,π,ξ≥0max
s.t.cTree,cRole,cArg,cPath c·y−P
p,a,rπp,a,r−ξ=0
L(y,z,π,ξ,λ) (9)
This optimization problem has the property that its optimum value is the same as the optimum of Eq. 7 for any value of λ. This is because whenever the constraints are satisfied, the terms in the Lagrangian involvingλare zero. If we remove the subtree con- straints from Eq. 9 we obtain the dual objective:
D(λ) = max
y,z,π,ξ≥0 s.t.cTree,cRole,cArg,cPath
L(y,z,π,ξ,λ) (10)
= max
ys.t.cTree s syn(x,y) +c·y·λ + max
s.t.cRolez,π,cArg,cPath
s srl(x,z,π)−λ·X
p,a,r
πp,a,r + max
ξ≥0 (−λ·ξ) (11)
The dual objective is an upper bound to the opti- mal value of primal objective of Eq. 7. Thus, we are interested in finding the minimum of the dual in order to tighten the upper-bound. We will solve
minλ D(λ) (12)
using a subgradient method. Algorithm 2 presents pseudo-code. The algorithm takes advantage of the decomposed form of the dual in Eq. 11, where we have rewritten the Lagrangian such that syntac- tic and semantic structures appear in separate terms.
This will allow to compute subgradients efficiently.
In particular, the subgradient ofDat a pointλis:
∆(λ) =c·by−X
p,a,r
b
πp,a,r−bξ (13) where
b
y = argmax
ys.t.cTree s syn(x,y) +c·y·λ (14) b
z,πb = argmax
z,πs.t.
cRole,cArg,cPath
s srl(x,z,π)−λ·X
p,a,r
πp,a,r(15) bξ = argmax
ξ≥0 −λ·ξ (16)
Wheneverπb is consistent withybthe subgradient will be zero and the method will converge. When paths πb contain a dependency hh, m, li that is in- consistent withby, the associated dualλh,m,lwill in- crease, hence lowering the score of all paths that use hh, m, liat the next iteration; at same time, the to- tal score for that dependency will increase, favoring syntactic dependency structures alternative toyb. As
Algorithm 2 A dual-decomposition algorithm for syntactic-semantic dependency parsing
Input: x, a sentence;T, number of iterations;
Output: syntactic and semantic structuresbyandbz Notation:we usecSem=cRole∧ cArg∧ cPath
1: λ1=0# initialize dual variables 2: c=number of distincthh, m, liinx 3: fort= 1. . . Tdo
4: by= argmaxys.t.cTree s syn(x,y) +c·λt·y 5: bz,πb = argmax z,π
s.t.cSem s srl(x,z,π)
−λt·P
p,a,rπp,a,r 6: λt+1=λt# dual variables for the next iteration 7: Setαt, the step size of the current iteration 8: for eachhh, m, lido
9: q=P
p,a,rbπh,m,lp,a,r # num. paths usinghh, m, li 10: ifq >0andybh,m,l = 0then
11: λt+1h,m,l =λt+1h,m,l+αtq
12: break if λt+1=λt # convergence 13: return by,bz
in previous work, in the algorithm a parameter αt
controls the size of subgradient steps at iterationt.
The key point of the method is that solutions to Eq. 14 and 15 can be computed efficiently using sep- arate processes. In particular, Eq. 14 corresponds to a standard dependency parsing problem, where for each dependencyhh, m, liwe have an additional score termc·λh,m,l—in our experiments we use the projected dependency parsing algorithm by (Eisner, 2000). To calculate Eq. 15 we use the assignment method described in Section 3, where it is straight- forward to introduce additional score terms−λh,m,l to every factorπp,a,rh,m,l. It can be shown that whenever the subgradient method converges, the solutions yb andbzare the optimal solutions to our original prob- lem in Eq. 4 (see (Koo et al., 2010) for a justifi- cation). In practice we run the subgradient method for a maximum number of iterations, and return the solutions of the last iteration if it does not converge.
5 Related Work
Recently, there have been a number of approaches to joint parsing of syntactic and semantic dependen- cies, partly because of the availability of treebanks in this format popularized by the CoNLL shared tasks (Surdeanu et al., 2008; Hajiˇc et al., 2009).
Like in our method, Johansson (2009) defined a model that exploits features of a semantic depen-
dency together with the syntactic path connecting the predicate and the argument. That method uses an approximate parsing algorithm that employsk-best inference and beam search. Similarly, Llu´ıs et al.
(2009) defined a joint model that forces the predi- cate structure to be represented in the syntactic de- pendency tree, by enriching arcs with semantic in- formation. The semantic component uses features of pre-computed syntactic structures that may diverge from the joint structure. In contrast, our joint pars- ing method is exact whenever the dual decomposi- tion algorithm converges.
Titov et al. (2009) augmented a transition-based dependency parser with operations that produce synchronous derivations of syntactic and semantic structures. Instead of explicitly representing seman- tic dependencies together with a syntactic path, they induce latent representations of the interactions be- tween syntactic and semantic layers.
In all works mentioned the model has no con- trol of assignment constraints that disallow label- ing multiple arguments with the same semantic role.
Punyakanok et al. (2008) first introduced a system that explicitly controls these constraints, as well as other constraints that look at pairwise assignments which we can not model. They solve SRL using general-purpose Integer Linear Programming (ILP) methods. In similar spirit, Riedel and McCallum (2011) presented a model for extracting structured events that controls interactions between predicate- argument assignments. They take into account pair- wise assignments and solve the optimization prob- lem with dual decomposition. More recently, Das et al. (2012) proposed a dual decomposition method that deals with several assignment constraints for predicate-argument relations. Their method is an alternative to general ILP methods. To our knowl- edge, our work is the first that frames SRL as a linear assignment task, for which simple and exact algo- rithms exist. We should note that these works model predicate-argument relations with assignment con- straints, but none of them predicts the underlying syntactic structure.
Our dual decomposition method follows from that of Koo et al. (2010). In both cases two separate pro- cesses predict syntactic dependency structures, and the dual decomposition algorithm seeks agreement at the level of individual dependencies. One dif-
ference is that our semantic process predicts partial syntax (restricted to syntactic paths connecting pred- icates and arguments), while in their case each of the two processes predicts the full set of dependencies.
6 Experiments
We present experiments using our syntactic- semantic parser on the CoNLL-2009 Shared Task English benchmark (Hajiˇc et al., 2009). It consists of the usual WSJ training/development/test sections mapped to dependency trees, augmented with se- mantic predicate-argument relations from PropBank (Palmer et al., 2005) and NomBank (Meyers et al., 2004) also represented as dependencies. It also con- tains a PropBanked portion of the Brown corpus as an out-of-domain test set.
Our goal was to evaluate the contributions of pars- ing algorithms in the following configurations:
Base Pipeline Runs a syntactic parser and then runs an SRL parser constrained to paths of the best syntactic tree. In the SRL it only enforces con- straintcArg, by simply classifying the candi- date argument in each path into one of the pos- sible semantic roles or asNULL.
Pipeline with Assignment Runs the assignment al- gorithm for SRL, enforcing constraintscRole andcArg, but constrained to paths of the best syntactic tree.
Forest Runs the assignment algorithm for SRL on a large set of precomputed syntactic paths, de- scribed below. This configuration corresponds to running Dual Decomposition for a single it- eration, and is not guaranteed to predict consis- tent syntactic and semantic structures.
Dual Decomposition (DD) Runs dual decomposi- tion using the assignment algorithm on the set of precomputed paths. Syntactic and semantic structures are consistent when it reaches con- vergence.
All four systems used the same type of discrimina- tive scorers and features. Next we provide details about these systems. Then we present the results.
6.1 Implementation
Syntactic model We used two discriminative arc- factored models for labeled dependency parsing: a
first-order model, and a second-order model with grandchildren interactions, both reimplementations of the parsers by McDonald et al. (2005) and Car- reras (2007) respectively. In both cases we used projective dependency parsing algorithms based on (Eisner, 2000).4 To learn the models, we used a log-linear loss function following Koo et al. (2007), which trains probabilistic discriminative parsers.
At test time, we used the probabilistic parsers to compute marginal probabilities p(h, m, l | x), us- ing inside-outside algorithms for first/second-order models. Hence, for either of the parsing models, we always obtain a table of first-order marginal scores, with one score per labeled dependency. Then we run first-order inference with these marginals to ob- tain the best tree. We found that the higher-order parser performed equally well on development us- ing this method as using second-order inference to predict trees: since we run the parser multiple times withinDual Decomposition, our strategy results in faster parsing times.
Precomputed Paths Both Forest and Dual De- compositionrun assignment on a set of precomputed paths, and here we explain how we build it. We first observed that 98.4% of the correct arguments in de- velopment data are either direct descendants of the predicate, direct descendants of an ancestor of the predicate, or an ancestor of the predicate.5 All meth- ods we test are restricted to this syntactic scope. To generate a list of paths, we did as follows:
• Calculate marginals ofunlabeleddependencies using the first-order parser: p(h, m | x) = P
lp(h, m, l | x). Note that for eachm, the probabilitiesp(h, m|x)for allh form a distri- bution (i.e. they sum to one). Then, for eachm, keep the most-likely dependencies that cover at least 90% of the mass, and prune the rest.
• Starting from a predicate p, generate a path by taking any number of dependencies that as- cend, and optionally adding one dependency that descends. We constrained paths to be pro- jective, and to have a maximum number of 6
4Our method allows to use non-projective dependency pars- ing methods seamlessly.
5This is specific to CoNLL-2009 data for English. In gen- eral, for other languages the coverage of these rules may be lower. We leave this question to future work.
ascendant dependencies.
• Label each unlabeled edgehh, miin the paths withl= argmaxlp(h, m, l|x).
On development data, this procedure generated an average of 43.8 paths per predicate that cover 86.2%
of the correct paths. In contrast, enumerating paths of the single-best tree covers 79.4% of correct paths for the first-order parser, and 82.2% for the second- order parser.6
SRL model We used a discriminative model with similar features to those in the system of Johansson (2009). In addition, we included the following:
• Unigram/bigram/trigram path features. For all n-grams in the syntactic path, patterns of words and POS tags (e.g., mary+loves+to, mary+VB+to).
• Voice features. The predicate voice together with the word/POS of the argument (e.g., pas- sive+mary).
• Path continuity. Count of non-consecutive to- kens in a predicate-argument path.
To train SRL models we used the averaged per- ceptron (Collins, 2002). For the base pipeline we trained standard SRL classifiers. For the rest of models we used the structured Perceptron running the assignment algorithm as inference routine. In this case, we generated a large set of syntactic paths for training using the procedure described above, and we set the loss function to penalize mistakes in predicting the semantic role of arguments and their syntactic path.
Dual Decomposition We added a parameter β weighting the syntactic and semantic components of the model as follows:
(1−β) s syn(x,y) +βs srl(x,z,π) . As syntactic scores we used normalized marginal probabilities of dependencies, either from the first or the higher-order parser. The scores of all factors of the SRL model were normalized at every sen- tence to be between -1 and 1. The rest of details
6One can evaluate the maximum recall on correct arguments that can be obtained, irrespective of whether the syntactic path is correct: for the set of paths it is 98.3%, while for single-best trees it is 91.9% and 92.7% for first and second-order models.
o LAS UAS semp semr semF1 sempp
Pipeline 1 85.32 88.86 86.23 67.67 75.83 45.64 w. Assig. 1 85.32 88.86 84.08 71.82 77.47 51.17
Forest - - - 80.67 73.60 76.97 51.33
Pipeline 2 87.77 90.96 87.07 68.65 76.77 47.07 w. Assig. 2 87.77 90.96 85.21 73.41 78.87 53.80
Table 1: Results on development for the baseline and as- signment pipelines, running first and second-order syn- tactic parsers, and the Forest method. oindicates the or- der of syntactic inference.
of the method were implemented following Koo et al. (2010), including the strategy for decreasing the step sizeαt. We ran the algorithm for up to 500 it- erations, with initial step size of 0.001.
6.2 Results
To evaluate syntactic dependencies we use unla- beled attachment score (UAS), i.e., the percentage of words with the correct head, andlabeled attach- ment scores (LAS), i.e., the percentage of words with the correct head and syntactic label. Semantic predicate-argument relations are evaluated with pre- cision (semp), recall (semr) and F1measure (semF1) at the level of labeled semantic dependencies. In ad- dition, we measure the percentage of perfectly pre- dicted predicate structures (sempp).7
Table 1 shows the results on the development set for our three first methods. We can see that the pipeline methods running assignment improve over the baseline pipelines in semantic F1 by about 2 points, due to the application of thecRoleconstraint.
The Forest method also shows an improvement in recall of semantic roles with respect to the pipeline methods. Presumably, the set of paths available in theForest model allows to recognize a higher num- ber of arguments at an expense of a lower preci- sion. Regarding the percentage of perfect predicate- argument structures there is a remarkable improve- ment in the systems that apply the full set of con-
7Our evaluation metrics differ slightly from the official met- ric at CoNLL-2009. That metric considers predicate senses as special semantic dependencies and, thus, it includes them in the calculation of the evaluation metrics. In this paper, we are not addressing predicate sense disambiguation and, conse- quently, we ignore predicate senses when presenting evaluation results. When we report the performance of CoNLL systems, their scores will be noticeably lower than the scores reported at the shared task. This is because predicate disambiguation is a reasonably simple task with a very high baseline around 90%.
o β LAS UAS semp semr semF1 sempp %conv 1 0.1 85.32 88.86 84.09 71.84 77.48 51.77 100 1 0.4 85.36 88.91 84.07 71.94 77.53 51.85 100 1 0.5 85.38 88.93 84.08 72.03 77.59 51.96 100 1 0.6 85.41 88.95 84.05 72.19 77.67 52.03 99.8 1 0.7 85.44 89.00 84.10 72.42 77.82 52.24 99.7 1 0.8 85.48 89.02 83.99 72.69 77.94 52.57 99.5 1 0.9 85.39 88.93 83.68 72.82 77.88 52.49 99.8 2 0.1 87.78 90.96 85.20 73.11 78.69 53.74 100 2 0.4 87.78 90.96 85.21 73.12 78.70 53.74 100 2 0.5 87.78 90.96 85.19 73.12 78.70 53.72 100 2 0.6 87.78 90.96 85.20 73.13 78.70 53.72 99.9 2 0.7 87.78 90.96 85.19 73.13 78.70 53.72 99.8 2 0.8 87.80 90.98 85.20 73.18 78.74 53.77 99.8 2 0.9 87.84 91.02 85.20 73.23 78.76 53.82 100
Table 2: Results of the dual decomposition method on development data, for different values of theβ parame- ter. ois the order of the syntactic parser. %conv is the percentage of examples that converged.
straints using the assignment algorithm. We believe that the cRole constraint that ensures no repeated roles for a given predicate is a key factor to predict the full set of arguments of a predicate.
The Forest configuration is the starting point to run the dual decomposition algorithm. We ran ex- periments for various values of theβparameter. Ta- ble 2 shows the results. We see that as we increase β, the SRL component has more relative weight, and the syntactic structure changes. TheDDmethods are always able to improve over theForestmethods, and find convergence in more than 99.5% of sentences.
Compared to the pipeline running assignment, DD improves semantic F1 for first-order inference, but not for higher-order inference, suggesting that 2nd order predictions of paths are quite accurate. We also observe slight benefits in syntactic accuracy.
Table 3 presents results of our system on the test sets, where we run Pipeline with Assignment and Dual Decomposition with our best configura- tion (β = 0.8/0.9 for 1st/2nd order syntax). For comparison, the table also reports the results of the best CoNLL–2009 joint system,Merlo09 (Ges- mundo et al., 2009), which proved to be very com- petitive ranking third in the closed challenge. We also includeLlu´ıs09(Llu´ıs et al., 2009), which is an- other joint syntactic-semantic system from CoNLL–
2009.8In the WSJ testDDobtains the best syntactic accuracies, while the Pipeline obtains the best se-
8Another system to compare to is the joint system by Jo- hansson (2009). Unfortunately, a direct comparison is not possi- ble because it is evaluated on the CoNLL-2008 datasets, which
WSJ LAS UAS semp semr semF1 sempp
Llu´ıs09 87.48 89.91 73.87 67.40 70.49 39.68 Merlo09 88.79 91.26 81.00 76.45 78.66 54.80 Pipe-Assig 1st 86.85 89.68 85.12 73.78 79.05 54.12 DD 1st 87.04 89.89 85.03 74.56 79.45 54.92 Pipe-Assig 2nd 89.19 91.62 86.11 75.16 80.26 55.96 DD 2nd 89.21 91.64 86.01 74.84 80.04 55.73 Brown LAS UAS semp semr semF1 sempp
Llu´ıs09 80.92 85.96 62.29 59.22 60.71 29.79 Merlo09 80.84 86.32 68.97 63.06 65.89 38.92 Pipe-Assig 1st 80.96 86.58 72.91 60.16 65.93 38.44 DD 1st 81.18 86.86 72.53 60.76 66.12 38.13 Pipe-Assig 2nd 82.56 87.98 73.94 61.63 67.23 38.99 DD 2nd 82.61 88.04 74.12 61.59 67.28 38.92
Table 3: Comparative results on the CoNLL–2009 En- glish test sets, namely the WSJ test (top table) and the out of domain test from the Brown corpus (bottom table).
mantic F1. The bottom part of Table 3 presents re- sults on the out-of-domain Brown test corpus. In this case,DDobtains slightly better results than the rest, both in terms of syntactic accuracy and semantic F1. Table 4 shows statistical significance tests for the syntactic LAS and semantic F1 scores of Table 3.
We have applied the sign test (Wackerly et al., 2007) and approximate randomization tests (Yeh, 2000) to all pairs of systems outputs. The differences be- tween systems in the WSJ test can be considered significant in almost all cases with p = 0.05. In the Brown test set, results are more unstable and dif- ferences are not significant in general, probably be- cause of the relatively small size of that test.
Regarding running times, our implementation of the baseline pipeline with 2ndorder inference parses the development set (1,334 sentences) in less than 7 minutes. Running assignment in the pipeline in- creases parsing time by ∼8% due to the overhead from the assignment algorithm. TheForestmethod, with an average of 61.3 paths per predicate, is∼13%
slower than the pipeline due to the exploration of the space of precomputed paths. Finally,Dual Decom- positionwith 2nd order inference converges in 36.6 iterations per sentence on average. The first itera- tion of DDhas to perform roughly the same work asForest, while subsequent iterations only need to re-parse the sentence with respect to the dual up-
are slightly different. However, note thatMerlo09is an applica- tion of the system by Titov et al. (2009). In that paper authors report results on the CoNLL-2008 datasets, and they are com- parable to Johansson’s.
WSJ Brown
ME PA1 DD1 PA2 DD2 ME PA1 DD1 PA2 DD2
LL ◦•◦• • ◦•◦• ◦•◦•
ME ◦• ◦•◦•◦• • •
PA1 ◦•◦•◦• • ◦• ◦•
DD1 ◦•◦• ◦• ◦•
PA2 •
Table 4: Statistical tests of significance for LAS and semF1differences between pairs of systems from Table 3.
◦/•= LAS difference is significant by the sign/ approxi- mate randomization tests at 0.05 level./= same mean- ing for semF1. The legend for systems is: LL: Llu´ıs09, ME:Merlo09, PA1/2: Pipeline with Assignment, 1st/2nd order, DD1/2:Dual Decomposition, 1st/2ndorder.
dates, which are extremely sparse. Our current im- plementation did not take advantage of the sparsity of updates, and overall,DDwas on average 13 times slower than the pipeline running assignment and 15 times slower than the baseline pipeline.
7 Conclusion
We have introduced efficient methods to parse syntactic dependency structures augmented with predicate-argument relations, with two key ideas.
One is to predict the local syntactic structure that links a predicate with its arguments, and seek agree- ment with the full syntactic structure using dual decomposition techniques. The second is to con- trol linear assignment constraints in the predicate- argument structure.
In experiments we observe large improvements resulting from the assignment constraints. As for the dual decomposition technique for joint parsing, it does improve over the pipelines when we use a first order parser. This means that in this configu- ration the explicit semantic features help to find a solution that is better in both layers. To some ex- tent, this empirically validates the research objec- tive of joint models. However, when we move to second-order parsers the differences with respect to the pipeline are insignificant. It is to be expected that as syntactic parsers improve, the need of joint methods is less critical. It remains an open question to validate if large improvements can be achieved by integrating syntactic-semantic features. To study this question, it is necessary to have efficient pars- ing algorithms for joint dependency structures. This paper contributes with a method that has optimality
guarantees whenever it converges.
Our method can incorporate richer families of fea- tures. It is straightforward to incorporate better se- mantic representations of predicates and arguments than just plain words, e.g. by exploiting WordNet or distributional representations as in (Zapirain et al., 2013). Potentially, this could result in larger im- provements in the performance of syntactic and se- mantic parsing.
It is also necessary to experiment with differ- ent languages, where the performance of syntactic parsers is lower than in English, and hence there is potential for improvement. Our treatment of local syntactic structure that links predicates with argu- ments, based on explicit enumeration of likely paths, was simplistic. Future work should explore meth- ods that model the syntactic structure linking predi- cates with arguments: whenever this structure can be parsed efficiently, our dual decomposition algorithm can be employed to define an efficient joint system.
Acknowledgments
We thank the editor and the anonymous reviewers for their valuable feedback. This work was financed by the European Commission for the XLike project (FP7- 288342); and by the Spanish Government for project OpenMT-2 (TIN2009-14675-C03-01), project Skater (TIN2012-38584-C06-01), and a Ram´on y Cajal contract for Xavier Carreras (RYC-2008-02223).
References
Rainer Burkard, Mario Dell’Amico, and Silvano Martello. 2009. Assignment Problems. Society for Industrial and Applied Mathematics.
Xavier Carreras and Llu´ıs M`arquez. 2005. Introduc- tion to the CoNLL-2005 shared task: Semantic role labeling. InProceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL- 2005), pages 152–164, Ann Arbor, Michigan, June.
Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. InProceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 957–961, Prague, Czech Republic, June.
Michael Collins. 2002. Discriminative training meth- ods for Hidden Markov Models: Theory and experi- ments with Perceptron algorithms. InProceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8, July.
Dipanjan Das, Andr´e F. T. Martins, and Noah A. Smith.
2012. An exact dual decomposition algorithm for
shallow semantic parsing with constraints. In Pro- ceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval ’12, pages 209–217, Stroudsburg, PA, USA.
Jason Eisner. 2000. Bilexical grammars and their cubic- time parsing algorithms. In Harry Bunt and Anton Nijholt, editors,Advances in Probabilistic and Other Parsing Technologies, pages 29–62. Kluwer Academic Publishers, October.
Andrea Gesmundo, James Henderson, Paola Merlo, and Ivan Titov. 2009. A latent variable model of syn- chronous syntactic-semantic parsing for multiple lan- guages. In Proceedings of the Thirteenth Confer- ence on Computational Natural Language Learning (CoNLL 2009): Shared Task, pages 37–42, Boulder, Colorado, June.
Daniel Gildea and Daniel Jurafsky. 2002. Automatic la- beling of semantic roles. Computational Linguistics, 28(3):245–288, September.
Daniel Gildea and Martha Palmer. 2002. The necessity of parsing for predicate argument recognition. InPro- ceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 239–246, Philadel- phia, Pennsylvania, USA, July.
Jan Hajiˇc, Massimiliano Ciaramita, Richard Johans- son, Daisuke Kawahara, Maria Ant`onia Mart´ı, Llu´ıs M`arquez, Adam Meyers, Joakim Nivre, Sebastian Pad´o, Jan ˇStˇep´anek, Pavel Straˇn´ak, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The CoNLL- 2009 shared task: Syntactic and semantic dependen- cies in multiple languages. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL-2009): Shared Task, pages 1–18, Boulder, Colorado, USA, June.
Richard Johansson. 2009. Statistical bistratal depen- dency parsing. InProceedings of the 2009 Conference on Empirical Methods in Natural Language Process- ing, pages 561–569, Singapore, August.
Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins. 2007. Structured prediction mod- els via the matrix-tree theorem. InProceedings of the 2007 Joint Conference on Empirical Methods in Nat- ural Language Processing and Computational Natu- ral Language Learning (EMNLP-CoNLL), pages 141–
150, Prague, Czech Republic, June.
Terry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual decompo- sition for parsing with non-projective head automata.
In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1288–1298, Cambridge, MA, October.