• Aucun résultat trouvé

CD at ENSEC 2016: Generating Characteristic and Diverse Entity Summaries

N/A
N/A
Protected

Academic year: 2022

Partager "CD at ENSEC 2016: Generating Characteristic and Diverse Entity Summaries"

Copied!
3
0
0

Texte intégral

(1)

CD at ENSEC 2016: Generating Characteristic and Diverse Entity Summaries

Danyun Xu, Liang Zheng, and Yuzhong Qu

National Key Laboratory for Novel Software Technology, Nanjing University, China {dyxu,zhengliang}@smail.nju.edu.cn,[email protected]

Abstract. We introduce our entity summarization approach called CD, which aims to select characteristic and diverse features into an entity summary. The characterizing ability of a feature is measured according to information theory. The information overlap between features considers ontological semantics of classes and properties, as well as string and numerical similarity. Finally, selecting characteristic and diverse features is formulated as a binary quadratic knapsack problem to solve.

Keywords: Entity summarization, self-information, reasoning, string similarity, numerical similarity.

1 Introduction

Our entity summarization approach, called CD, is adapted from [3]. The basic idea is to, given an entity description composed of a set of property-value pairs called features, select a size-limited subset of characteristic and diverse features as an entity summary. We formulate it as a binary quadratic knapsack problem (QKP) to solve. Specifically, the characterizing ability of a feature is measured according to information theory, and the information overlap between features considers ontological semantics of classes and properties, as well as string and numerical similarity.

2 Preliminaries

LetE,C,P, andLbe the sets of all entities, classes, properties, and literals in a dataset, respectively. The description of an entityeis a set of property-value pairs called features, denoted byd(e)⊆P×(E∪C∪L). In RDF data,d(e) is obtained from RDF triples in whicheis the subject or the object. Wheneis the subject of a triplet, the predicate (which is a property) and the object (which is an entity, a class, or a literal) oftcomprise a feature. Wheneis the object of a triplet, the inverse of the predicate and the subject oftcomprise a feature. The inverse of a propertypis a property automatically created by our approach and is distinguished fromp, though they share a common name; if a propertypi is a subproperty of a propertypj, we also define the inverse ofpi as a subproperty of the inverse of pj. Given an integer k, an entity summaryS of e is a subset ofd(e) subject to|S| ≤k.

(2)

2 D. Xu, L. Zheng, and Y. Qu

3 Approach

3.1 Characterizing Ability of a Feature

The characterizing ability of a featuref, denoted bych(f), is measured accord- ing to information theory. Specifically, we compute the normalized amount of self-information contained in the probabilistic event of observing f in an en- tity description in a dataset. A feature will have high characterizing ability if it belongs to a small number of entity descriptions:

ch(f) =

−log|{e∈E:f∈d(e)}|

|E|

log|E| , (1)

which is in the range of [0,1].

3.2 Information Overlap between Features

The information overlap between two featuresfiandfj, denoted byovlp(fi, fj), considers ontological semantics of classes and properties, as well as string and numerical similarity.

For a feature f, let prop(f) and val(f) return the property and the value off, respectively.

Firstly, we exploit ontological semantics of classes and properties. If both prop(fi) andprop(fj) arerdf:typeandval(fi) is a subclass ofval(fj) (or vice versa), we will defineovlp(fi, fj) = 1 because one of them can be inferred from the other and thus they share maximized overlapping information. Similarly, we will also defineovlp(fi, fj) = 1 ifval(fi) =val(fj) andprop(fi) is a subproperty ofprop(fj) (or vice versa).

In other cases, we calculate the string similarity between property names (isub) and the similarity between property values (sim):

ovlp(fi, fj) = max{isub(prop(fi), prop(fj)), sim(val(fi), val(fj)),0}, (2) which is in the range of [0,1]. Here,isub∈[−1,1] returns the ISub string sim- ilarity [2] between two property names; sim ∈ [−1,1] returns the similarity between two property values. To measure sim(val(fi), val(fj)), if both val(fi) andval(fj) are numerical data values, we calculate their similarity as follows.

1. Ifval(fi) =val(fj),sim(val(fi), val(fj)) = 1;

2. otherwise, ifval(fi)·val(fj)≤0,sim(val(fi), val(fj)) =−1;

3. otherwise,sim(val(fi), val(fj)) = max{|val(fmin{|val(fi)|,|val(fj)|}

i)|,|val(fj)|}.

In other cases, we treat val(fi) and val(fj) as strings; that is, for entities and classes, we take their names, and for literals, we take their string forms. Then we calculate their ISub string similarity assim.

(3)

CD: Generating Characteristic and Diverse Entity Summaries 3

3.3 Selecting Characteristic and Diverse Features

We aim to select up tokfeatures fromd(e) that maximize their total character- izing ability and minimize the total information overlap between them. To this end, we define the quality of an entity summaryS as

q(S) =γ· X

f∈d(e)

ch(f) +δ· X

fi,fj∈S

−ovlp(fi, fj), (3)

in whichγ, δ >0 are the weights of the two objectives to tune, to achieve different trade-offs.

Maximizingqcan be reformulated as an instance of QKP [1] as follows. We number the features ind(e) fromf1 tof|d(e)|. By introducing a series of binary variablesxi fori= 1· · · |d(e)|to indicate whetherfi is selected into the optimal summary, the problem is formulated as

maximize

|d(e)|

X

i=1

|d(e)|

X

j=i

pijxixj

subject to

|d(e)|

X

i=1

xi≤k

xi∈ {0,1} fori= 1· · · |d(e)|,

(4)

in whichpij is the “profit” achieved if bothfi andfj are selected:

pij =

(γ·ch(fi) ifi=j ,

δ·(−ovlp(fi, fj)) otherwise. (5) We solve QKP using a state-of-the-art heuristic algorithm [4].

Acknowledgments. This work was supported in part by the NSFC under Grant 61572247 and in part by the Fundamental Research Funds for the Central Universities.

References

1. Pisinger, D.: The Quadratic Knapsack Problem - A Survey. Discrete Appl. Math.

155(5), 623–648 (2007)

2. Stoilos, G., Stamou, G., Kollias, S.: A String Metric for Ontology Alignment. In:

Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol.

3729, pp. 624–637. Springer, Berlin Heidelberg (2005)

3. Xu, D., Cheng, G., Qu, Y.: Facilitating Human Intervention in Coreference Resolu- tion with Comparative Entity Summaries. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp.

535–549. Springer International Publishing, Switzerland (2014)

4. Yang, Z., Wang, G., Chu, F.: An Effective GRASP and Tabu Search for the 0-1 Quadratic Knapsack Problem. Comput. Oper. Res. 40(5), 1176–1185 (2013)

Références

Documents relatifs

In this section, I described two classification tasks conducted over attributed networks. The first was about classifying Twitter users in order to detect real-world influence, based

Equivalent diameter Dielectric factor of snow/ice/ liquid ⫽ 0.176 for ice and snow and ⫽ 0.930 for liquid water, at 94 GHz Net longwave flux LWn Nice/snow/liquid

Based on the results of the analysis, it can be assumed that any type of data used for ontological modeling of problem areas of conflict can be described through a list of (1)

for tuples in relations which reflect their compositional structure and which can be used in conjunction with an associative memory to efficiently identify and retrieve

In it there is information about the software quality characteristics, subcharacteristics and measures.For development of this ontology the 8 software quality characteristics by

That is, each ideal summary consists of five triples selected from the description of an entity for general purpose (not task-specific summaries).. Participating systems in the

Keywords: (B)-conjecture, Brunn-Minkowski, Convex geometry, Convex measure, Entropy, Entropy power, Rényi entropy, Gaussian measure, Hamilton-Jacobi equation, Isoperimetry,

Generalizing the ideas underlying the construction of U for these two rate points, we derive new rate points on the boundary of the rate region, which are below the time-sharing