• Aucun résultat trouvé

Cross-language Semantic Relations between English and Portuguese

N/A
N/A
Protected

Academic year: 2022

Partager "Cross-language Semantic Relations between English and Portuguese"

Copied!
10
0
0

Texte intégral

(1)

Cross-language Semantic Relations between English and Portuguese

Relaciones Sem´anticas entre los Idiomas Ingl´es y Portugu´es

Anabela Barreiro L2F – INESC-ID

Rua Alves Redol no 9, 1000-029 Lisboa, Portugal

[email protected]

Hugo Gon¸calo Oliveira CISUC, University of Coimbra, P´olo II

Pinhal de Marrocos 3030-290 Coimbra, Portugal

[email protected]

Resumen: Este art´ıculo describe las relaciones sem´anticas conceptuales obtenidas de los recursos del sistema OpenLogos que fueron convertidos al formato NooJ. Es- tas relaciones est´an representadas simb´olicamente en el l´exico OpenLogos como un esquema taxon´omico llamado abstracci´on sem´antico-sint´actica del lenguaje (SAL), que se utiliza para generar las relaciones jer´arquicas de hiponimia e hiperonimia.

El art´ıculo tambi´en describe las relaciones acci´on-de, resultado-de, y sinonimia en- tre unidades multi-palabra y palabras sueltas, sobre todo donde existe una relaci´on morfo-sint´actica y sem´antica entre las palabras de distintas categor´ıas gramaticales.

Las relaciones sem´anticas se generaron autom´aticamente a partir de la informaci´on ling¨u´ıstica asociada a cada entrada lexical en los diccionarios NooJ. Se desarrollaron gram´aticas locales como mecanismo para leer esta informaci´on ling¨u´ıstica y generar las relaciones sem´anticas que se han utilizado en la producci´on de par´afrasis y en tra- ducci´on autom´atica. Los diccionarios y las gram´aticas se pueden adaptar f´acilmente a distintas lenguas y son ´utiles para diferentes tareas de procesamiento natural de la lengua, tanto monoling¨ues como entre idiomas.

Palabras clave: relaciones sem´anticas, ontolog´ıas, diccionarios, gram´aticas locales, relaciones entre idiomas

Abstract: This paper describes conceptual semantic relations obtained from Open- Logos resources converted into NooJ format. These relations were symbolically rep- resented in the OpenLogos lexicon as a taxonomic scheme called semantico-syntactic abstraction language (SAL), used to generate hierarchical hyponymy and hypernymy relations. The paper also describes action-of, result-of, and synonymy relations be- tween multiword units and single words, mostly where there is a morpho-syntactic and semantic relation between words of distinct parts-of-speech. The semantic re- lations were generated automatically, based on the linguistic information associated with each lexical entry in NooJ dictionaries. Local grammars were developed as a mechanism to read this linguistic information and generate the semantic relations, which have been used in paraphrasing and machine translation. Dictionaries and grammars can easily be adapted to distinct languages and are useful to various nat- ural language processing monolingual or cross-language tasks.

Keywords: semantic relations, ontologies, dictionaries, local grammars, cross- language relations

1 Introduction

Lexical Semantics (Cruse, 1986) is the sub- field of semantics that studies the words of a language and their meanings. It sees the lex-

Anabela Barreiro was partially supported by the UPV, award 1931, under the program Research Vis- its for Renowned Scientists (PAID-02-11). Hugo Gon¸calo Oliveira is supported by the FCT scholarship grant SFRH/BD/44955/2008, co-funded by FSE.

icon as a finite list of lexical items (words or expressions) with a highly systematic struc- ture that controls what words can mean. It can be seen as the bridge between a language and the knowledge expressed in that lan- guage (Sowa, 1999). The conceptual model of a language is structured around lexical items, their meaning (often referred as sense) and lexico-semantic relations held between

(2)

the latter. To deal with the meaning of a language it is important to study these rela- tions.

Semantic relations are crucial to under- stand and to structure the meaning of nat- ural language. They are vital to communica- tion overall, and highly employed in technical and specialized domains, where the most im- portant content of texts is conveyed through the semantic relations between the terms that represent the domain’s concepts, rather than by the meaning of the words alone (e.g., the semantic relations between BRCA1/protein and RNF53/gene in the biomedical field).

Additionally, semantic relations are impor- tant for applications in the semantic web, mapping ontologies, text categorization, nat- ural language understanding, etc., and a req- uisite for paraphrasing and machine transla- tion, where words and expressions often must be substituted by semantic equivalents, such as synonyms between support verb construc- tions and single verbs (make an operation = operate; say hello to = greet), or other type of semantic alternates.

The most studied lexico-semantic rela- tions are: (1) synonymy, when different lexical items have the same meaning (e.g.

car synonym-of automobile); (2) homonymy, when lexical items have the same ortho- graphic form but different meanings (e.g.

bank, financial institution vs. slope); (3) hy- ponymy, when a lexical item is a subclass or a specific kind of another (e.g. dog hyponym-of mammal); and (4) meronymy, when a lexical item is a part, piece or member of another (e.g. wheel part-ofcar).

This paper describes the first attempt to extract cross-language semantic relations between English and Portuguese from the lexical resources of the OpenLogos machine translation system described by Scott (2003) and Barreiro et al. (2011). In combi- nation with the former resources, new re- sources were created, namely derivational rules and grammars to recognize and gen- erate morpho-syntactic and semantically re- lated words and multiword units. Semantic relations, obtained by means of local gram- mars developed within NooJ linguistic envi- ronment (Silberztein, 2007), cover a larger number of items and can be extracted in a simple and easy way. This paper aims at showing how these resources combined can be used in cross-language tasks. Section 2

describes the state of the art in lexical se- mantics and automatic acquisition of distinct types of lexico-semantic relations. Section 3 presents the base linguistic resources used to attain semantic relations. Section 4 de- scribes the relations of synonymy, hyponymy, action-of, and result-of. Section 5 presents the method for the extraction of the seman- tic relations. It describes, in particular, the morpho-syntactic and semantic relations es- tablished in the dictionary, how the gram- mars read this linguistic information, and how they use it to generate semantic pairs.

This latter section also shows how to expand from monolingual to cross-language relations with minimal change in the local grammars.

Section 6 presents some preliminary results.

And finally, section 7 presents the conclusions and guidelines for future research work.

2 State of the Art

Dictionaries are probably the main source of lexico-semantic knowledge, as they are repos- itories of words, which include the descrip- tion of several word senses. However, as def- initions are written in natural language, dic- tionaries are not completely ready for being used as computational lexical resources.

Common representations of lexico- semantic knowledge, ready for being used in natural language processing tasks, include thesauri, taxonomies, as well as lexical ontologies or lexical knowledge bases. For example, the Roget Thesaurus (Roget, 1852) is one of the most well-known and complete thesaurus that is available in a machine readable format. Also, Princeton Word- Net (Fellbaum, 1998) is a public domain lexical knowledge base, widely used in the natural language processing community. It is a handcrafted resource based on synsets, which are groups of synonymous words that may be seen as natural language concepts.

Each synset has a gloss, which is similar to a dictionary definition, and several types of semantic relations between synsets are represented.

As the manual creation of lexical knowl- edge bases is typically an extensive and time-consuming task, there are several works where lexico-semantic relations are extracted automatically from text, and then used either to create new knowledge bases from scratch or to enrich existing knowledge bases. Due to their structure, dictionaries are an obvious

(3)

target for the extraction of lexico-semantic relations (see, for example, (Chodorow, Byrd, and Heidorn, 1985) or (Richardson, Dolan, and Vanderwende, 1998)). Corpora and the Web have as well been exploited in the automatic acquisition of several types of lexico-semantic relations, including hy- ponymy (Hearst, 1992), meronymy (Berland and Charniak, 1999), causal relations (Girju and Moldovan, 2002), as well as in the discov- ery of new concepts (Lin and Pantel, 2002).

For Portuguese, in the latest years, semantic relations have also been a subject of increasing research interest. Santos et al. (2010) provide a review of the exist- ing Portuguese lexico-semantic resources.

Briefly, there are two handcrafted wordnets for European Portuguese, namely Word- Net.PT (Marrafa, 2002) and MWN.PT1, and an electronic thesaurus for Brazilian Portuguese, TeP (Maziero et al., 2008).

There have also been attempts to the automatic acquisition of semantic rela- tions, including: hyponymy extraction from corpora (Freitas and Quental, 2007);

the extraction of several relations from a dictionary and the creation of the lex- ical resource PAPEL (Gon¸calo Oliveira, Santos, and Gomes, 2010); and Onto.PT (Gon¸calo Oliveira and Gomes, 2010), an ongoing project on the automatic creation of a lexical ontology for Portuguese, where several textual resources (thesauri, dictionaries, encyclopedias) are being ex- ploited in the automatic acquisition of lexico-semantic relations.

Still, to the best of our knowledge, no research has been published on the auto- matic generation of cross-language seman- tic relations by using a linguistic method to map syntactic and semantically related words. This method can be extended to the type of relations that set equivalence between a word and a multiword unit (e.g. take a look =look), with a relative clause (that was corrected = corrected), with complex com- pounds (bottle made of plastic =plastic bot- tle) or even with a more complex construc- tion, such as a possessive construction or a passive, by exploiting the morpho-syntactic and semantic relations pairs described in the dictionaries. The method has the advantage of being systematic, expandable, holding an

1See http://mwnpt.di.fc.ul.pt/

unlimited possibility to grow and improve in observance of natural language complex- ity and compliant to distinct languages and across languages. This is the novel aspect of the work presented in this paper in relation to the state of the art.

3 Resources

In this section, we will describe the English and Portuguese resources used to achieve cross-language semantic relations.

Eng4NooJ and Port4NooJ (Barreiro, 2007) are sets of resources developed with the NooJ linguistic environment (Silberztein, 2007), aiming at the processing of the English and Portuguese languages. Both Eng4NooJ and Port4NooJ resources in- clude lexica and grammars which are used for different tasks, including morphologi- cal and semantico-syntactic analysis, dis- ambiguation, paraphrasing and translation.

Both include a morphological system, con- textual rules, different types of grammars (disambiguation, multiword units, etc.), and domain-specific dictionaries.

The Port4NooJ resources are publicly available2 and, at the moment, are being used in tools such as Corp´ografo, a cor- pora tool (Maia and Sarmento, 2005; Sar- mento et al., 2006; Maia and Matos, 2008), ParaMT, a paraphraser for machine trans- lation (Barreiro, 2008a; Barreiro, 2008b), and eSPERTo3, a system of paraphrasing for text editing and revision, currently being in- tegrated in a cyber-school pedagogical pro- gram. Port4NooJ resources have not been reviewed, but they were made available to the Portuguese natural language processing (NLP) community because of their novelty aspects, which we hope are evocative for fur- ther pioneering research, including exploita- tion to other languages and cross-language tasks. The semantic relations included in the

2Port4NooJ can be found at the NooJ website under Portuguese module (http://www.nooj4nlp.net) and its resources are also available at Linguateca since October 2008 (http://www.linguateca.pt/Repositorio/Port4NooJ/).

3eSPERTo (in Portuguese, stands for Sistema de Parafraseamento para Edi¸ao e Revis˜ao de Texto).

It is a derivative of ReEscreve, proposed by Barreiro (2008a), and also described in (Barreiro and Cabral, 2009). The English version of eSPERTo is called SPI- DER, standing for a System of Paraphrasing In Docu- ment Editing and Revision (formerly ReWriter). SPI- DER uses Eng4NooJ resources and is described in (Barreiro, 2011).

(4)

Port4NooJ and Eng4NooJ resources resulted from the application of simple local gram- mars to the semantico-syntactic properties in the lexical entries and the use of derivational rules that link semantically related words of different parts-of-speech.

Eng4NooJ and Port4NooJ lexica were in- herited from the OpenLogos system and en- hanced with several new properties, which will be described in detail in Section 5.

The OpenLogos lexical entries are classi- fied with more than 1,000 distinct categories, based on a taxonomy called SAL (Semantico- syntactic Abstraction Language)4. In the OpenLogos model, SAL is a meta-language that represents natural language, in effect, an ontology that represents things, ideas, rela- tionships, dispositions, conditions, processes, etc., as well as the elements of grammar such as articles, prepositions, conjunctions, etc.

In terms of natural language processing, the meta-language represents both syntax and semantics. SAL is an actual language, not a set of linguistic markers or primitives. This implies that natural language can be readily mapped to SAL. The granularity of the rep- resentational ontology is sufficient for trans- lation purposes only, i.e., the ontology does not need to be especially fine-grained.

SAL elements are divided in a hierarchi- cal scheme of supersets, sets and subsets, dis- tributed by all parts-of-speech. SAL com- prises 12 supersets for nouns: Concrete (CO), Mass (MA), Animate (AN), Place (PL), In- formation (IN), Abstract (AB), Process in- transitive (PI), Process transitive (PT), Mea- sure (ME), Time (TI), Aspective (AS), and Unknown (UN). For example, the concrete nouns superset consists of countable physi- cal things, either man-made or natural, in- cluding parts of the human body. Con- crete (count5) contain both sets and sub- sets. The principal sets of concrete nouns are functional things and agentive things.

Other sets are: natural things (COnat);

impulses/lights (COlight); marks/blemishes

4The full description of the multiple SAL cate- gories can be found at the Logos System Archives (http://logossystemarchives.homestead.com/) and all the resources (and descriptions) are downloadable from OpenLogos website at DFKI (http://logos- os.dfki.de/).

5Concrete nouns are always count nouns and, un- less in the plural, generally cannot occur without a preceding article or quantifier. For example: Com- puters are effective. *Computer is effective.

(COblem); edibles non-mass (COednm);

edibles/color (COedcol); classifiers (CO- class); amorphous (COamorph); and atom- istic (COatom). For example, the set of nat- ural things (COnat) includes subsets such as:

minute flora (COflora) (e.g. algae, spore);

plants (COplant) (e.g. rose, weed); trees (COtree) (e.g. apple, willow); trees/wood (COtrwd) (e.g. oak, maple); and miscella- neous natural things (COmnat) (e.g. pebble, iceberg).

The SAL meta-language is semantico- syntactic in nature, representing natural lan- guage at a second-order abstractions (com- mon nouns are first-order abstractions). Syn- tax and semantics are seen as a contin- uum. This semantico-syntactic continuum is always taken into account when classifying each lexical entry within SAL. The classifi- cation was done through the years by trial and error. For example, when classifying ele- ments into the functional (COfunc) or agen- tive (COagen) of the concrete noun superset, the following reasoning is taken into consid- eration: functional things tend to be passive, i.e. typically do not act of their own ac- cord and generally require an agent to use them. Hence, they are more instrumental in nature. Agents typically do work in and of themselves. This distinction may some- times seem arbitrary. For example, hingeis a fastener under functional things and clearly does work of itself, but is not coded as an agent. Airplane, on the other hand, obvi- ously does require an agent and yet is coded under agentives as a vehicle. As a rule, agen- tives have a source of power or energy in themselves, while functionals do not. Parts of the human/animal body are also classified as concrete. Words like heart, brain, diges- tive tract,stomach, and organs in general are machines/systems under agentives. Words like teeth,fingernail, toes,lips,tendons,liga- ments, bones, etc. belong to various subsets under functionals.

SAL categories contain domain- independent ontological (lexical-contextual) and semantico-syntactic relations (the same word form can be mapped to different concepts) are assigned to general language words or domain-specific terms. The general language dictionary contains many lexical entries which are broadly classified, which could be considered to pertain to a more spe- cific domain. For example, the lexical entries

(5)

dogIS HYPONYM OFanimal ao E HIP ´´ ONIMO DEanimal dogIS HYPONYM OFmammal ao E HIP ´´ ONIMO DEmam´ıfero dogIS HYPONYM OFnon-human being ao E HIP ´´ ONIMO DEser n˜ao humano dogIS HYPONYM OFinvertebrate ao E HIP ´´ ONIMO DEser vertebrado dogIS HYPONYM OFanimate being ao E HIP ´´ ONIMO DEser vivo/animado

Table 2: Hyponymy relations for the noun dog - c˜ao

for HIV (immunology), manic-depressive disorder, bipolar disorder (mental health) and asthma (pulmonology) are all classified under the superset Abstract and subset State (also for conditions and relationships).

This subset corresponds to abstract nouns that describe something about a thing or person that is not inherent to its nature (e.g. cancer, coma, circumstance, condition, disease, fatherhood, inequality, insolvency, loneliness, parity, poverty, status). Being more extrinsic, these states, conditions or re- lationships could conceivably change without altering the nature of the thing or person.

This is not a strict rule but is indicative of the difference between this subset and the properties/qualities/nature subset.

The information noun superset is com- prised of nouns that denote data, informa- tion, or knowledge, which might be consid- ered more specific to certain domains. But, this category also includes the medium on which the information is recorded, repre- sented or communicated; i.e., spoken, writ- ten, dramatized, sung, etc. Table 1 presents a list of terms classified as Instructional/legal (INinst) under the information noun superset (IN).

4 Semantic Relations for English and Portuguese

Both in Eng4NooJ and Port4NooJ, each lexical entry is described with semantico- syntactic properties, which represent rela- tions between words or expressions. These re- lations can be synonymy, hyponymy, action- of, result-of, process-of, made-of, property- of, member-of, among others. Table 2 illus- trates several semantic relations for the con- crete English and Portuguese nounsdog and c˜ao, respectively. These relations were in- ferred from the SAL hierarchical categories.

abolishment IS ACTION OF abolish A aboli¸c˜ao ´E AC¸ ˜AO DE abolir C abuse IS ACTION OF abuse T abuso ´E AC¸ ˜AO DE abusar

I happening IS ACTION OF happen O acontecimento ´E AC¸ ˜AO DE acontecer N agreement IS ACTION OF agree

acordo ´E AC¸ ˜AO DE acordar lit IS RESULT OF light

R aceso ´E RESULTADO DE acender E stuffed IS RESULT OF stuff

S embalsamado ´E RESULTADO DE embalsamar U rotten IS RESULT OF rotten

L podre ´E RESULTADO DE apodrecer T interdicted IS RESULT OF interdict

interditado ´E RESULTADO DE interditar

Table 3: Action-of and result-of semantic re- lations

In addition to the taxonomical classifi- cation inherited from OpenLogos, which al- lowed the establishment of hyponymy rela- tions, both Eng4NooJ and Port4NooJ re- sources include regular derivational, morpho- syntactic and semantic relations, such as syn- onymy, action-of, and result-of. The morpho- syntactic and semantic relations are estab- lished between words of a different part-of- speech, as for example, between an adjec- tive and its derived adverb (e.g. quick >

quickly - r´apido > rapidamente), between a noun and an adjective (e.g. enthusiasm>en- thusiastic - entusiasmo > entusiasmado), or between a noun and an adverb (e.g. imagi- nation > imaginatively = with imagination - imagina¸c˜ao > imaginativamente = com imagina¸c˜ao).

Table 3 illustrates action-of and result-of semantic relations. Action-of relations are es- tablished between a noun and a verb, where the noun is a morphological derivation of the verb. Result-of relations are established be- tween an adjective and a verb, where the ad- jective is morphologically derived from the verb.

5 Methodology for the Extraction of Semantic Relations

In order to obtain hyponymy relations from the OpenLogos properties in Port4NooJ and Eng4NooJ dictionaries, we created a local grammar that matches on the SAL code and presents, as an output, one or more words from the description of that specific SAL code. For the examples in Table 1, the NooJ local grammar recognizes the prop- erty [SAL=ANmamm], standing for Ani-

(6)

intima¸c~ao,N+FLX=CANC¸~AO+INinst+EN=summons garantia,N+FLX=CASA+INinst+EN=guarantee arrendamento,N+FLX=ANO+INinst+EN=lease garantia,N+FLX=CASA+INinst+EN=warranty autoriza¸c~ao,N+FLX=CANC¸~AO+INinst+EN=fiat lei,N+FLX=CASA+INinst+EN=law

autoriza¸c~ao,N+FLX=CANC¸~AO+INinst+EN=license licen¸ca,N+FLX=CASA+INinst+EN=license autoriza¸c~ao,N+FLX=CANC¸~AO+INinst+EN=permit mandato,N+FLX=ANO+INinst+EN=mandate autoriza¸c~ao,N+FLX=CANC¸~AO+INinst+EN=warrant morat´oria,N+FLX=CASA+INinst+EN=moratorium c^anone,N+FLX=ANO+INinst+EN=canon norma,N+FLX=CASA+INinst+EN=norm

cl´ausula,N+FLX=CASA+INinst+EN=clause norma,N+FLX=CASA+INinst+EN=standard condi¸c~ao,N+FLX=CANC¸~AO+INinst+EN=proviso ordem,N+FLX=MARGEM+INinst+EN=order contrato,N+FLX=ANO+INinst+EN=contract ordem,N+FLX=MARGEM+INinst+EN=ordinance credo,N+FLX=ANO+INinst+EN=credo pacto,N+FLX=ANO+INinst+EN=pact

declara¸c~ao,N+FLX=CANC¸~AO+INinst+EN=affidavit patente,N+FLX=CASA+INinst+EN=patent decreto,N+FLX=ANO+INinst+EN=decree ren´uncia,N+FLX=CASA+INinst+EN=waiver diretiva,N+FLX=CASA+INinst+EN=guideline testamento,N+FLX=ANO+INinst+EN=will estatuto,N+FLX=ANO+INinst+EN=bylaw tratado,N+FLX=ANO+INinst+EN=treaty estatuto,N+FLX=ANO+INinst+EN=statute veredicto,N+FLX=ANO+INinst+EN=veredict

Table 1: Sample of terms classified as Information + Instructional/legal (INinst) mate, Mammal and retrieves, as its output,

words that will be used as hypernyms of the words dog or c˜ao, in English or Portuguese, respectively. These words are: animal, mam- mal, non-human being, invertebrate, animate being. If the description of the SAL cate- gory included more hypernyms, these could, of course, be easily added to the list of pairs of the semantic relation IS HYPONYM OF fordog/c˜ao.

Table 4 shows distinct types of dictionary entries with implicit semantic information, namely the support verb construction that can be synonymous to a verb entry (impres- sionar=causar impress˜aoimpress=make an impression; ficar azedo = azedarturn sour =sour), the semantic relation between an adjective and a semantically related ad- verb (aesthetic – aesthetically), and the se- mantic relation between a noun and a se- mantically related adverb (skepticism –skep- tically). These relations are established by means of grammar rules. We have focused on the most regular rules, which are the ones that allow transformation of part-of-speech through the process of derivation.

In the examples illustrated in Table 4, the properties in bold correspond to the deriva- tional rule and inflectional paradigm. Ac- cordingly, DRV=NDRV01:CANC¸ ˜AO is a dic- tionary property that calls the rule to derive (through the process of nominalization) the predicate nounimpress˜ao (impression) from the verb impressionar (impress) and assigns it the inflectional paradigm CANC¸ ˜AO (the noun impress˜ao inflects in the same way as the nouncan¸c˜ao; i.e., following the same pro- cess and using the same morphemes to form the plural, etc.); DRV=ADRV00:ALTO is

a dictionary property that calls the rule to derive the predicate adjective azedo (sour) from the verb azedar (sour) and assigns it the inflectional paradigm ALTO (the adjec- tive azedo inflects like the adjective alto).

DRV=AVDRV03 is a dictionary property that calls the rule to derive the adverb aes- thetically from the adjective aesthetic; and, finally, DRV=NAVDRV02 is a dictionary property that calls the rule to derive the ad- verb skeptically from the noun skepticism.

The lexical entries for the verbsimpressionar (impress), adaptar (adapt), azedar (sour), have the property VSUP, that is, the descrip- tion of the support verb that occurs with the predicate nouns impress˜ao (impression), adapta¸c˜ao(adapt) and with the predicate ad- jective azedo (sour), which derive from the corresponding cited verbs. The combination of the description in the properties VSUP and DRV allows the semantic association be- tween these verbs and their equivalent sup- port verb constructions, namelyfazer/causar impress˜ao (make/cause impression), fazer adapta¸c˜ao(make adaptation), andficar azedo (turn sour).

Table 5 shows the transformational rules to associate morpho-syntactic and seman- tically related words of different parts- of-speech, extracted individually from the Eng4NooJ and Port4NooJ rule databases.

Rules are indexed according to different types of transformation. NDRV transforms verbs into nouns, ADRV transforms verbs into ad- jectives, and AVDRV transforms adjectives into nouns. The rules of each type are num- bered. For example, NDRV04 is the rule number 04 that transforms a verb into a noun. The slash (/) after each ending in-

(7)

impressionar,V+FLX=FALAR+SAL=PVPCpleasetype+EN=impress+VSUP=fazer+VSUP=causar+DRV=NDRV01:CANC¸~AO adaptar,V+FLX=FALAR+Aux=1+INOP57+Subset=132+EN=adapt+VSUP=fazer+DRV=NDRV00:CANC¸~AO

azedar,V+FLX=LIMPAR+Aux=1+OBJTRundif98+Subset=740+EN=sour+VSUP=ficar+DRV=ADRV00:ALTO aesthetic,AFLX=NATURAL+SAL=AVstate+PT=est´etico+DRV=AVDRV03

skepticism,N+FLX=BOOK+SAL=ABcause+PT=cepticismo+DRV=NAVDRV02

Table 4: General language dictionary entries with implicit semantic relations troduces the part-of-speech of the derived

word. The plus sign (+) introduces informa- tion about a specific noun or adjective. For example, Npred and Apred stand for pred- icate noun and predicate adjective, respec- tively. The capital letters between the less- than and the greater-than signs (<, >) cor- respond to commands. The command <B>

means “backspace one character and add the string that follows the command, assigning it a new part-of-speech”. The command<B2>

means “delete the last two characters of the word from which the new word derives and add the string that follows the command”, and so on and so forth. The strings that fol- low a command are the endings of the new generated words (e.g. -ion for the noun ac- celeration,-tically for the adverbrealistically, etc.). The command <E> means that no character needs to be deleted. The command

<A> means “delete the acute accent in the

word from which the new word derives”.

Eng4NooJ and Port4NooJ grammars are the devices used to recognize words or expres- sions and generate new ones, paraphrase or translate them. For example, the grammar in Figure 1, is used to recognize adverbial com- pounds in Portuguese and transform them into equivalent single adverbs. This gram- mar transforms multiword adverbs such asde (um) modo r´apido (in a fast/quick way) into single adverbs such asrapidamente (quickly).

This type of transformation is allowed by op- erations like the one represented in the first path of the graph. The box calls a new graph to recognize the stringsde (um) modo, de (uma) forma/maneira (in a (ADJ) way), which make up the multiword adverbial. The output $A ADV retrieves the adverb that is linked to the adjective $A. The adjective is transformed in the equivalent adverb by means of the derivational rules. The same grammar also recognizes multiword adverbs whose head is a noun, such as por acidente (by accident) orcom entusiasmo(with enthu- siasm), following the second and third paths.

The grammar in Figure 1 is monolingual,

Figure 1: Grammar to recognize multiword adverbials in Portuguese and transform them into equivalent single adverbs

because there is no specification of the out- put for a different language. However, both Eng4NooJ and Port4NooJ resources contain Portuguese and English transfers for each lex- ical entry, i.e., they are in fact bilingual dic- tionaries. As a result, any grammar used to obtain monolingual transformations can be reused to generate bilingual (or multilingual) transformations. That is, the same grammar can be used to retrieve the output in English or in any other language (separately or to- gether) as long as the words of that language are in the bilingual or multilingual dictionary and there are rules associated to the rele- vant dictionary properties. This means that, the grammar can generate translations from one to many languages, i.e., it can be used to create cross-language semantic relations.

For monolingual transformations, no output language is specified. For bilingual or cross- language transformations, the parameter for the specification of the output language needs to be added. The parameter $EN for En- glish, $IT for Italian, $SP for Spanish, etc.

specifies the retrieval of the output in one of these languages or in all of them simultane- ously. Similarly, the grammar presented in Figure 2, can be used for cross-language se- mantic relations. This grammar matches on a support verb construction of the type [Predi- cate Noun Construction] (dar um abra¸co (a)give a hug (to)) (in the figure represented in a box that calls a sub-graph) and paraphrases it into a single verb (abra¸carhug).

(8)

Eng4NooJ Port4NooJ

NDRV04 = <B>ion/Npred NDRV02 = <B>n¸ca/N+Npred e.g. accelerate >acceleration e.g. mudar >mudan¸ca ADRV02 = <B>icable/ADJ ADRV02 = <B2>o/A+Apred e.g. apply >applicable e.g. azedar >azedo AVDRV01 = <E>ly/ADV AVDRV00 = <B>zmente/ADV e.g. frequent >frequently e.g. veloz >velozmente AVDRV04 = <B>tically/ADV AVDRV05 = <A> <B>amente/ADV e.g. realism>realistically e.g. apido >rapidamente

Table 5: Rules to transform morpho-syntactic and semantically related words of different parts- of-speech

Figure 2: Grammar to generate cross- language relations between Portuguese sup- port verb constructions and equivalent En- glish single verbs

Figure 3: Cross-language relations between Portuguese support verb constructions and equivalent English single verbs

Figure 3 illustrates the output of a gram- mar that generates cross-language semantic relations between Portuguese support verb constructions and English single verbs. At present, the semantic relations included in Eng4NooJ and Port4NooJ are mostly used to generate paraphrases and integrated in the paraphrasing tools SPIDER and eSPERTo.

However, cross-language relations such as those illustrated in Figure 3 can be used directly in machine translation and are fu- elling the ParaMT bilingual paraphrasing tool. At the current stage of development, ParaMT translates mostly multiword units, performing well in the translation of Por- tuguese support verb constructions into En- glish verbs, and vice-versa, the linguistic phe- nomena most researched when applying the current methodology.

Relation Quantity

Hyponymy 14,963

Synonymy 10,395

between nouns 5,367

between verbs 20

between adjectives 34 between adverbs 5,014

Action-of 3,773

Result-of 283

Table 6: Relations in Port4NooJ v.2.0 6 Preliminary Results

In theory, the exploitation of the lexicon in combination with SAL allows the establish- ment of numerous relations between words and expressions. For the current paper, we focused only on a few of those relations which cover a larger number of items and could be extracted in a simple and easy way.

The result of extraction for Portuguese (not yet reviewed) is publicly available6. Cur- rently, Port4NooJ contains more than 30,000 morpho-syntactic relations between seman- tically related elements. Table 6 presents some preliminary results, which do not re- fer to paraphrasing capabilities, but simply to relations between lexical items. The to- tal results for paraphrasing are significantly higher. Local grammars, applied to informa- tion (properties) described in the dictionary, enable the recognition and analysis of expres- sions such asde (um) modo r´apido,de (uma) forma/maneira r´apida (in a fast/quick way) (which could be considered as relations be- tween an adjective and an adverb, but which were not counted), and also inflected forms such asdar uns passeios (go for some walks), etc.

Port4NooJ contains approximately 600 derivational rules, most of them transform- ing verbs into predicate nouns (587). 119 of

6See http://www.linguateca.pt/Repositorio/

Port4NooJ/relacoes semanticas explicitas/

(9)

these rules are productive, covering nominal- izations. 486 rules correspond to verb rela- tions between verbs and autonomous pred- icate nouns. At this point in the research, rules were only superficially evaluated.

7 Conclusions and Future Research

This paper presented semantic relations, namely domain-independent semantico- syntactic and ontological relations, suitable for paraphrasing and cross-language tasks, including machine translation. We have demonstrated that given the appropriate lin- guistic resources, the generation of semantic relations can become very systematic. Any grammar to generate monolingual semantic relations can be reused to generate cross- language relations, rules can be standardized and often re-used across close languages, etc.

Even though the methodology adopted was applied to the OpenLogos resources, it is compliant with the exploitation of other lexi- cal resources with semantic relations, for any language besides English and Portuguese, studied in this research.

Future work would gather and combine open source available semantic resources, en- hance properties on the existing resources, and enlarge the linguistic phenomena cover- age.

References

Barreiro, Anabela. 2007. Port4NooJ: Por- tuguese Linguistic Module and Bilingual Resources for Machine Translation. In Xavier Blanco, Max Silberztein, Xavier Blanco, and Max Silberztein, editors, Proceedings of the 2007 International NooJ Conference, pages 19–47. Cam- bridge Scholars Publishing, June 7-9.

Barreiro, Anabela. 2008a. Make it simple with paraphrases. Automated paraphrasing for authoring aids and machine transla- tion. Ph.D. thesis, Universidade do Porto, Portugal.

Barreiro, Anabela. 2008b. ParaMT: A paraphraser for machine translation. In Proceedings of Computational Processing of the Portuguese Language, 8th Inter- national Conference (PROPOR 2008), volume 5190 of LNCS, pages 202–211, Aveiro, Portugal. Springer.

Barreiro, Anabela. 2011. SPIDER: a Sys- tem for Paraphrasing In Document Edit- ing and Revision - applicability in ma- chine translation pre-editing. InProceed- ings of the 12th international conference on Computational linguistics and intelli- gent text processing - Volume Part II, CI- CLing’11, pages 365–376, Berlin, Heidel- berg. Springer.

Barreiro, Anabela and Lu´ıs Miguel Cabral.

2009. ReEscreve: a translator-friendly multi-purpose paraphrasing software tool. In Marie-Jos´ee Goulet, Christiane Melan¸con, Alain D´esilets, and Elliott Macklovitch, editors, Proceedings of the Workshop Beyond Translation Memories:

New Tools for Translators, The Twelfth Machine Translation Summit, pages 1–8.

Barreiro, Anabela, Bernard Scott, Walter Kasper, and Bernd Kiefer. 2011. Open- logos machine translation: philosophy, model, resources and customization. Ma- chine Translation, 25(2):107–126.

Berland, M. and E. Charniak. 1999. Finding parts in very large corpora. In Proceedo- ings of 37th annual meeting of the ACL on Computational Linguistics, pages 57–

64, Morristown, NJ, USA. Association for Computational Linguistics.

Chodorow, Martin S., Roy J. Byrd, and George E. Heidorn. 1985. Extracting semantic hierarchies from a large on-line dictionary. InProceedings of 23rd annual meeting on Association for Computational Linguistics, pages 299–304, Morristown, NJ, USA. ACL Press.

Cruse, D. A. 1986. Lexical Semantics. Cam- bridge University Press, Cambridge.

Fellbaum, Christiane, editor. 1998. Word- Net: An Electronic Lexical Database (Language, Speech, and Communication).

The MIT Press.

Freitas, Cl´audia and Violeta Quental. 2007.

Subs´ıdios para a elabora¸c˜ao autom´atica de taxonomias. In XXVII Congresso da SBC - V Workshop em Tecnologia da In- forma¸c˜ao e da Linguagem Humana (TIL), pages 1585–1594.

Girju, Roxana and Dan Moldovan. 2002.

Text mining for causal relations. In Su- san M. Haller and Gene Simmons, ed- itors, Proc. 15th Intl. Florida Artificial

(10)

Intelligence Research Society Conference (FLAIRS), pages 360–364.

Gon¸calo Oliveira, Hugo and Paulo Gomes.

2010. Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese. In Proceedings of 5th European Starting AI Researcher Symposium (STAIRS 2010).

IOS Press.

Gon¸calo Oliveira, Hugo, Diana Santos, and Paulo Gomes. 2010. Extrac¸c˜ao de rela¸c˜oes semˆanticas entre palavras a par- tir de um dicion´ario: o PAPEL e sua avalia¸c˜ao. Linguam´atica, 2(1):77–93, May.

Hearst, Marti A. 1992. Automatic acqui- sition of hyponyms from large text cor- pora. In Proc. 14th Conf. on Computa- tional Linguistics, pages 539–545, Morris- town, NJ, USA. ACL Press.

Lin, Dekang and Patrick Pantel. 2002. Con- cept discovery from text. In Proceedings of 19th International Conference on Com- putational Linguistics (COLING), pages 577–583.

Maia, Belinda and S´ergio Matos. 2008.

Corp´ografo v4: tools for researchers and teacher using comparable corpora. In Proceedings of LREC 2008 Workshop on Comparable Corpora, pages 79–82, Mar- rakech, Morocco. ELRA.

Maia, Belinda and Lu´ıs Sarmento. 2005.

The corp´ografo - an experiment in de- signing a research and study environ- ment for comparable corpora compilation and terminology extraction. In Proceed- ings of eCoLoRe / MeLLANGE Work- shop, Resources and Tools for e-Learning in Translation and Localisation, pages 45–

48, Leeds University, UK, March 21-23.

Center for Translation Studies.

Marrafa, Palmira. 2002. Portuguese Word- net: general architecture and internal se- mantic relations. DELTA, 18:131–146.

Maziero, Erick G., Thiago A. S. Pardo, Ar- iani Di Felippo, and Bento C. Dias-da- Silva. 2008. A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrˆonico para o Portuguˆes do Brasil.

In VI Workshop em Tecnologia da In- forma¸c˜ao e da Linguagem Humana (TIL), pages 390–392.

Richardson, Stephen D., William B. Dolan, and Lucy Vanderwende. 1998. Mindnet:

Acquiring and structuring semantic infor- mation from text. InProceedings 17th In- ternational Conference on Computational Linguistics (COLING), pages 1098–1102.

Roget, P. M. 1852. Roget’s Thesaurus of En- glish words and phrases. Available from Project Gutemberg, Illinois Benedectine College, Lisle IL (USA).

Santos, Diana, Anabela Barreiro, Cl´audia Freitas, Hugo Gon¸calo Oliveira, Jos´e Car- los Medeiros, Lu´ıs Costa, Paulo Gomes, and Ros´ario Silva. 2010. Rela¸c˜oes semˆanticas em portuguˆes: comparando o TeP, o MWN.PT, o Port4NooJ e o PA- PEL. In A. M. Brito, F. Silva, J. Veloso, and A. Fi´eis, editors, Textos selecciona- dos. XXV Encontro Nacional da Asso- cia¸c˜ao Portuguesa de Lingu´ıstica. APL, pages 681–700.

Sarmento, Lu´ıs, Belinda Maia, Diana San- tos, Ana Pinto, and Lu´ıs Cabral. 2006.

Corp´ografo v3: From terminological aid to semi-automatic knowledge engine. In Proceedings of the 5th International Con- ference on Language Resources and Eval- uation, LREC 2006, pages 1502–1505.

ELRA.

Scott, Bernard. 2003. The logos model: An historical perspective. Machine Transla- tion, 18:1–72, March.

Silberztein, Max. 2007. An alternative ap- proach to tagging. In Proceedings of Nat- ural Language Processing and Information Systems, 12th International Conference on Applications of Natural Language to Information Systems (NLDB 2007), vol- ume 4592 of LNCS, pages 1–11, Paris, France, June 27-29. Springer.

Sowa, John. 1999. Knowledge Representa- tion: Logical, Philosophical and Compu- tational Foundations. Thomson Learning, New York, NY, USA.

Références

Documents relatifs

The basic architecture of our factoid system is standard in nature and comprises query type identification, query analysis and translation, retrieval query formulation, document

The basic architecture of our factoid system is standard in nature and comprises query type identification, query analysis and translation, retrieval query formulation, document

The basic architecture of our system is standard in nature and comprises query type identification, query analysis and translation, retrieval query formulation, document

A query object is a tuple hEAT, BoO, Keys, Li consisting of the expected answer type EAT, the BoO representation of the question, the set of relevant keywords used as query for

Turning to query formulation and document retrieval, this was the first time we had used our own retireval engine – for TREC we used the TOPDOCS and O’ Gorman (2003) searched the

We have carried out experiments using Portuguese queries to retrieve documents in English.. The approach used was Latent Semantic Indexing, which is an automatic method not

Applying technological resources in order to solve translation problems: Applying the basic technological resources in order to solve different types of translation problems

Formulating the appropriate informative needs in order to translate: Formulating the appropriate information needs in order to translate basic (narrative and