24
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa technology from seed L 2 F - Spoken Language Systems Laboratory 1 Body part nouns and Whole-Part Relations in Portuguese Ilia Markov 123 , Nuno Mamede 23 , Jorge Baptista 123 1 U. Algarve/CECL 2 U. Lisboa/IST 3 INESC-ID Lisboa/L2F PROPOR2014 - Intl. Conference on Computational Processing of Portuguese October 6-8, 2014, ICMC, São Carlos, SP, Brazil

Body-Part Nouns and Whole-Part Relations in Portuguese

Embed Size (px)

DESCRIPTION

In this paper, we target the extraction of whole-part rela- tions involving human entities and body-part nouns in SYSTEM, a hy- brid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.

Citation preview

Page 1: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 1

Body part nouns and Whole-Part Relations in Portuguese

Ilia Markov123, Nuno Mamede23, Jorge Baptista123

1 U. Algarve/CECL 2 U. Lisboa/IST 3 INESC-ID Lisboa/L2F

PROPOR2014 - Intl. Conference on Computational Processing of Portuguese October 6-8, 2014, ICMC, São Carlos, SP, Brazil

Page 2: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 2

Objectives

• Improve the automatic extraction of semantic relations between textual elements in a existing NLP system, STRING !

• Part-whole relations (meronymy) !

•Human body-part nouns (Nbp) !

O Pedro partiu o braço ‘Pedro broke the arm’ WHOLE-PART(Pedro,braço)

Page 3: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 3

Objectives (cont.)

!

•Development of a rule-base meronymy detection module for Human-Nbp relations • Implementation in STRING (Mamede et al., 2012) !!

STRING: a hybrid, statistical and rule-based, Natural Language Processing (NLP) system for Portuguese

string.l2f.inesc-id.pt

Page 4: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 4

Motivation

Semantic relations are a device for structuring texts: contribute to cohesion and coherence of a text.

Automatic extraction of semantic relations is useful for some NLP tasks: • Anaphora Resolution O Pedro lavou a cara ‘Pedro washed the face’ WHOLE-PART(Pedro,cara) O Pedro lavou a sua cara ‘Pedro washed his face’ WHOLE-PART(sua,cara) & ANTECEDENT(?,sua)

Page 5: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 5

Motivation (cont.)

• Semantic Role Labeling O Pedro partiu um braço ‘Pedro broke an arm’ WHOLE-PART(Pedro,braço) ➢ Pedro is an experiencer. O Pedro partiu o braço do João ‘Pedro broke João’s arm’ WHOLE-PART(João,braço) ➢ Pedro is an agent.

Page 6: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 6

Motivation (cont.)

•Opinion mining !É um bom hotel: o quarto era limpo, as camas eram feitas de lavado todos os dias, e os pequenos-almoços eram opíparos ‘It is a nice hotel: the room was clean, the beds (bed sheets) were changed everyday, and the breakfast was sumptuous’

Page 7: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 7

Related Work

In NLP, various information extraction techniques have been developed in order to capture part-whole relations from texts: • Hearst, 1992

Lexico-syntactic patterns to capture hyponymic (type-of) relations

• Girju et al., 2003, 2006 The method semi-automatically identifies patterns that encode part-whole relations and learns automatically the classification rules needed for the extraction of part-whole relations from these patterns. The authors report an overall average precision of 80.95% and recall of 75.91%.

Page 8: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 8

• Van Hage et al., 2006 A method for learning part-whole relations from vocabularies and text sources; the authors were able to acquire 503 part-whole pairs from the AGROVOC Thesaurus to learn 91 reliable part-whole patterns. !

• Pantel and Pennacchiotti, 2006 The Espresso algorithm: takes as input a few seed instances of a particular relation and learns surface patterns to extract more instances. The algorithm obtains a precision of 80%.

Related Work (cont.)

Page 9: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 9

Related Work (cont.)

• Lexical ontologies for Portuguese: - WordNet.PT - PAPEL - Onto.PT !

• Parsers of Portuguese: - The PALAVRAS parser (Bick, 2000), using

the Visual Interactive Syntax Learning (VISL) environment; - LX Semantic Role Labeler (Branco & Costa, 2010).

Page 10: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 10

Dependency Rule in STRING

O Pedro partiu o braço do João ‘Pedro broke João’s arm’ IF( MOD[POST](#2[UMB-Anatomical-human],#1[human]) &

PREPD(#1,?[lemma:de]) &

CDIR[POST](#3,#2) & ~WHOLE-PART(#1,#2)

)

WHOLE-PART(#1,#2)

WHOLE-PART(João,braço)

Page 11: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 11

Fixed Phrases and Frozen Sentences involving Nbp

‣400 semi-automatically crafted rules, based on available lexicon-grammar of European Portuguese idioms

Page 12: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 12

Other phenomena

• DET=um and bilateral symmetry O Pedro partiu um braço ‘Pedro broke an arm’

• relations between 2 Nbp A Ana pinta as unhas dos pés ‘Ana paints the nails of the feet’

• part-of Nbp O Pedro tocou com a ponta da língua no gelado ‘Pedro touched with the tip of the tongue on the ice cream’

• “hidden” Nbp with disease nouns O Pedro tem uma gastrite (estômago) ‘Pedro has gastritis (stomach)’

Page 13: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 12

Evaluation

• First fragment of the CETEMPúblico corpus (Rocha & Santos, 2000): 14.7 M tokens; 6.3 M simple words; and 300 K sentences. • Using a Nbp lexicon (151 lemmas); 16,746 sentences with Nbp

were extracted. • A random stratified sample of 1,000 sentences with Nbp,

keeping the proportion of their total frequency in the source corpus. • Divided between 4 annotators – 4 subsets of 225 sentences

each, with a common set of 100 sentences to assess inter-annotator agreement. ‣WHOLE-PART, FIXED, nothing

Page 14: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 13

Inter-annotator Agreement

Inter-annotator AgreementAverage Pairwise Percent Agreement

Fleiss’ Kappa

Average Pairwise Cohen’s Kappa

http://dfreelon.org/utils/recalfront/recal3/

Page 15: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 14

Results (1st evaluation)

ResultsSystem’s performance for Nbp

Page 16: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 15

Error Analysis false-positives

• Disambiguation of Nbp in context - língua ‘tonge/language’ - língua portuguesa ‘Portuguese language’ - língua de Camões ‘language of Camões’

• New idioms have been encoded in the lexicon - abrir o coração a ‘to open one’s heart to sb.’ - fazer face a ‘to face sth./to deal with’

• Nbp used figuratively Além disso, a nova face desta Igreja chilena… ‘Moreover, the new face of this Chilean Church…’

Page 17: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 16

Error Analysis false-negatives

• The whole and the part are not syntactically related and may be quite far away from each other: !O facto do corpo ter sido encontrado na cozinha, leva os bombeiros a suspeitar que a vítima, com graves problemas de saúde, tenha desmaiado e caído à lareira, o que poderá ter estado na origem do incêndio. ‘The fact that the body was found in the kitchen, makes the firefighters to suspect that the victim with serious health problems fainted and fallen into the hearth, which may have been the origin of the fire.’ WHOLE-PART(vítima,corpo)

Page 18: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 17

Error Analysis false-negatives (cont.)

• Some human nouns and all pronouns (including personal, relative and demonstrative) are unmarked with the human feature (even if anaphora resolution performs ok);

Segundo o responsável do hospital, o doente – que também sofreu graves ferimentos na cabeça – poderia ser ainda sujeito a uma segunda intervenção cirúrgica ‘According to the head of the hospital, the patient - who also suffered serious head injuries – could still be subjected to a second surgical intervention’ ANTECEDENT(doente,que)!PART-WHOLE(que,cabeça)!

‣inheritance of features and relative placing of AR and WP modules within STRING architecture

Page 19: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 18

• A modifier of a noun or an adjective (and not a verb): !Um mágico com um barrete (enfiado) na cabeça ‘A magician with a hat (stuck) in the head’ !WHOLE-PART(mágico,cabeça)

Error Analysis false-negatives (cont.)

Page 20: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 19

System’s performance for Nbp

Results (2nd evaluation)

+0.13 +0.11 +0.12

Page 21: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 21

Thank you!

Questions please!

echo "O Pedro penteou o cabelo do filho com os dedos" | xip/string.sh TOP +------------+----------+----------------+-------------------+ | | | | | NP VF NP PP PP +-------+ + +-------+ +----+-------+ +----+-------+ | | | | | | | | | | | ART NOUN VERB ART NOUN PREP ART NOUN PREP ART NOUN + +- +- +- + + + +- +- + +- | | | | | | | | | | | O Pedro penteou o cabelo de o filho com os dedos MAIN(penteou) MOD_POST(cabelo,filho) MOD_POST(penteou,dedos) SUBJ_PRE(penteou,Pedro) CDIR_POST(penteou,cabelo) WHOLE-PART(filho,cabelo) WHOLE-PART(Pedro,dedos) 0>TOP{NP{O Pedro} VF{penteou} NP{o cabelo} PP{de o filho} PP{com os dedos}}string.l2f.inesc-id.pt

Page 22: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 20

References

Berland, M. and Charniak, E. 1999. Finding parts in very large corpora. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 57–64. Morristown, NJ, USA. Association for Computational Linguistics.

Bick, E. 2000. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.

Branco, A. and Costa, F. 2010. A Deep Linguistic Processing Grammar for Portuguese. In Pardo et al. (eds.), Computational Processing of Portuguese, LNAI 6001, Springer, pp. 86–89.

Girju,R., Badulescu A., and Moldovan, D. 2006. Automatic discovery of part-whole relations. Computational Linguistics, 21(1):83–135.

Nascimento, M., Veloso, R., Marrafa, P., Pereira, L., Ribeiro, R., and Wittmann, L. 1998. LE-PAROLE: do Corpus à Modelização da Informação Lexical num Sistema-multifunção. Actas do XIII Encontro Nacional da Associação Portuguesa de Linguística, 2:115–134.

Mamede, N., Baptista, J., Diniz, C. and Cabarrão, V. 2012. STRING: An hybrid statistical and rule-based natural language processing chain for portuguese. http://www.propor2012.org/demos/DemoSTRING.pdf

Page 23: Body-Part Nouns and Whole-Part Relations in Portuguese

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 21

References (cont.)

Pantel, P. and Pennacchiotti, M. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL-06), pages 113–120. Sydney, Australia.

Rocha,P. and Santos, D. 2000. "CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa". In Maria das Graças Volpe Nunes (ed.), V Encontro para o processamento computacional da língua portuguesa escrita e falada (PROPOR 2000) (São Paulo, Brasil, 19-22 de Novembro de 2000), São Paulo: ICMC/USP, pp. 131-140.

Widlöcher, A. and Mathet, Y. 2012. The Glozz Platform: a Corpus Annotation and Mining Tool. In Proceedings of the 2012 Association for Computational Liguistics Symposium on Document Engineering, DocEng ’12, pages 171–180, Paris, France. Telecom ParisTech, Association for Computational Liguistics.

Winston, M., Chaffin, R. and Herrmann, D.1987. A Taxonomy of Part-Whole Relations. Cognitive Science, 11:417–444.

Page 24: Body-Part Nouns and Whole-Part Relations in Portuguese

technology from seed

L2 F - Spoken Language Systems Laboratory