A computational grammar for Portuguese

Bruno Cuconato Claro

A computational grammar for Portuguese

Rio de Janeiro

2019

Bruno Cuconato Claro

A computational grammar for Portuguese

Dissertação submetida à Escola deMatemática Aplicada como requisitoparcial para a obtenção do grau de Mestreem Modelagem Matemática

Fundação Getulio Vargas

Escola de Matemática Aplicada

Mestrado em Modelagem Matemática

Ênfase em Modelagem e Análise da Informação

Supervisor: Alexandre Rademaker

Rio de Janeiro2019

Dados Internacionais de Catalogação na Publicação (CIP) Ficha catalográfica elaborada pelo Sistema de Bibliotecas/FGV

Claro, Bruno Cuconato A computational grammar for Portuguese / Bruno Cuconato Claro. – 2019. 112 f.

Dissertação (mestrado) -Fundação Getulio Vargas, Escola de Matemática Aplicada. Orientador: Alexandre Rademaker. Inclui bibliografia.

1. Linguística - Processamento de dados. 2. Teoria dos tipos. 3. Processamento da linguagem natural (Computação). I. Rademaker, Alexandre. II. Fundação Getulio Vargas. Escola de Matemática Aplicada. IV. Título.

CDD – 006.35

Elaborada por Maria do Socorro Almeida – CRB-7/4254

Acknowledgements

I thank my significant other for the love, patience, and help. You know well howmuch you helped me through this.

I thank my family for the love and support. The choices you made for me in thepast allowed me to choose this path now.

I thank my advisor Alexandre Rademaker for the many ideas, discussions, andsupport. I have learned a multitude of things under your guidance, only some of whichappear here.

I thank professor Flávio Coelho introducing me to the Unix tradition. It hasn’tbeen a day where I don’t use what I learned with/through the book you lent me.

I thank professor Paulo Carvalho for the help, advice, and teachings – your intro-duction to mathematics still echoes in everything I’ve done since then.

I thank Cirlei de Oliveira, Elisângela Santana, Conceição Lima, Cristiane Guimarães,and Mônica Souza for the help, support, and conversations.

I thank my friends and colleagues for the companionship and the conversations(which more than once led to interesting ideas, even if we were just talking nonsense).Specially I’d like to mention Laura Sant’Anna, Kátia Nishiyama, Henrique Muniz, Guil-herme Passos, Pedro Delfino, Harllos Arthur, Alessandra Cid, Alexandre Tessarollo, JoãoCarabetta, Fernanda Scovino.

Finally, I’d like to say I’d have given up on this dissertation’s idea if not for theextreme patience and kindness of a former stranger which is now a dear friend – thankyou, Inari Listenmaa.

“Look at the raw PGF”

AbstractIn this work we present a freely-available type-theoretical computational grammar forPortuguese, implemented in the Grammatical Framework (GF) multilingual formalism.Such a grammar can be used for both syntactic parsing and natural language generation.We first describe the formalism itself, discussing its logico-mathematical foundations;we then present our grammar. We evaluate our grammar’s productions with respect tosyntactical correctness, show possible applications, and discuss future work.

Keywords: Type theory. Natural Language Processing. Computational linguistics. NaturalLanguage Generation. Grammar engineering.

ResumoEste trabalho descreve a criação de uma gramática computacional do Português imple-mentada no formalismo Grammatical Framework. Nele apresentamos o formalismo e anossa gramática. Avaliamos nossa gramática com respeito à corretude sintática de suasproduções, demonstramos possíveis aplicações, e discutimos trabalhos futuros.

Palavras-chave: Teoria de tipos. Processamento de Linguagem Natural. Linguísticacomputacional. Geração de Linguagem Natural. Engenharia de gramáticas.

List of Figures

Figure 1 – Constituent analysis of a sentence . . . . . . . . . . . . . . . . . . . . . 24Figure 2 – A functor implementation of Haskell and Lisp list syntaxes. . . . . . . 39Figure 3 – The PGF grammar for Haskell lists. . . . . . . . . . . . . . . . . . . . . 46Figure 4 – A context-free aproximation for Haskell lists without an intermediary

category for empty lists . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 5 – Parse for the string [x , x] using CFG approximation with on-the-fly

specialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 6 – Parsing deduction rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Figure 7 – Parse deduction for the string [x , x] . . . . . . . . . . . . . . . . . . . 52Figure 8 – Failed parse deduction for the string [x , x]: unexpected token. . . . . 53Figure 9 – Failed parse deduction for the string [x , x]: unexpected token. . . . . 53Figure 10 – German predicate Prime . . . . . . . . . . . . . . . . . . . . . . . . . . 56Figure 11 – RGL module structure (condensed) . . . . . . . . . . . . . . . . . . . . 59Figure 12 – GF RGL category system . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 13 – Module structure of the Portuguese mini resource grammar . . . . . . . 63Figure 14 – Abstract syntax trees for John is from the city; the hole in the diamond-

shaped node can be filled by both UseComp and UseComp_estar. . . . . 79Figure 15 – Abstract syntax tree for what she did is important. . . . . . . . . . . . 80Figure 16 – Two analyses for the sentence John is not a doctor . . . . . . . . . . . 80Figure 17 – Abstract syntax tree for John saw no animals. . . . . . . . . . . . . . . 83Figure 18 – Screenshot of the DG demo app . . . . . . . . . . . . . . . . . . . . . . 92Figure 19 – Different kinds of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 20 – Converting decorated parse tree to a dependency tree . . . . . . . . . . 99Figure 21 – Dependency trees converted from GF trees . . . . . . . . . . . . . . . . 100

List of Tables

Table 1 – Inflection table for the word gramática . . . . . . . . . . . . . . . . . . . 69Table 2 – RGL and Romance tenses, and how they inflect verbs in a Portuguese

sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Listings

Listing 1.1 – Example GF code listing . . . . . . . . . . . . . . . . . . . . . . . . . . 27Listing 2.1 – Abstract grammar Foods. . . . . . . . . . . . . . . . . . . . . . . . . . 30Listing 2.2 – Concrete English grammar for Foods. . . . . . . . . . . . . . . . . . . 30Listing 2.3 – Abstract syntax for linked lists in GF. . . . . . . . . . . . . . . . . . . 32Listing 2.4 – GF shell session with abstract module only. . . . . . . . . . . . . . . . 33Listing 2.5 – Lisp list linearization types. . . . . . . . . . . . . . . . . . . . . . . . . 33Listing 2.6 – Lisp list linearization rules. . . . . . . . . . . . . . . . . . . . . . . . . 34Listing 2.7 – GF shell session with single concrete module. . . . . . . . . . . . . . . 34Listing 2.8 – Haskell List linearization types. . . . . . . . . . . . . . . . . . . . . . . 35Listing 2.9 – Haskell List linearization rules. . . . . . . . . . . . . . . . . . . . . . . 35Listing 2.10–Cons linearization rule using GF tables. . . . . . . . . . . . . . . . . . 36Listing 2.11–Translation between List concrete syntaxes. . . . . . . . . . . . . . . . 36Listing 2.12–Cons using a consWith oper . . . . . . . . . . . . . . . . . . . . . . . . 37Listing 2.13–List syntax interface module. . . . . . . . . . . . . . . . . . . . . . . . 38Listing 2.15–Haskell list syntax instantiation. . . . . . . . . . . . . . . . . . . . . . 39Listing 2.16–Haskell list functor instantiation. . . . . . . . . . . . . . . . . . . . . . 39Listing 2.17–Lisp list syntax instantiation. . . . . . . . . . . . . . . . . . . . . . . . 39Listing 2.18–Lisp list functor instantiation. . . . . . . . . . . . . . . . . . . . . . . . 39Listing 2.14–List functor module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Listing 3.1 – A hand-written rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Listing 3.2 – Using the RGL constructors directly. . . . . . . . . . . . . . . . . . . . 56Listing 3.3 – Using the RGL API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Listing 3.4 – mkCl overloaded function in the RGL API. . . . . . . . . . . . . . . . . 57Listing 3.5 – Prime predicate in Portuguese and English. . . . . . . . . . . . . . . . 57Listing 3.6 – prime_A in German, Portuguese, and English. . . . . . . . . . . . . . . 57Listing 3.7 – Portuguese noun constructors. . . . . . . . . . . . . . . . . . . . . . . 58Listing 3.8 – Concrete representation of pronouns and how they are built in the

Portuguese mini-resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Listing 3.9 – Portuguese parameter definitions in the Portuguese mini-resource. . . . 64Listing 3.10–Concrete representation of nouns and noun phrases in the Portuguese

mini-resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Listing 3.11–Linearization of the UsePron constructor in the Portuguese mini-resource. 65Listing 3.12–GF parameter representing verb forms in the Portuguese mini-resource. 66Listing 3.13–GF verbal concrete representations in the Portuguese mini-resource. . . 66Listing 3.14–GF verbal complementation in the Portuguese mini-resource. . . . . . 67Listing 3.15–GF clause building in the Portuguese mini-resource. . . . . . . . . . . 68

Listing 3.16–The Portuguese lexical constructor for noun gramática . . . . . . . . . 69Listing 3.17–Naive smart paradigm for verbs . . . . . . . . . . . . . . . . . . . . . . 70Listing 3.18–Portuguese verb smart paradigm . . . . . . . . . . . . . . . . . . . . . 71Listing 3.19–The GenRP constructor from Extend . . . . . . . . . . . . . . . . . . . 73Listing 3.20–The AdjAsCN constructor from Extend . . . . . . . . . . . . . . . . . . 74Listing 3.21–The AdvImp constructor from ParseExtend . . . . . . . . . . . . . . . 75Listing 3.22–A function mapping temporal order and anteriority to verb forms . . . 86Listing 4.1 – Dependency configurations for a few RGL functions . . . . . . . . . . . 96Listing 4.2 – Raw output of the GF to UD conversion for the sentence “há uma vaca

na floresta” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Listing 4.3 – Partially corrected output of the GF to UD conversion for the sentence

“há uma vaca na floresta” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Contents

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.2 Scope and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 251.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.4 Typesetting Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 GRAMMATICAL FRAMEWORK . . . . . . . . . . . . . . . . . . . 292.1 A GF tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.1.1 A simple GF grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.1.2 Refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2 GF, mathematically . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.2.2 From GF to PGF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.2.3.1 The example grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.2.3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.2.3.3 Parsing as deduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.2.3.3.1 Production items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.3.3.2 Active items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.3.3.3 Passive items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.2.3.3.4 Initial predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.2.3.3.5 Predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.2.3.3.6 Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.3.3.7 Complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.3.3.8 Combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.3.4 An example parse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.2.4 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 A COMPUTATIONAL GRAMMAR FOR PORTUGUESE . . . . . . 553.1 The GF resource grammar library . . . . . . . . . . . . . . . . . . . . 553.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.1.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.2 The Portuguese resource grammar . . . . . . . . . . . . . . . . . . . . 613.2.1 A Portuguese miniresource grammar . . . . . . . . . . . . . . . . . . . . . 623.2.2 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.2.3 Extra modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.2.3.1 Morphological modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.2.3.2 Grammar extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.3.1 UD examples corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.3.1.0.1 Copular verbs and adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.3.1.0.2 Copula type in complement of copula . . . . . . . . . . . . . . . . . . . . . . . 77

3.3.1.0.3 Incorrect trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.3.1.0.4 Romance clause inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.3.1.0.5 Romance clitic pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.3.1.0.6 The no quantifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.3.2 Matrix MRS test suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823.3.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.3.2.1.1 Mismatched tenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.3.2.1.2 Whose as interrogative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3.2.1.3 Compound nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.3.2.1.4 Adverb placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.3.2.1.5 Incorrect tense choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.3.2.1.6 Tag questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.3.2.1.7 Date and time units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.3.2.1.8 Incomplete sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.3.2.1.9 Iberic negative imperatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914.1 Health-domain application grammar . . . . . . . . . . . . . . . . . . . 914.2 Attempto Controlled English and GF . . . . . . . . . . . . . . . . . . 914.2.1 ACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924.2.2 ACE-in-GF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.2.3 Implementation and example usage . . . . . . . . . . . . . . . . . . . . . 944.3 GF to UD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.1 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.3.1.0.1 Abstract syntax trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.3.1.0.2 Parse trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.1.0.3 Dependency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.1.0.4 Abstract dependency trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.3.2.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.3.2.0.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.1.1 Known issues with the Portuguese RG . . . . . . . . . . . . . . . . . . . . 1035.1.1.0.1 Preposition + personal pronoun contractions . . . . . . . . . . . . . . . . . . . . 103

5.1.1.0.2 Enclitic and mesoclitic pronoun contractions . . . . . . . . . . . . . . . . . . . . 104

5.1.1.0.3 Preposition + demonstrative pronouns contractions . . . . . . . . . . . . . . . . . 104

5.1.1.0.4 Reflexive two-place adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.1.2 Possible applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.1.2.0.1 WordNet gloss corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.1.2.0.2 Data as text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

23

1 Introduction

This thesis presents a computational grammar of Portuguese developed under theGrammatical Framework (GF) formalism. GF is a programming language designed forgrammar writing, whose aim is to diminish the cost of creating grammars: it features arobust type-system to catch errors during compile-time (versus having these errors showingup at run-time), an elegant functional programming approach to describing grammarsin a generative way [30], a standard library to share and encapsulate common aspectsof all natural language grammars [32], and a common, shared ecosystem including theGF compiler, two runtime systems, and a testing tool able to generate minimal andrepresentative test cases for grammatical constructions of choice [23].

Natural language processing tools such as grammars are often faced with a coverageversus correctness tradeoff: tools like Google Translate will produce output for any giveninput, but their results are often bad (although improving by the day); tools like GFgrammars, on the other hand, will not be able to produce output for any input, butwhen they do they are often correct, or at least easily fixed in most cases (we give severalexamples of errors that were corrected in section 3.3). Because of GF’s lack of coverage,its main goals are that of domain-specific natural language processing – where the lexiconand linguistic phenomena are more restricted – and that of natural language generation(NLG). Its ideal application in this respect is that of generating grammatically-correctnatural language from data such as ontologies or database entries.

Our GF grammar has the same characteristics of GF as a whole: it is unsuitablefor general purpose natural language processing, and thus aims to

1. generate grammatically-correct Portuguese;

2. provide an application-programmer interface capable of supporting domain-specificgrammars and applications.

1.1 Motivation

A grammar is a set of rules that govern a language. It tells us how to combineand compose sentences from its constituents. A computational grammar is an encodingof such rules in a way that allows a computer to analyse sentences to its constituents(an analysis which is often represented as a tree like the one in figure 1), or to generatesentences according to these rules (gloss 1 shows a randomly generated sentence fromour Portuguese grammar). The act of analysing a sentence string to produce analysis

24 Chapter 1. Introduction

Figure 1 – Constituent analysis of a sentence

trees is called parsing, while transforming an analysis tree into a sentence string is calledlinearization (or generation). Note that parsing may successfully analyse the sentencestring in more than one way or not at all, while the linearization of a well-formed tree willproduce at least one sentence tree.

(1) deixe a maior parte de distância ser quebrada, poucos taças de ninguém, distâncias,distâncias a garrafa de muitos copos de ninguém e copos de todo a maior partede algo falado diretos que não teria sido doente se tais que teria tido fome, quenão teriam sido assustados então que serão prontos fáceis a si tais que teria havidofrancês e se que não tinham estado errados ou que não teriam estado certos, tais quetudo não se chamaria nada, que não serão prontos então que não teriam sido, outais que alguém não se teria chamado todos ou tais que ninguém não terá sido ouou que terão sido prontos, que tinham estado certos ou que estão certos, tais queque todos não terão sido casados com nada porque todos teriam tido sede se tinhatornado mais velho que este esquerda de ninguém e tanto que têm sede eles mesmose tais que não estava certo quanto tais que teria feito tanto casado

Grammars are useful because the text analyses they produce – be them syntactic orsemantic – are a possible way of understanding texts automatically. Of course, grammarsare not the only way of producing such analyses (statistical methods of doing so are

1.2. Scope and Contributions 25

actually more popular), nor are they the only way of achieving this ultimate goal ofartificial intelligence. Grammars are useful for their introspectability (it is relatively simpleto see where mistakes stem from and how to fix them, compared to statistical methods) andfor their portability (they encode a lot of information in relatively little space). Grammarsdo have their problems in incompleteness (they can not parse any given text), cost (writinga grammar is a lot of work) and performance (parsing text can be an expensive operationtime- and space-wise). However, as Ranta puts it [33, section 1.3], even though grammars(as abstractions of natural language) leak, all we need to justify them is that they areuseful. And their usefulness comes from a very common-sense principle (the title of [39]):don’t guess if you know. This means that there is no need for us to try to determinestatistically (i. e., guess) the behaviour of regular, well-known patterns of language (likesubject-verb agreement, lemmatization, etc). Let us employ these methods to phenomenawhich are highly irregular (or which are simply out of our grammar’s scope) instead.

1.2 Scope and ContributionsIn this work we present a GF Portuguese grammar in the GF formalism whose

linearizations are tested against two GF treebanks. The intended minimal coverage of thisgrammar is that of all GF Resource Grammar Library constructors (see section 3.1 fordetails), only including its extensions where they might be needed to generate our testtreebanks of choice correctly. A list of the linguistics constructors covered by the RGL canbe found at its synopsis1.

Our foremost contribution is the freely-available GF Portuguese grammar (a previewof which appeared in [7]). While unsuitable for general domain parsing, this grammar wasable to generate grammatical sentences from the syntactic trees in our test treebanks inmore than 90% of the cases (see section 3.3 for details).

In the process of developing it, we have made contributions to the Romancegrammar it is based on, which in turn meant contributions for languages such as Spanishand Catalan. We have also made non-academic contributions to the GF community,providing documentation patches and maintenance work (we set up continuous integrationtesting for the GF repositories and provided documentation fixes, for example).

A complete Portuguese grammar involves not only syntactic coverage but alsolexical coverage. To be able to provide a GF dictionary we needed a Portuguese lexicon.Our work in developing MorphoBr [8], a full-form lexicon of Portuguese open class wordswas inspired by this need.

Because of our work [6, 12] involving the Portuguese WordNet, we were naturallyinclined to contribute to the gf-wordnet project. We have produced a Portuguese in-1 http://www.grammaticalframework.org/lib/doc/synopsis/index.html

http://www.grammaticalframework.org/lib/doc/synopsis/index.html

26 Chapter 1. Introduction

stantiation of this project, based on our GF Portuguese grammar and on the PortugueseWordNet [11]. Although the project (and its Portuguese version) is still ongoing, we werealso able to contribute corrections to the Portuguese WordNet.

Finally, we have also made three existing GF applications support the Portugueselanguage by virtue of our Portuguese grammar. Although most of them are more prototypesthan industrial-grade applications, they should give an idea of what GF is capable of.One of these applications is the GF to Universal Dependencies conversion developed byKolachina & Ranta [20]. This conversion project has created a GF treebank which is usedto test both the project itself and the GF grammar library. On top of adding support forPortuguese in this project, we have also revised its treebank, contributing tree correctionsand removing duplicate trees.

1.3 Structure

In chapter 2 we aim to acquaint the reader with GF. We first present the ideasbehind GF; then we provide an example-based tutorial to familiarize the reader withGF syntax and semantics. We then describe what GF functors are, and how they canbe used to generalize grammatical notions, promoting sharing between different concreteimplementations. We conclude with a more in-depth view of GF, where we follow Angelov [1]in defining GF mathematically, including how its parsing and linearization work formally.Understanding this last part is not required to understand the remainder of this thesis.

Chapter 3 is the backbone of this work. It presents the GF Resource GrammarLibrary (RGL), discussing the motivation behind its existence, its uses, and its structure.We then describe our Portuguese implementation of a resource grammar, which is based onthe RGL Romance functor. We fully describe a miniature version of a resource grammarfor Portuguese, and then discuss the main points of the full implementation, including mor-phology and extension modules. Finally, we evaluate our work under syntactically-correctlanguage generation respect by linearizing two treebanks from well-known computationallinguistics projects, analysing errors and discussing potential solutions.

Chapter 4 presents three pre-existing GF applications which have been extendedto use our Portuguese grammar implementation. The first application is a demonstrationof a health-domain translation chat app developed by a company run by the GF coredevelopment team. The second application is a mapping from the Attempto ControlledEnglish controlled natural language to GF, which allows one to use GF and the RGLto both parse and linearize ACE-like text in languages other than English itself. Thelast application is a converter of GF abstract syntax trees to Universal Dependencies(UD) trees [20], which allows GF to be used as rule-based dependency parser, or as abootstrapper of UD treebanks, among other uses. Our focus is in showing how simple it

1.4. Typesetting Conventions 27

is to add a new language to a GF multilingual app, given a resource grammar for thelanguage is available.

Finally, in chapter 5 we summarize our work and discuss future projects.

1.4 Typesetting ConventionsThroughout this thesis we discuss linguistic examples. Single examples are usually

typeset inline in italic font, as are linguistic terms. More elaborate examples are typesetas two or three pieces of text like in gloss 2. If there are just two text fragments, thefirst is an English version, and the second a Portuguese one – whether it is producedfrom our grammar or is simply an example should be clear from the context. When thereare three fragments, the first is the English version, and the following two fragments arePortuguese. It should be clear from the context whether the first Portuguese fragment isthe final grammar output – in which case the second fragment is either the ideal outputor another possible version – or whether the first Portuguese fragment was the originalgrammar output before a correction was made, which then produced the second Portuguesefragment.

(2) a. I always thought that there was something fundamentally wrong with the universe

b. eu sempre pensei que havia algo de fundamentalmente errado com o universo

When an example is ungrammatical, we prefix it with an asterisk (*); when theexample is grammatical but is the ideal one (where the sense of ‘ideal’ is given by thesurrounding context), we prefix it by an exclamation mark (!). Finally, when an exampleis semantically incorrect but otherwise grammatical, we prefix it by a hash sign (#).

We also typeset inline GF elements differently – including judgement names,modules, constructors, categories, and functions. GF code is shown as in listing 1.1.

−− this is a commentfunHello x = {s = "Hello" ++ x.s} ;

Listing 1.1 – Example GF code listing

29

2 Grammatical Framework

We begin this chapter with an overview of GF. We proceed to a short tutorial onGF, whose purpose is to make the reader familiar with GF syntax and semantics. Weconclude the chapter by giving a mathematical definition of an important subset of GF,and showing formally how parsing and linearization are done in it.

Grammatical Framework (GF) is a domain-specific programming language forgrammar writing. It is a functional programming language, with syntax inspired by theHaskell programming language [25]; it draws from intuitionistic type theory [26] for itstype system.

A GF program is called a grammar, and it defines parsing, generation and translationfrom the same declarative source.

GF’s forte lies at multilingual processing. It applies to natural languages thedistinction made for programming languages: that of an abstract and a concrete syntax.Separating them allows GF to specify a single abstract grammar for several concretelanguages. Translation between two natural languages, for instance, becomes parsing ofconcrete syntax to its abstract representation, and then further linearization to the targetlanguage.

Foods> parse -lang=Eng "this fish is fresh"Pred (This Fish) FreshFoods> parse -lang=Por "este peixe é fresco"Pred (This Fish) FreshFoods> linearize -all Pred (This Fish) Freshthis fish is fresheste peixe é fresco

The idea is that both sentences above carry the same (abstract) information, whichis represented in each language by the respective string. A GF grammar demands adescription of its abstract syntax and how this syntax translates into the concrete syntaxes,i.e., the strings. The abstract syntax encompasses both trees (such as [Pred (This Fish)Fresh]) and their construction rules (see listing 2.1 for an example of an abstract GFgrammar). The concrete syntaxes, on the other hand, must specify rules for the translationof trees into strings of the desired language. Although natural languages are the main focus,GF can and has been used to generate any kinds of strings, e.g. LATEX code. Listing 2.2shows an English concrete grammar for the abstract grammar of listing 2.1; both are a

30 Chapter 2. Grammatical Framework

version of the Foods grammar appearing in GF documentation and literature, for examplein [33, chapter 3].

abstract Foods = {catComment ; Item ; Kind ; Quality ;

funPred : Item → Quality → Comment ;This : Kind → Item ;Fish : Kind ;Fresh : Quality ;

} ;

Listing 2.1 – Abstract grammar Foods.

concrete FoodsEng of Foods = {lincatComment, Quality = {s : Str} ;Kind = {s : Number ⇒Str} ;Item = {s : Str ; n : Number} ;

linPred item quality ={s = item.s ++ copula ! item.n ++ quality . s} ;

This = det Sg "this " ;Fish = noun "fish" " fish " ;Fresh = adj " fresh" ;

paramNumber = Sg | Pl ;

operdet : Number →Str →{s : Number ⇒Str} → {s : Str ; n : Number} =

λn,det,noun → {s = det ++noun.s ! n ; n = n} ;noun : Str → Str → {s : Number ⇒Str} =

λman,men →{s = table {Sg ⇒man ; Pl ⇒men}} ;regNoun : Str → {s : Number ⇒Str} =

λcar → noun car (car + "s") ;adj : Str → {s : Str} =

λcold → {s = cold} ;copula : Number ⇒Str =

table {Sg ⇒ " is " ; Pl ⇒ "are"} ;} ;

Listing 2.2 – Concrete English grammar for Foods.

2.1 A GF tutorialIn this short tutorial we give an overview of GF as a programming language.

We refrained from using an example in the natural language domain, because any such

2.1. A GF tutorial 31

example would need to be much simplified for our current purpose of presenting GF syntaxand semantics. Therefore we opt for a complete example which has the added benefit ofbeing familiar to most programmers. The linguistics-inclined reader will have to wait untilchapter 3 for linguistically-motivated GF code.

By the end of this section we expect the reader to be able to read and understandGF with reasonable fluence, even if writing it is another matter. Despite this stated goal,we still offer information which might be useful to the reader who intends to learn GF,e. g., how to use the GF shell. To this kind of reader we recommend following the examplesin your computer, with GF installed.

The definitive resource to learn GF is the GF book [33]; the freely-available onlinereference1, from where we take many of the definitions in this section, is also useful to lookup syntactic or semantic details; finally, Inari Listenmaa’s blog post2 is a great resource toavoid common GF pitfalls.

2.1.1 A simple GF grammar

In this section we present a grammar for linked lists, capable of translating betweentwo different syntaxes for them. The Lisp syntax uses () for the empty list and representsa non-empty list as a series of elements separated by spaces, delimited by parentheseson both ends, like so: (0 1 1 2 3 5). The Haskell syntax uses [] for the empty list, whilenon-empty lists are elements separated by commas (with optional spaces), delimited bysquare brackets on both ends, as in [0,1,1,2,3,5] .

Linked lists are a data structure which is defined recursively. A list is either:

(1) the empty list (also called Nil); or

(2) a list with an element added at the front.

Because adding an element to the front of a list is how you construct all lists except forthe empty one, this process is called Cons. Given this definition, Cons X (Cons X (ConsX Nil)) is an abstract tree representing the three-element list which in Lisp syntax is(x x x).

A GF grammar is composed of an abstract syntax and any number of concretesyntaxes. While the abstract syntax declares categories and how they can be combinedas in a logical framework, the concrete syntaxes define the linearization of the createdabstract syntax trees to strings.

1 http://www.grammaticalframework.org/doc/gf-refman.html2 https://inariksit.github.io/gf/2018/08/28/gf-gotchas.html

http://www.grammaticalframework.org/doc/gf-refman.html

https://inariksit.github.io/gf/2018/08/28/gf-gotchas.html


Each GF module is composed by a header3 and a body, which is simply a setof judgements. The GF compiler demands that the programmer specify what form ofjudgement she is making by using a judgement keyword, so, e.g., we precede categorydeclarations with cat. We may specify a judgement keyword for each judgement, or(more conveniently) we may omit subsequent judgement keywords of the same form. Eachjudgement is separated by a semicolon ( ;).

Listing 2.3 shows an abstract syntax for linked lists in GF syntax. Note how itfollows the recursive definition of lists. We use the cat judgement to declare categories; ithas the form cat C.4 In this example, we define the List category that we are interestedin, and a dummy element category Elem. We also add the S (for sentence) category, whichis special in GF in that it is the default start category for a grammar. The start categoryis the category we start parsing from. The fun judgment declares how we can construct amember of a category from members of its argument categories; it has the form fun f : T,where T is a type built from basic types (i. e., declared categories and built-in categorieslike Str) using the type constructor →. We declare two zero-argument constructors (whichare constants) and the Cons constructor which is the same as in the recursive definition oflists. To be able to parse lists from the start category S, we add the LS constructor, whichlifts lists to the sentence category.

abstract List = {

catElem ; List ; S ;

funX : Elem ;Nil : List ;Cons : Elem → List → List ;LS : List → S ;

} ;

Listing 2.3 – Abstract syntax for linked lists in GF.

We can already import the file List.gf5 in the GF shell, and play with thegenerate_trees and generate_random commands. Note that all GF shell commandshave shorter aliases. To know more about GF shell commands, call help for a summary ofall available commands, or help [command], for help on the command [command].

3 Module headers specify what kind of module it is, imports, and other things; See the GF referenceat [33] or http://www.grammaticalframework.org/doc/gf-refman.html for more about headers.

4 This form is actually a simplification; for the dependently-typed subset of GF we use more elaboratecategory declarations.

5 Note that GF files must have the same name as the module they contain.

http://www.grammaticalframework.org/doc/gf-refman.html


>i List . gfList> gt −number=3Cons X (Cons X (Cons X (Cons X Nil)))Cons X (Cons X (Cons X Nil))Cons X (Cons X Nil)

Listing 2.4 – GF shell session with abstract module only.

In order to implement a concrete syntax for this abstract syntax, we must askourselves how we want to linearize each constructor defined in the abstract syntax. If wetake Lisp lists as an example, we want to have

Nil <=> ()Cons X (Cons X (Cons X Nil)) <=> (x x x)

In order to achieve this behaviour, we first have to define the linearization types of thecategories declared in the abstract module, using the lincat judgement keyword, as inlisting 2.5.

concrete ListLisp of List = {

lincatElem, S = {s : Str} ;List = {bp,b : Str} ;

Listing 2.5 – Lisp list linearization types.

lincat judgements are composed of category names, an equal sign (=), and a validlinearization type. A linearization type can be either a basic type, a parameter type (user-or library-defined), a table and/or records of those, but not a function type (we will explainthese shortly). We decide on the Elem and S types being a record type with one field of Strtype, and on the List type being a record with two string fields, bp to store the beginningparenthesis, and b to store the body of the list (see listing 2.5).6 If more than one categoryshare a linearization type, we can define them in a single judgement by separating thecategory identifiers by commas.7 We may also omit the linearization type of a category,which gives it the default {s : Str} linearization type.

Listing 2.6 contains the actual linearizations. The dummy X element is linearizedto "x", Nil is linearized to a record containing the strings for ( ), and Cons updates6 We could make Elem and S be simple Str types, however it is a GF convention to use record types for

their extensibility – it is much easier to add a field to an existing record than to change something intoa record and then add a field to it.

7 Something analogous works in abstract modules for the declaration of several constructors of the sametype.


(∗∗) its second argument (the list argument) with a new body field consisting of theconcatenation (++) of the string field of the element argument (x) and of the bodyfield of list argument (xs). The extension record operator ∗∗ is used to both extend andupdate records; it essentially performs the union of the records’ fields. When the tworecords’ field labels are not disjoint, the second record’s fields get precedence, essentiallyupdating the values of the first record’s fields. Our use of ∗∗ could have been writtenequivalently as Cons x xs = {bp = xs.bp ; b = x.s ++ xs.b}. We finally define LS to simplybe the concatenation of the beginning parenthesis with the body of the input list.

linX = {s = "x"} ;Nil = {bp = "("; b = ")" } ;Cons x xs = xs ∗∗ {b = x.s ++ xs .b} ;

LS xs = {s = xs.bp ++ xs.b} ;

Listing 2.6 – Lisp list linearization rules.

>i ListLisp . gfList> parse −lang=Lisp "( x x x x )"LS (Cons X (Cons X (Cons X (Cons X Nil))))

Listing 2.7 – GF shell session with single concrete module.

Now let’s define another concrete syntax, this time for Haskell lists. We want tohave:

Nil <=> []Cons X (Cons X (Cons X Nil)) <=> [x,x,x]

There is an issue here with the commas. We only want to insert a comma whenthere is an existing element in the input list (i. e., the input list is not the empty list). Ifwe try to define Cons as something similar to the Lisp definition in listing 2.6, but witha comma in-between the concatenation of the elements (as in x. s ++ " , " ++ xs.b), wewill get the wrong linearization when the input list is empty. To solve this problem, wegive lists a more complex linearization type, as can be seen in listing 2.8.8 For this, weintroduce the param judgement, which is used to declare a new parameter type, togetherwith its constructors. A parameter type is either introduced by a param judgement or is a8 We could have solved this problem by creating a separate category of empty lists, but this has two

downsides: it makes the Lisp concrete more complicated than it has to be, and it does not follow preciselythe recursive definition of linked lists. An additional downside is that it prevents the introduction ofuseful GF concepts to the reader.


record type whose fields are all parameters types. Parameter types must be finite (we maynot have a parameter constructor take an infinite type such as a string as an argument)and thus non-recursive.

paramBoolean = T | F ;

lincatElem, S = {s: Str} ;List = {s : Str ; null : Boolean} ;

Listing 2.8 – Haskell List linearization types.

We define a boolean parameter whose values might be T and F (for true and false),and make it the type of a field in the linearization type of lists. This will allow us to givedifferent linearizations to different values of the null field. The reader may have noticedthat the bp field of Lisp lists was always the same; to remove this redundancy in theHaskell concrete we remove the bp field, and rename the b field to s. We use the LS ruleto add the beginning square brackets of the lists. The linearization rules of Haskell linkedlists can be seen in listing 2.9.

linX = {s = "x"} ;Nil = {s = "]" ;null = T} ;

Cons x xs = case xs.null of {T ⇒ {s = x.s ++ xs . s ; null = F} ;_ ⇒ xs ∗∗ {s = x.s ++ " , " ++ xs. s}} ;

LS xs = {s = "[" ++ xs. s} ;

Listing 2.9 – Haskell List linearization rules.

Cons’s linearization rule introduces GF case expressions, which allows us topattern-match on parameter values. GF case expressions are simply syntactic sugarfor table selections. GF tables are finite functions of type P ⇒ T , where P is a pa-rameter type and T is any type. Tables can be written with the table keyword like so:table { V1 ⇒ t1 ; ... ; Vn ⇒ tn }, where the V’s are parameters and the t’s are expres-sions of the output type of the table. Tables are finite functions because parameter typesare themselves finite, so it is possible to enumerate all input-ouput pairs. You can apply atable to an expression that evaluates to an instance of its input parameter type using thetable selection operator !.


Because case is just syntactic sugar for tables, the linearization of Cons in listing 2.9is equivalent to the one in listing 2.10 (note the use of the record selection operator). Theunderscore (_) is called a wildcard pattern and matches any value.

Cons x xs = table {True ⇒xs ∗∗ {s = x.s ++ xs . s ; null = F} ;_ ⇒ xs ∗∗ {s = x.s ++ " , " ++ xs. s}} ! xs . null

Listing 2.10 – Cons linearization rule using GF tables.

In order to linearize Cons, then, we check if its list argument is the empty list bychecking its null field; if it is, we update its s field with the concatenation of the elementstring field and its current s field (which is just the closing square bracket). Its null field isalso updated, to the value of F, indicating that the list is not empty anymore. If the listargument is not an empty list, we simply update its body field to be the concatenation ofthe string field of the element argument, a comma, and its previous body field. The otherlinearization rules are what one would expect.

Now that we have two concrete syntaxes, we can translate between them by parsingone concrete to the abstract syntax and then linearizing the abstract tree to the targetconcrete syntax:

>i ListLisp . gf ListHask. gfList> p −tr −lang=Lisp "( x x x )" | l −lang=HaskCons X (Cons X (Cons X Nil))[ x , x , x ]

Listing 2.11 – Translation between List concrete syntaxes.

The -tr option of a GF shell command traces its output (i.e., prints it), while thepipe command (|) uses the output of the previous command as the input to the next one.

2.1.2 Refactoring

If we analyse the two concrete grammars we developed in section 2.1.1, we cansee that they share many definitions, like that of the lincat of Elem or that of the LSconstructor. As a functional programming language, GF has the means to avoid mostcode boilerplate and repetition. Two of the most prominent ways of doing so is by usingfunctors and the oper judgement. In this section we refactor our implementation of theList grammar using both of these. Note that the gains of using these GF constructs insuch a small grammar are negligible, but that they are very useful in larger grammars.


An oper is akin to a function definition in most programming languages. It is ofthe form oper h : T = t, where T is a type and t is a term of type T. The type can beomitted if the compiler is able to infer it, and type and definition can also be given bytwo separate oper judgements. Note that in many cases the term definition t of an operwill be an anonymous function (or lambda abstraction) of the form λx → t, which is aone-argument function whose application is computed by substituting its argument for xin t. Listing 2.12 shows the declaration and definition of an oper consWith that can beused to avoid the repetition in the two branches of the case expression in the previousdefinition of the Cons constructor, as is done in listing 2.14. It is also possible to define anoper using a shorthand like the one used by lin judgements.

operconsWith : Str → Elem → List → List= λsep,x,xs → xs ∗∗ {s = x.s ++ sep ++ xs. s ; null = False} ;

−−−− could also be defined as two separate judgements:−− consWith : Str →Elem → List → List ;−− consWith sep x xs = xs ∗∗{s = x.s ++ sep ++ xs. s ; null = False} ;

Listing 2.12 – Cons using a consWith oper

In larger GF grammmars it is customary to define parameters and functions in aresource module, which can then be imported by several concrete modules for use.

The other element used in our refactoring of the List grammar are functors. GFfunctor modules are inspired by the parametrized modules found in functional programminglanguages like ML.9 A GF functor module is any module that opens (imports) an interfacemodule. An interface module is a module that only declares oper judgements, possiblydefining them.

The header syntax for a functor module is like below:

incomplete concrete ListI of List = open ListSyntax in

The incomplete keyword highlights the fact that a functor without its instantiation isnot a valid (concrete) grammar. The ListSyntax module is the interface module that isopened by the functor.

What allows wider use of functors is the possibility of using abstract syntax modulesas interface modules, and concrete syntax modules as their instances, performing thefollowing mapping:

cat C ↔ oper C : Type

9 If you know the functional programming language Haskell, be warned: its functors are a differentmatter, not concerning modules at all.


fun f : A ↔ oper f : A

lincat C = T ↔ oper C : Type = T

lin f = t ↔ oper f : A = t

To refactor our list grammar into a functor, we try to generalize our two originalimplementations, trying to abstract what they do not have in common to the interfacemodule. In the case of lists, we know that Haskell and Lisp syntax differ only in whatkind of separator they use and what kind of boundary characters they use, so we endup with the ListI functor module in listing 2.14 and the ListSyntax interface modulein listing 2.13. In order to build ListI, we took the most general implementation out ofthe two (that of Haskell), and changed a few strings to parametrized opers, which aredeclared in ListSyntax. To complete our implementation, we only need to instantiate bothmodules for each concrete we want, which we do in figure 2. Because the type signaturesof the opers are already given in the interface module, we do not need to give them in itsinstantiations.

We can think of the List functor module as a function at the module-level, with atype signature like:ListI : instance of ListSyntax → concrete of List

Functor instantiation then would resemble function application.

interface ListSyntax = {operelemSep : Str ;leftBound : Str ;rightBound : Str ;

} ;

Listing 2.13 – List syntax interface module.

2.2 GF, mathematicallyGF grammars can be compiled to a binary format called portable grammar format

(PGF) [2]. This format is ideal for a formal description of GF since it abstracts awaysyntactic details of GF as a programming language. In this section we present the mathe-matical definition of this format, discuss on a high-level how GF code is translated into it,and show how parsing and linearization is done by a PGF interpreter. The main sourcefor this section is Krasimir Angelov’s PhD thesis [1], which we follow closely in text andnotation; we have contributed nothing to his work.

2.2. GF, mathematically 39

instance ListSyntaxHask of ListSyntax = {

operelemSep = "," ;leftBound = "[" ;rightBound = "]" ;

} ;

Listing 2.15 – Haskell list syntax instantiation.

concrete ListFHask of List = ListIwith (ListSyntax=ListSyntaxHask) ∗∗{

} ;

Listing 2.16 – Haskell list functor instantiation.

instance ListSyntaxLisp of ListSyntax = {

operelemSep = "" ; −− no element separator but for spacingleftBound = "(" ;rightBound = ")" ;

} ;

Listing 2.17 – Lisp list syntax instantiation.

concrete ListFLisp of List = ListIwith (ListSyntax=ListSyntaxLisp) ∗∗ {

} ;

Listing 2.18 – Lisp list functor instantiation.

Figure 2 – A functor implementation of Haskell and Lisp list syntaxes.


incomplete concrete ListI of List = open ListSyntax, Prelude in {

lincatElem, S = {s : Str} ;List = {s : Str ;−− Bool is actually a pre−defined parameter type, so we don’t−− need to redefine itnull : Bool} ;

linX = {s = "x"} ;Nil = {s = rightBound ;null = True} ;

Cons x xs = case xs.null of {True ⇒consWith "" x xs ;_ ⇒ consWith elemSep x xs} ;−− we can use lambda abstractions for lin definitions too!LS = λxs →{s = leftBound ++xs. s} ;

operconsWith : Str → Elem → List → List= λsep,x,xs → xs ∗∗ {s = x.s ++ sep ++ xs. s ; null = False} ;

−−−− could also be defined as two separate judgements:−− consWith : Str →Elem → List → List ;−− consWith sep x xs = xs ∗∗{s = x.s ++ sep ++ xs. s ; null = False} ;

} ;

Listing 2.14 – List functor module.

We only present a subset of GF in that we don’t discuss dependently-typed GFcode.

2.2.1 Definitions

Definition 1. A PGF grammar G is a pair of an abstract grammar A and a finite set ofconcrete syntaxes C1, . . . , Cn:

G = 〈A, {C1, . . . , Cn}〉

Definition 2. An abstract syntax is a triple of a set of abstract categories,10 a set of10 By ‘category’ we mean GF categories, not category in the Category Theory sense.


abstract constructors with their type signatures, and a start category:

A = 〈NA, FA, S〉

• NA is a finite set of abstract categories.

• FA is a finite set of abstract constructors. Every element in the set is of the formf : τ where f is a constructor symbol and τ is its type. The type is either a categoryC ∈ NA or a constructor type τ1 → τ2 where τ1 and τ2 are also types.

• S ∈ NA is the start category.

Definition 3. A concrete syntax C is a parallel multiple context-free grammar (PMCFG)complemented by a mapping from its categories and constructors to the abstract syntax:

C = 〈G,ψN , ψF , d〉

• G is a PMCFG: an extension of a context-free grammar (CFG) where a syntacticcategory is defined not as a set of symbols but as a set of tuples of strings. We applyconstructors over tuples of input categories to obtain a tuple in the result category.A formal definition of PMCFG is given in definition 4, but a PMCFG is mainlycomposed of a set of production rules which define how to construct a given categoryby applying some constructor. An example using a constructor f and categories A,B, and C is the production rule:

A→ f [B,C]

• ψN is a mapping from the concrete categories in G to the set of abstract categoriesNA.

• ψF is a mapping from the concrete constructors inG to the set of abstract constructorsFA. A concrete constructor fC has the same arity of its corresponding abstractconstructor:

a(fC) = a(ψF (fC))

where a is a mathematical function which takes a GF constructor and returns itsarity.

The notation for the definitions of concrete funtions is tailored to simplify thededuction rules we will write later; an example is:

f := (〈0; 0〉b, 〈1; 0〉〈0; 1〉)


where f is the constructor name. The notation 〈d; r〉 stands for the constituentnumber r of argument d, so f creates a tuple of two strings where the first one(〈0; 0〉b) is built from the first constituent of the first argument and by adding theterminal b at the end. The second one (〈1; 0〉〈0; 1〉) concatenates the first constituentof the second argument with the second constituent of the first argument of the firstargument.

• d assigns a positive integer d(A), called dimension, to every abstract category A ∈ NA.A given category may have different dimensions in the different concrete syntaxes.

Definition 4. A parallel multiple context-free grammar (PMCFG) [36, 37] is a5-tuple:

G = 〈NC, T, F C, P, L〉

• F C is a finite set of concrete categories. For every A ∈ NC, the equation d(A) =d(ψN (A)) defines the dimension for every concrete category as equal to the dimensionin the current concrete syntax of the corresponding abstract category.

• T is a finite set of terminal symbols.

• F C is a finite set of concrete constructors. For every f ∈ F C, the dimensions r(f)(the number of constituents in the output of f) and di(f) (the dimension of thei-th argument category) (with 1 ≤ i ≤ a(f)) are given. For every positive integerd, (T ∗)d denotes the set of all d-tuples of strings over T . So if T = {a, b}, then(T ∗)3 = {(a, a, a), (a, a, b), (a, b, a), (b, a, a), (b, b, a), (b, a, b), (a, b, b), (b, b, b)}. Eachconstructor f ∈ F C is a total mapping from (T ∗)d1(f) × (T ∗)d2(f) × · · · × (T ∗)da(f)(f)

to (T ∗)r(f), and is defined as

f := (α1, α2, . . . , αr(f))

Here αi is a sequence of terminals and 〈k; l〉 pairs, where 0 ≤ k ≤ a(f) is calledargument index and 0 ≤ l ≤ dk(f) is called constituent index. We also use thenotation rhs(f, l) to refer to constituent αl of the constructor f .

• P is a finite set of productions of the form:

A→ f [A1, A2, . . . , Aa(f)]

where A ∈ NC is called result category, A1, A2, . . . , Aa(f) ∈ NC are called argumentcategories and f ∈ F C is a constructor symbol. For the production to be well formed,the conditions di(f) = d(Ai) (with 1 ≤ i ≤ a(f)) and r(f) = d(A) must hold.


• A default linearization function L of a category C is a function which produces anobject of the linearization type of C when applied to a string. L ⊂ NC × F C is a setwhich defines the default linearization functions for those concrete categories thathave default linearizations. If the pair (A, f) is in L then f is a default linearizationfunction for A. We will also use the abbreviation:

lindef(A) = {f | (A, f) ∈ L}

to denote the set of all default linearization functions for A. For every f ∈ lindef(A)it must hold that r(f) = d(A), a(f) = 1, and d1(f) = 1.

The abstract syntax defines a grammar of typed lambda terms; similarly, theconcrete syntaxes allow us to construct concrete syntax trees:

Definition 5. (f t1 . . . ta(f)) is a concrete tree of category A if ti is a concrete tree ofcategory Bi and there is a production:

A→ f [B1 . . . Ba(f)]

The abstract notation for to say that t is a tree of category A is t : A. Whena(f) = 0, then the tree does not have children and the node is called a leaf.

A concrete syntax tree can be bottom-up linearized to a tuple of strings in thefollowing way: leaves (trees with no arguments) are already tuple of strings; to linearize atree with one or more arguments, one linearizes the arguments first, and them combinesthe linearization into a tuple of strings.

To define a linearization procedure, we employ a helper constructor K whichproduces a string from a sequence of tuples of strings (the linearized arguments) and asequence αi of terminals and 〈k; l〉 pairs. The output string is produced by the substitutionof each 〈k; l〉 with the string for constituent l from argument k:

K −→σ (β1〈k1; l1〉β2〈k2; l2〉 . . . βn) = β1σk1l1β2σk2l2 . . . βn

where βi ∈ T ∗ and −→σ is the sequence of linearized arguments. With K, we can define L,the actual linearization constructor:

L (f t1t2 . . . ta(f)) = (x1, x2, . . . xr(f)) (2.1)

wherexi = K (L(t1),L(t2) . . .L(ta(f))) αi

andf := (α1, α2, . . . α(f)) ∈ F C


2.2.2 From GF to PGF

In this section we discuss on a high level how GF code is transformed into PGF.As shown in section 2.2.1, the PGF format is very simple – its production rules simplygenerate tuples of strings from other tuples of strings. So it is natural to ask how the GFcompiler can encode the relatively rich language of GF in this simpler system.

The first thing to notice here is that although GF enjoys pattern-matching andfunction definitions (opers), these are compiled away when GF is transformed into canonicalGF [24,30], which is a variant of GF without syntactic sugar and where partial evaluationhas been applied. To partially evaluate GF code means to inline all functions definitions,and evaluate all expressions as far as possible, until they depend on runtime variables(as constructors do). This is possible because GF functions may not be recursive norco-recursive. We can also compile away pattern-matching because GF parameter types(on which we pattern match) are finite, and can thus be enumerated. Canonical GF thenonly has category and constructor declarations and definitions, containing only argumentvariables, strings, records, tables, and parameters, plus the concatenation, record projectionand table selection operators. Thus we only have to worry about converting these to PGF.Records and tables can be straightforwardly flattened to tuples of strings. However it isnot so clear how can we encode record fields whose values are parameters, nor how theappropriate elements of tables are selected using parameters.

Let us take the Haskell concrete of the List grammar from section 2.1.1 as an examplegrammar being converted to PGF. The Elem category is straightforwardly representedas a one-element string tuple. For instance, X is just "x"; if we had other Elems, they’dbe simple strings too. The way to represent a record where one field is a parameter (likein the case of the List category) is to enumerate all possible values of the parameter(which is always possible since parameters must be finite), and create a new concretecategory for each such value. Therefore what was just one category in GF (List) becomestwo concrete categories in the PGF representation: Listempty and Listnonempty.11 Bothconcrete categories are then represented as one-element tuple of strings; the value of theirnull fields is encoded by the concrete category they belong to. For example, Nil is thenrepresented as just 〈")"〉.

Now that we know that categories can get split in the GF to PGF conversion, it iseasier to see how the linearization definition of the Cons constructor in the Haskell concreteof the List grammar (see listing 2.9) is handled. It contains a case expression whose valuedepends on the two-valued Bool parameter from the null field of its input argument ofList. But there is no List category anymore, so we split the Cons constructor into twoconcrete constructors: one which takes a Listempty and another that takes a Listnonempty.There is no case expression in any of them (and there could not be, nor is there a need to11 These names are just for mnemonics; they are actually integers in the actual PGF format.


be). For example, the Consnonempty concrete constructor takes a non-empty List as input,and produces a non-empty List; we know that non-empty Lists are just a one-stringtuple; using the notation from definition 3, we have:

Consnonempty := (〈0; 0〉“,"〈1; 0〉)

That is, we produce the first constituent (corresponding to the s field) of theone-element tuple by concatenating the first constituent of the first argument (which is anElem), a comma, and the first constituent of the second argument (which is a Listnonempty).

We can obtain a human-readable PGF from any GF grammar by using theprint_grammar command of the GF shell. Its output is very close in notation fromthe one presented here.

In summary, although PGF preserves abstract categories and constructors definedin GF, it replaces linearization categories and rules by concrete categories and constructors.In the general case there is more than one concrete category and constructor to eachlinearization category and rule (and thus to each abstract category and constructor). Weuse ψN and ψF from definition 3 to preserve the mapping from concrete categories toabstract categories and from concrete constructors to abstract constructor, respectively.

2.2.3 Parsing

In this section we show how GF parsing from a PGF representation works using arunning example. We first explain the idea behind the parse algorithm, show how we canrepresent it as a series of deduction steps, and explain the deductions rules we use. Wethen present the example grammar, and show a complete parse. More details and proofsof soundness (i. e., it is impossible to parse a string not accepted by the grammar) andcompleteness (i. e., it is possible to parse every string produced by the grammar) of theparsing algorithm can be found in Krasimir Angelov’s thesis [1].

2.2.3.1 The example grammar

Our running example is still that of Haskell lists from section 2.1.1. Figure 3 showsthe PGF version of the Haskell list grammar from section 2.1.1.

There are two construction rules for the top-level category S (whose dimension isone), one which takes one argument of category N (for empty lists, formerly Listempty), andanother which takes one argument of category L (for non-empty lists, formerly List).Bothrules use the same constructor ls, which concatenates the beginning square bracket withits first argument’s first constituent.12 As discussed in section 2.2.2, the Cons becomes12 The actual PGF has the ls constructor accept a single category of L*, and adds two coercions from

empty lists and non-empty lists to L*.


S → ls[N ]S → ls[L]L→ ce[E,N ]L→ co[E,L]N → nl[]E → e[]

ls := (“[”〈0, 0〉)ce := (〈0, 0〉〈1, 0〉)co := (〈0, 0〉“, ”〈1, 0〉)nl := (“]”)e := (“x”)

Figure 3 – The PGF grammar for Haskell lists.

S → “[”LL→ “]” | “x”L | “x”“, ”L

Figure 4 – A context-free aproximation for Haskell lists without an intermediary categoryfor empty lists

two concrete constructors: one that takes empty lists (ce), and the other which takesnon-empty lists (co); the former simply concatenates the first constituent of the firstargument (the element) to the first constituent of the second (the list) argument; the latterincludes a comma between these two constituents. The element category has only onenullary constructor e, which is a single tuple; the PGF-only empty list category also has asingle nullary constructor nl.

The syntax tree for the string [x, x] is (ls (ce e (co e nl))), while that of theempty list is simply (ls nl).

2.2.3.2 The algorithm

GF’s parsing algorithm is a generalization of a context-free parsing algorithm.Figure 4 shows a context-free approximation of the grammar Haskell lists that does notuse an intermediate category for the empty list.13

To restrict this grammar so that it only produces syntactically correct Haskelllists, we can create new production rules and categories at every rule application. This13 It is possible to define the grammar exactly using a CFG by adding a category for empty lists; if we

did so, however, we would not be able to show how GF parses context-sensitive grammars.


S → “[” LL→ “]” | “x” L | “x, ” L

1 S2 “[” L3 “[” “x, ”L1 L1 → “x” | “x, ” L24 “[” “x, ” “x” L2 L2 → “]”5 “[” “x, ” “x” “]”

Figure 5 – Parse for the string [x , x] using CFG approximation with on-the-fly special-ization.

on-the-fly specialization of the parser prevents it from accepting strings which are not inour example grammar.

Figure 5 shows how parsing the list string [x, x] would work. We begin the parsefrom the start category S, which has only one branch, so we parse the beginning squarebracket, and soon find the L category, which might be formed in three ways; we continuewith the only one matching our input in step 3. Now, given that we have parsed anelement-with-a-comma, we want to have only two valid continuations: we can either parsethe another element-with-a-comma, or we can parse a normal element; this is enforced bythe on-the-fly creation of a new category L1 – a specialization of L that does not havethe ending square bracket branch. We continue with parsing the normal element branch(because it matches our input), which means there is no following element – the list mustend now or the parse will fail – so we want to have only one valid continuation where weparse the ending square bracket – this is reflected by the new specialized rule L2. Noticethat if we had a malformed input which had another element at this point, the parsewould fail; if we were using our pure CFG approximation, the former case would still beparsed. We successfully parse an ending square bracket because it matches our input, andhave now fully consumed the input with a successful parse.

A parser for this grammar would look like parser for a context-free language, exceptfor the creation of new categories and rules on-the-fly.

2.2.3.3 Parsing as deduction

We present the parsing algorithm as a deduction, following Shieber et al. [38] Eachrule application derives a set of items:

X1 . . . Xn

Y〈side conditions on X1 . . . Xn〉

where the premises Xi are items, and the derivation Y is also an item. We take the inputstring to be a sequence w1 . . . wn of tokens.

We first explain the kind of items derived by the parser. After presenting the formalnotation, we show concrete examples to help make the notation clearer. If these examples


are not enough, their use in section 2.2.3.4 should be enough to explain their significance.We finally present and explain the parsing deduction rules; their use in section 2.2.3.4should help clarify their semantics.

The parsing deduction system generates active, passive, and production items.

2.2.3.3.1 Production items

In Shieber et al. deduction systems, the grammars are constant. Because in thecase of GF parsing the grammars are extended at runtime (as intuited by section 2.2.3.2),the set of productions must be a part of the deduction set, and the productions from theoriginal grammar are considered axioms included in the initial deduction set.

2.2.3.3.2 Active items

Active items represent the parsing state at a given point of the deduction:

[kjA→ f [−→B ]; l : α • β], j ≤ k

This notation encodes a constructor f with the following production:

A→ f [−→B ]

f := (γ1, . . . , γl−1, αβ, . . . , γr(f))

such that the tree (f t1 . . . ta(f)) will produce the substring wj+1 . . . wk as prefix in con-stituent l for any sequence of arguments ti : Bi. The sequence α is the part that producedthe substring:

K (L(t1),L(t2) . . .L(ta(f))) α = wj+1 . . . wk

and β is the part that is not processed yet.

Take the following example active item from a parsing using the Haskell listgrammar at figure 3 (page 46):

[11L→ ce[E,N ]; 0 : • 〈0, 0〉〈1, 0〉]

It denotes the ongoing parse of a non-empty list (category L), starting at index 1 of theinput (the lower 1) and currently at index 1 of the input (it has not successfully parsedanything yet). Note that all indices start from zero. This non-empty list is being builtfrom the ce constructor, which takes an element (category E) and an empty list (categoryN). It is currently parsing the first (thus 0) and only constituent of the result category; itspoint is just before two argument-constituent pairs corresponding to the first constituentof the first argument and to the first constituent of the second argument. The componentsafter the constituent index correspond exactly to the ones at the ce rule at figure 3.


2.2.3.3.3 Passive items

Passive items are written as:

[kjA; l;N ], j ≤ k

and are proof of at least one production:

A→ f [−→B ]

f := (γ1, γ2, . . . , γr(f))

and a tree (f t1 . . . ta(f)) : A such that the constituent with index l in the linearizationof the tree is equal to wj+1 . . . wk. Contrary to the active items in the passive the wholeconstituent is matched:

K(L(t1),L(t2) . . .L(ta(f))) γl = wj+1 . . . wk

Every time there is a completion of an active item, a passive item is derived along witha new category N which accumulates all productions for A that produce the wj+1 . . . wk

substring from constituent l. All trees of category N must produce wj+1 . . . wk in theconstituent l.

Observe the following passive item from a parse using the grammar at figure 3(page 46):

[21E; 0;C0]

It witnesses the parsing of the first (and only) constituent of a member of category E.It spans one token of the input, starting at index 1. It has also produced a specializedcategory C0 with the production rule C0 → e[], where e is the zero-argument constructorfrom figure 3. Its corresponding tree e : E is such that the constituent with index 0 in thelinearization of this tree (“x”) is equal to token of the index 1 of the input.

The deduction rules of the parsing system are shown in figure 6.

2.2.3.3.4 Initial predict

Derive an item spanning the 0–0 range for each production whose result categoryis mapped to the start category in the abstract syntax.

2.2.3.3.5 Predict

Given an active item with a dot before an argument-constituent pair 〈d; r〉 and amatching production rule, derive an active item where the dot is in the beginning of theconstituent r of the constructor function.


Initial Predict

A→ f [−→B ]ψN(A) = S, S the start category in A, α = rhs(f, 1)

[00A→ [−→B ]; 1 : •α]

Predict

Bd → g[−→C ] [kjA→ [−→B ]; l : α • 〈d; r〉β]γ = rhs(g, r)

[kkBd → g[−→C ]; r : •γ]

Scan

[kjA→ f [−→B ]; l : α • sβ]s = wk+1

[k+1j A→ f [−→B ]; l : αs • β]

Complete

[kjA→ f [−→B ]; l : α•]N = (A, l, j, k)

N → f [−→B ] [kjA; l;N ]

Combine

[ujA→ f [−→B ]; l : α • 〈d; r〉β] [kuBd; r;N ]

[kjA→ f [−→B {d := N}]; l : α〈d; r〉 • β]

Figure 6 – Parsing deduction rules

2.2.3.3.6 Scan

Given an active item with a dot before a terminal s and that s matches the currentelement wk of the input string, derive a new active item where the dot is moved to thenext position.

2.2.3.3.7 Complete

Derive a passive item from an active item whose dot is at the end. The resultingcategory N in the passive item is a fresh category. Also derive a new production for Nwhich has the same constructor and arguments as the category in the active item.

2.2.3.3.8 Combine

Derive an active item from matching active and passive items. For the purposes ofCombine, an active item matches a passive one if its dot is before the parsing of the r-th


constituent of the d-th argument of its constructor function, and the passive item derivesjust such a constituent, with matching indices.

The item in the premise of Complete was at some point predicted by a Predictof some other item. The Combine rule then replaces the occurence of the original categoryA (in the premise of Predict) with its specialization N .

A parsing is sucessful if we have derived a passive item in the start category of thegrammar which spans the whole length of the input text.

2.2.3.4 An example parse

In this section we show the derivation of the parse of the string [x , x] using theHaskell list PGF grammar in figure 3. To save space, figure 7 shows the derivation omittingsteps that lead to an unsuccessful parse; figure 8 and figure 9 show two branches of thecomplete derivation that led to an unsuccessful parse, but other such branches exist. TheGF parser has no way of knowing which branches will lead to successful parses withouttrying them, of course, so it will explore all branches – unless told to stop after a certainnumber of successful parses.

Instead of the usual deduction tree, we present the steps as a table where the firstcolumn is an integer identifier for the items derived in that step, the second column showsthe derived items themselves, and the last column is the name of the rule applied in thatstep, along with the identifiers of the items the rule was applied to. That is, step 6 fromfigure 7 shows the item derived from the application of the Predict rule to the itemsderived in steps 2 and 5. Note that the integer identifiers have no meaning and follow noparticular order.

The first few lines of the derivation are the productions from the grammar in figure 3.Steps 0 and 1 tell us that the start category may be formed using the ls constructortaking either an empty list or a non-empty list. Figure 8 shows that following the emptylist branch leads us to a successful parse of the opening square bracket token, but then toa dead-end where we have the closing square bracket at point but it does not match thenext token in the input stream. Because this branch leads us to a parse failure, we continueat step 9 by applying the Initial Predict rule, which yields us an item with point (•)before two components. The first component (the opening square bracket) matches thenext token of our input string, so we apply the Scan rule and advance point by one token.

We are trying to parse a sentence built from a non-empty list; recall the Predictrule from figure 6: we may predict only the category matching the argument index of theargument-constituent pair in front of the point; this category is that of non-empty listsand we have two production rules for those. The one at step 2 would lead us to try toparse [x], because it would force the parse of an empty list following the first element. So


0 S → ls[N ]1 S → ls[L]2 L→ ce[E,N ]3 L→ co[E,L]4 N → nl[]5 E → e[]9 [00S → ls[L]; 0 : • “[” 〈0, 0〉] Initial Predict 110 [10S → ls[L]; 0 : “[” • 〈0, 0〉] Scan 917 [11L→ co[E,L]; 0 : • 〈0, 0〉 “, ” 〈1, 0〉] Predict 3 1018 [11E → e[]; 0 : • “x”] Predict 5 1719 [21E → e[]; 0 : “x” •] Scan 1820 C1 → e[] [21E; 0;C1] Complete 1921 [21L→ co[C1, L]; 0 : 〈0, 0〉 • “, ” 〈1, 0〉] Combine 17 2022 [31L→ co[C1, L]; 0 : 〈0, 0〉 “, ” • 〈1, 0〉] Scan 2128 [33L→ ce[E,N ]; 0 : • 〈0, 0〉〈1, 0〉] Predict 2 2229 [33E → e[]; 0 : • “x”] Predict 5 2830 [43E → e[]; 0 : “x” •] Scan 2931 C3 → e[] [43E; 0;C3] Complete 3032 [43L→ ce[C3, N ]; 0 : 〈0, 0〉 • 〈1, 0〉] Combine 31 2833 [44N → nl[]; 0 : • “]”] Predict 4 3234 [54N → nl[]; 0 : “]” •] Scan 3335 C4 → nl[] [54N ; 0;C4] Complete 3436 [53L→ ce[C3, C4]; 0 : 〈0, 0〉〈1, 0〉 •] Combine 35 3237 C5 → ce[C3, C4] [53L; 0;C5] Complete 3638 [51L→ co[C1, C5]; 0 : 〈0, 0〉 “, ” 〈1, 0〉 •] Combine 37 2239 C6 → co[C1, C5] [51L; 0;C6] Complete 3840 [50S → ls[C6]; 0 : “[” 〈0, 0〉 bullet] Combine 39 1041 C7 → ls[C6] [50S; 0;C7] Complete 40

Figure 7 – Parse deduction for the string [x , x]

we continue by applying Predict to the rule at step 3.

We are now trying to parse a non-empty list built with the co constructor. We havealready parsed one token, and are now to parse the first constituent of the first argumentof the co constructor, which is an element. The Predict rule tells us that we are to parsean element (E), which amounts to parsing a “x” token; because it matches our next tokenin the input stream, we Scan it. We have successfully parsed an element, so we apply theComplete rule to produce a passive item witnessing this.

At step 21, we use this passive item to advance on our parse of the non-empty listfrom step 17. Point is now in front of a comma token, which matches the next token in theinput stream, so we may Scan it. Point then comes to the first constituent of the secondargument of the list constructor co; we are trying to parse a non-empty list. We againPredict two branches to follow, corresponding to the two constructors of non-empty lists.If we follow the co constructor, we will be trying to parse a list with at least 3 elements,since we have already parsed one element, and co takes an element and a non-empty list,


6 [00S → ls[N ]; 0 : • “[” 〈0, 0〉] Initial Predict 07 [10S → ls[N ]; 0 : “[” • 〈0, 0〉] Scan 68 [11N → nl[]; 0 : • “]”] Predict 4 7

Figure 8 – Failed parse deduction for the string [x , x]: unexpected token.

23 [33L→ co[E,L]; 0 : • 〈0, 0〉 “, ” 〈1, 0〉] Predict 3 2224 [33E → e[]; 0 : • “x”] Predict 5 2325 [43E → e[]; 0 : “x” •] Scan 2426 C2 → e[] [43E; 0;C2] Complete 2527 [43L→ co[C2, L]; 0 : 〈0, 0〉 • “, ” 〈1, 0〉] Combine 26 23

Figure 9 – Failed parse deduction for the string [x , x]: unexpected token.

which itself has at least one element. Figure 9 follows this branch up until a parse failuredue to an unexpected token.

Therefore we continue at step 28 with the ce constructor. The element at point is thefirst constituent of the first argument, which is an element. We apply the Predict, Scanand Complete rules in succession to successfully parse another element, and Combineour resulting passive item at step 32 with our prior active item trying to parse a non-emptylist built by the ce constructor, advancing point to the next argument-constituent pair.

The next argument-constituent pair corresponds to an empty list, which can onlybe constructed by one concrete constructor (this application of the Predict rule has nosibling branch). We successfully parse the empty list because the closing square brackettoken matches the next token in our input stream. This allows us to Complete theparsing of the empty list, which Combined with step 32 Completes the parsing of thenon-empty list component of the the first non-empty list we were trying to parse. This listis also completed by this step at 39, which finishes parsing our input sentence. Our finalpassive item at step 41 witnesses the parsing of the first (and only) constituent of an itemin the sentence category spanning the whole 5-token input.

2.2.4 Linearization

Linearization in GF is the process that maps an abstract syntax tree to a tupleof strings (usually one string at the topmost level). We have already discussed howconcrete syntax trees are linearized to tuples of strings in section 2.2.1, definition 5 andequation (2.1); here we show how we can map abstract syntax trees to concrete syntaxtrees, so that we can also linearize the former. For simplicity, we do not handle the casewhere the abstract trees are incomplete, but the interested reader can refer (again) toKrasimir Angelov’s thesis [1].

Before we describe the transformation of an abstract syntax tree to a concretesyntax tree, we defineP

fA,−→B ∈ P , the subset of productions whose argument categories are


−→B and whose concrete constructors map to the same abstract constructor FA:

PfA,−→B = {A→ fC[−→B ] | A→ fC[−→B ] ∈ P ; ψF (fC) = fA}

We can write a transformation of an abstract syntax tree tA to a concrete syntaxtree tC of category B as tA ⇓ tC : B, and describe it as a deduction rule:

tA1 ⇓ tC1 : B1 . . . tAn ⇓ tCn : Bn

A→ fC[−→B ] ∈ P, n = a(fA) = a(fC)fA tA1 . . . t

An ⇓ fC tC1 . . . tCn : A

To transform a tree where a constructor is applied to more than zero arguments, we firsttransform the arguments. The vector −→B from the linearization of the arguments and fA isthe abstract constructor of the input tree.

To complete the linearization after having computed tA ⇓ tC : B, we just calculateL(tC) using equation (2.1).

55

3 A computational grammar for Portuguese

In this section we describe the GF resource grammar library [32] [33, chapters5,9,10] and our implementation of a Portuguese resource grammar [7]. We also evaluateour grammar with respect to syntactical correctness.

3.1 The GF resource grammar library

The GF resource grammar library (RGL) is a collection of natural languagegrammars (each one individually called a resource grammar, or RG) that intend to providean usable and syntactically-correct interface to the languages they describe, so that theycan be reused by other, domain-specific grammars [32, section 7.1]. Their usability comesin part from the fact that the grammars share a common framework – they use the samecategories and provide the same constructors. As of January 2019, the RGL comprisesgrammars for more than 40 languages in different levels of completeness.

3.1.1 Motivation

When writing domain-specific grammars naively, one often needs to reimplementlinguistic constructions that show up in other grammars. This is a consequence of thefact that even small fragments of natural language may contain non-trivial linguisticphenomena that are difficult to capture properly. Handling these at each grammar instanceleads to code repetition and unmaintainability. In the GF book [33, section 1.7], it is shownhow a simple sentence such as the number 2 is prime in a multilingual application mayneed an implementation of mood variations, number agreement, word order, and differentnoun cases. This kind of general syntactic knowledge is better encapsulated in such a waythat it can be reused, and in GF this encapsulation is in the form of a resource grammar.As GF supports grammar composition (as in defining a grammar in terms of anothergrammar) out of the box, domain-specific grammar writers can benefit from a standardlibrary that encloses specialized linguistic knowledge. This library can then be employedby someone with domain-specific knowledge but less linguistic knowledge, promoting adivision of labour of sorts.

The resemblance of a RG to software libraries has not been missed [31]. A RG canbe seen as a software library, and its use has the same advantages as the use of softwarelibraries: besides the aforementioned avoidance of code repetition and the increased ease ofwriting (needing less specialized authors means more people can write grammars/programs),the library centralizes development and bug reporting in one place, saving resources and

56 Chapter 3. A computational grammar for Portuguese

lin Prime x = \\ord,mod ⇒let

ist = case 〈mod,x.n〉 of {〈Ind, Sg〉 ⇒ " ist " ;〈Ind, Pl〉 ⇒ "sind" ;〈Conj,Sg〉 ⇒ " sei " ;〈Conj,Pl〉 ⇒ "seien"}

in case ord of {Main ⇒ x. s ! Nom ++ist ++ "unteilbar" ;Sub ⇒ x. s ! Nom ++"unteilbar" ++ ist ;Inv ⇒ ist ++ x.s ! Nom ++"unteilbar"}

Listing 3.1 – A hand-written rule.

lin Prime x = UseCl(TTAnt TPres ASimul) PPos(PredVP x (UseComp (CompAP (PositA unteilbar_A))))

Listing 3.2 – Using the RGL constructors directly.

lin Prime x = mkS (mkCl x unteilbar_A)

Listing 3.3 – Using the RGL API.

Figure 10 – German predicate Prime

gathering improvements faster than if each user had to develop their own library.

3.1.2 Usage

Again similarly to a software library, the GF RGL makes available an applicationprogramming interface (API). The syntactic part of the API is language-independent, sothat it is simpler to write multilingual grammars because a user does not need to learnan API for each language. Figure 10 adapted from [33, section 1.7] shows three differentsapproaches to defining the rule for the predicate Prime (as in the number 2 is prime) inGerman.

As shown in listing 3.3, the RGL API is mainly composed of functions of the formmkX, where X is result category of the application. These are overloaded functions, so thatthe user does not need to memorise too many function names – if they want to create aclause, they can browse the RGL API1 and see 31 ways of composing one, some of whichcan be seen in listing 3.4.

1 http://www.grammaticalframework.org/lib/doc/synopsis/index.html


3.1. The GF resource grammar library 57

mkCl : NP →V → Cl ; −− she sleepsmkCl : NP →V2 → NP →Cl ; −− she loves himmkCl : NP →V3 → NP →NP →Cl ; −− she sends it to himmkCl : NP →VV → VP →Cl ; −− she wants to sleepmkCl : NP →A → Cl ; −− she is oldmkCl : NP →A2 → NP →Cl ; −− she is married to himmkCl : NP →Adv → Cl ; −− she is heremkCl : SC →VP → Cl −− that she sleeps is goodmkCl : N → Cl ; −− there is a housemkCl : V → Cl ; −− it rains

Listing 3.4 – mkCl overloaded function in the RGL API.

These clause-building functions handle things like agreement, negation, questionforming, and tenses – so that the application grammar writer does not have to implementthem herself.

Because the API is multilingual, the rule shown in listing 3.3 for German is almostthe same for Portuguese and English – see listing 3.5.

lin Prime x = mkS (mkCl x primo_A)lin Prime x = mkS (mkCl x prime_A)

Listing 3.5 – Prime predicate in Portuguese and English.

Note that only content words need to be changed. There is an API for contentwords too, but it is language-dependent. Despite this, the morphological APIs are stillvery uniform in that the function names are mostly of the same mkX form. These lexicalparadigms can be browsed online at the RGL synopsis,2 and are defined and documentedin each language’s Paradigms module. The prime adjective can then be defined in German,Portuguese, and English as in listing 3.6.

lin unteilbar_A = mkA "unteilbar"lin primo_A = mkA "primo"lin prime_A = mkA "prime"

Listing 3.6 – prime_A in German, Portuguese, and English.

These lexical constructors are also overloaded; the ones that take less forms asinput are deemed smart, and try to guess the remaining forms from the ones it has beenprovided. An extreme example is the Finnish noun constructor mkN: it has a definition thatreceives only one form, and produces 30+ forms, with 87% accuracy against the KOTUSlexicon gold standard [13]. For the remaining 13%, Finnish mkN also has a “worst-case”2 http://www.grammaticalframework.org/lib/doc/synopsis/index.html



definition that takes 10 forms, and three other definitions taking 2, 3, and 4 argumentseach.

As an example of the uniformity of lexical constructors, we note that Portugueseand English also have noun constructors called mkN; they simply differ on which argumentsthey take. While Finnish had five different constructors, all of them taking differentnumbers of input strings, Portuguese has four: the main smart one that guesses genderand plural form from the singular form, the worst-case one that takes singular and pluralforms and a gender, and two intermediate ones that use the smart one but force either thegender or the plural form (their type signatures are shown in listing 3.7).

mkN : (revolução : Str) → N ; −− predictable nounsmkN : (alemão, alemães : Str) → N ; −− force noun plural , guess gendermkN : (mapa : Str) → Gender → N ; −− force gender, guess pluralmkN : (mão,mãos : Str) → Gender → N ; −− worst−case

Listing 3.7 – Portuguese noun constructors.

3.1.3 Structure

In this section we give an overview of the structure of RGL, describing in a high-level its main modules. The dependency tree in figure 11 (taken from the GF book [33])shows the overall structure of the RGL. The modules with solid contours are visible to theend-user, while those with dashed contours are internal; rectangular modules are pairsof abstract and concrete grammar, while ellipses are resource or instance modules, anddiamond-shaped modules are interfaces. If a module has a name in brackets, it is derivedmechanically from other modules.

The Cat and Common modules define the backbone of the RGL category system,which can be seen in figure 12 (from the RGL synopsis). Categories in rectangular boxesare open lexical categories. The categories declared in Common usually share the sameconcrete implementation.3 For the categories declared in Cat no attempt is made at astandard implementation.

The syntactic API we describe in section 3.1.2 is declared by Syntax (and itschildren), while the morphological API is declared by Paradigms, so that accessing themboth from an application grammar is done by opening the language modules that implementthem, i. e., SyntaxX and ParadigmsX, where X is the language of choice. The definitions inSyntaxX are not explicitly defined, but derived using a functor (explained in section 2.1.2)from other RGL constructors. This means that all languages that implement the necessaryconstructors get the Syntax API for free. Listings 3.2 and 3.3 give an idea of how this is3 Text is always a string, but Adv for adverbs which is usually just a string needs more structure in the

case of languages like Chinese or Sindhi.

3.1. The GF resource grammar library 59

Figure 11 – RGL module structure (condensed)

done: knowing that both definition trees are equivalent, one can presume that mkS : Cl → Scan be defined as the anonymous function λ cl → UseCl (TTAnt TPres ASimul) PPos cl, evenif one does not know exactly what each constructor is. Indeed this is how the functordefines this overloaded instance of the mkS constructor, and for that it only needs thedefinition of UseCl (as TTAnt, TPres, ASimul, and PPos are all defined in Common for allRGL languages).

Although not part of the user-facing API, the module MorphoX is customary, andis used to implement ParadigmsX for each language. Not being part of the API, its use byother grammars is strongly discouraged, as it is not committed to backwards-compatibilityas user-facing modules are.

Another module which is not part of the user-facing API but it is customary is ResX.It is where parameters and functions useful for several modules are placed, so that theycan be reused across the RG. ResEng for instance defines the Case and VForm (for verbforms) parameters (among others), and implements the toAgr (that builds an agreementparameter from a number, a person, and a gender) and predV (which builds a verb phrasefrom a verb) functions (among others).


Figure 12 – GF RGL category system

Other modules that are useful to end-users are IrregX and DictX. While theLexiconX module contains a small lexicon of 300-odd words, IrregX and DictX offerlarger lexicons bootstrapped from morphological resources. Trustworthy lexicons aredesirable because application programmers of multilingual grammars do not need to worryabout getting word inflections correct using the appropriate paradigms when they are notnative speakers (or speakers at all) of the language they are implementing. These modulesare not available for all languages, however.

The modules ExtraX and ExtendX (not pictured in figure 11) are extensions of theRGL grammatical core, offering more categories and constructors. The latter is intendedto be an improved version of the former, but is still under development.4 A significantdifference between them is that ExtraX is language-specific: there are common constructorsdefined in the abstract Extra, but every language implements its own constructors, whichmakes it impossible to translate between them directly, since they don’t share a commonabstract syntax.5 Unlike ExtraX, ExtendX has a common abstract syntax; because of this itmust define constructors that may not be relevant to language X; as an example, the EnglishRG ends up defining youFem_Pron, even though this is mostly6 relevant for languages

4 https://groups.google.com/d/msg/gf-dev/YWajYB5CcEg/Q6MJXExvDgAJ5 Users may still open ExtraX to help define constructions in an application grammar without sacrificing

interlingual translation.6 I say mostly because although another version of you_Pron introduces redundancy it also allows one to

https://groups.google.com/d/msg/gf-dev/YWajYB5CcEg/Q6MJXExvDgAJ

3.2. The Portuguese resource grammar 61

where the second person pronouns are gendered. This disadvantage to ExtendX is mitigatedby a functor that it instantiates, which includes default linearizations for constructors suchas youFem_Pron, which are only relevant to a subset of the RGL languages.

The abstract module Grammar defines the syntactic core of the RGL. It is composedof modules with evocative names such as Noun (which mostly declares constructors whoseresults are noun phrases and common nouns) and Question (which declares constructorsrelating to interrogative constructions such as question clauses), plus Idiom for idiomaticconstructs (such as progressive verb phrases like estar dormindo) and Structural forstructural words like alguém.

To work on or test a resource grammar in the GF shell, the LangX or AllX modulesmay be imported. These are top-level modules that also contain the LexiconX module, andcan thus be used for testing. (If GrammarX is imported directly, there will be no contentwords and most trees will not be able to be generated.) If using just one language, bothLangX or AllX will do (and the latter might even be preferred because of its additionalimports), but for more languages LangX is the correct choice, as the AllX do not share anabstract syntax by virtue of their import of ExtraX.7

3.2 The Portuguese resource grammar

In this section we present our implementation of a Portuguese resource grammar.We begin by presenting a miniature version of it, explaining its main constructs andthereby giving a high-level view of the syntax of Portuguese. We continue by showingour work on Portuguese morphology and explaining how morphology is encoded in a GFgrammar and how it is exposed to the end-user of the grammar. We then give a briefoverview of a few modules of the Portuguese grammar which are not part of the coreRGL modules, and are thus not provided by the Romance functor. Finally, we performtwo evaluation experiments using GF treebanks,8 and discuss the grammatical correctnessof our Portuguese linearizations. We must note that our implementation mostly followsBrazilian Portuguese, although we sometimes try to offer compatibility with EuropeanPortuguese too.

Our implementation of a Portuguese RG was an instantiation of the Romancefunctor. The Romance functor generalizes the grammar for the following languages: French,Italian, Catalan, Spanish, and Portuguese. Although technically a Romance language,Romanian has a separate implementation because of considerable differences from theother languages in the functor (see [15]).

parse French “tu est intellingente” and get a feminine abstract syntax tree instead of a gender-neutralone.

7 This might change when Extend replaces Extra in All.8 By GF treebank we simply mean a treebank where the trees are GF abstract syntax trees.


The functor implements most of the syntax of Romance languages in a common way,the difference between languages being encapsulated in the DiffRomance interface module(see section 2.1.2), to be instantiated by each language. The Romance functor delegates toits instantiations the language’s morphology, its idiomatic constructs, structural words,and its lexicon. The syntactic parts of our Portuguese implementation are thus restrictedto the DiffPor, StructuralPor, and IdiomPor modules, plus the extra syntactic modulesdiscussed in section 3.2.3.2 (page 73).

The Romance functor follows the RGL module structure presented in section 3.1.3,adding a CommonRomance and a ResRomance module. Parameters and operations that canbe reused across Romance languages are defined in the former; these include the Genderand AForm (adjective form) parameters, and the numForms (for building a table fromnumbers to strings) and Noun (defining a noun type) functions. The ResRomance moduleis a normal resource module, except that it is an interface that gets extended by eachRomance instantiation (refer back to section 2.1.2 (page 36) for the definition of theseterms).

3.2.1 A Portuguese miniresource grammar

The GF book [33] recommends starting a new RG from a miniature resourcegrammar. A miniature RG is separate grammar which intends to be a simplified versionof a full RG, comprising less categories and constructors than those available from theRGL. A mini-resource grammar is useful as both a training exercise in preparation to theimplementation of a full resource grammar and as a better didactic instrument, which ishow we use it here.

We eventually produced three versions of a miniresource grammar.9 The latterone was used as an example grammar in the GF summer school 2018, and is the onewe will present here. Our presentation will focus on syntax, since we will describe themorphological implementation of Portuguese word classes for the full resource.

The structure of the Portuguese mini resource can be seen in figure 13. Rectangularboxes are abstract modules, solid ellipses are concrete modules, and dashed ellipses areresource modules.

The mini-resource grammar is a simplified version of the RGL that tries to followits conventions, so its modules should be familiar from section 3.1. The MiniGrammarlumps together category definitions and grammatical constructors, whereas in the RGLit is an extension of several modules that implement these separately. It uses the samecategory and constructor names as the RGL, removing some of them for simplicity. In the9 The versions are available at following links: https://github.com/GrammaticalFramework/gf-contrib/

tree/master/mini, https://github.com/GrammaticalFramework/gf-contrib/tree/master/mini/newmini,and https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/resource.

https://github.com/GrammaticalFramework/gf-contrib/tree/master/mini

https://github.com/GrammaticalFramework/gf-contrib/tree/master/mini

https://github.com/GrammaticalFramework/gf-contrib/tree/master/mini/newmini

https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/resource


Figure 13 – Module structure of the Portuguese mini resource grammar

mini-resource, for example, polarity is represented in much the same way, while the tensesystem is greatly simplified to have only a simultaneous and an anterior tense (comparedto 8 RGL tenses). Instead of the hundreds of RGL constructors, the mini-resource hasonly 49 constructors (excluding the lexical ones) and 22 categories. It is a significantfragment of the full RGL, though: it includes imperatives (não ande com ela) and questionsentences (vocês andam com o cachorro ?) and has noun and verb phrases, pronouns andconjunctions.

We begin by describing the linearization of the mini-resource categories for Por-tuguese, then explain how some of the main constructor linearizations are implemented.

Listing 3.8 shows how pronouns are created and represented, along with an exampleof a pronoun definition. We define a Pron type for pronouns, which shorthand for a recordwith two fields: the main s field is table from grammatical Case to a string, while thea field represents the agreement of the pronoun. Case, Agreement, Gender, Number, andPerson are all parameters defined in listing 3.9.

Listing 3.10 shows how nouns are represented in the mini resource grammar forPortuguese. A noun is simply a record with two fields: one that stores the noun’s inherentgender, another that stores a table from the noun’s variable number to strings. (Commonnouns, although not shown, get the same representation). Noun phrases are slightly more


−− PronoperPron : Type = {s : Case ⇒ Str ; a : Agreement} ;

mkPron : (_,_,_ : Str) → Gender → Number →Person →Pron ;mkPron eu me mim g n p = {s = table {Nom ⇒eu ; Acc ⇒me ; Dat ⇒ mim} ;a = Agr g n p} ;

Listing 3.8 – Concrete representation of pronouns and how they are built in the Portuguesemini-resource.

paramGender = Masc −− aluno| Fem ; −− aluna

Number = Sg −− alunos| Pl ; −− alunas

Case = Nom −− as subject, e.g. ele| Acc −− as object, e.g. o| Dat ; −− as prepositional object , e.g ., lhe

Person = Per1 −− eu, nós| Per2 −− tu, você| Per3 ; −− ele,ela , eles , elas

Agreement = Agr Gender Number Person ;

Listing 3.9 – Portuguese parameter definitions in the Portuguese mini-resource.

involved, although they also have two fields: one stores their inherent agreement (which isparameter representing gender, number and person), and another storing a table from acase (either nominative, accusative, or dative) to another record, which has two fields, oneof which is a string (the object), and the other (the clitic) is a record of a string and aboolean value (which encodes whether the optional clitic is actually present or not).

To make the representation of noun phrases clearer, we show in listing 3.11 examplesof constructors that create NPs. It includes the UsePron constructor which lifts pronounsto noun phrases. This code includes a syntactic novelty: the double slash (\\) shorthandfor one-branch tables. Because the table in the s field of the DetCN constructor wouldhave the same definition modulo the case parameter it receives, it can be abbreviated in afashion similar to lambda abstractions: \\p ⇒ t is equivalent to table {p ⇒ t}, where p iseither a variable or a wildcard (_).

As its name suggests, DetCN builds a NP from a determiner and a common noun.


operNoun : Type = {s : Number ⇒Str ; g : Gender} ;

NP : Type = {s : Case ⇒ { clit : Clit ; obj : Str} ;a : Agreement} ;

Listing 3.10 – Concrete representation of nouns and noun phrases in the Portuguesemini-resource.

linDetCN det cn = {s = \\c ⇒ {

clit = emptyClit ;obj = det.s ! cn.g ! c ++ cn.s ! det.n} ;

a = Agr cn.g det.n Per3 ;} ;

UsePN pn = {s = \\_ ⇒{ clit = emptyClit ; obj = pn.s } ;a = Agr pn.g Sg Per3} ;

UsePron p = {s = table {Nom ⇒{

clit = emptyClit ;obj = p.s ! Nom} ;

Acc ⇒ {clit = {s = p.s ! Acc ; hasClit = True} ;obj = []} ;

Dat ⇒ {clit = emptyClit ;obj = p.s ! Dat}

} ;a = p.a} ;

Listing 3.11 – Linearization of the UsePron constructor in the Portuguese mini-resource.


Its agreement field inherits from both, taking the gender from the CN and the numberfrom the determiner, while its string field has no clitic and its object is the appending ofthe selection of the determiner by the noun’s gender and the noun phrase’s case, and theCN’s string selection by the determiner’s number.10

Proper nouns are simply nouns which do not vary in number; their lifting to NPsignores case and puts the PN’s string field as object, and their agreement features arestandard, except for the gender.

Regardless of case, the NP inherits the pronoun’s agreement features, and thendifferent values of clitic and object are chosen depending on case. (The emptyClit constantis just a record of an empty string with a false value on the field that indicates if the cliticis present).

Verbs are represented as simple tables from verb forms to strings. Because the tensesystem is simplified, verb forms are much less complicated than in the full Portuguese RG:there are only the infinitive, present, past, and imperative forms (see listing 3.12).

VForm = VInf −− amar| VPres Number Person −− amo,amam| VPast Number Person −− amei, amou| VImp ImpNumPer ; −− ame

Listing 3.12 – GF parameter representing verb forms in the Portuguese mini-resource.

On the other hand, verb phrases are more involved. They have four fields: onestores the main verb, another an optional clitic, another the clitic agreement features, andthe final one is a table from agreement features to a string, represent the complement (seelisting 3.13).

Verb : Type = {s : VForm ⇒Str } ;

VP = {verb : Verb ;clit : Clit ;clitAgr : ClitAgr ;compl : Agreement ⇒Str ;} ;

Listing 3.13 – GF verbal concrete representations in the Portuguese mini-resource.

A simple verb phrase (VP) can be built from a verb by adding the verb to theappropriate field, and filling in empty versions of the other fields, as is done by the10 Although we did not show the linearization of the determiner category, it is not difficult to see what it

is – try it!


UseV constructor. A more interesting version of VP formation is given by the ComplV2constructor, which takes a verb that expects a complement and a noun phrase. A verb thatexpects a complement (simply called a V2) is like a verb, but has an additional field forcase. The main verb of the resulting VP is of course the input V2, while the VP’s clitic andits agreement are inherited from the NP’s clitic (if it exists). Finally, the VP’s complementis the NP’s object field. The code in listing 3.14 has two syntactic novelties: the use oflet ... in for intermediary definitions, and GF tuples, which are enclosed by 〈〉 angledbrackets.

linUseV v = {verb = v ;clit = emptyClit ;clitAgr = CAgrNo ;compl = \\_ ⇒[]} ;

ComplV2 v2 np = let nps = np.s ! v2.c in {verb = {s = v2.s} ;clit = nps. clit ;clitAgr = case 〈nps. clit . hasClit ,v2.c〉 of {〈True,Acc〉 ⇒CAgr np.a ;_ ⇒ CAgrNo} ;

compl = \\_ ⇒nps.obj} ;

Listing 3.14 – GF verbal complementation in the Portuguese mini-resource.

The mini-resource includes other constructors for building both VPs and NPs (seesection 3.2.1), whose names and types are sufficient to tell what they do.

−− VP formationfun AdvVP : VP →Adv →VP ;fun ComplV2 : V2 →NP →VP ;fun UseAP : AP →VP ;fun UseAdv : Adv →VP ;fun UseNP : NP →VP ;fun UseV : V →VP ;−− NP formationfun DetCN : Det →CN →NP ;fun MassNP : CN →NP ;fun UsePN : PN →NP ;fun UsePron : Pron →NP ;

We now describe how simple NP-VP clauses are built by the PredVP constructor (seelisting 3.15). The clause (and question clause) representation is rather simple: it is a table


from two boolean values to a string; one boolean value encodes the clause’s polarity (i. e.,if it’s positive or negative), and the other encodes the tense (a simple choice between pastand present). The clause resulting from PredVP is the concatenation of subject, a possible“não” string if the clause selected is negative, the clitic, the verb (its form depends ontense), and the (optional) object. The subject is the nominative case of the NP, while theobject is the VP’s complement selected by the NP’s agreement. The clitic is inherited fromthe VP, and the verb form (modulo tense) is chosen using the NP’s features (number andperson) by the agrV function. Building a sentence from a clause is pretty straighforward,it is just a matter of selecting the appropriate tense and polarity.

PredVP np vp = {s = \\isPos, isPres ⇒ subj ++ neg isPos ++ clit ++ verb ! isPres ++ obj} where { −− is syntactic sugar for the let ... in expressionsubj = (np.s ! Nom).obj ;obj = vp.compl ! np.a ;clit = vp. clit . s ;verb = agrV vp.verb np.a

} ;

Listing 3.15 – GF clause building in the Portuguese mini-resource.

3.2.2 Morphology

As discussed in section 3.1.3, API access to a resource grammar is implemented inthe ParadigmsX module, (usually) with support modules named ResX and MorphoX. TheAPI (with documentation) for Portuguese morphological paradigms can be found at RGLsynopsis.11 In this section we discuss its implementation and presentation to the user.

Inflection is the process which modifies a word according to a grammatical featuresuch as case, agreement or tense. For example, in Portuguese nouns are inflected bynumber, verbs by tense, and adjectives by both gender and number, while adverbs areinvariant – i. e., have no inflection. An inflection paradigm (or a paradigm for short) is aregular inflection pattern of a language. As an example, the most common paradigm forthe pluralization of Portuguese nouns is the one which adds a s suffix to the noun; howeveranother paradigm is the one which applies to some words ending in ão and pluralizesthem by changing this suffix to ões. An inflection table for a word maps grammaticalfeatures to the correct inflections of the word given those features. Listing 3.16 features aGF inflection table on number in the string field of the lexical constructor for the wordgramática, while table 1 shows a regular inflection for the same word one might find in anonline dictionary.

11 http://www.grammaticalframework.org/lib/doc/synopsis/index.html



grammar_N = {s = table {Sg ⇒ "gramática" ;Pl ⇒ "gramáticas"} ;

g = feminine} ; −− same as grammar_N = mkN "gramática" ;

Listing 3.16 – The Portuguese lexical constructor for noun gramática

FormSingular gramáticaPlural gramáticas

Table 1 – Inflection table for the word gramática

Getting inflection tables right is a large part of making a grammar produce correctoutput. Because word inflections in a language are repetitive, it is GF good practice toencapsulate these in morphological functions called paradigms, and make them availableto the end-user of the grammar. The implementation of morphological paradigms in GFfollows a pattern: a worst-case function taking all possible forms is devised, and then“smarter” functions (which take less input forms and try to guess the same output) aredefined using the worst-case function. For readability, it is customary to define singleparadigms too, i. e., paradigms that do no guessing despite taking fewer forms, and followthe inflection patterns specified by their names. An example is the peneirar_Beschfunction defined in the BeschPor12 module: it expects a string corresponding to a verbinfinitive form, and creates the full inflection table (63 forms) for a verb, assuming theinflection of the input verb matches that of the peneirar verb.

Listing 3.17 shows a naive example implementation of a smart paradigm forPortuguese verbs: it takes the infinitive form of the desired verb, and tries to match itsending to one of the endings characteristic of the three biggest Portuguese conjugationgroups; if, say, the verb form ends in -er, it applies the comer paradigm to the infinitiveform, resulting in an inflection table that follows that of this verb. In case the infinitiveform does not match any of the cases, a fall-through case simply builds a verb with aconstant inflection table using the worst-case paradigm.

GF always demands a value for each parameter of an inflection table, so the GFgrammarian has two main choices when a word does not have a form for a given inflection:they can either use the special nonExist token, or they generate a ‘theoretical’ form, i. e.,one that follows a regular inflection pattern. Because the former may cause runtime errors,

12 Besch comes from the famous Bescherelle French books, which lists verb inflection tables for Frenchand other languages.


the latter is often preferred.

smartVerb : Str → Verb = λinf → case inf of {−− 1st conj group, ending in −aram +"ar" ⇒ verbAmar inf−− 2nd conj group, ending in −ercom +"er" ⇒ verbComer inf−−3rd conj group, ending in −irpart + " ir " ⇒ verbPartir inf−− only pôr, supor and derivatives_ ⇒ verbPor inf} ;

Listing 3.17 – Naive smart paradigm for verbs

The actual implementation of a smart paradigm for Portuguese verbs is moreinvolved, but follows the same pattern. In the BeschPor module we define several verbparadigms, following the ones found in Portuguese verb conjugation tables [16,35]. Thenwe define a smart paradigm that takes the infinitive form of a verb and tries to guess whichparadigm to apply to it, as can be seen in listing 3.18. In the case of Portuguese verbs, itis impractical for the end-user to use the worst-case paradigm, since that would requireinputting more than 60 verb forms. Because of this, we do not even make it available tothe end-user API – but it is nevertheless useful to define the other paradigms in terms ofit. It is customary for a resource grammar API to offer intermediate paradigms, that takeless forms than the worst-case paradigm, but that take more than the smarter paradigm;in the case of Portuguese verbs, we found no heuristic that would result in a practicalparadigm (i. e., one with not too many forms that performed significantly better thanregV). Therefore we encourage end-users to simply lookup the correct paradigm for a verbin case the smart paradigm guesses it wrongly, or just import the verb (if available) fromthe IrregPor module (see section 3.2.3.1 at page 72).

For Portuguese nouns, we define paradigms for nouns that follow several inflectionpatterns (like the ones of the nouns vinho-vinhos, areia-areia, alemão-alemães, falcão-falcões, cidadão-cidadãos, nuvem-nuvens, rapaz-rapazes, canal-canais, réptil-répteis, etc),and then create a smart paradigm that tries to guess the correct paradigm that yields thenoun’s plural form and its gender from its singular form. Intermediate paradigms use thesmart paradigm under the hood, simply forcing it to choose a certain gender, or a certainplural form.

The adjective smart paradigm actually uses the smart paradigm for nouns underthe hood; its only work is guessing the feminine singular form from the masculine singularform. After that both forms are passed separately to the smart paradigm for nouns, whichguesses their plural forms, and these are combined into the full adjective.


regV : Str → V ;regV s = case s of {chamar +"−se" ⇒ reflV (regV’ chamar) ;_ ⇒ regV’ s} ;

regV’ : Str → V ;regV’ v =

letxr = Predef.dp 2 v ; −− −arz = Predef.dp 1 (Predef. tk 2 v) ; −− i in −iarparadigm = case xr of {" ir " ⇒ case z of {"g" ⇒ redigir_Besch ;"a" ⇒ sair_Besch ;"u" ⇒ distribuir_Besch ;_ ⇒ garantir_Besch} ;

"er" ⇒ case z of {"c" ⇒ aquecer_Besch ;"g" ⇒ proteger_Besch ;"o" ⇒ moer_Besch ;_ ⇒ vender_Besch} ;

"ar" ⇒ case z of {"c" ⇒ ficar_Besch ;"ç" ⇒ começar_Besch ;"e" ⇒ recear_Besch ;"g" ⇒ chegar_Besch ;" i " ⇒ anunciar_Besch ;" j " ⇒ viajar_Besch ;"o" ⇒ perdoar_Besch ;"u" ⇒ suar_Besch ;_ ⇒ comprar_Besch} ;

"or" | "ôr" ⇒ pôr_Besch ;_ ⇒ comprar_Besch −− fallback}

in lin V (verboV (paradigm v)) ;

Listing 3.18 – Portuguese verb smart paradigm


3.2.3 Extra modules

In this section we briefly describe our implementation of modules which are notpart of the core RGL modules.

3.2.3.1 Morphological modules

Although not part of the core RGL because they are language-specific, big lexicalmodules are included in the RGs of the most complete languages. These modules helpapplication programmers by allowing them to just import the correct lexical entry insteadof having to decide which paradigm to use. This is even more important in the recurringcase where the application programmer is not a native speaker (or a speaker at all) ofthe language they are implementing, in which case they would have to enlist a languagespeaker’s help in order to check their choices. If a dictionary module is available, theapplication programmer only needs to now the correct entry to import, and can restassured that its inflections forms are all correct.13

Unwritten conventions of the RGL allow for two big lexical modules: DictX andIrregX. While the latter contains only irregular verbs, the former usually contains verbs,nouns, adjectives, and adverbs.

To help define morphological resources for our Portuguese RG, we have participatedin the creation of the MorphoBr resource [8], a large-scale full-form lexicon of Portuguesethat consolidates existing lexical resources for Portuguese, corrects some of their gaps andmistakes, and adds new entries, most of them in the field of diminutives. This work hasbeen fundamental to the definition of the DictPor module, providing us the data for thecorrect inflection of adjectives and nouns. On the other hand, we have defined the verbs inIrregPor by applying the appropriate paradigms from BeschPor.

We have discarded MorphoBr data on verbs because it includes no verb valencyinformation and thus adds little to the IrregPor and BeschPor modules. Adverb informa-tion was also discarded, since adverbs in the Portuguese RG are records of a single stringfield, and by knowing their lemmas they be can safely constructed – there is no need toinclude them in the dictionary module.

The IrregPor and DictPor modules are used in at least one application: the in-development wide-coverage grammar gf-wordnet project,14 which also uses wordnet datafrom the original Priceton wordnet [27] and its Portuguese version, OpenWordNet-PT [11].

13 The importance of large monolingual dictionaries has been highlighted by Aarne Ranta in a recentemail to the GF list (https://groups.google.com/d/msg/gf-dev/YWajYB5CcEg/Q6MJXExvDgAJ),which finally pushed me to include one for Portuguese.

14 https://github.com/grammaticalframework/gf-wordnet

https://groups.google.com/d/msg/gf-dev/YWajYB5CcEg/Q6MJXExvDgAJ

https://github.com/grammaticalframework/gf-wordnet


3.2.3.2 Grammar extensions

The RGL grammatical core is expressive, even though it is limited. Its statedgoal [32] is not syntatic coverage, but semantic coverage: “The user must be able to findsome way to express any content she wants to express”.

Despite this non-goal, it is sometimes useful to extend the RGL grammatical back-bone to add coverage; section 3.1.3 (page 58) explains how the ExtraX and ExtendX modulesfulfill this role. The Portuguese RG has implementations of both modules. ExtraPor ispartially inherited from the Romance functor, while ExtendPor was started from scratch –Portuguese is the second Romance language to have such a module (since Extend is a newmodule, started in late 2017), and it has the most complete Romance implementation atthe moment. When a Spanish Extend module was implemented, we instead decided torefactor all existing Romance Extend modules into a functor, and create Extend modulesfor the Romance languages that did not have it.

(3) a. (has a car been through here?) a green one has

b. (um carro passou por aqui?) um verde passou

(4) a. the man whose car has died sleeps

b. o homem cujo carro morreu dorme

The Extend module mostly has syntactic extensions like the GenRP constructor in list-ing 3.19, which adds a relative pronoun corresponding to Portuguese cujo, or syntacticcoercions like the AdjAsCN constructor in listing 3.20, which coerces an adjectival phraseto a common noun, allowing the grammar to recognize sentences like the ones in glosses 3and 4.

GenRP nu cn = {s = \\_b,_aagr,_c ⇒cujo ! g ! n ++ num ++cn.s ! n ;a = aagr g n ;hasAgr = True} where {cujo = genNumForms "cujo" "cuja" "cujos" "cujas" ;g = cn.g ;n = nu.n ;num = if_then_Str nu.isNum (nu.s ! g) []

} ;

Listing 3.19 – The GenRP constructor from Extend


AdjAsCN ap = {s =\\n ⇒ap.s ! (genNum2Aform Masc n) ;g = Masc} ;

Listing 3.20 – The AdjAsCN constructor from Extend

Another extension that has been partially implemented for Portuguese is theParseExtend module of the gf-wordnet project. When complete, gf-wordnet is to bea wide-coverage translation grammar based on the GF RGL and WordNet. It needsgrammatical extensions to the RGL so that it can attain its goal of syntactic coverage.Even though the project is ongoing, we have contributed a dictionary module based onboth OWN-PT and MorphoBr, and the grammar extension module.

(5) a. he is intelligent enough to love the dog

b. ele é inteligente o suficiente para amar o cão

(6) a. I wanted not to live

b. eu queria não viver

(7) a. I wanted to live

b. eu queria viver

(8) a. lower your hands slowly

b. abaixe suas mãos lentamente

(9) a. slowly lower your hands

b. lentamente abaixe suas mãos

The ParseExtend module declares several new constructors. Many of them add linguisticconstructions like the special adverbial modification of adjectives made by enough – seegloss 5, and notice that in English most adverbs modifying adjectives will come before theadjective, as in he is completely stupid or in she is very smart.

Other functions generalize RGL constructors, like the ParseExtend version ofComplVV, which takes a verb with a verb phrase complement and a verb phrase like the

3.3. Evaluation 75

one from the RGL, but additionally takes an anteriority and a polarity, allowing us togenerate the sentence in gloss 6 (whereas the RGL version only permits gloss 7). Anotherexample is that of the AdvImp constructor: it allows us to add an adverb modifying animperative in front of the verb, like in gloss 9, whereas the most idiomatic way in Englishwould be to put it after the verb (as in the RGL-supported gloss 8). However, if the goalis to have syntactic coverage we must be able to parse such constructions, even if they arerelatively rare.

−− AdvImp : Adv →Imp →Imp ;AdvImp adv imp = {s = \\pol,impform,g ⇒ adv.s ++ imp.s ! pol ! impform ! g} ;

Listing 3.21 – The AdvImp constructor from ParseExtend

3.3 EvaluationIn this section we discuss two evaluations of the Portuguese resource grammar.

In the first evaluation experiment we use a previously available GF treebank to assessthe grammatical correctness of the Portuguese linearizations produced by our grammar.In the second experiment we derive a GF treebank from an existing parallel treebank.We not only discuss the grammatical correctness of Portuguese linearizations, but alsotheir appropriateness as translations of the English versions. In the later experiment weadditionally test and discuss the RGL coverage of the syntactic phenomena present in theoriginal treebank.

3.3.1 UD examples corpus

The Universal Dependencies (UD) project is a collection of multilingual treebanksand a set of common guidelines to annotate them [9]. UD is heavily influenced by theStanford dependencies project [10], which focused on English syntax. The latest UD release(2.3) counts 129 treebanks spanning over 76 languages.

As a byproduct of their work on converting GF trees to UD trees [20, section 6.1] (ofwhich we will more about in section 4.3), Ranta & Kolachina constructed a GF treebankfrom sentences found in the English-language documentation of the UD project. Thesentences were parsed and manually disambiguated (only the most appropriate analysistree of a sentence was chosen, even if it had more available analyses). This treebank becamea part of the testing treebanks of the RGL,15 and we used it to test the linearizationsprovided by the Portuguese RG.15 https://github.com/GrammaticalFramework/gf-rgl/blob/master/treebanks/ud-rgl-trees.txt

https://github.com/GrammaticalFramework/gf-rgl/blob/master/treebanks/ud-rgl-trees.txt


Before linearizing the UD examples treebank, we revised its trees (there wereat least three wrong trees) and removed some duplicated entries. A native BrazilianPortuguese speaker with no knowledge of GF judged that out of the 63 trees, 8 producedungrammatical output. If we disconsider trees that are wrong (we discuss why we considersome trees wrong below) and use a grammar extension, this number gets reduced to 3trees, one of which actually produces unidiomatic Brazilian Portuguese, but perfectlyacceptable European Portuguese (see gloss 20).

The numbers above refer to the latest version of the grammar; some of the issueswe describe below have been fixed. We also discuss whether these issues are due to ourimplementation, the underlying Romance functor, the RGL syntactic coverage, or thetreebank itself.

3.3.1.0.1 Copular verbs and adjectives

(10) a. John is handsome

b. João é bonito

c. João está bonito

(11) a. he was ready when I saw him

b. # ele era pronto quando eu o via

c. ele estava pronto quando eu o via

Copulas are words that join a sentence’s subject to its complement, as does is in the firstsentence of gloss 10. In many languages copulas are verbs, but some languages do not evenhave copular words.

While the English RG only has one copular verb (the verb to be), the PortugueseRG currently declares three copular verbs – the verbs ser, estar, and ficar. In many caseswhere an English sentence will use the to be copula, the appropriate Portuguese version ofthe sentence might use any of the Portuguese copulas (as is the case of gloss 10), eventhough there are cases (like gloss 11) where there is a clear choice.

The difference between these cases is most of the time semantic and not syntactic,so there is little the resource grammarian can do to produce the appropriate tree for everycase. One thing they can do, however, is to allow the user to specify which copula theywant to use. We have thus modified the linearization type of adjectives in the Romancefunctor to do exactly this.

The concrete representation of adjectives in the Romance functor was

3.3. Evaluation 77

A = {s : Degree −− e.g., comparative⇒ AForm −− e.g., Masculine Singular⇒ Str ;

isPre : Bool −− positioned before or after noun?} ;

encoding the fact that degree and agreement forms are variable but an adjective’s position-ing is inherent, e. g., the use of espertas (and not, say, esperto) in elas são espertas dependson something (in this case, the sentence’s subject), while the fact that we sometimes sayo bom velhinho instead of o velhinho bom as we usually do for other adjectives (e. g., ovelhinho teimoso) is inherent to the adjective, not depending syntactically on anythingon the sentence. Our modification of the Romance functor consisted in adding anotherinherent parameter defining which kind of copula an adjective uses, as in

A = {s : Degree ⇒ AForm ⇒Str ; isPre : Bool ; copTyp : CopulaType} ;

Of course, the same adjective might use different copular verbs in different sentences, butnothing prevents the user from declaring two adjectives which differ only in their choice ofcopular verb. This promotes flexibility at the cost of increasing ambiguity in translationsfrom languages that do have such a distinction – although it also allows more possibletranslations, as in having both ele é arrumado and ele está arrumado be translations ofhe is tidy.

This implementation decision is in line with GF’s preference for language generation,as it makes it easier for an application programmer to choose what kind of copular verbsthey want to use, specially when coupled with a new overloaded instance of the mkAfunction:

mkA : A →CopulaType →A ; −− force copula type

3.3.1.0.2 Copula type in complement of copula

(12) a. John is from the city

b. # João está da cidade

c. João é da cidade

The solution to the copula selection problem we have previously discussed is restricted toadjectival phrases being used as complements to the copula. Gloss 12 shows an examplewhere the default copula choice (in this case the estar verb) is the wrong one, yieldingwrong Portuguese linearizations. This is due to the definition of the CompAdv constructor,


which builds a complement to a copula from an adjective, which simply selects the estarcopula regardless of its input adverb.16

A possible solution to this problem would be akin to that of the copula selectionin adjectival phrase complements: we simply add a parameter to adverbs saying whichcopula they take. In this case we abstain from following this path, however. Adverbs areless frequently complements of copulas than adjectives, and we know of no semantic norsyntactic motivation for them to take one copula or another.

We still would like the end-user to be able to choose which copula they want tohave, so we offer a general solution to all copula complements: we have taken a coercionconstructor that forces the use of the estar copula in copula complements from a language-specific module of the Spanish RG, and added it to the RGL-wide Extend module. Thissolution introduces ambiguity in languages which do not have a ser copula: if before thischange we would get one analysis tree for a sentence with a copula, now we will get twotrees, one whose linearization will correspond to the estar copula in the languages thathave it, and another which corresponds to the usual copula in the languages that have it(see figure 14).

3.3.1.0.3 Incorrect trees

(13) a. what she did is important

b. * que ela fazia é importante

Gloss 13 would seem to show a wrong Portuguese linearization. This is not the case becausethe analysis tree is incorrect – it just happens to produce coherent English text.

Figure 15 shows the abstract syntax tree for the sentences at gloss 13. We cansee that tree corresponds to an incorrect syntactic interpretation where the main clauseis composed of an embedded question sentence corresponding to what she did, but thisconstituent is not a question in this case, so the linearization is off. We were not able toproduce the correct analysis of the sentence in gloss 13 using the RGL or its extensions.

(14) a. John is not a doctor

b. * João é nem um médico

A similar situation is displayed by gloss 14. The English linearization seems OK,with the Portuguese one being wrong. If we analyze the original analysis tree of thetreebank though, we find that it is incorrect, and only produces a proper English sentence16 Note that the GF RGL treats prepositional phrases like from the city as adverbs, hence our use of the

term here.

3.3. Evaluation 79

Figure 14 – Abstract syntax trees for John is from the city; the hole in the diamond-shapednode can be filled by both UseComp and UseComp_estar.

by chance. Figure 16 shows both the incorrect analysis (a.) and the correct analysis (b.)for the sentence from gloss 14. (a.) takes not to be a predeterminer (as in not a single dogbarked), not a negation, which is why the Portuguese linearization produced is incorrect.Unlike the case of gloss 13, for this sentence we were able to produce a correct abstractsyntax tree, which gives us the correct linearizations in gloss 15.

(15) a. John isn’t a doctor

b. João não é um médico

3.3.1.0.4 Romance clause inversion

Romance clause inversion was implemented with French in mind, and thereforewas performing an inversion which is not idiomatic in Portuguese, as in gloss 16.


Figure 15 – Abstract syntax tree for what she did is important.

(a)

(b)

Figure 16 – Two analyses for the sentence John is not a doctor

3.3. Evaluation 81

(16) a. which book do you like

b. de qual livro você gosta?

c. ! de qual livro gosta você ?

Our first solution was simply to not invert clauses in any questions, but that is not correcteither, for it get us unnatural-sounding sentences (even if they are grammatical), as ingloss 17.

(17) a. what is love ?

b. que é o amor ?

c. ! que o amor é ?

Portuguese inverts word order in question clauses when a question starts with an interrog-ative word (like onde or que) and the verb is copular. We therefore edited the Romancefunctor to allow finer-grained choice of when inversion would be performed, and changedthe Portuguese (and Spanish) grammars to invert copular questions but not to performthe inversion in other kinds of questions. This allows us to have the intented linearizations.

If the subject of a copular question is a pronoun or a proper name and it startswith an interrogative, the inversion is optional; however the Portuguese RG will alwaysperform it for predictability. Predictability is important for natural language generationbecause it avoids surprising the user, plus it makes for a consistent (and thus easier touse) API.

Another instance of word inversion in Portuguese questions is in very formal writings,see in gloss 18 an example taken from [41]. The RGL does not implement formality in ageneral way: it is not the case that every abstract syntax tree has a formal/informal version.What the RGL does have is constructors that express formality in a few grammaticalconstructions: there is a constructor for formal and informal imperatives, for example.

(18) a. Why did the president refuse to sign the decree?

b. Por que se recusou o presidente a assinar o decreto?

3.3.1.0.5 Romance clitic pronouns

(19) a. I wanted to eat it

b. * eu queria comer o


The default implementation of clitic pronouns in the Romance functor fails forPortuguese, as can be seen in gloss 19. We have not yet changed it to correctly implementPortuguese enclitic and mesoclitic pronouns because we do not see a way of doing itwithout sacrificing the grammar’s performance.

3.3.1.0.6 The no quantifier

(20) a. John saw no animals

b. * João não via nenhum animais

c. João não via nenhuns animais

The correct Portuguese linearization in gloss 20 is not idiomatic in BrazilianPortuguese (even though it is in European Portuguese). Despite this consideration, it isnevertheless a good linearization for the corresponding syntax tree, which is displayed infigure 17. Both the noun (animal_N) and its quantifier (no_Quant) are variable in number,and must share the same number when it is determined; this means that if we were toforce an invariable no_Quant by defining its plural form to be nenhum, we would still havea plural animal_N, yielding the incorrect Portuguese linearization at gloss 20. This wasin fact the linearization we originally had, before realizing our mistake of trying to forceno_Quant to be invariable.

(21) a. John saw no animal

b. João não via nenhum animal

The idiomatic Brazilian Portuguese linearization is still available, however mapped to adifferent English linearization, as shown by gloss 21.

3.3.2 Matrix MRS test suite

The development of a grammar requires a way of testing its results. In the initialphase of grammar writing, testing the grammar in the controlled environment that is acorpus is ideal, because one can test basic grammatical constructs without worrying toomuch about robustness. As the RGL is a multilingual grammar, a multilingual corpus isbest suited because it would allow us to not only test the translations produced by ourgrammar, but also to test the coverage of grammatical phenomena in the RGL (usinganother language’s grammar, in our case that of English).

The Matrix MRS corpus17 was created to be “a short but representative set ofsentences”, and is used as a test suite for the English Resource Grammar (ERG) under17 http://moin.delph-in.net/MatrixMrsTestSuite

http://moin.delph-in.net/MatrixMrsTestSuite

3.3. Evaluation 83

Figure 17 – Abstract syntax tree for John saw no animals.

the Head-drive Phrase-Structure Grammar (HPSG) formalism. The fact that this corpusis successfully used by an industrial-strength grammar and is limited in both vocabularyand size made it an ideal choice for an initial experiment. Composed of 107 sentences, thecorpus is available in English and Portuguese versions, besides several other languages.

In order to the test the coverage of the RGL, we employed its English grammar toparse the Matrix sentences. Out of the 107 sentences, 77 were parsed with only minor lexicaladditions. Using an extension module of the RGL allowed us to parse an additional 21sentences, and two small grammatical constructs were added to parse another 2 sentences.The remaining 6 sentences demand more significant grammatical additions in other to beparsed; we discuss some of them in section 3.3.2.1.

The procedure we followed amounted to preprocessing input sentences (downcasingsentence-leading letters and adding space between tokens), then parsing them with theEnglish GF resource grammar (our additions to it are collected in the MatrixEng module18).This yielded us GF abstract syntax trees, most of the time several of them for each sentence.

We then manually removed spuriously ambiguous trees using a primitive (tool-less)form of discriminant-based disambiguation, inspired by the approach in [29]. Take thesentence Some bark: it has readings where bark is taken as noun and readings where barkis taken as verb; by deciding whether bark is a noun or a verb in a given sentence, we mayremove all trees where it is taken to something else. We can generalize this approach fromlexical units to subtrees, which give us even more discrimination power. Even in highlyambiguous sentences producing thousands of trees, one can find the plausible tree(s) in18 https://github.com/odanoburu/gf-matrix

https://github.com/odanoburu/gf-matrix


relatively few disambiguation steps.

Finally, we picked the disambiguated trees and linearized them in Portugueseusing the Portuguese RG. The Portuguese linearizations were then compared to theircorresponding sentences in the test suite, and analyzed with respect to grammaticalcorrectness. We do not test the translated sentences for equivalence because translationequivalence is not a goal of the RGL [32]. A native speaker of Brazilian Portuguese withno knowledge of GF judged that out of the 101 trees we produced, 11 of them did not havea grammatically correct linearization. Our own judgement is that the wrong linearizationsnumber only 7.

3.3.2.1 Discussion

3.3.2.1.1 Mismatched tenses

Gloss 22 shows a unidiomatic translation of the English sentence. Although bothPortuguese translations are syntactically correct, the former uses the auxiliary verb inthe present tense and the main verb in the participle, while the latter simply uses thePortuguese pretérito perfeito tense. Even though the former linearization is not a bug(since it is syntactically correct), we do consider the fact that it was impossible to obtainthe latter linearization from the core RGL (i. e., without using grammar extension modules)a bug.

(22) a. the dog has barked

b. o cachorro tem ladrado

c. o cachorro ladrou

To understand why that was not possible, we must first understand how the RGLtense system works. A RGL clause like the one in gloss 22 is built using the PredVPconstructor, which takes a noun phrase and a verb phrase and builds a clause. Clauses haveno fixed tense; it is when we transform them into sentences using the UseCl constructorthat we specify what tense we want to have.

The RGL does not enforce a common tense model, but it does offer a commonAPI for all languages. This means that the UseCl constructor will only take as argumentsthese 8 tenses, although language-specific constructors taking more tenses may exist ina particular RG. The RGL tense API is a combination of anteriority (simultaneous oranterior) and temporal order (past, present, future, or conditional) for a total of 8 differenttenses [33, section 5.17].

The Romance functor’s tense API adds a second form of past temporal order(corresponding to Portuguese pretérito perfeito) to the RGL’s, for a total of 10 tenses.

3.3. Evaluation 85

Anteriority Temporal Order Romance ExampleASimul TPres RPres o cachorro ladra

TPast RPast o cachorro ladrava– RPasse o cachorro ladrou

TFut RFut o cachorro ladraráTCond RCond o cachorro ladraria

AAnter TPres RPres o cachorro tem ladradoTPast RPast o cachorro tinha ladrado

– RPasse o cachorro teve ladradoTFut RFut o cachorro terá ladradoTCond RCond o cachorro teria ladrado

Table 2 – RGL and Romance tenses, and how they inflect verbs in a Portuguese sentence

Table 2 shows how RGL tenses are mapped to Romance tenses and how those were mappedto actual verb forms by giving the example in gloss 22 with varying tenses.19

When mapping RGL tenses to Romance tenses, the Romance functor made thechoice of using the auxiliary verb in the present and the participle form of the main verbin the case of an RGL present anterior tense, as in French le chien a aboyé – which isperfectly idiomatic French – and in Portuguese o cachorro tem ladrado (see table 2). Itwas due to this choice that we could not get o cachorro ladrou as the translation of thedog has barked, since RPasse – the temporal order producing Portuguese pretérito perfeito– is a Romance-only construct. This does not simply affect RGL translation: applicationprogrammers were also forced to use RGL extensions to achieve o cachorro ladrou.

Because we considered that only offering such a crucial aspect of the language inan extension was bad user interface, we decided to make the Romance tense mappingto verb forms language-dependent. We did this by including a parameter function in theDiffRomance module with a standard implementation that is overridden by Portuguese,as in listing 3.22.

We are then able to have o cachorro ladrou and le chien a aboyé as translations ofthe dog has barked; the Portuguese pretérito perfeito is available in general as RGL presentanterior tense (the AAnter x TPres combination in table 2).

(23) a. Abrams handed Browne the cigarette

b. ! Atlas passava o cigarro a Bobi

c. Atlas passou o cigarro a Bobi

19 Note that the form corresponding to combination AAnter x RPasse is ungrammatical; however GFdemands a value for each parameter combination and this was the choice that made the most sense, sothat we had forms corresponding to the auxiliary verb in all possible tenses as we do for the main verb.


operTAtoVF : RTense → Anteriority→ (VF ⇒ Str) → (VF ⇒ Str)→ Number →Person →Mood →Str → Str ∗ Str ;

TAtoVF t a verb vaux n p m part = case 〈t,a〉 of {〈RPast,Simul〉 ⇒ 〈verb ! VFin (VImperf m) n p, [] 〉 ;〈RPast,Anter〉 ⇒ 〈vaux ! VFin (VImperf m) n p, part〉 ;〈RFut,Simul〉 ⇒ 〈verb ! VFin VFut n p, [] 〉 ;〈RFut,Anter〉 ⇒ 〈vaux ! VFin VFut n p, part〉 ;〈RCond,Simul〉 ⇒ 〈verb ! VFin VCondit n p, [] 〉 ;〈RCond,Anter〉 ⇒ 〈vaux ! VFin VCondit n p, part〉 ;〈RPasse,Simul〉 ⇒ 〈verb ! VFin VPasse n p, [] 〉 ;〈RPasse,Anter〉 ⇒ 〈vaux ! VFin VPasse n p, part〉 ;−− default Romance: 〈RPres,Anter〉 ⇒ 〈vaux ! VFin (VPres m) n p, part〉 ;〈RPres,Anter〉 ⇒ 〈verb ! VFin VPasse n p, [] 〉 ;〈RPres,Simul〉 ⇒ 〈verb ! VFin (VPres m) n p, [] 〉} ;

Listing 3.22 – A function mapping temporal order and anteriority to verb forms

(24) a. Abrams knew that it rained

b. Atlas sabia que chovia

Gloss 23 highlights another problem with tenses: the Portuguese RG assigns thePortuguese pretérito imperfeito to the RGL tense which the English RG assigns the Englishsimple past, which ends up creating unidiomatic sentences. However it is not always thecase that the pretérito imperfeito is a unidiomatic translation of the English simple past,as gloss 24 shows. There is no general solution to this translation problem at the RGLlevel (which is syntactic), and is an example of why the RGL does not have translationequivalence as a goal.

3.3.2.1.2 Whose as interrogative

(25) a. whose dog barked ?

b. * quem cachorro ladrava ?

Gloss 25 shows a trickier example than it seems. Google Translate (as of 2018-10-02)will translate it wrongly in Portuguese, Spanish, and French.20 Appropriate Portuguesetranslations such as de quem é o cachorro que ladrou? or o cachorro de quem ladrou?are difficult to represent in GF because they are not parallel to other questions whichshare the same underlying abstract tree. Gloss 26 shows a sentence whose abstract syntax20 As cujo cachorro latiu?, de quien perro ladró?, and quel chien a aboyé?, respectively. Interestingly, it

makes a different kind of mistake for each language.

3.3. Evaluation 87

tree is almost the same as the one gloss 25; in English their structure is almost the same,but the correct Portuguese translations have different structure (one introduces a copula,the other inverts the constituent order). To have the QuestVP constructor which buildsboth sentences behave in such different ways, we would need to add a new parameter tothe linearization type of interrogative pronouns and build the resulting question clauseaccording to it. That would be a valid solution if linguistic theory supported such aclassification of interrogative pronons. We are not aware that it does, so we argue that abetter solution is to add a constructor for whose-sentences in the Construction moduleof the RGL, which already contains constructors for other kinds of questions that areexceptional, like for example how old are you? (whose appropriate Portuguese translationhas again a different structure).

(26) a. which dog barked ?

b. qual cachorro ladrava ?

3.3.2.1.3 Compound nouns

(27) a. the garden dog barked

b. o cachorro do jardim ladrava

Gloss 27 needed the definition of the CompoundN construction rule (that createscompound nouns) from the Extend grammar extension to produce a non-empty Portugueselinearization. After defining it, we were able to perform the linearization correctly.

3.3.2.1.4 Adverb placement

(28) a. the dog probably barked

b. # o cachorro ladrava provavelmente

c. o cachorro provavelmente ladrava

(29) a. the dog barked softly

b. o cachorro ladrava suavemente

c. o cachorro suavemente ladrava


GF has two adverb categories: AdV for adverbs that are directly attached to theverb (like always), and Adv for regular adverbs. The Romance functor currently insertsthe Adv in a VP correctly, but fails to do so for AdV. For instance, for an Adv like softly,both Portuguese linearizations in gloss 29 work, although the first is more idiomatic. Wehave produced a patch for the Romance functor in order to fix this.

3.3.2.1.5 Incorrect tense choices

(30) a. Abrams intended Browne to bark

b. * Atlas pretendia que Bobi ladrar

c. Atlas pretendia que Bobi ladrasse

The verb intend in gloss 30 takes two complements, one of which is a verb. As wecan see, the current Portuguese linearization employs the indicative mode, whereas theideal linearization would use the verb complement in the subjunctive. There is currentlyno way of specifying which mode the complement verb should use in the RGL (most likelybecause it is not necessary for most of its languages), hence the wrong linearization.

3.3.2.1.6 Tag questions

The RGL does not support tag questions, therefore we could not parse the sentencethe dog barked , didn’t it. This means that the sentence is not recognized by the EnglishRG, and so it has no GF abstract syntax tree.

3.3.2.1.7 Date and time units

(31) a. it took Abrams ten minutes to arrive

b. levava dez minutos para Atlas chegar

To parse gloss 31 we needed to add a new verb category (V3V, for verbs taking twonoun phrase complements and a verb phrase complement), plus a construction rule tocreate noun phrases from time units.

(32) a. June third arrived

b. o três de junho chegava

(33) a. Abrams arrived at three twenty

3.3. Evaluation 89

b. # Atlas chegava às vinte de três

c. Atlas chegava às três e vinte

GF has construction rules for dates, which allow us to correctly parse gloss 32.Without these rules, we could have parsed June third as a compound, which in this casewould yield the same Portuguese linearization. Consider gloss 33, however. Its secondsentence is the Portuguese linearization of the tree containing June third as a compound,while the third sentence is the linearization of the tree analyzing it using a specialconstruction rule for time hours. We generally do not want to define special constructionrules if we can avoid it, but in this case they are permissible since linguistic constructionsfor expressing time are indeed special cases in most languages.

3.3.2.1.8 Incomplete sentences

(34) Abrams could

(35) Browne tried to

Glosses 34 and 35 are incomplete sentences. Due to the way GF is structured, itcould parse a sentence missing a NP complement, or a VP missing a complement, but nota sentence missing a verb complement. That is the case because there are intermediatecategories for the former cases, but not for sentences missing a verb complement. A way tosolve this problem is to create coercions; if we had a coercion from a verb that takes a verbcomplement (like could) to a verb that takes no complements, we would be able to parsegloss 34 successfully. The use of coercions introduces ambiguity, however, so coercions areusually declared in the grammar extension modules and not in the core RGL.

3.3.2.1.9 Iberic negative imperatives

(36) a. don’t bark

b. * não ladra !

c. não ladres !

Gloss 36 corresponds to an abstract syntax tree specifying a formal imperative in thesingular. Because the imperative is in the negative, its conjugation differs from that in thepositive imperative, which would be ladra !. It should follow the conjugation of the presentsubjunctive tense instead. Portuguese shares this feature with Spanish and Catalan, whoseoriginal implementations also produced wrong linearizations where the imperative behaved


as if in the afirmative. Another problem with Iberic negative imperatives (although it didnot show up in our evaluations) is showcased by gloss 37 – reflexive pronouns were beingplaced after the verb (não vista-se agora*), but this has also been fixed.

(37) a. don’t love yourselves

b. * não amem-se

c. não se amem

91

4 Applications

In this section we present three multilingual applications of the GF resourcegrammar library, which now include Portuguese because of our implementation. The intentof this section is to show how easy it is to extend an existing multilingual application witha new language, given a GF implementation of a resource grammar.

4.1 Health-domain application grammarDigital Grammars (DG1) is a company founded by the GF core team who uses

GF as a production tool.2 One of DG’s products is a healthcare-domain translationchat app, a demo of which is available from https://www.digitalgrammars.com/demo/chat/ (a screenshot of the app demonstration is in figure 18). The app features a smallcontrolled natural language (CNL) that talks about general health questions, meant tobe a demonstration of a much larger industrial app meant for healthcare providers inmulticultural societies. (We employ the term CNL here meaning a subset of English, butthe definition of CNLs are far from straightforward [22].)

Among the languages the app translates is Portuguese, supported by a concretegrammar written by us during the GF summer school 2018.3 A first version of thisgrammar was supported by the miniresource grammar described in section 3.2.1. Analternative, latter version is supported by the Portuguese RG implementation part of theRGL, developed as the instantiation of a functor. Both versions are under 150 lines of GFcode (the functor version being even shorter) because they are supported by an underlyingPortuguese grammar which does the heavy lifting.

4.2 Attempto Controlled English and GFAttempto Controlled English (ACE4) is a controlled natural language (CNL)

comprising an unambiguous subset of the English natural language which attempts toprovide precision in textual descriptions while maintaining their naturalness, allowinguntrained readers to understand them because of their similarity to English.

ACE-in-GF is a project aiming to implement ACE’s syntax using GF, so thatACE can become multilingual. The current coverage is almost complete for the OWL-1 https://www.digitalgrammars.com/2 There are a few other companies who do so, textual (https://textual.ai) among them.3 Available from https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/

application4 For more information see http://attempto.ifi.uzh.ch/

https://www.digitalgrammars.com/demo/chat/

https://www.digitalgrammars.com/demo/chat/

https://www.digitalgrammars.com/

https://textual.ai

https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/application

https://github.com/GrammaticalFramework/gf-summerschool-2018/tree/master/application

http://attempto.ifi.uzh.ch/

92 Chapter 4. Applications

Figure 18 – Screenshot of the DG demo app

compatible5 subset of ACE. In this subsection we describe the ACE-in-GF project, showhow we ported it to Portuguese with support from the Portuguese RG, and discuss a fewof its capabilities.

4.2.1 ACE

ACE is a CNL because although it looks like the natural language English, it is infact a formal language, i. e., a language completely described by a grammar. Despite theill-definition of the concept of a CNL [22], in the case of ACE it means a subset of English– with a restricted grammar and a domain-specific vocabulary – that can be unambiguouslytranslated to a notational variant of first-order logic [18], that of discourse representationstructures. For an introduction to discourse representation theory, see Blackburn & Bos [4];for a concrete specification of discourse representation structures in ACE, see [17].

To be be able to unequivocally translate ACE text to first-order logic – somethingthat can not be done for English – two main techniques are employed: removing dispensableambiguous constructions from the language (favoring the use of unambiguous ones instead)and interpreting the remaining ambiguous constructions using deterministic rules. Toreflect their intent in a text, a user might need to rephrase the input. This is aided bytwo tools: a predictive editor aids the user in staying within the grammar’s and lexicon’sdomain, and the ACE parser will produce paraphrases from an accepted input text tohelp the user ascertain if the ACE interpretation of the text is the one they meant. In the

5 https://www.w3.org/TR/owl2-overview/

https://www.w3.org/TR/owl2-overview/

4.2. Attempto Controlled English and GF 93

following example taken from the ACE documentation,6 the first sentence is the input,and the following sentences are the paraphrases produced by the ACE parser.

(38) a. A customer inserts a card that is valid and opens an account.

b. There is a customer X1. There is a card X2. The card X2 is valid. The customerX1 inserts the card X2. The customer X1 opens an account.

If the user intended the interpretation where the card opens the account, they mustrephrase their input to A customer inserts a card that is valid and that opens an account,which will produce the appropriate paraphrase.

ACE construction rules summarizing its grammar can be found in the onlinedocumentation,7 as can the interpretation rules followed by the ACE parser.8 A morecomplete description of the ACE grammar is also available online.9

4.2.2 ACE-in-GF

ACE-in-GF10 is a project intending to implement the syntax of ACE in GF, allowingit to cover more languages, more easily. The project does not reimplement the ACE parser,it simply translates between the other languages and ACE, which means that the ACEparser is still necessary in most use cases.

Implementing ACE in GF lends it not only multilinguality, but also support forGF technologies like lookahead writing prediction (which shows possible completions tothe sentence being written), embeddable grammars, and conversion to speech-recognitiongrammar formats [5].

Lookahead editing, for example, is very important for CNLs in general, ACE amongthem. Because CNLs are similar to their host languages they are very easy for readers tounderstand, but are also very difficult to write, since the user has to learn their rules tobe able to differentiate them from their host languages. Lookahead editing helps fix thisproblem by showing possible completions on an unfinished sentence.

In the particular case of ACE, though, GF lookahead editing is not sufficient outof the box, because GF has no built-in notion of anaphoras, which are a core part ofACE [21]. Kuhn [21] exemplifies the problem with the partial sentence every man protectsa house from every enemy and does not destroy.... Possible anaphoric completions to thissentence would refer to both man and house, but not to enemy, because enemy would6 http://attempto.ifi.uzh.ch/site/docs/ace_nutshell.html7 http://attempto.ifi.uzh.ch/site/docs/ace_constructionrules.html8 http://attempto.ifi.uzh.ch/site/docs/ace_interpretationrules.html9 http://attempto.ifi.uzh.ch/site/docs/syntax_report.html10 https://github.com/Attempto/ACE-in-GF

http://attempto.ifi.uzh.ch/site/docs/ace_nutshell.html

http://attempto.ifi.uzh.ch/site/docs/ace_constructionrules.html

http://attempto.ifi.uzh.ch/site/docs/ace_interpretationrules.html

http://attempto.ifi.uzh.ch/site/docs/syntax_report.html

https://github.com/Attempto/ACE-in-GF


be out-of-scope since in ACE verb phrases close at their end all scopes that are openedwithin them. In a GF lookahead editor, however, all these anaphoric completions wouldbe available, since there is no way to enforce an anaphoric constraint in GF [21].11 This isnot to say that it is impossible to implement such a lookahead editor with GF – indeed,one can be implemented in a general-purpose programming language with bindings toGF. But it does mean that there is no way of encoding declaratively the anaphora-relatedgrammatical rules of ACE in GF. It is because of this that Kuhn implemented a newgrammar notation capable of encoding such rules [21]. For more on the expressivity of GF,refer to [24].

4.2.3 Implementation and example usage

The Portuguese concrete grammar of ACE-in-GF is built on top of the functorserving all languages of the project, with support from the Portuguese RG. A full evaluationof the ACE-in-GF project is available from [5], which excludes our later implementation.We have not repeated this evaluation including Portuguese, since our intent here is toshow how easily one can add another language to a GF multilingual application given aresource grammar, not to ascertain how well an instance of such an application performs.

To exemplify how ACE-in-GF could be used, let us take a real-world example of aCNL whose use has been discontinued because of problems which ACE-in-GF might solve.Catterpillar Fundamental English (CFE) was developed by the Caterpillar company tofacilitate communication with its non-English speaking staff [40] (as cited in [22]).12 It wasmainly used in service manuals that had to be read by employees whose native languagesnumbered more than 50. Instead of having the manuals translated – which was costly andmight introduce errors – Caterpillar has its employees complete a 30-lesson course on CFEthat made them capable of understanding the language.

CFE ended up discontinued due to Caterpillar’s inability at the time of enforcingthe language’s rules [19] (as cited in [22]), which often made the language degenerateinto full English. This kind of enforcement is nowadays simply solved by a syntax-driveninteractive editor providing lookahead editing and syntax-checking, but was not availableat the time of CFE’s introduction. Some years later Caterpillar introduced another CNLwhose aims were to have its rules enforceable and to diminish translation costs – as opposedto eliminating them.

This later language has a similar use case to that of ACE-in-GF. A technical writerwould use ACE’s or GF’s interactive editor to write a text in the CNL. ACE-in-GF would

11 We suspect that these constraints could be enforced using the dependently-typed fragment of GF,however this implementation choice would incur into performance and maintainability costs.

12 We do not claim that Caterpillar could or should use ACE-in-GF; CFE dates from 1971 and a lot haschanged since then.

4.3. GF to UD 95

then produce translations with reasonably low cost and quality. Gloss 39 shows an exampleof an ACE sentence translated to Portuguese taken from the ACE-in-GF documentation.

(39) a. it is false that X is read by nothing but computers that Y doesn’t see

b. é falso que X é lido só por computadores que Y não vê

4.3 GF to UDWe have given an overview of the Universal Dependencies (UD) project in sec-

tion 3.3.1. In this section we first present the kinds of trees we will discuss, and then showthe algorithm developed by Kolachina & Ranta [20] that allows the transformation of GFRGL abstract syntax trees into UD trees. Further work from Kolachina & Ranta on theconversion of UD trees to GF trees is also available [34], but we will not discuss it here.

Converting GF trees to UD trees has a few use-cases; some of the ones mentionedin [20] include:

1. it allows GF to be used as a rule-based dependency parser;

2. it enables bootstrapping of UD treebanks from GF treebanks (this is speciallyinteresting for languages that have little to no UD data);

3. it becomes a way of checking manually annotated UD trees (at least for those thatcan be generated by GF).

This conversion is enhanced by a small language-specific13 configuration, which iswhat we created for Portuguese to allow the conversion of syntax trees to Portuguese UDdependency trees. This and the Portuguese RG itself are our only contributions to thissection.

Finally, we discuss aspects of the conversion, focusing on how the results deviatefrom the UD standard, following [20].

4.3.1 Trees

In this section we briefly explain what are the different trees involved in the GF toUD conversion. Figure 19 shows examples of these trees.

4.3.1.0.1 Abstract syntax trees

Are trees whose nodes and leaves are abstract syntax functions (see figure 19a).13 Actually it is concrete grammar-specific, since the conversion could work for any GF grammar, not

only the RGL.


4.3.1.0.2 Parse trees

Are trees whose nodes are categories and the leaves are words (strings) (seefigures 19b and 19c).

4.3.1.0.3 Dependency trees

Are trees whose nodes are words and whose edges have dependency labels; wordorder is significant (see figures 19d and 19e).

4.3.1.0.4 Abstract dependency trees

Are trees whose nodes are constant functions and whose edges have dependencylabels; word order is insignificant (see figure 19f).

4.3.2 The algorithm

We can transform an abstract syntax tree into a dependency tree by picking one ofits argument as the head, and giving the other arguments appropriate labels. For example,given an abstract syntax tree created by a function such as

Pred : V → Subj → Obj → S ;

we can derive a dependency tree where the first argument is the head, the second is adependent with label Subj, and the third is another dependent with label Obj. To havemore control over the conversion process, we can specify a dependency configuration,which specifies which of the arguments is the head, and what the other arguments mustbe labelled. In the GF to UD conversion, this is done by a text file such as the one inlisting 4.1.

PredVP nsubj head −− NP →VP →ClQuestIAdv advmod head −− IAdv →Cl →QClComplSlash head dobj −− VPSlash →NP →VPSlashVV head acl −− VV →VPSlash →VPSlash

Listing 4.1 – Dependency configurations for a few RGL functions

This handles one level of the tree; Given an abstract syntax tree T and a wordsequence S, we can derive a dependency tree by generalizing this idea to a recursivealgorithm:

1. For each word w in S, find the function fw forming its smallest spanning subtree inT .

4.3. GF to UD 97

(a)

(b) (c)

(d)(e)

(f)

Figure 19 – Different kinds of trees


2. Link each word w in S with either

a) the head argument of fw, if w is not the head;

b) or the head of the whole fw, if w is itself the head

(The smallest spanning subtree of a word is the subtree whose top node is thefunction whose linearization generates that word.)

There is an alternative way of deriving dependency trees from abstract syntax trees,which has benefit of being simpler to follow. To perform this method, we need to producea parse tree, and decorate it with the labels obtained from the dependency configurations.

To obtain a parse tree from an abstract syntax tree T :

1. Linearize it to a word sequence T ;

2. Link each word in S to its smallest spanning subtree in T ;

3. Replace each function in the nodes of T by its value category.

After labelling the parse tree with the abstract syntax functions and dependencylabels, we can assign a head and a label to each word by the following procedure:

1. Follow the edges up from the word until a labels is reached: this is the label of theword ;

2. From the dominating node, follow the path of unlabelled edges down to anotherword: this is the head of the word ;

3. If no label is encountered on the way upwards, the word itself is the head of thesentence.

4.3.2.0.1 Example

Figure 20 shows the partial conversion of the sentence o gato nos vê to a dependencytree using the default dependency configuration for GF to UD. The final dependency treewould be the one in figure 19e.

4.3.2.0.2 Corpus

After we revised the UD examples treebank described in section 3.3.1, we haveperformed its conversion to a UD treebank, which is available at https://github.com/odanoburu/gfss2018-presentation/blob/master/por-ud.conllu.

https://github.com/odanoburu/gfss2018-presentation/blob/master/por-ud.conllu

https://github.com/odanoburu/gfss2018-presentation/blob/master/por-ud.conllu

4.3. GF to UD 99

(a)

(b)

(c)

o gatoDET NOUN

Quant-0 N-0

� �?

det

(d)

(e)

o gato veDET NOUN VERB

Quant-0 N-0 V2-3

� �?

det

# ?

nsubj

Figure 20 – Converting decorated parse tree to a dependency tree


(a)(b)

Figure 21 – Dependency trees converted from GF trees

4.3.3 Discussion

Comparing the GF and the UD [28] approaches to the idea of an universal grammaris one of the goals of Kolachina & Ranta [20]. During their work, they found a lot ofcommonalities, but also points where the two approaches differ. These points raise biggerissues about how linguistic knowledge should be encoded, but also cause problems in theGF to UD conversion.

A first case of mismatch between the UD and the GF approaches is in the handlingof syncategorematic words (see [20, section 2.4]). Syncategorematic words are words withno abstract syntax category attached to them, a very common example of them beingcopulas. While the UD approach is that all words in a sentence must have labels connectingthem to a head word, GF chooses not consider copulas as deserving a category of theirown, because languages like Russian do not have them in the present tense, for example.The UD approach leads to less homogeneity between trees across languages, while the GFapproach – even though it is more general – fails to specify exactly what is the relationbetween the copulas and their heads when converted to dependency trees.

In their conversion to UD trees, all syncategorematic words in GF get a defautlabel of dep, which is the unspecified label defined by the UD project.14 Although thisproblem in the conversion could be solved by changing the GF RGL to have a categoryfor copulas, Kolachina & Ranta decided to create a language-specific configuration for theconversion process that allows one to specify how words appearing in a certain functionare to be labelled, and what words should be their heads. It is this configuration I havecontributed for Portuguese, reducing the number of unspecified dependency relations inthe GF UD treebank from 115 to just 3. Figure 21 shows a dependency tree before andafter the configuration file was specified.

Not all UD-GF discrepancies can be solved by this language-dependent configura-tion, though. Take the case of passive constructions: in UD their subject must have thelabel nsubj:pass, however as in GF they are constructed at verb-phrase level, they are seenby the clause-making functions like PredVP as any other VP, and thus have their headslabelled as nsubj. This conversion problem is solved in [20] by another type of configuration,14 See https://universaldependencies.org/u/dep/dep.html

https://universaldependencies.org/u/dep/dep.html

4.3. GF to UD 101

that of non-local abstract rules. We refrain from reproducing their explanation here sincethey have not been published at the public GF repository yet.

Two issues not discussed in [20] are lemmatization (the conversion output includesthe GF constructor names as lemmas instead of the actual lemmas) and UD meta-tokens(i. e., multi-word tokens and empty nodes), which do not seem to be handled by the currentconversion tool – it only ever outputs UD word tokens. Listing 4.2 shows the raw ouputof the GF to UD conversion, while listing 4.3 shows a version where the lemmas and thetokenization were corrected to conform to UD standards. Because of this and the otherissues we mentioned, GF can be successfully used as a bootstraper of UD treebanks, butits output needs revision to conform to the UD standards. We must note that much of thisrevision need can be automated, so we believe that the GF to UD conversion is practical.

# text = há uma vaca na floresta1 há ExistNPAdv AUX Cl Cl−0 3 aux _ _2 uma IndefArt DET Quant Quant−7 3 det _ _3 vaca cow_N NOUN N N−0 0 root _ _4 na DefArt DET Quant Quant−10 5 det _ _5 floresta forest_N NOUN N N−0 3 nmod _ _

Listing 4.2 – Raw output of the GF to UD conversion for the sentence “há uma vaca nafloresta”

# text = há uma vaca na floresta1 há haver AUX Cl Cl−0 3 aux _ _2 uma um DET Quant Quant−7 3 det _ _3 vaca vaca NOUN N N−0 0 root _ _4−5 na _ _ _ _ _ _ _ _4 em em DET Quant Quant−10 6 det _ _5 a o DET Quant Quant−10 6 det _ _6 floresta floresta NOUN N N−0 3 nmod _ _

Listing 4.3 – Partially corrected output of the GF to UD conversion for the sentence “háuma vaca na floresta”

103

5 Conclusion

In this dissertation we have presented a freely-available computational grammar ofPortuguese under the GF grammar formalism. As the GF resource grammar library, itsfocus is in generating syntactically-correct natural language – even if it can be used forrestricted-domain parsing – and in providing a library of grammatical constructions thatcan be reused by other grammars (dubbed application grammars).

Our evaluations have shown that the resource grammar is capable of producinggrammatical Portuguese across a rich range of syntactic phenomena (producing gram-matical sentences in over 90% of the trees from our test treebanks, see section 3.3), eventhough a few issues still remain to be worked on. We have also shown extensions of existingapplication grammars to Portuguese, exemplifying how easy it is to create new grammarson top of our work.

5.1 Future WorkWe may classify future work related to the Portuguese RG into two kinds: that

related to its own improvement and that related its applications.

Going further than a resource grammar under the RGL, there is also future workon creating a Portuguese- or Romance-only GF grammar with bigger coverage of syntacticphenomena. This work could then be used to define the Portuguese RGL grammar (toavoid work duplication) and might be better suited to open-domain parsing – in additionto providing more flexible but monolingual NLG. This idea could of course apply to anylanguage, and it depends on improvements being made to the GF ecosystem to supportcompilation of even bigger grammars and the disambiguation of their parse trees.

5.1.1 Known issues with the Portuguese RG

In this section we describe a few outstanding issues with our Portuguese implemen-tation of a resource grammar.

5.1.1.0.1 Preposition + personal pronoun contractions

(40) a. he is here with me

b. * ele está aqui com mim

c. ele está aqui comigo

104 Chapter 5. Conclusion

We have not implemented contractions such as those between prepositions com, em, deand personal pronouns, as can be seen in gloss 40.

(41) a. this is his dog

b. este é seu cachorro

c. este é o cachorro dele

However this does not always means that grammatical mistakes occur, for we might phrasethings differently, as in gloss 41.

5.1.1.0.2 Enclitic and mesoclitic pronoun contractions

(42) a. to love her isn’t easy

b. * amar a não é fácil

c. amá-la não é fácil

There is currently no support for enclitic pronoun contractions, as shown by gloss 42. Weare unsure about whether there should ever be support for mesoclitic pronouns since theyare archaic and very rarely used in written or spoken Portuguese.

5.1.1.0.3 Preposition + demonstrative pronouns contractions

(43) a. I don’t like this dog

b. * eu não gosto de este cachorro

c. eu não gosto deste cachorro

As exemplified by gloss 43, we have not implemented contractions between prepositionssuch as a, de, em and demonstrative pronouns.

5.1.1.0.4 Reflexive two-place adjectives

(44) a. the number 2 is divisible by itself

b. o número 2 é divisível por si

5.1. Future Work 105

The RGL contains a constructor for reflexive two-place adjectives, whose use is exemplifiedby gloss 44. Its Romance implementation hardcodes a reference to the third-person (causingthe error in gloss 45), since its representation of adjectival phrases is not variable in theperson parameter. We debate whether this implementation should change, since changingthe representation of adjectival phrases for only one constructor has little advantage andincurs a non-insignificant performance cost.

(45) a. I am divisible by myself

b. * eu sou divisível por si

5.1.2 Possible applications

In this section we describe ideas for applications using GF and our Portuguese RG.

5.1.2.0.1 WordNet gloss corpus

The gf-wordnet project uses WordNet senses to improve GF machine transla-tion. However it is also the case that GF machine translation can improve WordNet. Ifgf-wordnet makes GF robust enough to parse all WordNet glosses, it can be used totranslate them too. This work could be coupled with the WordNet gloss corpus annotationproject,1 which would guarantee correct translations at the lexical level. This joint workcould also be used to establish that WordNet is closed under the definition property, i. e.,that each word-sense pair used to define a sense is itself defined in WordNet. If it is thecase that such a pair does not have this property, then either a word has to be added to asense or a new sense should be added to WordNet, or both.

The translation process should also help to catch bugs in other language’s word-nets and in the glosses disambiguation. While working on the Portuguese modules forgf-wordnet we found several cases of word translations that seemed wrong, and theywere caused by errors in OpenWordNet-PT, the Portuguese WordNet. We believe it is alsothe case that a bad translation might point out a bug in the disambiguation; take gloss 8(page 74) as an example; if we wrongly disambiguate hands as a rotating pointer on theface of a timepiece instead of the body part, the error is not immediately to someoneseeing the data; however if we present the gloss together with its Portuguese translationas in gloss 46, then it is obvious that something is wrong.

(46) a. lower your hands slowly

b. abaixe seus ponteiros lentamente1 http://wordnetcode.princeton.edu/glosstag.shtml

http://wordnetcode.princeton.edu/glosstag.shtml

106 Chapter 5. Conclusion

5.1.2.0.2 Data as text

Angelov & Enache describe the encoding of the Suggested Upper-Merged Ontology(SUMO) in GF [3], affording it consistence-checking through GF’s logical framework. Inaddition to consistence-checking, GF generated improved natural language descriptions ofSUMO’s concepts; gloss 47 (taken from [14]) shows first the concept description offered bySUMO’s own NLG tool, and second a GF-generated description of the same concept.

(47) a. for all unique list ?LIST holds for all ?NUMBER1, ?NUMBER2 holds if “helement of ?LIST” is equal to “h element of ?LIST”, then ?NUMBER1 is equalto ?NUMBER2

b. for every unique list LIST, every positive integer NUMBER2 and every positiveinteger NUMBER1 we have that if the element with number NUMBER1 in LISTis equal to the element with number NUMBER2 in LIST then NUMBER1 isequal to NUMBER2

The existence of a Portuguese RG allows the extension of this work to include Portuguese,but more interestingly it allows syntactically-correct and reasonably idiomatic naturallanguage generation from ontologies, and even more generally from any kind of datathat might need natural language descriptions to make it understandable to non-experts.Another use for natural language descriptions of ontologies or data is help domain expertsfind errors in them without needing to learn a formal language or a data model.

For Portuguese NLG in particular we believe there is a use-case in Brazilian officialjournals (or Diário Oficial), whose current format only now started to provide machine-readable data of unrealiable quality despite the fact that no person can keep up with itslong contents. Changing the Diário Oficial to be machine-readable first, with GF NLGproviding a human interface to the data would be easier and more correct than the currentsolution of trying to extract structured information from its natural language texts.

107

Bibliography

[1] Krasimir Angelov. The mechanics of Grammatical Framework. PhD thesis, ChalmersUniversity of Technology and University of Gothenburg, 2011.

[2] Krasimir Angelov, Björn Bringert, and Aarne Ranta. PGF: A portable run-timeformat for type-theoretical grammars. Journal of Logic, Language and Information,19(2):201–228, Apr 2010.

[3] Krasimir Angelov and Ramona Enache. Typeful ontologies with direct multilingualverbalization. In Michael Rosner and Norbert E. Fuchs, editors, Controlled NaturalLanguage, pages 1–20, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.

[4] Patrick Blackburn and Johan Bos. Representation and inference for natural language-volume ii: Working with discourse representation structures. Department of Compu-tational Linguistics, University of Saarland, Germany, 1999.

[5] John J Camilleri, Norbert E Fuchs, and Kaarel Kaljurand. Deliverable d11. 1. acegrammar library. Technical report, Technical report, MOLTO project, June 2012.http://www. molto-project. eu . . . , 2012.

[6] Alessandra Cid, Alexandre Rademaker, Bruno Cuconato, and Valeria de Paiva. Linguis-tic legal concept extraction in portuguese. In Monica Palmirani, editor, Legal Knowl-edge and Information Systems, volume 313 of Frontiers in Artificial Intelligence and Ap-plications. 2018. The 31th International Conference on Legal Knowledge and Informa-tion Systems (JURIX 2018). Expanded version at https://arxiv.org/abs/1810.09379.

[7] Bruno Cuconato and Alexandre Rademaker. A computational grammar for Portuguese.In Valeria de Paiva, Rodrigo Wilkens, and Fernando Batista, editors, PROPOR 2018demonstration session. September 2018.

[8] Leonel Figueiredo de Alencar, Bruno Cuconato, and Alexandre Rademaker. Mor-phobr: An open source large-coverage full-form lexicon for morphological analysis ofportuguese. Texto Livre: Linguagem e Tecnologia, 11(3):1–25, 2018.

[9] Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, FilipGinter, Joakim Nivre, and Christopher D Manning. Universal stanford dependencies:A cross-linguistic typology. In LREC, volume 14, pages 4585–4592, 2014.

[10] Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependen-cies manual. Technical report, Technical report, Stanford University, 2008.

108 Bibliography

[11] Valeria de Paiva, Alexandre Rademaker, and Gerard de Melo. Openwordnet-pt: Anopen Brazilian Wordnet for reasoning. In Proceedings of COLING 2012: DemonstrationPapers, pages 353–360, Mumbai, India, December 2012. The COLING 2012 OrganizingCommittee. Published also as Techreport http://hdl.handle.net/10438/10274.

[12] Pedro Delfino, Bruno Cuconato, Guilherme Paulino Passos, Gerson Zaverucha, andAlexandre Rademaker. Using OpenWordnet-PT for question answering on legaldomain. In Global Wordnet Conference 2018, Singapore, January 2018.

[13] Grégoire Détrez and Aarne Ranta. Smart paradigms and the predictability andcomplexity of inflectional morphology. In Proceedings of the 13th Conference of theEuropean Chapter of the Association for Computational Linguistics, EACL ’12, pages645–653, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.

[14] Ramona Enache. Reasoning and language generation in the sumo ontology. Master’sthesis, Chalmers University of Technology and University of Gothenburg, 2010.

[15] Ramona Enache, Aarne Ranta, and Krasimir Angelov. An open-source computationalgrammar for Romanian. In International Conference on Intelligent Text Processingand Computational Linguistics, pages 163–174. Springer, 2010.

[16] N Anido Freire. Les verbes portugais: formes et emplois.

[17] Norbert E Fuchs, Kaarel Kaljurand, and Tobias Kuhn. Discourse representationstructures for ace 6.7. Technical report, Department of Infomatics & Institute ofComputational Linguistics, University of Zurich, 2013.

[18] Norbert E Fuchs, Uta Schwertel, and Rolf Schwitter. Attempto controlled english—notjust another logic specification language. In International Workshop on Logic Pro-gramming Synthesis and Transformation, pages 1–20. Springer, 1998.

[19] Christine Kamprath, Eric Adolphson, Teruko Mitamura, and Eric Nyberg. Controlledlanguage for multilingual document production: Experience with caterpillar technicalenglish. In Proceedings of the Second International Workshop on Controlled LanguageApplications, volume 146, 1998.

[20] Prasanth Kolachina and Aarne Ranta. From abstract syntax to universal dependencies.LiLT (Linguistic Issues in Language Technology), 13, 2016.

[21] Tobias Kuhn. Codeco: A grammar notation for controlled natural language inpredictive editors. arXiv preprint arXiv:1103.5676, 2011.

[22] Tobias Kuhn. A survey and classification of controlled natural languages. Computa-tional Linguistics, 40(1):121–170, 2014.

Bibliography 109

[23] Inari Listenmaa and Koen Claessen. Automatic test suite generation for pmcfg.Proceedings of the fifth workshop on Natural Language and Computer Science, 2018.

[24] Peter Ljunglöf. Expressivity and Complexity of the Grammatical Framework. PhDthesis, Göteborg University & Chalmers University of Technology, 2004.

[25] Simon Marlow et al. Haskell 2010 language report. Available onlinehttps://www.haskell.org/onlinereport/haskell2010/(October 2017), 2010.

[26] Per Martin-Löf and Giovanni Sambin. Intuitionistic type theory, volume 9. BibliopolisNapoli, 1984.

[27] George Miller. WordNet: An electronic lexical database. MIT press, 1998.

[28] Joakim Nivre. Towards a universal grammar for natural language processing. InAlexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing,pages 3–16, Cham, 2015. Springer International Publishing.

[29] Stephan Oepen and Jan Tore Lønning. Discriminant-based mrs banking. In Proceedingsof the 5th International Conference on Language Resources and Evaluation, pages1250–1255, 2006.

[30] Aarne Ranta. Grammatical framework: a type-theoretical grammar formalism. Journalof Functional Programming, 14(2):145–189, 2004.

[31] Aarne Ranta. From Semantics to Computer Science. Essays in honour of Gilles Kahn,chapter Grammars as software libraries. Cambridge University Press, 2009.

[32] Aarne Ranta. The GF resource grammar library. Linguistic Issues in LanguageTechnology, 2(2):1–63, 2009.

[33] Aarne Ranta. Grammatical Framework: Programming with Multilingual Grammars.CSLI Publications, Stanford, 2011. ISBN-10: 1-57586-626-9 (Paper), 1-57586-627-7(Cloth).

[34] Aarne Ranta and Prasanth Kolachina. From universal dependencies to abstractsyntax. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies(UDW 2017), pages 107–116, 2017.

[35] Maria Aparecida Florence Cerquera Ryan. Conjugação dos verbos em português:prático e eficiente. Ática, 1999.

[36] Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. On multiplecontext-free grammars. Theoretical Computer Science, 88(2):191–229, 1991.

110 Bibliography

[37] Hiroyuki Seki, Ryuichi Nakanishi, Yuichi Kaji, Sachiko Ando, and Tadao Kasami. Par-allel multiple context-free grammars, finite-state translation systems, and polynomial-time recognizable subclasses of lexical-functional grammars. In Proceedings of the31st annual meeting on Association for Computational Linguistics, pages 130–139.Association for Computational Linguistics, 1993.

[38] Stuart M Shieber, Yves Schabes, and Fernando CN Pereira. Principles and imple-mentation of deductive parsing. The Journal of logic programming, 24(1-2):3–36,1995.

[39] Pasi Tapanainen and Atro Voutilainen. Tagging accurately: Don’t guess if you know.In Proceedings of the Fourth Conference on Applied Natural Language Processing,ANLC ’94, pages 47–52, Stroudsburg, PA, USA, 1994. Association for ComputationalLinguistics.

[40] Charles A Verbeke. Caterpillar fundamental english. Training and DevelopmentJournal, 27(2):36–40, 1973.

[41] John Whitlam. Modern Brazilian Portuguese Grammar. Routledge, 2010.

Documents

A computational grammar for Portuguese