47
Tópicos Especiais em Computação: Introdução ao Processamento da Linguagem Natural - Profa. Dra. Aline Villavicencio Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition - Daniel Jurafsky & James H. Martin. Chapter 12 - FORMAL GRAMMARS OF ENGLISH por Luciano Sclovsky - setembro de 2015 1

Formal Grammars of English

Embed Size (px)

Citation preview

Tópicos Especiais em Computação: Introdução aoProcessamento da Linguagem Natural - Profa. Dra. Aline Villavicencio

Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition - Daniel Jurafsky & James H. Martin.

Chapter 12 - FORMAL GRAMMARS OF ENGLISH

por Luciano Sclovsky - setembro de 2015

1

2

Introduction

Study of grammar dates back to Panini's study of Sanskrit

The grammar is wrong!

Syntax (greek) - setting out together or arrangement

ATIS domain example (Air Traffic Information Systems)

3

Main ideas in this chapter

Constituent (such as noun phrase)

Grammatical relations (such as subject and object from traditional grammar)

Subcategorization and dependency relations - kinds of relations between

words and phrases

Context Free Grammars are the backbone of formal models

grammar checking, semantic interpretation, dialogue understanding, machine translation

4

Constituency

Noun phrase: sequence of words around at least one noun

Can appear in similar syntactic environments (such as before a verb)

They

Harry the horse

a high-class spot such as Mandy's

Preposed

On september 17th, I'd like to fly from Atlanta to Denver

Postposed

I'd like to fly from Atlanta to Denver on september 17th,

5

Context-Free Grammars (CFG)

The most commonly used mathematical system

AKA: Phrase-Structure Grammar

Equivalent to: Backus-Naur Form (BNF)

Dates back to 1900 (psy Wilhelm Wundt), formalized by Chomsky (1956)

is a set of rules or productions

expresses the way symbols can be grouped and ordered together

includes a lexicon of words

can be hierarchically embedded

example

6

Context-Free Grammars (CFG)

terminals: correspond to words in the language

non-terminals: expresses generalization

production:

left: non-terminal

right: ordered list of terminals and non-terminals

the production associated to a single word is associated with its lexical category

A CFG can be thought as

a device for generating sentenced

a device for assigning structure to a given sentence

7

Context-Free Grammars (CFG)

NP immediately dominates Det and Nom

NP dominates all nodes

Each CFG has a start symbol: often S (sentence node)

Example:

sentence can consist of a noun phrase followed by VERB PHRASE by a verb phrase

S → NP VP I prefer a morning flight

VP → Verb NP prefer a morning flight

VP → Verb NP PP leave Boston in the morning

VP → Verb PP leaving on Thursday

PP → Preposition NP from Los Angeles (locations, times, dates) 8

9

Formal definition of a CFG

A CFG is defined by a "4-tuple":

Convention:

10

Formal definition of a CFG

Parsing: mapping a string of words into a parse tree

Direct derivation: product of a single production

Derivation: product of n productions

11

Some Grammar rules for English: Sentence Level Constructions

Four important sentence structures:

declarative:

subject NP + VP I prefer a morning flight

imperative

S → VP (and no subject) Show the lowest fare

yes-no question

S → Aux NP VP (aux verb) Do any of these flights have stops

wh-question 12

Some Grammar rules for English: Sentence Level Constructions

wh-question

mainly wh-subject-question: identical to declarative, with the wh-word

S → Wh-NP VP What airlines fly from Burbank to Denver?

wh-non-subject-question: sentence includes another subject

S → Wh-NP Aux NP VP What flights do you have from Burbank to Tacoma Washington?

wh-non-subject-question is an example of a long-distance dependencysemantic interpretation: parse that flights is the argument of have

13

English: Clauses and Sentences

Clause: forms a complete thought

a clause is the smallest grammatical unit that can express a complete proposition (Wikipedia)

Sentence: may form larger structures

S may be in the right side of a production

14

English: the Noun Phrase

Most frequent types: pronouns, proper-nouns and the NP construction

NP → Det Nominal

Determiner: NP may begin with simple lexical determiners

a stop these flights some flights

may have more complex structures (recursive)

Det → NP ′s Denver's mayor's mother's canceled flight

Nominal: follows the determiner and contains pre- and post-head noun

modifiers

Nominal → Noun15

English: the Noun Phrase

Determiner: before the Head Noun

cardinal numbers: two friends

ordinal numbers: the first one the last flight

quantifiers: many fares much luggage

adjective phrase: the least expensive fare

NP → (Det) (Card) (Ord) (Quant) (AP) Nominal

16

English: the Noun Phrase

Determiner: after the Head Noun (postmodifiers)

prepositional phrases: all flights from Cleveland

common in ATIS corpus to mark origin and destination

Nominal → Nominal PP

non-finite clauses any flights arriving after eleven a.m.

gerundive (-ing, -ed) any of those leaving on Thursday

Nominal → Nominal GerundVP

infinitive the last flight to arrive in Boston

relative clauses a flight that serves breakfast

begins with a relative pronoun (mostly: that and who)

17

English: the Noun Phrase

predeterminer: word classes that modify and appear before the NP

all the flights

18

19

English: the Agreement

Agreement: in 3rd person singular the final verb form is different

the flight does the flights do

Occurs also when the verb has a noun acting as a subject

What flight leave in the morning?

One simple solution is to have different productions, but it doubles the size of

the grammar!

S → 3sgAux 3sgNP VP

S → Non3sgAux Non3sgNP VP

Rule proliferation would happen also in the noun case: nominative (I, she, he,

they) and accusative (me, her, him, them) pronouns

Other languages also have gender agreement

20

English: Verb Phrase and Subcategorization

Verb Phrase may contain NPs and PPs and combinations of both

VP → Verb disappear

VP → Verb NP prefer a morning flight

VP → Verb NP PP leave Boston in the morning

VP → Verb PP leaving on Thursday

Sentential complement: an entire embedded sentence, and others

VP → Verb S You said there were two flights that were the cheapest

A VP may include another VPI want to fly from Milwaukee to Orlando

Not every verb is compatible with every verb phrase

21

English: Verb Phrase and Subcategorization

transitive verb: takes a direct object I found a flight

intransitive verb: I

disappeared

modern grammars sub categorizes verbs into more than 100 subcategories

complements of the verbVerb-with-NP-complement → find | leave | repeat | . . .

Verb-with-S-complement → think | believe | say | . . .

Verb-with-Inf-VP-complement → want | try | need | . . .

VP → Verb-with-no-complement disappear

VP → Verb-with-NP-comp NP prefer a morning flight

VP → Verb-with-S-comp S said there

were two flights22

English: Auxiliaries

auxiliary or helping verbs: places a constraint on the form of the following

verb, and must combine in a particular ordermodal verbs: can, could, may, might, must, will, would, shall, should

perfect auxiliary: have

progressive auxiliary: be

passive auxiliary: be

A sentence can have multiple auxiliary verbs, but they must occur in a particular order: modal <

perfect < progressive < passive

modal perfect could have been a contender

modal passive will be married

perfect progressive have been feasting

modal perfect passive might have been prevented

23

English: Auxiliaries

verb group: includes the main verb and its auxiliaries

active sentence: subject is the semantic event described by the verbI prevented a catastrophe

passive sentence: different than other auxiliaries: subject is the undergoer of

the eventa catastrophe was prevented

24

English: Coordination

Conjunctions: conjoins major phrase types

Coordinate noun phrase

NP → NP and NP Please repeat the flights and the costs

VP → VP and VP What flights do you have leaving Denver and arriving in

San Francisco

S → S and S

Metarule: X → X and X

25

Treebanks

A treebank is a corpus in which each sentence is syntactically annotated with

a parse tree

The Penn Treebank Project

26

27

long distance dependencies or syntactic movement are marked

28

Using a treebank as a grammar

In Penn Treebank there are 4,500 rules for expanding VP in every possible

arrangement

VP → VBD PP

VP → VBD PP PP

VP → VBD PP PP PP

VP → VBD PP PP PP PP

VP → VB ADVP PP

...

VP → VBP PP PP PP PP PP ADVP PP

Thousands of NP rules

[DT The] [JJ state-owned] [JJ industrial] [VBG holding] [NN company] [NNP Instituto] [NNP

Nacional] [FW de] [NNP Industria]

NP → DT JJ JJ VBG NN NNP NNP FW NNP

17,500 different rules29

Searching Treebanks

TGrep2

Example: a node which either is the word bush or else ends in the string tree can be expressed

as:

/tree$/|bush

30

31

Heads

Head: the word in the phrase which is grammatically the most important

Dates back to 1914 (Bloomfield)

Central to Head-Driven Phrase Structure Grammar

32

Heads finding

Heads are passed up the parse tree

Finding heads is simple for textbooks examples

Modern linguistic theories of syntax generally include a component that define

heads

33

Head finding rule example

If the last word is tagged POS, return last-word.

Else search from right to left for the first child which is an NN, NNP, NNPS, NX, POS,

or JJR.

Else search from left to right for the first child which is an NP.

Else search from right to left for the first child which is a $, ADJP, or PRN.

Else search from right to left for the first child which is a CD.

Else search from right to left for the first child which is a JJ, JJS, RB or QP.

Else return the last word

34

Grammar equivalence and normal form

A formal language is defined as a possibly infinite set of strings of word

Two grammars are equivalent if they generate the same set of strings

Two CFGs may generate the same language

weak equivalence - do not assign the same phrase structure

strong equivalence - assign the same phrase structure

Normal form: each of productions take the same form

Chomsky Normal Form (CNF)

right-side: two non-terminal symbols, or one terminal symbol

Have binary branching

35

Finite-State and Context-Free Grammars

Adequate models of grammar need to be able to represent the complexities of

a natural language

Why can't we use finite-state methods?

mathematical reason: certains structures present in English make them non regular languages

subjective reason: non expressive

A finite state machine does not have center-embedded recursions

The luggage arrived.

The luggage that the passengers checked arrived.

The luggage that the passengers that the storm delayed checked arrived.

For many practical purposes matching semantic and syntactic rules aren't

necessary, fine-state rules are sufficient36

Dependency Grammars

the syntactic structure of a sentence is described purely in terms of words and

binary semantic or syntactic relations between these words

since 1959 (Tesnière), trace back to ancient Greek and Indian linguistic

traditions

handle languages with relative free word order, such as Czech

37

Similarity between dependency graphs and head structures

38

Categorial grammar

combinatory categorial grammar (CCG)

two types of arguments: functors and arguments, that can be combined

For example, a determiner can be thought of as a function that applies to an N on its right to

produce an NP

39

40

Spoken language syntax

Utterance is the unit for spoken language (fala)

Fragments are common: pauses, non-verbal sounds, partial words

The subject of a sentence is often a pronoun

Disfluencies

words uh and um

word repetitions

restarts

word fragments: wha-, betwee-

Disfluencies are very common: 37% of sentences are disfluent

41

Spoken language syntax

Repair, reparandum, interruption point

Edit terms: you know, I mean, uh, um

Find the interruption point, delete the reparandum, replace with repair

42

Treebanks of Spoken Language

Augmented notation with

disfluencies and other

phenomena

43

Grammars and Human processing

Do people use context-free grammars in their mental processing of language?

It has proved very difficult to find clear-cut evidence that they do.

Constituent: perceptual unitconstituents are often semantic units as well as syntactic units

It is necessary to find evidence for a constituent which is not a semantic unit

ditransitive alternation: change two arguments of a verbThe wealthy widow gave [NP the church] [NP her Mercedes].

The wealthy widow gave [NP her Mercedes] [PP to the church].

44

Summary

In many languages, groups of consecutive words act as a group or a constituent, which can be

modeled by context-free grammars (also known as phrase-structure grammars).

A context-free grammar consists of a set of rules or productions, expressed over a set of non-

terminal symbols and a set of terminal symbols. Formally, a particular context-free language is the

set of strings which can be derived from a particular context-free grammar.

A generative grammar is a traditional name in linguistics for a formal language which is used to

model the grammar of a natural language.

There are many sentence-level grammatical constructions in English; declarative, imperative, yes-no-

question, and wh-question are four very common types, which can be modeled with context-free

rules.

An English noun phrase can have determiners, numbers, quantifiers, and adjective phrases preceding

the head noun, which can be followed by a number of postmodifiers; gerundive VPs, infinitives

VPs, and past participial VPs are common possibilities.

45

Summary

Subjects in English agree with the main verb in person and number.

Verbs can be subcategorized by the types of complements they expect. Simple subcategories are

transitive and intransitive; most grammars include many more categories than these.

The correlate of sentences in spoken language are generally called utterances. Utterances may be

disfluent, containing filled pauses like um and uh, restarts, and repairs.

Treebanks of parsed sentences exist for many genres of English and for many languages. Treebanks

can be searched using tree-search tools.

Any context-free grammar can be converted to Chomsky normal form, in which the right-hand-side

of each rule has either two non-terminals or a single terminal.

Context-free grammars are more powerful than finite-state automata, but it is nonetheless possible

to approximate a context-free grammar with a FSA.

There is some evidence that constituency plays a role in the human processing of language.

46

Thanks!

[email protected]

@cafedosky

47