Upload
luciano-sclovsky
View
305
Download
0
Embed Size (px)
Citation preview
Tópicos Especiais em Computação: Introdução aoProcessamento da Linguagem Natural - Profa. Dra. Aline Villavicencio
Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition - Daniel Jurafsky & James H. Martin.
Chapter 12 - FORMAL GRAMMARS OF ENGLISH
por Luciano Sclovsky - setembro de 2015
1
Introduction
Study of grammar dates back to Panini's study of Sanskrit
The grammar is wrong!
Syntax (greek) - setting out together or arrangement
ATIS domain example (Air Traffic Information Systems)
3
Main ideas in this chapter
Constituent (such as noun phrase)
Grammatical relations (such as subject and object from traditional grammar)
Subcategorization and dependency relations - kinds of relations between
words and phrases
Context Free Grammars are the backbone of formal models
grammar checking, semantic interpretation, dialogue understanding, machine translation
4
Constituency
Noun phrase: sequence of words around at least one noun
Can appear in similar syntactic environments (such as before a verb)
They
Harry the horse
a high-class spot such as Mandy's
Preposed
On september 17th, I'd like to fly from Atlanta to Denver
Postposed
I'd like to fly from Atlanta to Denver on september 17th,
5
Context-Free Grammars (CFG)
The most commonly used mathematical system
AKA: Phrase-Structure Grammar
Equivalent to: Backus-Naur Form (BNF)
Dates back to 1900 (psy Wilhelm Wundt), formalized by Chomsky (1956)
is a set of rules or productions
expresses the way symbols can be grouped and ordered together
includes a lexicon of words
can be hierarchically embedded
example
6
Context-Free Grammars (CFG)
terminals: correspond to words in the language
non-terminals: expresses generalization
production:
left: non-terminal
right: ordered list of terminals and non-terminals
the production associated to a single word is associated with its lexical category
A CFG can be thought as
a device for generating sentenced
a device for assigning structure to a given sentence
7
Context-Free Grammars (CFG)
NP immediately dominates Det and Nom
NP dominates all nodes
Each CFG has a start symbol: often S (sentence node)
Example:
sentence can consist of a noun phrase followed by VERB PHRASE by a verb phrase
S → NP VP I prefer a morning flight
VP → Verb NP prefer a morning flight
VP → Verb NP PP leave Boston in the morning
VP → Verb PP leaving on Thursday
PP → Preposition NP from Los Angeles (locations, times, dates) 8
Formal definition of a CFG
Parsing: mapping a string of words into a parse tree
Direct derivation: product of a single production
Derivation: product of n productions
11
Some Grammar rules for English: Sentence Level Constructions
Four important sentence structures:
declarative:
subject NP + VP I prefer a morning flight
imperative
S → VP (and no subject) Show the lowest fare
yes-no question
S → Aux NP VP (aux verb) Do any of these flights have stops
wh-question 12
Some Grammar rules for English: Sentence Level Constructions
wh-question
mainly wh-subject-question: identical to declarative, with the wh-word
S → Wh-NP VP What airlines fly from Burbank to Denver?
wh-non-subject-question: sentence includes another subject
S → Wh-NP Aux NP VP What flights do you have from Burbank to Tacoma Washington?
wh-non-subject-question is an example of a long-distance dependencysemantic interpretation: parse that flights is the argument of have
13
English: Clauses and Sentences
Clause: forms a complete thought
a clause is the smallest grammatical unit that can express a complete proposition (Wikipedia)
Sentence: may form larger structures
S may be in the right side of a production
14
English: the Noun Phrase
Most frequent types: pronouns, proper-nouns and the NP construction
NP → Det Nominal
Determiner: NP may begin with simple lexical determiners
a stop these flights some flights
may have more complex structures (recursive)
Det → NP ′s Denver's mayor's mother's canceled flight
Nominal: follows the determiner and contains pre- and post-head noun
modifiers
Nominal → Noun15
English: the Noun Phrase
Determiner: before the Head Noun
cardinal numbers: two friends
ordinal numbers: the first one the last flight
quantifiers: many fares much luggage
adjective phrase: the least expensive fare
NP → (Det) (Card) (Ord) (Quant) (AP) Nominal
16
English: the Noun Phrase
Determiner: after the Head Noun (postmodifiers)
prepositional phrases: all flights from Cleveland
common in ATIS corpus to mark origin and destination
Nominal → Nominal PP
non-finite clauses any flights arriving after eleven a.m.
gerundive (-ing, -ed) any of those leaving on Thursday
Nominal → Nominal GerundVP
infinitive the last flight to arrive in Boston
relative clauses a flight that serves breakfast
begins with a relative pronoun (mostly: that and who)
17
English: the Noun Phrase
predeterminer: word classes that modify and appear before the NP
all the flights
18
English: the Agreement
Agreement: in 3rd person singular the final verb form is different
the flight does the flights do
Occurs also when the verb has a noun acting as a subject
What flight leave in the morning?
One simple solution is to have different productions, but it doubles the size of
the grammar!
S → 3sgAux 3sgNP VP
S → Non3sgAux Non3sgNP VP
Rule proliferation would happen also in the noun case: nominative (I, she, he,
they) and accusative (me, her, him, them) pronouns
Other languages also have gender agreement
20
English: Verb Phrase and Subcategorization
Verb Phrase may contain NPs and PPs and combinations of both
VP → Verb disappear
VP → Verb NP prefer a morning flight
VP → Verb NP PP leave Boston in the morning
VP → Verb PP leaving on Thursday
Sentential complement: an entire embedded sentence, and others
VP → Verb S You said there were two flights that were the cheapest
A VP may include another VPI want to fly from Milwaukee to Orlando
Not every verb is compatible with every verb phrase
21
English: Verb Phrase and Subcategorization
transitive verb: takes a direct object I found a flight
intransitive verb: I
disappeared
modern grammars sub categorizes verbs into more than 100 subcategories
complements of the verbVerb-with-NP-complement → find | leave | repeat | . . .
Verb-with-S-complement → think | believe | say | . . .
Verb-with-Inf-VP-complement → want | try | need | . . .
VP → Verb-with-no-complement disappear
VP → Verb-with-NP-comp NP prefer a morning flight
VP → Verb-with-S-comp S said there
were two flights22
English: Auxiliaries
auxiliary or helping verbs: places a constraint on the form of the following
verb, and must combine in a particular ordermodal verbs: can, could, may, might, must, will, would, shall, should
perfect auxiliary: have
progressive auxiliary: be
passive auxiliary: be
A sentence can have multiple auxiliary verbs, but they must occur in a particular order: modal <
perfect < progressive < passive
modal perfect could have been a contender
modal passive will be married
perfect progressive have been feasting
modal perfect passive might have been prevented
23
English: Auxiliaries
verb group: includes the main verb and its auxiliaries
active sentence: subject is the semantic event described by the verbI prevented a catastrophe
passive sentence: different than other auxiliaries: subject is the undergoer of
the eventa catastrophe was prevented
24
English: Coordination
Conjunctions: conjoins major phrase types
Coordinate noun phrase
NP → NP and NP Please repeat the flights and the costs
VP → VP and VP What flights do you have leaving Denver and arriving in
San Francisco
S → S and S
Metarule: X → X and X
25
Treebanks
A treebank is a corpus in which each sentence is syntactically annotated with
a parse tree
The Penn Treebank Project
26
Using a treebank as a grammar
In Penn Treebank there are 4,500 rules for expanding VP in every possible
arrangement
VP → VBD PP
VP → VBD PP PP
VP → VBD PP PP PP
VP → VBD PP PP PP PP
VP → VB ADVP PP
...
VP → VBP PP PP PP PP PP ADVP PP
Thousands of NP rules
[DT The] [JJ state-owned] [JJ industrial] [VBG holding] [NN company] [NNP Instituto] [NNP
Nacional] [FW de] [NNP Industria]
NP → DT JJ JJ VBG NN NNP NNP FW NNP
17,500 different rules29
Searching Treebanks
TGrep2
Example: a node which either is the word bush or else ends in the string tree can be expressed
as:
/tree$/|bush
30
Heads
Head: the word in the phrase which is grammatically the most important
Dates back to 1914 (Bloomfield)
Central to Head-Driven Phrase Structure Grammar
32
Heads finding
Heads are passed up the parse tree
Finding heads is simple for textbooks examples
Modern linguistic theories of syntax generally include a component that define
heads
33
Head finding rule example
If the last word is tagged POS, return last-word.
Else search from right to left for the first child which is an NN, NNP, NNPS, NX, POS,
or JJR.
Else search from left to right for the first child which is an NP.
Else search from right to left for the first child which is a $, ADJP, or PRN.
Else search from right to left for the first child which is a CD.
Else search from right to left for the first child which is a JJ, JJS, RB or QP.
Else return the last word
34
Grammar equivalence and normal form
A formal language is defined as a possibly infinite set of strings of word
Two grammars are equivalent if they generate the same set of strings
Two CFGs may generate the same language
weak equivalence - do not assign the same phrase structure
strong equivalence - assign the same phrase structure
Normal form: each of productions take the same form
Chomsky Normal Form (CNF)
right-side: two non-terminal symbols, or one terminal symbol
Have binary branching
35
Finite-State and Context-Free Grammars
Adequate models of grammar need to be able to represent the complexities of
a natural language
Why can't we use finite-state methods?
mathematical reason: certains structures present in English make them non regular languages
subjective reason: non expressive
A finite state machine does not have center-embedded recursions
The luggage arrived.
The luggage that the passengers checked arrived.
The luggage that the passengers that the storm delayed checked arrived.
For many practical purposes matching semantic and syntactic rules aren't
necessary, fine-state rules are sufficient36
Dependency Grammars
the syntactic structure of a sentence is described purely in terms of words and
binary semantic or syntactic relations between these words
since 1959 (Tesnière), trace back to ancient Greek and Indian linguistic
traditions
handle languages with relative free word order, such as Czech
37
Categorial grammar
combinatory categorial grammar (CCG)
two types of arguments: functors and arguments, that can be combined
For example, a determiner can be thought of as a function that applies to an N on its right to
produce an NP
39
Spoken language syntax
Utterance is the unit for spoken language (fala)
Fragments are common: pauses, non-verbal sounds, partial words
The subject of a sentence is often a pronoun
Disfluencies
words uh and um
word repetitions
restarts
word fragments: wha-, betwee-
Disfluencies are very common: 37% of sentences are disfluent
41
Spoken language syntax
Repair, reparandum, interruption point
Edit terms: you know, I mean, uh, um
Find the interruption point, delete the reparandum, replace with repair
42
Grammars and Human processing
Do people use context-free grammars in their mental processing of language?
It has proved very difficult to find clear-cut evidence that they do.
Constituent: perceptual unitconstituents are often semantic units as well as syntactic units
It is necessary to find evidence for a constituent which is not a semantic unit
ditransitive alternation: change two arguments of a verbThe wealthy widow gave [NP the church] [NP her Mercedes].
The wealthy widow gave [NP her Mercedes] [PP to the church].
44
Summary
In many languages, groups of consecutive words act as a group or a constituent, which can be
modeled by context-free grammars (also known as phrase-structure grammars).
A context-free grammar consists of a set of rules or productions, expressed over a set of non-
terminal symbols and a set of terminal symbols. Formally, a particular context-free language is the
set of strings which can be derived from a particular context-free grammar.
A generative grammar is a traditional name in linguistics for a formal language which is used to
model the grammar of a natural language.
There are many sentence-level grammatical constructions in English; declarative, imperative, yes-no-
question, and wh-question are four very common types, which can be modeled with context-free
rules.
An English noun phrase can have determiners, numbers, quantifiers, and adjective phrases preceding
the head noun, which can be followed by a number of postmodifiers; gerundive VPs, infinitives
VPs, and past participial VPs are common possibilities.
45
Summary
Subjects in English agree with the main verb in person and number.
Verbs can be subcategorized by the types of complements they expect. Simple subcategories are
transitive and intransitive; most grammars include many more categories than these.
The correlate of sentences in spoken language are generally called utterances. Utterances may be
disfluent, containing filled pauses like um and uh, restarts, and repairs.
Treebanks of parsed sentences exist for many genres of English and for many languages. Treebanks
can be searched using tree-search tools.
Any context-free grammar can be converted to Chomsky normal form, in which the right-hand-side
of each rule has either two non-terminals or a single terminal.
Context-free grammars are more powerful than finite-state automata, but it is nonetheless possible
to approximate a context-free grammar with a FSA.
There is some evidence that constituency plays a role in the human processing of language.
46