Page Numbers: Yes X: 306 Y: 0.9" First Page: 1
Margins: Top: 1.1" Bottom: 1.2"
Heading:
PROBLEM SET #2 LISP: LANGUAGE AND LITERATURE May 22, 1984
————————————————————————————————————————————
Problem Set #2: The Grammatical Analysis of Expressions
Distributed:May 22, 1984
Solutions due:June 4, 1984
Filed as:[phylum]<3-lisp>course>problem-sets>ps-2a,b.bravo
User.cm:[phylum]<BrianSmith>system>user.classic
Last edited:May 22, 1984 1:07 PM
————————————————————————————————————————————
2-1. Introduction
This problem set focuses on a cluster of issues having to do with the general analysis and interpretation of expressions in ways that correspond to their syntactic or grammatical structure. This includes but is not limited to issues of parsing; we will also explore translation and semantic interpretation routines, for example. Two themes will be of particular importance:
1.The distinction between information implicitly encoded in procedures and computational processes, and information explicitly represented in structures that are manipulated by those procedures and processes.
2.The relationships among procedures that analyse, translate, internalize, and (semantically) interpret grammatical expressions. Although these tasks differ in crucial ways, programs that perform them are often extremely similar in structure. We will exploit similarities and point out differences.
In order to uncover issues that may be of general interest to CSLI researchers, at various points we press the analysis to more depth than might be expected in a first quarter course in programming. Rather than working out half-solutions to the whole problem set, in other words, you should work carefully through as much as you can, stopping when it gets too difficult or you run out of time. A full appreciation of a few of the questions will be much more useful than a partial understanding of all of them.
The kinds of complexity we are interested in do not, in general, have to do with complexities in languages or grammars. For pedagogical reasons, therefore, we consider grammars no more complex than half a dozen context-free rules. The organizational techniques we employ, however, could be generalized for more sophisticated or subtle languages. (When you are done, you may want to think about how you would extend your procedures to cope with subtleties that interest you.)
As mentioned in the introduction to problem set #1, we recommended reading the whole problem set through carefully before starting to work. Also, we will assume familiarity with all of the programming techniques introduced in that problem set.
An assortment of procedure definitions that will be needed in the course of doing this problem set can be loaded by typing:
(load "{turing}<3-lisp.problem-sets>ps-2.3-lisp")
2-2. Some Background Vocabulary
In class, we have consistently distinguished between notations — elements of a consensual medium of interchange between agents — and internal structures, which are internal to a computational process and play a causal role in engendering behaviour. The former are the stuff of standard linguistic analysis: expressions, sentences, subjects of type/token distinctions, etc.; the latter, we have suggested, are the appropriate vehicles of information in terms of which to analyse processes.
Strictly speaking, communication between agents could be carried on in almost any medium at all — air waves, motions, arrangements of sticks on the ground, or whatever. However, we will restrict ourselves to text strings: ordered sequences of characters, of the sort that could be written on a page, or entered into an EMACS buffer on the Dandelion.
In the cases we will consider, expressions will be built up compositionally from a stock of lexical types, in the usual fashion.
There are many things one could say about expressions: what properties they have on their own; what they mean or signify; what their translations would be in another language; etc. For the purposes of this problem, we will talk about these various aspects of a linguistic expression in terms of the following diagram:
<==<ps-2-figure-01.press<
e1 is assumed to be an expression in some language L1; e2 another expression in language L2. d1 is something like the meaning or significance of e1; similarly d2. In each case, i(e1) is meant to be information about e1; whether that information is encoded in yet another language — i.e., whether i(e1) is itself a linguistic representation, or something more abstract — we won’t say. Very roughly, however, the assumption is that e2 "means" approximately the same thing (in L2) as e1 means (in L1), and that e1 is about d1.
We will use various names for procedures that take us from one place in this diagram to another. More specifically, we will call the relationship between e1 and e2 one of a translation; between anything x and i(x) an analysis; and between e and its corresponding d an interpretation.
In some limiting cases, these procedures are trivial. If language L1 is the same as L2, then the simplest translation procedure is the one that computes the identity function (i.e., e1 = e2 in all cases). In general, however, we expect there to be work associated with computing these relationships. One of the distinguishing characteristics of languages is the fact that these relationships are typically defined compositionally.
An obvious question, given a particular expression e, is to ask whether it is a well-formed expression of a language L. A procedure that computes this function — that is, that takes an expression and yields a truth-value indicating whether it is in the language — is called a recognizer. Since whether an expression is in a language is information about it (albeit of a rather simple kind), a recognizer is a particular kind of analysis procedure, going from e to i(e).
More interesting information about an expression e might be an indication of its structure, phrased in terms of some theory of the language Lt. A particularly simple kind of theory, about which a great deal is known, is called a grammar; a description of the structure of an expression in terms of such a theory we will call a parse or derivation, and a procedure that maps expressions onto parses or derivations we will call a parser. Like recognizers, parsers are kinds of analysers.
Rather different are translators, which map expressions in one language onto expressions in another. Taking the 3-LISP structural field as a kind of language, we can view the routines that translate 3-LISP notation into internal structure — the internalization procedures — as a kind of translator. If it were discovered that people’s minds are made of mentalese, then the human linguistic facility would presumably be a species of internalization procedure.
Also, the 3-LISP processor performs an odd sort of translation, of 3-LISP structures onto other 3-LISP structures (although this is of course only one of a number of things that it does). In this case the difference between e1 and e2 lies not in their being in different languages, but in other formal relationships between them — specifically, the latter being a normal-form co-designator.
Finally, functions that map expressions onto their real significance — meaning, designation, denotation, or whatever that significance is taken to be — we call interpretation functions. Since semantics is the study of the relationships between expressions e and their meanings or significance, semantical accounts are primarily about e’s and d’s, and thus are explicit representations of i(e), i(d), and the continuum including both of them (we said earlier we wouldn’t say whether the i(e)’s were lingustically represented in general, but in this case they clearly are). The interpretation functions of model theory are of approximately this type. It is more common in semantic analysis to deal with functions than with procedures, because you can’t normally compute the interpretation in any active sense.
If the stuff of theories are accounts or expressions, they must themselves be expressed in some language, which we will call Lt. Thus, the sentence
The French word ‘chat’ is used to refer to cats.
is a sentence of English that describes the relationship between an expression e ("chat") in L1 (French) and its meaning d; the sentence is in English, and presents information i(e) and i(d). ... [[[ ... ??? ... ]]]
The fact that theoretical accounts are in language Lt is sometimes misinterpreted as implying that semantic relationships are simply a particular kind of language−language relationships. But this is wrong: the relationship itself, between e and d, is between language and something, where that something can be anything at all (in the example just given, for example, d is cats, fur and all). Similarly, if we agree that "Stan" refers to Mr. Rosenschein and that "Stanley" refers to Mr. Peters, then we are agreeing on a relationship between first names (e’s) and people (d’s); the fact that we have described this relationship in English doesn’t make it a linguistic phenomenon itself.
When we write programs that notate procedures to compute various of these relationships, we will have to keep all of these languages straight. The point of this problem set is to examine techniques that serve in particular cases, and techniques that are useful across the board, and in general to develop a facility for talking about all these various relationships. As was mentioned in the introduction, we will not need to deal with very complex languages or grammars in order to illuminate these issues.
2-3. A Lexical Analyser
It is common to separate the grammatical analysis of expressions in a language into two parts: lexical analysis and constituent analysis. A lexical analyser is essentially a pre-processor that carves the expression — which, given our assumption that expressions are strings of characters, will include spaces, carriage returns, etc. — into a series of tokens, each of which plays a specific role in the grammatical analysis of the expression as a whole. Though the distinction isn’t precise, the basic idea is that the lexical analyser will group adjacent letters into words, treat all breaks as equivalent, etc., and in general ignore details of the expression that are irrelevant to the grammatical structure.
Throughout this problem set, we will rely on a particular lexical analyser that maps strings of characters onto sequences of tokens according to the grammar given on page 7 of the 3-LISP manual. For example, it maps the string
‘(+ 1 test)’
onto that five (not three!) element sequence we might write in our theoretic language as:
<‘(’, ‘+’, ‘1’, ‘test’, ‘)’>
Since comments don’t matter for 3-LISP processing, this particular lexical analyser happens to ignore comments and white-space; thus the whole expression:
‘(define FOO ; this is a rather inefficient
(* 234 ; way to make FOO designate the number zero.
;;;
;;; ooops -- the phone rang; I’ll be back in a minute
;;;
(- 234 234)))’
(which contains 92 characters) would be mapped onto the sequence of 13 tokens:
<‘(’, ‘define’, ‘FOO’, ‘(’, ‘*’, ‘234’, ‘(’, ‘-’, ‘234’, ‘234’, ‘)’, ‘)’, ‘)’>
Although designed for 3-LISP, the lexical analyser can be (and will be) used for other languages as well. For example, it would map
‘Randy, regretably, retreated.’
onto the 6 element sequence:
<‘Randy’, ‘,’, ‘regretably’, ‘,’, ‘retreated’, ‘.’>
We have described the lexical analyser using our extended-English theoretic language;
there are two ways in which we can use an implementation of it written in 3-LISP (the code is in {turing}<3-lisp.problem-sets>ps-2.3-lisp, which you should have loaded; you shouldn’t have to look at it — the descriptions here are sufficient for all of the problems we will discuss). First, there is a 3-LISP procedure called LEXICAL-ANALYSIS, which takes strings as arguments and returns the corresponding sequence of tokens. For example:
1> (lexical-analysis "(+ 2 foo)")
1= ["(" "+" "2" "foo" ")"]
1> (lexical-analysis " [ $true ; this is a comment
]")
1= ["[" "$" "true" "]"]
In addition, there are two predicates, NUMERAL-EXPRESSION and IDENTIFIER-EXPRESSION, which are are useful for checking the category of a particular token. More specifically, the following is one complete set of procedures with which one can use the lexical analyser:
(LEXICAL-ANALYSIS string)—as described above, designates the sequence of tokens comprising the string designated by string.
(NUMERAL-EXPRESSION token)—true just in case token designates a string of digits, except that it may start with a single ‘+’ or ‘-’.
(IDENTIFIER-EXPRESSION token)—true just in case token designates a sequence of alphanumeric characters, not all of which are digits.
There is also, however, a way of using the lexical analyser in the OBJECT style of programming that we have recently explored in class. Specifically, (LEXAN-OBJECT string) will create an object that supports the following messages:
(CURRENT-TOKEN string)—Returns (a designator of) the token currently under analysis (i.e., the one made current by the last use of GET-TOKEN).
(GET-TOKEN string)—Makes the next token in the string be current, and returns (a designator of) it. This is the normal way in which tokens will be accessed.
(NEXT-TOKEN string)—Returns (a designator of) the next token — the one that the next GET-TOKEN will make current. This is extremely useful for dispatching on a token type without actually "reading it in", which GET-TOKEN would do.
In each case, the atom EOT, for end-of-tokens (which is not a string, and is therefore distinguishable from any other token) is taken as the token past the end of the string; GET-TOKEN and NEXT-TOKEN will constantly return ’EOT after all other tokens have been used up.
For example, we would have:
1> (set x (lexan-object "(+ 1 2)"))
1= ...
1> (get-token x)
1= "("
1> (current-token x)
1= "("
1> (next-token x)
1= "+"
1> (get-token x)
1= "+"
1> [(get-token x) (get-token x) (get-token x)]
1= ["1" "2" ")"]
1> (get-token x)
1= ’eot
1> (get-token x)
1= ’eot
We will use both the functional and the object-oriented ways of using the lexical analyser in subsequent examples.
2-4. Two Simple Languages
In the problem set we will use two languages: ARITH and 3-NOTE. ARITH, which supports extremely simple prefix-form arithmetic expressions, is specified by the following context free grammar (CFG):
expression←numeral-expression
expression←complex
complex←operator "(" expression "," expression ")"
operator←"+"
operator←"-"
operator←"*"
operator←"/"
[In computer science BNF ("Backus-Naur Form") grammars are more commonly used than full context-free grammars; the following BNF is essentially equivalent to the previous CFG:
expression::=numeral-expression | complex
complex::=operator "(" expression "," expression ")"
operator::="+" | "*" | "-" | "/" ].
For example, the following are legitimate expressions of ARITH:
‘234’
‘+(0,0)’
‘*(100 , /(20 , 10) )’
whereas the following are ill-formed:
‘(+ 234 1000)’
‘+(1, 2, 3)’
‘*’
3-NOTE, on the other hand, is intended to be a very minimal subset of standard 3-LISP notation. Its CFG grammar is the following (extended to use Kleene’s ‘*’ for zero-or-more repetition, and ‘+’ for one-or-more repetition):
expression←rail-expression
expression←pair-expression
expression←identifier-expression
expression←numeral-expression
rail-expression←"[" expression* "]"
pair-expression←"(" expression+ ")"
For example, the following are legitimate expressions of 3-NOTE:
‘234’
‘(+ 2 3)’
‘(lambda [n]
(if (= n 0)
(length [1])
(* n (factorial (- n 1)))))’
whereas the following are ill-formed:
‘ (+ 2 3]’—mismatched parentheses
‘ ()’—missing expression
‘ 1 2’—more than one expression
‘ ’foo’—since handles aren’t supported
‘↑args’—since up and down arrows aren’t supported
‘(f . a)’—since dots aren’t supported
‘(string-first "This is a test")’—since string quotes aren’t supported.
2-5. A "Doubly Implicit" 3-LISP Recognizer
Given all this machinery, we can delve into examples.
We first look at a series of programs that recognize expressions in ARITH. The following code, which is described here procedure-by-procedure, can be loaded by typing:
(load "{turing}<3-lisp.problem-sets>arith-recognizer.3-lisp")
ARITH-RECOGNIZE is the "top-level" procedure, meaning that it is designed to be the one that users use.
(define ARITH-RECOGNIZE
(lambda [string]
(if (null (dynamic-catch
(arith-recognize-expression (lexical-analysis string))))
"Accepted"
"Rejected")))
It takes a string as an argument, and returns "accepted" or "rejected", to indicate whether or not its single argument (meant to be a string) is a well-formed ARITH expression. It uses a strange control operator, DYNAMIC-CATCH, to field recognition errors; its operation will be explained presently, but for the moment it should be ignored. Otherwise, ARITH-RECOGNIZE merely calls ARITH-RECOGNIZE-EXPRESSION with the sequence of tokens into which the lexical analyser converts its argument. If ARITH-RECOGNIZE-EXPRESSION returns a null sequence, the recognition was a success; if it returns a sequence with 1 or more elements in it, then the recognition has failed. The real work is done by three recognition routines, that are recursively defined together (ARITH-RECOGNIZE-EXPRESSION, ARITH-RECOGNIZE-ARGS, and ARITH-RECOGNIZE-TOKEN).
;;; ARITH-RECOGNIZE-EXPRESSION Takes a sequence of tokens; returns the
;;; ========================== tail of the sequence after removing tokens
;;; corresponding to exactly one full ARITH
;;; expression. Calls ARITH-ERROR-IN-RECOGNITION
;;; if a syntax error is encountered.
(define ARITH-RECOGNIZE-EXPRESSION
(lambda [tokens]
(cond [(member (first tokens) ["+" "*" "-" "/"])
(arith-recognize-args (rest tokens))]
[(numeral-expression (first tokens))
(rest tokens)]
[$true (arith-error-in-recognition)])))
(define ARITH-RECOGNIZE-ARGS
(lambda [tail]
(letseq [[tail (arith-recognize-token "(" tail)]
[tail (arith-recognize-expression tail)]
[tail (arith-recognize-token "," tail)]
[tail (arith-recognize-expression tail)]
[tail (arith-recognize-token ")" tail)]]
tail)))
(define ARITH-RECOGNIZE-TOKEN
(lambda [token tail]
(if (and (not (null tail))
(= token (first tail)))
(rest tail)
(arith-error-in-recognition))))
The basic strategy of the program is simple: each routine is given as an argument the sequence of tokens remaining in the original expression. Its task is to return that sequence except with an initial segment removed. For example,
(ARITH-RECOGNIZE-TOKEN "(" ["(" "10" "," "20" ")"])
would return
["10" "," "20" ")"]
having stripped off the leading open parenthesis. Similarly,
(ARITH-RECOGNIZE-EXPRESSION ["+" "(" "10" "," "20" ")" "," "12345" ")"])
would return
["," "12345" ")"]
having stripped off ["+" "(" "10" "," "20" ")"], since that is the token sequence for one full expression.
Note the call to NUMERAL-EXPRESSION in ARITH-RECOGNIZE-EXPRESSION.
There are two kinds of errors that must be treated: too many tokens (for example, the expression ‘+(234, 234))’, and an inappropriate token (for example, the expression ‘+(234 ( 234)’. The former are caught because ARITH-RECOGNIZE checks the finally returned sequence of tokens to make sure it has been exhausted; if there are any remaining tokens, not all of them were used up by the top-level expression, and the string should therefore be rejected. The other errors cause the procedure ARITH-ERROR-IN-RECOGNITION to be invoked, which interacts with DYNAMIC-CATCH.
(define ARITH-ERROR-IN-RECOGNITION
(lambda []
(dynamic-throw ["something"])))
DYNAMIC-CATCH works as follows: if any code within the scope of the call of the DYNAMIC-CATCH, or any code that that code calls (and so on recursively), calls the procedure DYNAMIC-THROW, then control will be immediately returned to the point of the call to DYNAMIC-CATCH, discarding the whole computation that was being built up in between. Therefore, all that ARITH-ERROR-IN-RECOGNITION needs do is to "throw" a sequence with at least one token in it; since it will automatically be sent all the way back to the call in ARITH-RECOGNIZE, this too will be trapped by the NULL check, and cause the string to be rejected. Note that this means that strings are rejected at the first moment that an illegal or inappropriate token is encountered; no further processing is done.
ARITH-RECOGNIZE will support the following sorts of behaviour:
1> (arith-recognize "234")
1= "accepted"
1> (arith-recognize "+(2,3)")
1= "accepted"
1> (arith-recognize "(+ 2 3)")
1= "rejected"
1> (arith-recognize "*(10,/(20,30))")
1= "accepted"
————————————
Another version of this recognizer, this time using the object-oriented approach to the lexical analyser, is called A2-RECOGNIZE in the following listing. Because the lexical-analysis object (consistently bound to the variable s) contains within it the state of the input string, these routines do not need to pass around "tails" of the input expression. A2-RECOGNIZE-EXPRESSION, A2-RECOGNIZE-ARGS, and A2-RECOGNIZE-TOKEN (which correspond to the "ARITH" versions above) simply return the boolean constant $TRUE if they recognize the input expression. In the top-level routine A2-RECOGNIZE, the only way in which ANSWER will be $FALSE is if A2-ERROR-IN-RECOGNITION has been called, which throws $FALSE immediately upon encountering an inappropriate token.
Given the descriptions above of GET-TOKEN and NEXT-TOKEN, you should be able to read and understand this listing without too much difficulty.
;;; A2-RECOGNIZE Main recognition routine. Returns a message
;;; ============ identifying whether or not the STRING encodes a single
;;; well-formed ARITH expression.
(define A2-RECOGNIZE
(lambda [string]
(letseq [[s (lexan-object string)]
[answer (dynamic-catch (a2-recognize-expression s))]]
(if (and answer (= (next-token s) ’eot))
"Accepted"
"Rejected"))))
;;; Ancillary recognition functions:
;;;
;;; The following procedures call ERROR-IN-RECOGNITION immediately upon
;;; finding a grammatical error of any sort, which does not return (since
;;; the procedures below are not designed to do anything sensible about
;;; continuing from errors), but instead does a DYNAMIC-THROW (q.v.).
;;; A2-RECOGNIZE-EXPRESSION Attempts to recognize an arbitrary ARITH
;;; ======================= expression.
(define A2-RECOGNIZE-EXPRESSION
(lambda [s]
(cond [(member (next-token s) ["+" "*" "-" "/"])
(a2-recognize-args s)]
[(numeral-expression (get-token s)) $true]
[$true (a2-error-in-recognition)])))
;;; A2-RECOGNIZE-ARGS Attempts to recognize an ARITH args expression
;;; =================
(define A2-RECOGNIZE-ARGS
(lambda [s]
(get-token s) ; skip the operator
(a2-recognize-token "(" s)
(a2-recognize-expression s)
(a2-recognize-token "," s)
(a2-recognize-expression s)
(a2-recognize-token ")" s)))
;;; A2-RECOGNIZE-TOKEN Ensures that the next token in s is token
;;; ==================
(define A2-RECOGNIZE-TOKEN
(lambda [token s]
(if (not (= token (get-token s)))
(a2-error-in-recognition)
$true)))
;;; A2-ERROR-IN-RECOGNITION Signals an error by sending back $false.
;;; =======================
(define A2-ERROR-IN-RECOGNITION
(lambda []
(dynamic-throw $false)))
It would support the same sort of behaviour as ARITH-RECOGNIZE.