As poor as a king without a poet. Irish saying
General notes about formal language grammars and nom.
In the Medieval period in Europe the four pillars of education were Rhetoric, Logic, Theology and Grammar. Today this seems a strange classification of knowledge and study, but Logic and Grammar are the basis of computer science.
I should state here that I am not a great formal-language theorist or high-level mathematician who can give insights into grammar theory. This page is just a set of heuristic notes about things I have discovered while parsing languages with the ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 system.
You can use the begins-with and ends-with modifiers to achieve token lookahead in [nom]. The script /eg/maths.parse.pss and also /eg/maths.tolatex.pss have good examples of using lookahead.
The document /doc/howto/nom.lookahead.html has detailed information about how to write lookahead into a nom script .
There is also the peep register which provide a single
character look-ahead and which is not used directly by the script writer.
This register is used by the while
and whilenot
commands
to read the input-stream only while the next character matches or
does not match the criteria given.
This is how we parse “a + b + c” .... Do we parse this as “(a + b) + c” (left-associative) or as “a + (b + c)” (right-associative)
With nom we need to build this associativity into the grammar. See the arithmetic parser example for an example of associativity (with +/- signs which associate right) .
This is how we parse for example “a + b * c” We say that certain operators have precedence over others. Do we parse this as “(a + b) * c” ( '+' precedence ) or as “a + (b * c)” ( * precedence )
Again, this has to be factored into the nom grammar. Its not that difficult.
A parser generator is basically a way to avoid having to write a recursive descent parser (and compiler) or some other style of hand coded compiler. Lex, Yacc, Bison, Antlr and many others are examples of parser generators and they seem to be widely used for simple languages. An old example would be the EQN language for formatting mathematical expressions.
The kleene star is a grammar construct which basically means “zero or more of something” . This exists in regular expressions
s/ab*c//
This means: 'a' followed by zero or more 'b's followed by 'c' The same construct is used in EBNF grammars (check!).
The rule below means that “A 'statement' is a 'keyword' followed” by zero or more 'parameters' followed by a semi colon.
statement = keyword parameter* ';' ;
How can we translate this into ℕ𝕠𝕞 ? We cannot create a single block which represents the rule above but by using several blocks we can create a nom fragment which represents it.
The code below represents a complete “lexer” and “recogniser” for statements as shown below. Notice that white-space is completely ignored except in parameters.
say 'good'; fizz; bang 'crash' ;
whirr 'ok' 'okay'; grow'big' 'small' 'tiny' ;
bird
'tweet' 'fly '
'nest' ;
The script that follows is a recogniser because it doesn't actually compile/transpile/transform the input text, it just determines if the input text consists of valid “statements”. Also notice, that the single EBNF rule with a kleene star is broken up into 3 ℕ𝕠𝕞 blocks ( statements between braces ).
statement = keyword ';' ;
statement = keyword parameter ';' ;
statement = keyword parameterset ';' ;
parameterset = parameter parameter ;
parameterset = parameterset parameter ;
Although their are 5 ebnf rules above, ℕ𝕠𝕞 only requires 3 blocks because it can use OR logic concatenation in the tests (using the ',' comma character)
# "lexing"
read;
# literal token, the character is also the parse-token
';' { add "*"; push; }
[:space:] { while [:space:]; clear; }
[:alpha:] { while [:alpha:]; put; clear; add "keyword*"; push; }
"'" { until "'"; put; clear; add "parameter*"; push; }
!"" {
put; clear;
add "?? bad char ["; get; add "]\n";
print; quit;
}
# The nom version of the ebnf rule
# statement = keyword parameter* ';' ;
parse>
# show the stack reductions for debugging
# add "line "; lines; add " char "; chars; add ": "; print; clear;
# unstack; print; stack; add "\n"; print; clear;
pop;
pop;
"keyword*;*"
{ clear; add "statement*"; push; .reparse }
"parameter*parameter*","paramset*parameter*"
{ clear; add "paramset*"; push; .reparse }
pop;
"keyword*parameter*;*","keyword*paramset*;*" {
clear; add "statement!\n"; print;
clip; clip; add "*";
push; .reparse
}
push;push;push;
Take a simple regular expression and attempt to translate into nom
ab*c
This needs to be recognised with a much more verbose nom script. The script below will no doubt seem ridiculously verbose, but the strength of ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 is not in parsing regular languages.
read;
"a" { put; clear; add "A*"; push; }
"b" { put; clear; add "B*"; push; }
"c" { put; clear; add "C*"; push; }
!"" {
clear; add "no"; print; quit;
}
parse>
pop;pop;
"B*B*" {
clear; get; ++; get; --; put;
clear; add "B*"; push;
}
"A*C*" {
clear; get; ++; get; --; put;
clear; add "pattern*"; push; .reparse
}
pop;
"A*B*C*" {
clear; get; ++; get; ++; get; --; --; put;
clear; add "pattern*"; push; .reparse
}
(eof) {
!"pattern*" {
clear; add "no"; print; quit;
}
"pattern*" {
clear; add "yes"; print; quit;
}
}
push;push;push;
Nom can only match or recognise a fixed number of parse-tokens
at any point in the script. This is because it needs to pop
the parse tokens off the stack before it can match
parse token sequences.
This is why ebnf rules in the form
a = b+ c ; a = c b* ;
need to be broken up into several
ℕ𝕠𝕞 blocks, because b+ and b*
represent a variable number of parse tokens.
LL grammars (same?)
Although grammar theorists talk about all sorts of different types of parsing, in practice, the type of parsers that are written by jobbing programmers seem to be recursive descent . There is a very surprising statement on Robert Nystrom's blog that all compilers for modern languages in common use are recursive descent. Mr. Nystrom is involved in creating (designing?) the DART language and has written a great book about parsing and compiling, so I am pretty convinced he knows what he is talking about. Recursive descent is also the type of parser/compiler that seems to be taught in University Comp-Sci courses. Nicholas Wirth of Pascal fame was a strong proponent of this type of parser/compiler.
I don't wish to criticise recursive descent parsers, because they obviously work, but it does seem odd to me that this is still the main way to recognise and translate formal-languages.
ℕ𝕠𝕞 does not do recursive descent parsing: in fact I wrote nom because I wanted a way to understand the grammars of language that was more “natural” for my way of thinking. The ℙ𝕖𝕡 & ℕ𝕠𝕞 system does www.google.com/search?q=shift+reduce+parsing but it does it purely as a (unix-style) text-stream filter.
The ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 lexing/parsing/compiling process is as follows:
In the lexical analysis phase of the nom script, ℙ𝕖𝕡 🙵 ℕ𝕠𝕞
reads the input stream (more or less) one Unicode character at a
time, and creates parse-tokens and their “attributes” in a text buffer
. The parse-tokens are pushed onto a stack
and the “attributes” (the partially compiled/translated text) are put
into the tape . Then, during the parsing or “shift-reduce” phase
of the script, the parse-tokens are popped off the stack
back into the workspace
buffer. If a particular parse-token
sequence is matched or recognised, then the workspace is
cleared and the token attributes are got from the
tape and compiled and then put again onto the tape .
Then the workspace (string) buffer is cleared and the
new parse-token is created in the workspace with the add
command.
Finally, the new parse-token or tokens is pushed onto
the stack .
This entire process is text-oriented meaning that everything is a “string” including the parse-tokens and the parse tokens and their attributes are manipulated in the same virtual machine with the same commands.
E := E + E
E := T
My knowledge of formal language grammar theory is quite limited. I am more interested in practical techniques. But there is a reasonably close correlation between bnf-type grammar rules and ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 script construction.
The right-hand-side of a (E)BNF grammar rule is represented by the quoted text before a brace block, and the left-hand-side correlates to the new token pushed onto the stack.
"article*noun*" { clear; add "nounphrase*"; push; }
Terminology, productions (same as rules?) terminals, non-terminals, tokens. Tokens are the same as terminals? BNF backus-naur form
There is a relationship between virtual machines, automatons and grammars and formal languages.