ℙ𝕖𝕡 🙴 ℕ𝕠𝕞

As poor as a king without a poet. Irish saying

general notes about grammars and nom

General notes about formal language grammars and nom.

In the Medieval period in Europe the four pillars of education were Rhetoric, Logic, Theology and Grammar. Today this seems a strange classification of knowledge and study, but Logic and Grammar are the basis of computer science.

I should state here that I am not a great formal-language theorist or high-level mathematician who can give insights into grammar theory. This page is just a set of heuristic notes about things I have discovered while parsing languages with the ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 system.

look ahead

You can use the begins-with and ends-with modifiers to achieve token lookahead in [nom]. The script /eg/maths.parse.pss and also /eg/maths.tolatex.pss have good examples of using lookahead.

The document /doc/howto/nom.lookahead.html has detailed information about how to write lookahead into a nom script .

There is also the peep register which provide a single character look-ahead and which is not used directly by the script writer. This register is used by the while and whilenot commands to read the input-stream only while the next character matches or does not match the criteria given.

associativity

This is how we parse “a + b + c” .... Do we parse this as “(a + b) + c” (left-associative) or as “a + (b + c)” (right-associative)

With nom we need to build this associativity into the grammar. See the arithmetic parser example for an example of associativity (with +/- signs which associate right) .

precedence

This is how we parse for example “a + b * c” We say that certain operators have precedence over others. Do we parse this as “(a + b) * c” ( '+' precedence ) or as “a + (b * c)” ( * precedence )

Again, this has to be factored into the nom grammar. Its not that difficult.

parser generators

A parser generator is basically a way to avoid having to write a recursive descent parser (and compiler) or some other style of hand coded compiler. Lex, Yacc, Bison, Antlr and many others are examples of parser generators and they seem to be widely used for simple languages. An old example would be the EQN language for formatting mathematical expressions.

kleene star

The kleene star is a grammar construct which basically means “zero or more of something” . This exists in regular expressions

example kleene star in a sed regular expression

 s/ab*c//

This means: 'a' followed by zero or more 'b's followed by 'c' The same construct is used in EBNF grammars (check!).

The rule below means that “A 'statement' is a 'keyword' followed” by zero or more 'parameters' followed by a semi colon.

kleene star in ebnf grammar rule

 statement = keyword parameter* ';' ;

How can we translate this into ℕ𝕠𝕞 ? We cannot create a single block which represents the rule above but by using several blocks we can create a nom fragment which represents it.

The code below represents a complete “lexer” and “recogniser” for statements as shown below. Notice that white-space is completely ignored except in parameters.

example of statements recognised below


     say 'good'; fizz; bang 'crash' ;
     whirr 'ok' 'okay'; grow'big' 'small' 'tiny' ;
     bird
       'tweet' 'fly '
       'nest' ;

The script that follows is a recogniser because it doesn't actually compile/transpile/transform the input text, it just determines if the input text consists of valid “statements”. Also notice, that the single EBNF rule with a kleene star is broken up into 3 ℕ𝕠𝕞 blocks ( statements between braces ).

equivalent ebnf rules as used by nom


    statement = keyword ';' ;
    statement = keyword parameter ';' ;
    statement = keyword parameterset ';' ;
    parameterset = parameter parameter ;
    parameterset = parameterset parameter ;

Although their are 5 ebnf rules above, ℕ𝕠𝕞 only requires 3 blocks because it can use OR logic concatenation in the tests (using the ',' comma character)

Ebnf rule with kleene star


  # "lexing"
  
  read; 
  # literal token, the character is also the parse-token
  ';' { add "*"; push; }
  [:space:] { while [:space:]; clear; }  
  [:alpha:] { while [:alpha:]; put; clear; add "keyword*"; push; }
  "'" { until "'"; put; clear; add "parameter*"; push; }
  !"" { 
    put; clear; 
    add "?? bad char ["; get; add "]\n";
    print; quit;
  }
 
  # The nom version of the ebnf rule
  #   statement = keyword parameter* ';' ;
  parse>
    # show the stack reductions for debugging
    # add "line "; lines; add " char "; chars; add ": "; print; clear;
    # unstack; print; stack; add "\n"; print; clear;
    pop;
    pop;
    "keyword*;*" 
      { clear; add "statement*"; push; .reparse }
    "parameter*parameter*","paramset*parameter*" 
      { clear; add "paramset*"; push; .reparse }
    pop;
    "keyword*parameter*;*","keyword*paramset*;*" { 
      clear; add "statement!\n"; print;
      clip; clip; add "*";
      push; .reparse 
    }
    push;push;push;

more kleene star analysis

Take a simple regular expression and attempt to translate into nom

a none,one,or many asterisk in a regular expression

 ab*c

This needs to be recognised with a much more verbose nom script. The script below will no doubt seem ridiculously verbose, but the strength of ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 is not in parsing regular languages.

the equivalent in nom



    read;
    "a" { put; clear; add "A*"; push; }
    "b" { put; clear; add "B*"; push; }
    "c" { put; clear; add "C*"; push; }
    !"" {
      clear; add "no"; print; quit;
    }
  parse>
    pop;pop;
    "B*B*" { 
      clear; get; ++; get; --; put; 
      clear; add "B*"; push;
    }
    "A*C*" { 
      clear; get; ++; get; --; put;
      clear; add "pattern*"; push; .reparse
    }
    pop;
    "A*B*C*" { 
      clear; get; ++; get; ++; get; --; --; put;
      clear; add "pattern*"; push; .reparse
    }
    
    (eof) {
      !"pattern*" {
        clear; add "no"; print; quit;
      }
      "pattern*" {
        clear; add "yes"; print; quit;
      }
    }
    push;push;push;

parse token matching or recognising

Nom can only match or recognise a fixed number of parse-tokens at any point in the script. This is because it needs to pop the parse tokens off the stack before it can match parse token sequences.

This is why ebnf rules in the form


   a = b+ c ;
   a = c b* ;

need to be broken up into several ℕ𝕠𝕞 blocks, because b+ and b* represent a variable number of parse tokens.

recursive descent parsing

LL grammars (same?)

Although grammar theorists talk about all sorts of different types of parsing, in practice, the type of parsers that are written by jobbing programmers seem to be recursive descent . There is a very surprising statement on Robert Nystrom's blog that all compilers for modern languages in common use are recursive descent. Mr. Nystrom is involved in creating (designing?) the DART language and has written a great book about parsing and compiling, so I am pretty convinced he knows what he is talking about. Recursive descent is also the type of parser/compiler that seems to be taught in University Comp-Sci courses. Nicholas Wirth of Pascal fame was a strong proponent of this type of parser/compiler.

I don't wish to criticise recursive descent parsers, because they obviously work, but it does seem odd to me that this is still the main way to recognise and translate formal-languages.

ℕ𝕠𝕞 does not do recursive descent parsing: in fact I wrote nom because I wanted a way to understand the grammars of language that was more “natural” for my way of thinking. The ℙ𝕖𝕡 & ℕ𝕠𝕞 system does www.google.com/search?q=shift+reduce+parsing but it does it purely as a (unix-style) text-stream filter.

The ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 lexing/parsing/compiling process is as follows: In the lexical analysis phase of the nom script, ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 reads the input stream (more or less) one Unicode character at a time, and creates parse-tokens and their “attributes” in a text buffer . The parse-tokens are pushed onto a stack and the “attributes” (the partially compiled/translated text) are put into the tape . Then, during the parsing or “shift-reduce” phase of the script, the parse-tokens are popped off the stack back into the workspace buffer. If a particular parse-token sequence is matched or recognised, then the workspace is cleared and the token attributes are got from the tape and compiled and then put again onto the tape . Then the workspace (string) buffer is cleared and the new parse-token is created in the workspace with the add command. Finally, the new parse-token or tokens is pushed onto the stack .

This entire process is text-oriented meaning that everything is a “string” including the parse-tokens and the parse tokens and their attributes are manipulated in the same virtual machine with the same commands.

recursion in grammars

Left recursive grammars


   E := E + E
   E := T

grammar and script construction

My knowledge of formal language grammar theory is quite limited. I am more interested in practical techniques. But there is a reasonably close correlation between bnf-type grammar rules and ℙ𝕖𝕡 🙵 ℕ𝕠𝕞 script construction.

The right-hand-side of a (E)BNF grammar rule is represented by the quoted text before a brace block, and the left-hand-side correlates to the new token pushed onto the stack.

the rule "<nounphrase> ::= <article> <noun> ;" in a parse script

 "article*noun*" { clear; add "nounphrase*"; push; }

terminology

Terminology, productions (same as rules?) terminals, non-terminals, tokens. Tokens are the same as terminals? BNF backus-naur form

automatons

There is a relationship between virtual machines, automatons and grammars and formal languages.

en.wikipedia.org/wiki/Deterministic_finite_automaton