The Styx Handbook: Further aspects

8. Further aspects

Within this chapter we'll give a short overwiew on further aspects and possibilities offered by Styx like error recovery, persistence, unicode, preprocessing, early reduction and pretty printing.

8.1 Some notes on dangling else grammars and abstraction

The following example illustrates the process to construct an abstract grammar with dangling "else".

; Dangling Else Grammar -- not obvious

; Note that a grammar with dangling "else" can be properly normalized 
; in Styx with regular means, though the construct is not obvious on 
; first glance.

; The abstract grammar comes out normalized as we want it:
;
;  Stm     = stmt;
;            ifte(Cond, Stm, Stm);
;            loop(Cond, Stm);
;            seri(Stm*);
;
; We have normalized "if (C) Stm" to "if (C) Stm else ;".
; Additionally, the empty statement (";") is normalized to
; the empty compound ("{}").

; Two problems to solve here:
;
; 1 - (difficult) - dangling else grammar
; 2 - (easy     ) - normalization

; In the grammar below, we have simplified all non-compound statements to 
; "S" and the empty statement to "noop".
;
; Note that we have added the conditions after the "if" and "while" tokens 
; only to make the grammar to appear a little more C-like, though they do not 
; contribute to the problem.

; The obvious consequence of the dangling-if, which is difficult to see, is 
; the fact that instead of the intuitively expected two rules (one with and 
; one without "else") a third rule has to be introduced for an "if" that
; always demands an else part. This occurs in the then-branch of each 
; if-clause having an "else".

; Note that the "while" and "for" clauses have to be doubled in "Stm" and 
; "Stm1", too. This is because a while-clause in the then-part of an if-clause 
; with an else-part demands a containing if-clause to have an else-part, too.
;
; To illustrate the interesting cases:
;
;    if (C)
;      while (C)
;        if (C)
;          S;
;        else
;          S;
;    else
;      S;
;
; or:
;
;    if (C)
;      while (C)
;        if (C)
;          S;
;
; but one cannot write:
;
;    if (C)
;      while (C)
;        if (C)
;          S;
;    else
;      S;
;
; as this would mean:
;
;    if (C)
;      while (C)
;        if (C)
;          S;
;        else
;          S;
;
; We have to use parenthesis to express the former compound:
;
;    if (C)
;      while (C)
;      {
;        if (C)
;          S;
;      }
;    else
;      S;
;
; The same holds of cause for the above example with the while-clauses omitted.
;
; The unobvious consequence of the dangling "else" is that the compound clauses 
; between "if" and "else" have a different grammar than the top level clauses, 
; as the inner demand an "if" to have an "else" in any case, while the outermost 
; do not. For this reason it is unlikely to get such a grammar right on the 
; first attempt.
;
; Note that all these complications are an intrinsic problem of dangling-else 
; grammars and not caused by the Styx, which does pure BNF. One might contemplate, 
; whether breaking shift-reduce or reduce-reduce conflicts, as allowed e.g. 
; within Yacc, would make such a situation any clearer.
;
; Grammars with dangling "else" and loop-clauses like C have, as another 
; disadvantage, the typographical problem, that one need to insert parenthesis as 
; soon as an extra statement has been added to the bodies. This, together with the 
; binding problems, might be a reason why some prefer to always write parenthesis
; in conjunction with these statements.

Language ife

Regular Grammar

; Character Set

  ign B  = ' \n'
  tok T  = "if" | "while" | "(" | "C" | ")" | "S" | "else" | ";" | "{" | "}"

Context Free Grammar

  start Source
  :pgm : Stm

  let Stm
  :ign1: XIf
  :ifte: "if" "(" Cond ")" Stm1 "else" Stm
  :loop: "while" "(" Cond ")" Stm
  :ign0: Stm2

  let XIf
  :ifte: "if" "(" Cond ")" Stm StmNoop  ; normalize: if (C) Stm --> if (C) Stm else {}

  let Stm1
  :ifte: "if" "(" Cond ")" Stm1 "else" Stm1
  :loop: "while" "(" Cond ")" Stm1
  :ign0: Stm2

  let Stm2
  :stmt: "S" ";"
  :seri: "{" Stms "}"
  :ign1: StmNoop ";"    ; normalize ";" -> {}

  let Stms
  :nil :
  :cons: Stm Stms

  let StmNoop
  :seri: StmSkip
  let StmSkip
  :nil :

  let Cond
  :cond: "C"

8.2 The error recovery mechanism

Normally, a parser will stop the parse process in the case of a syntax error. This is the default behavior. There exists several error recovery methods which allow a parser to continue the parse process after a syntax error and thus behave more user-friendly. The Styx error recovery mechanism differs from that one provided by yacc-compatible parsers.

The yacc error recovery mechanism is based on special error productions of the form Nonterminal --> error (Token|Nonterminal) ..., which the user explicitly adds to the grammar specification. They will be treated like normal productions. In the case of an error the parser continues to pop elements from its stack until reaching a top state whose corresponding element set contains an error production. Next the parser shifts a fictitious error token onto the stack. If the error production looks like Nonterminal --> error the parser performs an "error" reduction and ignores the next input symbols until the normal parse process could continue. Otherwise the parser consumes as much input symbols until the error production could be reduced and then continues with the normal parse process.

The Styx parser uses a variant of the panic-mode error recovery mechanism. There is no need for special user-defined error productions. This method tries to isolate the part of the sentence which contains the syntax error. The parser looks for a state in its stack for which a goto-action to a single nonterminal exists and removes the other states. The next input symbols will be skipped until the first one which can follow the above mentioned nonterminal. Now the parser performs an "error" reduction, pushes the resulting state of the goto-action onto its stack and continues with the normal parse process.

The Styx variant of this method uses only such nonterminals as resumption points which are declared as those in the corresponding grammar definition.

8.3 Using persistence

Among others, persistence is a proper means to split a compilation process into two parts:

The "compiler" does the parsing, static semantic and produces some kind of p-code or simply stores the derivation tree.
The "runtime system" interprets the result of the first part, i.e. reads and executes the p-code.

Binary image library

Styx features such a proceeding with the binary image library which contains a set of functions to store data types in a machine-independent, compressed and encrypted form.

Supported data types

  Actually the following data types are supported.

  | Type     | Bytes | C-Type                 |
  +------------------+------------------------+----------------------------
  | Byte     |     1 | unsigned char          |
  | Word     |     2 | unsigned short int     |   Intrinsic C-data types
  | Long     |     4 | signed long int        |
  | ULong    |     4 | unsigned long int      |
  | Int64    |     8 | signed long long int   | if supported
  | UInt64   |     8 | unsigned long long int | if supported
  +------------------+------------------------+----------------------------
  | String   |       | (char *)               |   Strings
  | Binary   |       | c_bstring              |   binary Strings
  | Symbol   |       | symbol                 |   Symbols
  | Function |       | (? (*)())              |   Functions
  | Abstract |       | (?)                    |   "Objects"
  | StdCPtr  |       | (?*)                   |   References

For each data type the library provides a pair of put- and get-function.

Image representation format

Intrinsic C-data types
A Byte is treated as "atomic" data type and stored as is, whereas a Word in the order low-Byte and high-Byte. Analogous Long as well as ULong will be separated into low-Word and high-Word, and so on.
Binary, strings & symbols
The same representation applies to a String, Binary and Symbol. First the length i.e. number of bytes is stored and then the data bytes in their respective order.
Functions
For technical reasons the Function must be defined in a global table and is represented by a symbolic name representing the key to the function table entry.
Abstract data types
In the case of a generic data type ( e.g. List(Alpha) ) a put-function typically looks like:
```
  |  void putList(List(Alpha) v, void putAlpha(Alpha v))
  |  {
  |    putInt(List_length(v));
  |    for (; !List_null(v); v = List_rest(v))
  |      putAlpha(List_first(Alpha,v));
  |  }
```
In the case of a heterogeneous parameter type ( "Object" ) the user has to save the corresponding get-function together with the value.
References
References to multiple or cyclic referenced structures ( except symbols and functions ) can't be simply expanded if the representation should be unique. A Reference is treated in the following way. In the case of the first reference the structure value is stored, otherwise a reference number to this structure.

Header information

Each binary image starts with some header information which contains, for example, the version of the image.

Compression

The applied method is a variant of the Lempel-Ziv-Welch compression method.

Encryption

The applied encryption method is a variant of the so-called linear congruence methods.

Examples

This is to become the section about Example03. For convenience, we first include the related README here, literally

[README] Example 03

This example is a quick variation of the interpreter in Example 02.

It demonstrates persistence as a feature of Styx.
All the modification with regard to Example 02 is to split
the [pl0.c] program apart into two parts:

1) A "compiler" [pl0c.c], which parses the source,
   does the static semantics, stores the derivation
   tree into a file.

2) A "run time system" [pl0r.c] which reads and
   executes the so-produced binary image.

"compile" [testpl0.pl0] by 'pl0c testpl0' yielding [testpl0].

If you browse the file, you find it starting with
something like "#!/p/ping/pl0r". You may want to
adjust this path issued in [pl0c.c] to the location
of the pl0r binary and do a 'chmod +x testpl0' for
a real executable.

Otherwise run it using 'pl0r testpl0'.

One may argue, that this is not a "real" compiler,
which should create pseudo code, at least. This is
true, but writing an interpreter for pseudo code that
is significantly faster then this example is not so
trivial as one might think.

Perhaps we will continue later with an example of a 
proper to-pseudo-code compiler and a nice little machine, 
but this may never be necessary, since there is something 
as strong as Styx itself on top of it, which may soon be 
ready for prime time.

8.4 Unicode support

The Styx scanner & parser generator is able to deal with unicode based language definitions and scan streams. Since version 1.5 each of the released programs support unicode.

First you have to design the proper grammar. Styx itself doesn't accept unicode specifications. You define unicode tokens and keywords with the help of the long form of the hexadecimal literal notation. The generated scan tables are in any case based on wide characters.

Scan streams which correspond to such a language definition must be unicode based, too. They can be created with the function 'Stream_Itr_new' of the scan stream interface.

The scanner converts the scanned unicode tokens and keywords into equivalent multibyte character ( UTF-8 ) strings and then into symbols. This will be the final token and keyword representation within the preprocessing facility ( see next section ), parser and derivation tree. Note that in this case the diagnose functions like 'PT_error' ( see parse term construction interface ) expects UTF-8 based message parameters.

8.5 The preprocessing facility

User-defined preprocessing

The scan stream interface provides a hook for user-defined i.e. language-specific preprocessing. One activates preprocessing by specifying a proper handler ( see function 'Stream_premac_set' ).

In this case each time after the scanner separates a token and before this will be passed to the parser, for example, the specified handler is called. Depending on the preprocessing result the scanner behaves as follows:

The token will be taken as is and passed to the calling (parser) function. This is the same behavior when no handler was specified.
The token will be replaced by the result of the handler and passed to the calling (parser) function.
The scanner ignores the token and instead rescans the result character or string of the handler.
The handler returns the name of a file. In this case the scanner ignores the token and rescans the file.
The handler returns a character iterator. Again the scanner ignores the token and rescans the character iterator.
In any other cases the scanner ignores the token and continues with the separation of the next token.

When the scanner rescans the preprocessing result of a token a new scan stream will be created and pushed onto an internal stack. On EOF at the main scan stream the scanner passes the EOF token to the calling (parser) function. On EOF at the current scan stream on top of the internal stack the scanner pops the stream and continues with the next one on top of the stack.

The tracking of the token locations within substreams is performed relative to the rescanned token from the upper stream. Its location is parts of the substream identifier.

Standard (Styx-compliant) preprocessing

The Styx preprocessor provides modularization, macro expansion and conditional compilation - with the help of the above mentioned macro tokens. The evaluation of the preprocessing macros takes place at scan time, so the parser didn't get any note of it.

A token like #include path forces the scanner to scan the declared file. The search path can be configured outside the grammar specification.
The sequence #macro introduces a new macro definition with optional parameters. The name of the macro must belong to the identifier token class Ide. Whenever the preprocessor sees an identifier which equals the name of a previously defined macro the identifier along with the optional list of actual parameters will be replaced by the rescan result of the expanded macro definition.
The actual macro parameters must belong to one of the specified token classes. Each parameter must be filled by a specified delimiter character which is a space in the default case. In order to handle a token sequence as a single parameter one has to specify another delimiter character with a leading ' in front of it.
The preprocessing token #ifdef and #ifndef provide means for conditional compilation. Depending on the existence ( non-existence ) of the declared macro the following source code until the corresponding #else or #end will be evaluated or not. Is is possible to define macros outside the grammar specification.

Following the steps below you can use this preprocessing facility within your language.

First add the preprocessing tokens to your grammar specification. If they doesn't fit you can change the names of the tokens and / or change the #-keywords. Your grammar must specify an identifier token which can have an alternate name.
The Styx preprocessor must be initialized once with your changes. Next activate the it in the scan stream by calling the above mentioned function 'Stream_premac_set' with the function 'SPP_premac' as parameter. After the scan and parse process release the Styx preprocessor resources.

8.6 Using early reduction

The Styx parser supports early reduction, a facility that allows you to parse parts of a source.

If you, for example, designs a schema language like the DDL part of SQL with multiple start symbols for the database and table definition section, you can apply early reduction to retrieve each table definition as a separate derivation tree, even if they are combined in one source file.

A second example refers to the calculator from Example 01 and demonstrates the partial parsing of the following expression list in the file 'explist.calc'.

1+2
3-1
6*6
9/3

When applying the command 'pim_test calc -early explist.calc' you will receive this result.

Derivation Tree from Source : explist.calc

[calc.Start_Command (1,1)
 [Command.exp (1,1)
  [Exp.add (1,1)
   [Exp.ign0 (1,1)
    [Exp1.ign0 (1,1)
     [Exp2.int (1,1)
      [Int (1,1) "1"]]]]
   [Keyword (1,2) "+"]
   [Exp1.ign0 (1,3)
    [Exp2.int (1,3)
     [Int (1,3) "2"]]]]]]

Derivation Tree from Source : explist.calc

[calc.Start_Command (2,1)
 [Command.exp (2,1)
  [Exp.sub (2,1)
   [Exp.ign0 (2,1)
    [Exp1.ign0 (2,1)
     [Exp2.int (2,1)
      [Int (2,1) "3"]]]]
   [Keyword (2,2) "-"]
   [Exp1.ign0 (2,3)
    [Exp2.int (2,3)
     [Int (2,3) "1"]]]]]]

Derivation Tree from Source : explist.calc

[calc.Start_Command (3,1)
 [Command.exp (3,1)
  [Exp.ign0 (3,1)
   [Exp1.mlt (3,1)
    [Exp1.ign0 (3,1)
     [Exp2.int (3,1)
      [Int (3,1) "6"]]]
    [Keyword (3,2) "*"]
    [Exp2.int (3,3)
     [Int (3,3) "6"]]]]]]

Derivation Tree from Source : explist.calc

[calc.Start_Command (4,1)
 [Command.exp (4,1)
  [Exp.ign0 (4,1)
   [Exp1.div (4,1)
    [Exp1.ign0 (4,1)
     [Exp2.int (4,1)
      [Int (4,1) "9"]]]
    [Keyword (4,2) "/"]
    [Exp2.int (4,3)
     [Int (4,3) "3"]]]]]]

8.7 Parsing from strings and special files

This ability enables, for example, a background service to parse client requests "on-the-fly".

The function 'Stream_string' of the scan stream interface let you define strings as scan streams while the function 'Stream_line' is suitable for parsing from special files like pipes.

Look at Example 01 and 04 as examples for parsing from strings and special files.

8.8 Using the scanner alone

There isn't much to say here beside that it's possible. The following code fragment demonstrates how to do it.

First load the scanner, either with the function 'Scn_get_<language>' from the generated source file '<language>_lim.c' or from the corresponding binary image '<language>.lim' with the function 'Scn_get' of the scanner primitives interface. After that create a scan stream using, for example, the function 'Stream_file' of the scan stream interface.

// ...
#include "scn_base.h"
#include "scn_io.h"
// ...
Scn_T      Scn;     // Scanner
Scn_Stream cStream; // Scan stream
int        i;
// ...
// scanner & scan stream creation (see above)
// ...
// define EOF, error and token id's
Stream_defEofId(cStream,-1);
Stream_defErrId(cStream, 0);
for (i = 1; i < Scn_tokens(Scn); i++)
{ string TokenName = Scn_tokid(Scn,i);
  Stream_defTokId( cStream, TokenName, (short)i );
  FreeMem(TokenName);
}
// ...
// scan loop
for ( Stream_next(cStream); Stream_ctid(cStream) >= 0; Stream_next(cStream) )
{ string FileName   = symbolToString(Stream_cfil(cStream)),
         TokenName  = Scn_tokid(Scn,Stream_ctid(cStream)),
         TokenValue = symbolToString(Stream_csym(cStream));
  long   Line       = Stream_clin(cStream),
         Column     = Stream_ccol(cStream);
// ...
}
// ...
// scanner & scan stream disposal
Stream_close(cStream);
Stream_free(cStream);
Scn_free(Scn);

Finally, look at the source file [lim_test.c] of the scanner test program for an application.

8.9 Integration of external scanner and parser

Using the Styx parser with an external scanner

Basically, it's possible. The (low-level) parser interface is flexible configurable by appropriate user-defined handlers for the retrieval of the next token, the shift and reduce operations and the reporting of syntax errors.

The user-defined function 'get next token', which represents the interface between scanner and parser, must return -1 in the case of EOF, -2 or less in the case of an error or unknown token and in the case of a token the correct index from the parse table.

Using the Styx parser and term generation with an external scanner

From version 1.5 on it's possible to combine an externally defined scanner not only with the Styx parser but also with the term generation facility. All what you have to do is to provide the corresponding external scanner interface. Use the function 'PT_init_extscn' of the term generation interface to initialize parsing and term construction.

Using the Styx term generation with an external scanner and parser

If you plan to use the term generation facility with an external scanner and parser, which is possible from version 1.5 on, you must provide both the external scanner interface and the external parser interface before calling the function 'PT_init_ext' of the term generation interface to initialize parsing and term construction.

8.10 Constructing and accessing a derivation tree

As already described in the previous chapters the derivation tree with the complete source information will be automatically constructed during the parse process - with the help of the term generation interface. The Styx compiler produces a C interface '<language>_int.c' to the abstract syntax tree, which in most cases will be the preferred access method, along with the generic language support.

In order to perform meta-operations on arbitrary derivation trees corresponding to different languages you'll need dynamic access to the concrete syntax tree. The Styx framework comes along with generic methods for these purposes. The term interface provides basic operations needed to construct and access a derivation tree. Beside functions for the retrieval of node information it contains a "depth first" and a "breast first" tree iterator - useful for the iteration of concrete syntax trees.

8.11 Meta-operations

Styx supports the implementation of generic, language independent services. Examples are the Styx compiler itself as well as the scanner and parser test programs.

Dynamic loading and execution of tables

The scanner primitives interface provides several functions to load a scan table from a binary image. Once loaded you could create scan streams on it and perform scan operations. Analogous a parse table can be loaded from a binary image ( look at the parse table load & unload interface ) and then, for example, used to initialize parsing and term construction.

Dynamic scanner and parser creation

The Styx compiler is an example for such an application. In version 1.5 a new interface, the Styx translation library, provides a "high level" access to this functionality.

8.12 Pretty printing

When a grammar specification evolves there will be a need for the automatic conversion of the "older" source files. The pretty printing facility allows a user to translate source code from one grammar specification to another and then to print the translated source code. The print function could be even useful without a previous transformation, for example, when a source tree was dynamically constructed and consequently lacks any positional information.

This feature is "ready to apply" but especially the applied layout mechanism has some disadvantages and will be revised in a later version. ( see also next chapter )

Tree transformation

The transformation facility was designed with regard to the above mentioned application. It doesn't (yet) support tree transformations in a general sense. A tree-to-tree transformation is performed according the underlying source and target grammar specifications which must satisfy the following restrictions. Comments are handled, too, a must with regard to source-to-source transformations.

  source tree based on CFG 1
  specification of CFG 2
  ------------------------------------------------->  target tree based on CFG 2
  abstraction(CFG 1)   = abstraction(CFG 2)
  regexp(token(CFG 1)) = regexp(token(CFG 2)) [*]

[*] From version 1.5 on user-defined hooks can be specified in order to handle different token representations. This is useful, for example, to cope with different comments.

Layout specification

The indentation of the symbols and productions in a grammar definition will be taken as layout hints. Take, for example, the start production of the Styx command line specification language, below. Its format advises the layouter to start the arguments, options and environment section on a new line and output the following definitions with an indent. The format of the list production, below, is suitable if each element of a list should start on a new line, too.

  start [err] Source
  :root: "Program" Fid1 Dol Doc
         "Arguments"   
            Dfns OptDoc
         "Options"     
            Dfns OptDoc
         "Environment" 
            Dfns OptDoc

  let [err] Dfns   ; Definitions
  :nil  : 
  :cons : Dfn 
          Dfns

The layout option in the Styx grammar allows the user to force an extra new line ( option "!" ) or to overrule the indentation based layout ( option "?" ). Latter tells the layouter that the whole grammar phrase should be printed on one line if possible. This is useful in the case of expressions.

Another topic is the separation of the tokens. A Space will be taken as default separation character. If this doesn't seem suitable the user must explicitly specify which separation rule should be applied for two tokens. He can prevent a separation, force the tokens to be separated by a newline or according the current indentation which would be the default rule in the case of comments. Obviously, these separation rules also influences the layout.

Printing

Printing is done in two steps. First the appropriate positions of the grammar phrases will be determined by applying the above mentioned layout rules. After that the actual printing can take place.

8.13 Programming language support

Basically, Styx is available for C/C++ development platforms. Future releases will come with small runtime libraries for a wider range of popular programming languages like C# and Java. The source distribution of version 1.7.5 contains a pure STL-based C++ runtime library.

C++ Runtime system

The C++ library contains some (template) classes for the construction of scanners and parsers. If you plan to use them with a Styx grammar, you first have to specify the grammar and export the scanner and parse tables with the 'styx' program. The C++ runtime scanner and parse table classes provide methods to import the exported tables.

Note, that there are some restrictions regarding the Styx grammar specification. For now, the following features won't be supported by the C++ runtime system:

Further, there won't be a generated C++ interface to the abstract syntax tree. You have to use the generic C++ parse tree interface, instead, which provides the user with the necessary (token and nonterminal) class information. Beside that, depending on the task you want to perform, you can construct parse trees with full source information or reduced abstractions.

Similar to the basic Styx system the C++ runtime system provides a scanner test program 'StyxScannerTest' and parser test program 'StyxParseTreeTest'. They are suitable to test exported Styx grammars and, secondly, they are good examples for the usage of the C++ library. For more information, please have a look at the online source documentation.

C# Runtime system

The source distribution of version 1.7.6 comes with a C# runtime library. For now, the C# library contains the module 'StyxScanner.cs' for the construction of scanners. In order to use it with a Styx grammar, you first have to specify the grammar and export the scanner with the 'styx' program. The C# runtime scanner classes provide methods to import the exported scanner table.

Similar to the basic Styx system the C# runtime system provides a scanner test program 'StyxScannerTest', suitable to test exported Styx grammars and demonstrate the usage of the C# scanner module. For more information, please have a look at the online source documentation.

Next Previous Contents