Within this chapter we'll give a short overwiew on further aspects and possibilities offered by Styx like error recovery, persistence, unicode, preprocessing, early reduction and pretty printing.
The following example illustrates the process to construct an abstract grammar with dangling "else".
; Dangling Else Grammar -- not obvious ; Note that a grammar with dangling "else" can be properly normalized ; in Styx with regular means, though the construct is not obvious on ; first glance. ; The abstract grammar comes out normalized as we want it: ; ; Stm = stmt; ; ifte(Cond, Stm, Stm); ; loop(Cond, Stm); ; seri(Stm*); ; ; We have normalized "if (C) Stm" to "if (C) Stm else ;". ; Additionally, the empty statement (";") is normalized to ; the empty compound ("{}"). ; Two problems to solve here: ; ; 1 - (difficult) - dangling else grammar ; 2 - (easy ) - normalization ; In the grammar below, we have simplified all non-compound statements to ; "S" and the empty statement to "noop". ; ; Note that we have added the conditions after the "if" and "while" tokens ; only to make the grammar to appear a little more C-like, though they do not ; contribute to the problem. ; The obvious consequence of the dangling-if, which is difficult to see, is ; the fact that instead of the intuitively expected two rules (one with and ; one without "else") a third rule has to be introduced for an "if" that ; always demands an else part. This occurs in the then-branch of each ; if-clause having an "else". ; Note that the "while" and "for" clauses have to be doubled in "Stm" and ; "Stm1", too. This is because a while-clause in the then-part of an if-clause ; with an else-part demands a containing if-clause to have an else-part, too. ; ; To illustrate the interesting cases: ; ; if (C) ; while (C) ; if (C) ; S; ; else ; S; ; else ; S; ; ; or: ; ; if (C) ; while (C) ; if (C) ; S; ; ; but one cannot write: ; ; if (C) ; while (C) ; if (C) ; S; ; else ; S; ; ; as this would mean: ; ; if (C) ; while (C) ; if (C) ; S; ; else ; S; ; ; We have to use parenthesis to express the former compound: ; ; if (C) ; while (C) ; { ; if (C) ; S; ; } ; else ; S; ; ; The same holds of cause for the above example with the while-clauses omitted. ; ; The unobvious consequence of the dangling "else" is that the compound clauses ; between "if" and "else" have a different grammar than the top level clauses, ; as the inner demand an "if" to have an "else" in any case, while the outermost ; do not. For this reason it is unlikely to get such a grammar right on the ; first attempt. ; ; Note that all these complications are an intrinsic problem of dangling-else ; grammars and not caused by the Styx, which does pure BNF. One might contemplate, ; whether breaking shift-reduce or reduce-reduce conflicts, as allowed e.g. ; within Yacc, would make such a situation any clearer. ; ; Grammars with dangling "else" and loop-clauses like C have, as another ; disadvantage, the typographical problem, that one need to insert parenthesis as ; soon as an extra statement has been added to the bodies. This, together with the ; binding problems, might be a reason why some prefer to always write parenthesis ; in conjunction with these statements. Language ife Regular Grammar ; Character Set ign B = ' \n' tok T = "if" | "while" | "(" | "C" | ")" | "S" | "else" | ";" | "{" | "}" Context Free Grammar start Source :pgm : Stm let Stm :ign1: XIf :ifte: "if" "(" Cond ")" Stm1 "else" Stm :loop: "while" "(" Cond ")" Stm :ign0: Stm2 let XIf :ifte: "if" "(" Cond ")" Stm StmNoop ; normalize: if (C) Stm --> if (C) Stm else {} let Stm1 :ifte: "if" "(" Cond ")" Stm1 "else" Stm1 :loop: "while" "(" Cond ")" Stm1 :ign0: Stm2 let Stm2 :stmt: "S" ";" :seri: "{" Stms "}" :ign1: StmNoop ";" ; normalize ";" -> {} let Stms :nil : :cons: Stm Stms let StmNoop :seri: StmSkip let StmSkip :nil : let Cond :cond: "C"
Normally, a parser will stop the parse process in the case of a syntax error. This is the default behavior. There exists several error recovery methods which allow a parser to continue the parse process after a syntax error and thus behave more user-friendly. The Styx error recovery mechanism differs from that one provided by yacc-compatible parsers.
The yacc error recovery mechanism is based on special error productions of the form
Nonterminal --> error
(Token|Nonterminal) ..., which the user explicitly adds to the
grammar specification. They will be treated like normal productions. In the case of an error
the parser continues to pop elements from its stack until reaching a top state whose corresponding
element set contains an error production. Next the parser shifts a fictitious error token onto the
stack. If the error production looks like Nonterminal --> error
the parser performs an
"error" reduction and ignores the next input symbols until the normal parse process could continue.
Otherwise the parser consumes as much input symbols until the error production could be reduced and then
continues with the normal parse process.
The Styx parser uses a variant of the panic-mode error recovery mechanism. There is no need for special user-defined error productions. This method tries to isolate the part of the sentence which contains the syntax error. The parser looks for a state in its stack for which a goto-action to a single nonterminal exists and removes the other states. The next input symbols will be skipped until the first one which can follow the above mentioned nonterminal. Now the parser performs an "error" reduction, pushes the resulting state of the goto-action onto its stack and continues with the normal parse process.
The Styx variant of this method uses only such nonterminals as resumption points which are declared as those in the corresponding grammar definition.
Among others, persistence is a proper means to split a compilation process into two parts:
Styx features such a proceeding with the binary image library which contains a set of functions to store data types in a machine-independent, compressed and encrypted form.
Actually the following data types are supported. | Type | Bytes | C-Type | +------------------+------------------------+---------------------------- | Byte | 1 | unsigned char | | Word | 2 | unsigned short int | Intrinsic C-data types | Long | 4 | signed long int | | ULong | 4 | unsigned long int | | Int64 | 8 | signed long long int | if supported | UInt64 | 8 | unsigned long long int | if supported +------------------+------------------------+---------------------------- | String | | (char *) | Strings | Binary | | c_bstring | binary Strings | Symbol | | symbol | Symbols | Function | | (? (*)()) | Functions | Abstract | | (?) | "Objects" | StdCPtr | | (?*) | References
For each data type the library provides a pair of put- and get-function.
A Byte
is treated as "atomic" data type and stored as is, whereas a Word
in the
order low-Byte and high-Byte. Analogous Long
as well as ULong
will be separated
into low-Word and high-Word, and so on.
The same representation applies to a String
, Binary
and Symbol
. First
the length i.e. number of bytes is stored and then the data bytes in their respective order.
For technical reasons the Function
must be defined in a
global table
and is represented by a symbolic name representing the key to the function table entry.
In the case of a generic data type ( e.g. List(Alpha)
) a put-function typically
looks like:
| void putList(List(Alpha) v, void putAlpha(Alpha v)) | { | putInt(List_length(v)); | for (; !List_null(v); v = List_rest(v)) | putAlpha(List_first(Alpha,v)); | }
In the case of a heterogeneous parameter type ( "Object"
) the user has to save the corresponding
get-function together with the value.
References to multiple or cyclic referenced structures ( except symbols and functions ) can't be simply
expanded if the representation should be unique. A Reference
is treated in the following way.
In the case of the first reference the structure value is stored, otherwise a reference number to this
structure.
Each binary image starts with some header information which contains, for example, the version of the image.
The applied method is a variant of the Lempel-Ziv-Welch compression method.
The applied encryption method is a variant of the so-called linear congruence methods.
This is to become the section about Example03. For convenience, we first include the related README here, literally
[README] Example 03 This example is a quick variation of the interpreter in Example 02. It demonstrates persistence as a feature of Styx. All the modification with regard to Example 02 is to split the [pl0.c] program apart into two parts: 1) A "compiler" [pl0c.c], which parses the source, does the static semantics, stores the derivation tree into a file. 2) A "run time system" [pl0r.c] which reads and executes the so-produced binary image. "compile" [testpl0.pl0] by 'pl0c testpl0' yielding [testpl0]. If you browse the file, you find it starting with something like "#!/p/ping/pl0r". You may want to adjust this path issued in [pl0c.c] to the location of the pl0r binary and do a 'chmod +x testpl0' for a real executable. Otherwise run it using 'pl0r testpl0'. One may argue, that this is not a "real" compiler, which should create pseudo code, at least. This is true, but writing an interpreter for pseudo code that is significantly faster then this example is not so trivial as one might think. Perhaps we will continue later with an example of a proper to-pseudo-code compiler and a nice little machine, but this may never be necessary, since there is something as strong as Styx itself on top of it, which may soon be ready for prime time.
The Styx scanner & parser generator is able to deal with unicode based language definitions and scan streams. Since version 1.5 each of the released programs support unicode.
First you have to design the proper grammar. Styx itself doesn't accept unicode specifications. You define unicode tokens and keywords with the help of the long form of the hexadecimal literal notation. The generated scan tables are in any case based on wide characters.
Scan streams which correspond to such a language definition must be unicode based, too. They can be created with the function 'Stream_Itr_new' of the scan stream interface.
The scanner converts the scanned unicode tokens and keywords into equivalent multibyte character ( UTF-8 ) strings and then into symbols. This will be the final token and keyword representation within the preprocessing facility ( see next section ), parser and derivation tree. Note that in this case the diagnose functions like 'PT_error' ( see parse term construction interface ) expects UTF-8 based message parameters.
The scan stream interface provides a hook for user-defined i.e. language-specific preprocessing. One activates preprocessing by specifying a proper handler ( see function 'Stream_premac_set' ).
In this case each time after the scanner separates a token and before this will be passed to the parser, for example, the specified handler is called. Depending on the preprocessing result the scanner behaves as follows:
When the scanner rescans the preprocessing result of a token a new scan stream will be created and pushed onto an internal stack. On EOF at the main scan stream the scanner passes the EOF token to the calling (parser) function. On EOF at the current scan stream on top of the internal stack the scanner pops the stream and continues with the next one on top of the stack.
The tracking of the token locations within substreams is performed relative to the rescanned token from the upper stream. Its location is parts of the substream identifier.
The Styx preprocessor provides modularization, macro expansion and conditional compilation - with the help of the above mentioned macro tokens. The evaluation of the preprocessing macros takes place at scan time, so the parser didn't get any note of it.
Following the steps below you can use this preprocessing facility within your language.
The Styx parser supports early reduction, a facility that allows you to parse parts of a source.
If you, for example, designs a schema language like the DDL part of SQL with multiple start symbols for the database and table definition section, you can apply early reduction to retrieve each table definition as a separate derivation tree, even if they are combined in one source file.
A second example refers to the calculator from Example 01 and demonstrates the partial parsing of the following expression list in the file 'explist.calc'.
1+2 3-1 6*6 9/3
When applying the command 'pim_test calc -early explist.calc' you will receive this result.
Derivation Tree from Source : explist.calc [calc.Start_Command (1,1) [Command.exp (1,1) [Exp.add (1,1) [Exp.ign0 (1,1) [Exp1.ign0 (1,1) [Exp2.int (1,1) [Int (1,1) "1"]]]] [Keyword (1,2) "+"] [Exp1.ign0 (1,3) [Exp2.int (1,3) [Int (1,3) "2"]]]]]] Derivation Tree from Source : explist.calc [calc.Start_Command (2,1) [Command.exp (2,1) [Exp.sub (2,1) [Exp.ign0 (2,1) [Exp1.ign0 (2,1) [Exp2.int (2,1) [Int (2,1) "3"]]]] [Keyword (2,2) "-"] [Exp1.ign0 (2,3) [Exp2.int (2,3) [Int (2,3) "1"]]]]]] Derivation Tree from Source : explist.calc [calc.Start_Command (3,1) [Command.exp (3,1) [Exp.ign0 (3,1) [Exp1.mlt (3,1) [Exp1.ign0 (3,1) [Exp2.int (3,1) [Int (3,1) "6"]]] [Keyword (3,2) "*"] [Exp2.int (3,3) [Int (3,3) "6"]]]]]] Derivation Tree from Source : explist.calc [calc.Start_Command (4,1) [Command.exp (4,1) [Exp.ign0 (4,1) [Exp1.div (4,1) [Exp1.ign0 (4,1) [Exp2.int (4,1) [Int (4,1) "9"]]] [Keyword (4,2) "/"] [Exp2.int (4,3) [Int (4,3) "3"]]]]]]
This ability enables, for example, a background service to parse client requests "on-the-fly".
The function 'Stream_string' of the scan stream interface let you define strings as scan streams while the function 'Stream_line' is suitable for parsing from special files like pipes.
Look at Example 01 and 04 as examples for parsing from strings and special files.
There isn't much to say here beside that it's possible. The following code fragment demonstrates how to do it.
First load the scanner, either with the function 'Scn_get_<language>' from the generated source file '<language>_lim.c' or from the corresponding binary image '<language>.lim' with the function 'Scn_get' of the scanner primitives interface. After that create a scan stream using, for example, the function 'Stream_file' of the scan stream interface.
// ... #include "scn_base.h" #include "scn_io.h" // ... Scn_T Scn; // Scanner Scn_Stream cStream; // Scan stream int i; // ... // scanner & scan stream creation (see above) // ... // define EOF, error and token id's Stream_defEofId(cStream,-1); Stream_defErrId(cStream, 0); for (i = 1; i < Scn_tokens(Scn); i++) { string TokenName = Scn_tokid(Scn,i); Stream_defTokId( cStream, TokenName, (short)i ); FreeMem(TokenName); } // ... // scan loop for ( Stream_next(cStream); Stream_ctid(cStream) >= 0; Stream_next(cStream) ) { string FileName = symbolToString(Stream_cfil(cStream)), TokenName = Scn_tokid(Scn,Stream_ctid(cStream)), TokenValue = symbolToString(Stream_csym(cStream)); long Line = Stream_clin(cStream), Column = Stream_ccol(cStream); // ... } // ... // scanner & scan stream disposal Stream_close(cStream); Stream_free(cStream); Scn_free(Scn);
Finally, look at the source file [lim_test.c] of the scanner test program for an application.
Basically, it's possible. The (low-level) parser interface is flexible configurable by appropriate user-defined handlers for the retrieval of the next token, the shift and reduce operations and the reporting of syntax errors.
The user-defined function 'get next token', which represents the interface between scanner and parser, must return -1 in the case of EOF, -2 or less in the case of an error or unknown token and in the case of a token the correct index from the parse table.
From version 1.5 on it's possible to combine an externally defined scanner not only with the Styx parser but also with the term generation facility. All what you have to do is to provide the corresponding external scanner interface. Use the function 'PT_init_extscn' of the term generation interface to initialize parsing and term construction.
If you plan to use the term generation facility with an external scanner and parser, which is possible from version 1.5 on, you must provide both the external scanner interface and the external parser interface before calling the function 'PT_init_ext' of the term generation interface to initialize parsing and term construction.
As already described in the previous chapters the derivation tree with the complete source information will be automatically constructed during the parse process - with the help of the term generation interface. The Styx compiler produces a C interface '<language>_int.c' to the abstract syntax tree, which in most cases will be the preferred access method, along with the generic language support.
In order to perform meta-operations on arbitrary derivation trees corresponding to different languages you'll need dynamic access to the concrete syntax tree. The Styx framework comes along with generic methods for these purposes. The term interface provides basic operations needed to construct and access a derivation tree. Beside functions for the retrieval of node information it contains a "depth first" and a "breast first" tree iterator - useful for the iteration of concrete syntax trees.
Styx supports the implementation of generic, language independent services. Examples are the Styx compiler itself as well as the scanner and parser test programs.
The scanner primitives interface provides several functions to load a scan table from a binary image. Once loaded you could create scan streams on it and perform scan operations. Analogous a parse table can be loaded from a binary image ( look at the parse table load & unload interface ) and then, for example, used to initialize parsing and term construction.
The Styx compiler is an example for such an application. In version 1.5 a new interface, the Styx translation library, provides a "high level" access to this functionality.
When a grammar specification evolves there will be a need for the automatic conversion of the "older" source files. The pretty printing facility allows a user to translate source code from one grammar specification to another and then to print the translated source code. The print function could be even useful without a previous transformation, for example, when a source tree was dynamically constructed and consequently lacks any positional information.
This feature is "ready to apply" but especially the applied layout mechanism has some disadvantages and will be revised in a later version. ( see also next chapter )
The transformation facility was designed with regard to the above mentioned application. It doesn't (yet) support tree transformations in a general sense. A tree-to-tree transformation is performed according the underlying source and target grammar specifications which must satisfy the following restrictions. Comments are handled, too, a must with regard to source-to-source transformations.
source tree based on CFG 1 specification of CFG 2 -------------------------------------------------> target tree based on CFG 2 abstraction(CFG 1) = abstraction(CFG 2) regexp(token(CFG 1)) = regexp(token(CFG 2)) [*]
[*] From version 1.5 on user-defined hooks can be specified in order to handle different token representations. This is useful, for example, to cope with different comments.
The indentation of the symbols and productions in a grammar definition will be taken as layout hints. Take, for example, the start production of the Styx command line specification language, below. Its format advises the layouter to start the arguments, options and environment section on a new line and output the following definitions with an indent. The format of the list production, below, is suitable if each element of a list should start on a new line, too.
start [err] Source :root: "Program" Fid1 Dol Doc "Arguments" Dfns OptDoc "Options" Dfns OptDoc "Environment" Dfns OptDoc let [err] Dfns ; Definitions :nil : :cons : Dfn Dfns
The layout option in the Styx grammar allows the user to force an extra new line ( option "!" ) or to overrule the indentation based layout ( option "?" ). Latter tells the layouter that the whole grammar phrase should be printed on one line if possible. This is useful in the case of expressions.
Another topic is the separation of the tokens. A Space
will be taken as default separation character.
If this doesn't seem suitable the user must explicitly specify which separation rule should be applied for
two tokens. He can prevent a separation, force the tokens to be separated by a newline or according the
current indentation which would be the default rule in the case of comments. Obviously, these separation rules
also influences the layout.
Printing is done in two steps. First the appropriate positions of the grammar phrases will be determined by applying the above mentioned layout rules. After that the actual printing can take place.
Basically, Styx is available for C/C++ development platforms. Future releases will come with small runtime libraries for a wider range of popular programming languages like C# and Java. The source distribution of version 1.7.5 contains a pure STL-based C++ runtime library.
The C++ library contains some (template) classes for the construction of scanners and parsers. If you plan to use them with a Styx grammar, you first have to specify the grammar and export the scanner and parse tables with the 'styx' program. The C++ runtime scanner and parse table classes provide methods to import the exported tables.
Note, that there are some restrictions regarding the Styx grammar specification. For now, the following features won't be supported by the C++ runtime system:
Further, there won't be a generated C++ interface to the abstract syntax tree. You have to use the generic C++ parse tree interface, instead, which provides the user with the necessary (token and nonterminal) class information. Beside that, depending on the task you want to perform, you can construct parse trees with full source information or reduced abstractions.
Similar to the basic Styx system the C++ runtime system provides a scanner test program 'StyxScannerTest' and parser test program 'StyxParseTreeTest'. They are suitable to test exported Styx grammars and, secondly, they are good examples for the usage of the C++ library. For more information, please have a look at the online source documentation.
The source distribution of version 1.7.6 comes with a C# runtime library. For now, the C# library contains the module 'StyxScanner.cs' for the construction of scanners. In order to use it with a Styx grammar, you first have to specify the grammar and export the scanner with the 'styx' program. The C# runtime scanner classes provide methods to import the exported scanner table.
Similar to the basic Styx system the C# runtime system provides a scanner test program 'StyxScannerTest', suitable to test exported Styx grammars and demonstrate the usage of the C# scanner module. For more information, please have a look at the online source documentation.