The Styx Handbook: Introduction and Overview

1. Introduction and Overview

This section contains an overview on the Styx framework and a comparison with lex/yacc combination.

1.1 The supported translation framework

To give an idea what Styx is about, we first look on the overall compilation process (see diagram 1) as supported by this tool:

  Source     - - - - - - - - - >    Object
   Code         Translate            Code
     |                                ^
     |                                |
     |                                |
   Parse              =            Unparse
     |                                |
     |                                |
     V                                |
  Source          Rewrite           Object
Derivation   ------------------>  Derivation
   Tree                              Tree

The diagram commutes, i.e. the following equation holds:

Translate(Source) = Unparse(Rewrite(Parse(Source)))

By this equation, the translation falls apart into a combination of three steps,

Parsing the source to a term
Rewriting the source term to an object term
Unparsing the object term

While both source and object are (textual or binary) strings, intermediate forms within the compilation are trees, i.e. hierarchical representations of their respective linear counterparts.

The first and the last step of a translation are string-to-tree and tree-to-string transformations, while the inner step is a tree-to-tree transformation, the heart-piece of the translation.

To simplify the inner translation, we are interested to have a simple structure, if possible, thereby not only omitting keywords but also other language features often referred to as "syntactic sugar".

Not only the internal representation of both the source and the object distinguishes itself from the external one. Referring to their grammars, we call the grammars of the string representations the surface grammar of the languages, while the simpler, internal grammar is called the depth grammar.

Also, while syntactical notions may apply well to the external strings, the rewriting can be further simplified if the trees are treated as terms, thereby allowing algebraic means and notions to be applied. By this, the depth grammar becomes a recursive system of the types of the terms.

In practice, having done the step from and to the linear forms is a substantial gain. By having abstracted from the surface properties of a concrete code to be translated, and having arranged things to be manageable as separate parts, we achieved a reduction of the translation processes complexity.

1.2 Reasons for this framework

The advantage of this construction may not be apparent to someone with few experience in compiler writing. Especially, having a derivation tree may not appear as a gain with regard to the "semantic actions" of yacc. Yacc's semantic actions have been designed to support one pass compilation, meaning the code generation intentionally happens during parsing and is in fact controlled by the parser.

While this approach was necessary in times when memory was extremely short, it has consequences on the language design. As soon as definitions become recursive, which is the case for every non-trivial language, a one pass compilation is not longer possible. This has lead to work-around constructions like forward-declarations and header-files.

Additionally, many constructions can not be compiled into code immediately or need to produce entries in so-called symbol tables. Typical examples are all sorts of definitions. The texts that constitutes these definitions has to be translated to some internal form anyway.

As soon as a derivation tree of the source becomes available, these complications and hand-crafted representations are not longer necessary. Because representations of the definitions are already present, one could simply refer to the definition's derivation tree. Dealing with recursive constructions, any number of passes can both easily and efficiently been done over the tree, i.e. collecting the recursive definitions in a first go only considering the headers and using them in the second, diving into to bodies.

The more complicated the compilation process would be with yacc the more substantial becomes the gain from using this framework. Practically, a compiler writer will therefore construct a derivation tree within yacc's semantic actions anyway. Generating this from the grammar, is the most substantial advantage of Styx over the lex/yacc combination.

The advantage of this framework becomes even more clear, if one thinks of an interpreted language. Because Styx also provides means to make the derivation tree persistent, the complete "compilation" of such languages may eventually been done by it. After writing down the grammar, the language designer can immediately concentrate on writing the interpreter, instead of being bothered with making an internal representation of the source, which is already provided by the Styx framework. Even in making such an interpreter, Styx gives some support by offering an appropriate interface to the source tree.

1.3 The role of Styx within this framework

Styx is designed to handle the first and the last step of the translation, that is, both the parsing and unparsing, or the string-to-tree and tree-to-string transformations, where source and object can be represented textually or binary.

Given a surface grammar, it does not only produce a scanner and a parser, but also automatically derives the depth grammar, creates the derivation tree (term) and provides a versatile interface oriented on the abstract language.

The unparsing is done by means of a pretty printer, or, if a binary format is of interest, it supports making the tree persistent. This is especially useful for interpreted languages when not instruction code is produced, but the derivation tree is evaluated directly.

Additionally, the result of the parsing process still maintains all source information including keywords and comments. By this, source-source translations are easily implemented with this tool, too.

1.4 The supported language model

While lex and yacc were designed to cope with the many abnormalities used in (surface) language design, Styx is somewhat more restrictive. It preferably supports languages designed by a canonical, two level model.

By this, the tokens separated by the scanner have to be in regular or ( since version 1.6 ) dyck grammars, and the context-free language has to be deterministic (LALR(1)).

This excludes versatile support of some constructions that we considered to be weird. Examples are languages that have no context-free grammar (like C, which comes with the context-dependent "typedef" construction).

This decision was originally made because Styx has been used for internal language design, therefore it did not have to cope with all the oddities out. Now that we have released Styx to public use, this design decision may cause troubles to other people, although one can work around these restrictions.

With the introduction of "dynamic" tokens in version 2.0.1 Styx is able to handle context-dependent constructs like C's "typedef". Beside that users will be able to work around ambiguities in the syntax of programming languages. One example for such an ambiguity is the "prefixexp" and "functioncall" definition in Lua's syntax. (see also Lua reference manual 5.1, chapter 2.5.8) Reduce-reduce conflicts can be solved with the help of explicit reduction rules.

1.5 Comparison to the lex/yacc combination

lex and yacc follow a different concept. Basically, they do not even provide the first step of this framework. Instead of providing the generation of a derivation tree, the allow to add "semantic actions", which means in most cases, that the tree has to be hand-crafted and constructed manually within the action slots.

The central disadvantage of lex/yacc's approach is, that the burden of designing and making this structures is placed on the shoulders of the developers. Not only, that this costs quite a lot of time, it also typically leads to unsystematically constructed internal structures and a non-comformant interface to them, if any. Often, the raw implementation of the tree structure is used instead, making changes in it's representation a subtle task.

Styx's restriction mentioned in the previous section are not substantial. Styx provides means to cope with abnormalities, but the slots where one can handle them are as weird as these language features are themselves. Since Styx is available in source form, one may additionally adjust it to strange needs. So, comparing to lex/yacc, the restrictions are minor. It only means that when one does weird things, the implementation may become weird in these places and the support less handy.

More substantial restrictions are in the parser and - originally - the scanner. The parser is LALR(1), which means that a look-ahead of only one token is supported. Likely the scanner, in versions prior to 1.7, does a look-ahead of only one character. This later restriction has been introduced to guarantee an effective scanner, which is linear by the length of the string, while lex provides an arbitrary look ahead, meaning that lex complexity can become quadratic for abnormal token designs. With version 1.7 it is possible to force the scanner to do a look-ahead of more than the default case of one character.

Additionally, lex allows to switch between different sets of tokens (languages) to be able to do even more weird things. Since version 1.6 the Styx scanner is able to handle different token sets, too.

Like the lex/yacc combination, the scanner and parser are separated within Styx (also much more integrated), so one can plug-in any other, even a hand-crafted scanner, if nothing else helps. The disadvantage of doing so is again, that the many supports that Styx offers for the canonical design do not apply anymore. One has to write additional code, about to the amount that people applying the lex/yacc combination are already used to.

Next Previous Contents