The Styx Handbook: Mapping Trees to Terms

5. Mapping Trees to Terms

As outlined in the introduction, it is not only an advantage for a proper interface on a derivation tree to use an (abstract) depth grammar, but also to switch from grammatical to algebraic notions. Basically by this, non-terminals are mapped onto (abstract data-)types and productions are mapped onto functions. Words in the domain of languages became terms. Grammars are treated as signatures (loosely speaking "header files") of term algebras, then. This is far more than an overall renaming, but a transition to a different more appropriate concept with different tools and properties.

Within Styx, the transition from the concrete (surface) grammar to the corresponding term algebra is done in one step, and one final outcome is the C interface of a concrete language. The abstract grammar is somewhat bypassed, see the notes at the end of the section.

This section mainly defines how we derive the term algebra from the concrete grammar. Having this, the C interface can be explained in more detail.

5.1 Well-formed productions

Before we can do our transformation, we have to place some requirements onto the concrete grammar first. These conditions are tested by the Styx system and non-well-formed productions are diagnosed.

Within this and the following subsections, we ignore any keyword members in the production bodies. This may or may not be indicated. Further, we treat the individual production rules (with keywords ignored) as predicates.

A production is well-formed if it belongs to one of the following groups:

let X :nil#*: where the production name starts with "nil", optionally followed by a natural number, and the production contains no (identifier) members.
let X :cons#*: Y Z where the production name is "cons", optionally numbered, and contains exactly two (identifier) members.
let X :ign#+ : Y where the production name starts with "ign"i, followed by a natural number and the production contains exactly one (identifier) member.
let X :name : X1 .. Xn where "name" does not start with "nil", "cons" or "ign". No restriction apply to the production members.
EBNF extension (styx version >= 1.8):
let X :name : X1 .. Xn where "name" is not "none", "some" or starts with "nil", "cons", "ign". No restriction apply to the production members.
let X :none : where the production contains no (identifier) members.
let X :some : Ywhere the production contains exactly one (identifier) member.

This grouping serves two purposes. The first two groups will be used to derive list-like productions, while the "ign" production is used to define identity productions. The later typically occur with expressions that have different levels binding strength or when likely classes of productions are excluded or included into certain contexts. When producing the abstract grammar, we consider these non-terminals to be equivalent.

As examples, see the definitions of Exp, Exp1-2 in the introduction and Exp, ExpDyck, ExpQuot, Exp0-4 in the lexical Styx grammar itself. For lists, the context free grammar of Styx (Dfns, Prds, Mbrs) are proper examples.

The last two groups will be used to derive option-like productions.

5.2 An induced congruence relation

We get rid of the superficial distinction between the different non-terminals by means of a congruence relation over the non-terminal names induced by the special production names "cons#*","nil#*" and "ign#+".

The congruence relation is defined as follows:

X <=> X
X <=> Y --> Y <=> X
X <=> Y && Y <=> Z --> X <=> Z
let X :ign#+: Y --> X <=> Y
let X :cons#*: Y Z --> X <=> Z
X <=> Y && let X :id: X1 .. Xn && let Y :id: Y1 .. Yn && 1 <= i <= n --> Xi <=> Yi

While the first three formulas define an equality by stating the properties of the relation (reflexive,symmetric,transitive), the next two specify the equations induced by the concrete grammar. By this, two non-terminals are treated as equivalent when they appear both on the left and right side of an "ign" production or on the left side and on the end of a "cons" production (in which case they both mean lists of the same type later).

The final rule makes a congruence from this equality. It states, that if we have two equivalent non-terminals, that both contain productions with the same name, then the equality is extended over the bodies of that productions by pairing each identifier successively and concluding the equality of the so-yielded pairs (ignoring keyword members).

5.3 Classes and Representatives

What we have gained so far is that we have evtl. grouped different (terminal and non-terminal) identifiers into the classes introduced by the above congruence relation. Using this relation each identifier corresponds to a set of its equivalents. As an example, "Exp2" in the introduction example expands to the set [Exp2] = {Exp,Exp1,Exp2}. These classes will later be mapped to the abstract types of the term algebra to be produced.

[X] = { Y | Y <=> X }

We assign to each of these classes a unique name by picking the lexically smallest identifier as the representative of the class. In our example, this is "Exp". We denote the so chosen representative of [X] by X^.

5.4 Compatibility Conditions

Having set up equivalent identifiers, we now come to the productions. Basically, all we have to do is to merge the productions of the different equivalent non-terminals and to drop the "ign" productions. But this is only possible under additional conditions. Basically, what can go wrong is, that by the congruence terminal and non-terminals have been concluded to be equivalent, that we cannot merge productions with same names and different numbers of (identifier) members, and that lists would contain additional non-list productions.

This leads to the following conditions:

X <=> Y --> Type(X) = Type(Y), where Type(Z) = { terminal, nonterminal }
let X :id: X1 .. Xm && let Y :id: Y1 .. Yn && X <=> Y --> m = n
(let X :nil#*: || let X :cons#*: A B) --> not exists P,prod: P <=> X && let P :prod: c && prod not in { ign#+, nil#*, cons#* }
EBNF extension (styx version >= 1.8):
(let X^ :none: a || let X^ :some: b) --> not exists P: P = let X^ :id: c && id =/= { ign#+, none, some }
(let X^ :none: a) --> exists P: P = let X^ :some: b
(let X^ :some: a) --> exists P: P = let X^ :none: b

While generating the abstract grammar, Styx will validate these compatibility conditions.

5.5 Conversion to term algebras

After all this preparation and conditions, we can finally convert the concrete grammar to a signature.

To do this, we map all non-terminals NT which does not have list-productions (those named "cons#*" or "nil#*") or - optionally - option-productions (those named "none" or "some") to their representative names NT^. Likely, all terminal names T are mapped to their representatives T^. Collectively, these form the types of the algebra.

Every non-list and - optionally - non-option production (ignoring keywords) of the form "let X :prod: X1 .. Xn" is mapped to a function "prod : |X1| .. |Xn| -> X^". "|Xi|" denotes here the right hand side translation of the (non-)terminal names to types. The difference is, that we have to cope with list-production and - optionally - option-production, which have been omitted earlier. |X| is X^ if we have a non-list and - optionally - non-terminal or a terminal X. If X is a non-terminal with a production "let A :cons#*: B C" and X <=> A, |X| is List(|B|). (If we only have nil-productions, the translation is List(void)). Optionally: If X is a non-terminal with a production "let A :some: B" and X <=> A, |X| is Option(|B|).

The set of the so-yielded functions forms the signature of the derived term algebra and what we finally get as a data model for the (abstract) derivation tree is the initial term algebra that corresponds to this signature.

To give another example for this derivation, here is the abstract of the Styx grammar itself:

/* ------------------------------------------------------------------------ */
/*                                                                          */
/* [styx.abs]                  Abstract Grammar                             */
/*                                                                          */
/* ------------------------------------------------------------------------ */

LANGUAGE styx

TOKENS

  Ide, Nat, Set, Seq

TYPES

  styx          = Start_Source(Source)

  Source        = root(OptNat, Ide, QlxDfn*, OptCfg)

  OptCfg        = non;
                  cfg(Dfn*, Conflict*)

  QlxDfn        = defd(Ide);
                  defn(QlxCat, QlxOpt, QlxGrp, Ide, QlxGrp, Exp);
                  igrp(Ide);
                  tgrp(Ide);
                  mgrp(Ide, Ide*);
                  xgrp(Ide)

  QlxCat        = comC;
                  indC;
                  letC;
                  tokC;
                  lanC;
                  ignC

  QlxGrp        = non;
                  pigrp;
                  pop;
                  igrp;
                  pgrp(Ide);
                  grp(Ide)

  QlxOpt        = ignca;
                  non

  Exp           = conc(Exp, Exp);
                  diff(Exp, Exp);
                  sequ(Seq);
                  plusn(Exp, Limit);
                  plus0(Exp);
                  dyck(Exp, Exp, Exp);
                  non;
                  opt(Exp);
                  range(Exp, Exp);
                  plus(Exp);
                  epat(Exp, Ide, Exp);
                  set(Set);
                  union(Exp, Exp);
                  quot(Exp, Exp);
                  ident(Ide);
                  star(Exp);
                  spat(Exp, Set, Exp)

  OptNat        = non;
                  nat(Nat)

  Limit         = range(Nat, OptNat);
                  ntime(Nat)

  Dfn           = defn(Cat, DfnOpt, Ide, Prd*)

  Cat           = letC;
                  bgnC

  DfnOpt        = non;
                  errnt

  Lay           = grp;
                  rec;
                  dft

  Prd           = prod(Lay, Ide, Mbr*)

  Mbr           = opt(Seq*, Mbr, Seq*);
                  dtok(Ide, Ide);
                  klst1(Seq*, Mbr, Seq*, Seq*);
                  tkm(Seq);
                  ntm(Ide);
                  klst0(Seq*, Mbr, Seq*, Seq*);
                  else

  Conflict      = defn(State, Token, Rule*)

  State         = nat(Nat);
                  ide(Ide);
                  seq(Seq)

  Token         = seq(Seq);
                  ide(Ide)

  Rule          = red(Ide, Ide)

Two notes on notation: The form List(X) is denoted as X*, the form Option(X) as X?. The functions are abbreviated for convenience extracting the result type, so Exp = ... star(Exp) denotes the function star: Exp -> Exp. For constants, i.e. functions with no parameters, the argument parenthesis are omitted, so "QlxOpt = non; ignca" spells the two functions non: -> QlxOpt and ignca: -> QlxOpt.

There's an extra rule for the start production(s) one may deduce from the examples.

5.6 A note on the implementation

After all these definitions, we only have the mapping from a grammar to a signature. The mapping from the concrete derivation tree onto corresponding terms is straight forward and will be informally explained by having a look on the implementation.

One might expect that the derivation tree is copied to yield a term. But this is not the case. Instead, the above introduced mapping has been carefully chosen to be done on the fly. So, what the Styx parser produces, is the concrete derivation tree as shown in the introducing example. With delivering it, it's job is done. The conventions defined in these section are implemented only within the C interface, which permits an abstract access to the concrete derivation tree.

All the necessary normalization is done within the access functions, the "term destructors" of the C interface. Looking closer at the structure of the derivation tree, one can already imagine what these functions have to do. Provided with a reference to the tree, they decent into it skipping every "ign" production they pass. After this, they end at a normal production and have to check for the production identifier. If this is OK, they start decomposing the children of the node into the result slot, thereby skipping all keywords and comments they meet. That's all.

The advantage of doing so is, that while having a rather compact view on the derivation tree, complete source information is still preserved and can be accessed from this abstract view whenever needed.

In practice, this interface have both been proven to be efficient as well during language design and application. Often the design starts out with the abstract grammar finding an appropriate surface grammar, or having the surface grammar already given, concentrates on extracting a proper abstract grammar, which can be easily done just by assigning proper production names.

5.7 Relation between the abstract grammar and the algebra.

Only to prevent confusion between terms and words in abstract grammar which are sometimes loosely treated as synonyms in the text, a few additional notes apply here.

When talking about terms, we're talking about values having specific types. These values are abstract in that we do not offer details of their implementation through their interface. In fact, Styx can produce different implementation of terms with an identical interface. Abstract data types are abstract with respect to their implementation.

In contrary, abstract grammars are abstractions of the surface details of the concrete grammar. An abstract grammar preserves the depth structure of the language, but simplifies the derivation tree by dropping unnecessary details as done above, for example. One can easily see that the structure is preserved by recognizing, that the mapping from the concrete derivation tree to it's abstract correspondent is a homomorphism.

The ".abs" file generated by Styx has both possible readings. On can read it as the abstract grammar as well as the signature of the term algebra.

Referring to the first, we can write down words of the abstract grammar, too. The word "1+2*(3-4)/5" of the introduction example would spell "add(1,div(mul(2,sub(3,4)),5))" in the abstract grammar. Superficially, this looks precisely like a denoted term of the corresponding algebra.

Next Previous Contents