Next Previous Contents

3. The Styx Language Specification

This sections gives some explanations on how to write a language specification for Styx. Contrary to yacc, Styx is reflectively implemented, meaning it is written with it's own help. Thus, a proper Styx definition for the Styx language exists within the Styx source distribution. For omitted details you might like to refer to this source (styx.sty), from which we cite often in this part of the document. This does not only provides a proper definition, but also gives a plethora of examples.

3.1 The overall source

Referring back to the above walk-through, a specification of a language is written down within one file consisting of three sections:

Note: Since version 1.3 the preprocessing facility #include can be used to modularize the specification in order to re-use parts of them.
  start Source
  :root : "Language" OptNat Ide 
          QlxDfns0
          OptCfg
 
  let OptCfg
  :non  :
  :cfg  : "Context" "Free" "Grammar" 
            Dfns
            Conflicts

  let QlxDfns0
  :nil  :
  :ign0 : "Regular" "Grammar"
            QlxDfns
An extra twist is implemented within the Styx generators. As naming convention it is required that the Styx source files are named like the corresponding languages and have the extension ".sty". Thus, if you specify a language named "calc", you have to name the language definition file "calc.sty".

3.2 Lexical Conventions

Character Set, Formatters and Comments

The character set for the source is ASCII. For later reference, we distinguish between printable characters and control characters:

; Character Set

  let Byte        = '\00' .. '\ff' ; all extended ASCII
  let Control     = '\00' .. '\1f' ; control
                  |          '\7f' ; DEL
                  |          '\ff' ; space-like extended ASCII

  let Printable   = Byte - Control

Space, Newline, Return and Formfeed are used to separate tokens and are otherwise completely ignored. The source itself is format-free. Note that also the page separator character may be used, we do never refer the source by pages. Additionally, no tabulator characters may be used with in source. We had so many problems with different programs having different ideas about how to expand them, that we drooped them from this specification.

  ign Space       = " "                 ; ASCII - Space
  ign Line        = "\n" | "\r\n" | "r" ; UNIX / DOS / Mac
  ign Page        = "\p"                ; weak separation convention

Comments start with a semicolon and extend to the end of the line.

; Comments et al

  com Comment     = ';' {Printable}

Identifier, Literals and Operators

The regular tokens are Identifier (consisting of letters and digits, starting with a letter), three sorts of literals and a set operators.

; complex tokens

  tok Ide = Letter {Letter} {Digit} ; Identifier
  tok Nat = Digit+                  ; Natural
  tok Set = '\'' {LitChar} '\''     ; CharacterSet
  tok Seq = '\"' {LitChar} '\"'     ; CharacterSequence (String)
  tok Opr = (Special - ';=<>')+ | '=<>' ; Operator

Beside the natural numbers, which are later used to denote characters by their ASCII code, we distinguish two sorts of strings form the literals. Single quoted strings denote sets of characters and double quoted strings sequences of characters. When containing only a single character, their meaning is of course identical.

Contrary to C syntax, both the single and the double quote has to be escaped when used inside these literals themselves. Additionally, a hexadecimal notation for (unicode) characters is provided within the character literals. Some control characters (form feed, return, newline, tabulator) can also be denoted within the quotation by a single character after the backslash.

For completeness, here are the remaining definitions for the literals:

; Definitions

  let Letter      = 'A'..'Z' | 'a'..'z'
  let HexDigit    = '0'..'9' | 'a'..'f'
  let Digit       = '0'..'9'
  let Normal      = Letter | Digit | Space

  let Quote       = '\'\"\`\\'
  tok Parenthesis = '()[]{}'       ; one character tokens

  let Special     = Printable - Normal - Parenthesis - Quote

  let LitChar     = Printable - Quote
                  | '\\' (Quote | 'prnt' | HexDigit HexDigit)
                  | '\\' 'xX' HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit

The remaining tokens are operators and parenthesis. Both token classes do not have a meaning for themselves, but are used to form reserved words later in the regular grammar. Operators are made up from special characters.

Preprocessing macros

The tokens in this section have a special meaning for the Styx preprocessor. They were introduced to provide modularization, macro expansion and conditional compilation.

; Preprocessing tokens

  let White   = Space | Line | Page
  let Name    = (Letter | "_") { Letter | Digit | "_" } 
  let MPar    = ( Printable - ( White | ',' | ')' | '=' ) )
                { Printable - ( White | ',' | ')' | '=' ) }

  tok MacInc  = "#include" White {White} (Printable-White) {Printable-White} ; Include
  tok MacDfn  = "#macro" White {White} Name                                  ; Macro definition
                  {White} [ "(" {White} MPar 
                  { {White} "," {White} MPar } {White} ")" {White} ]
                  [ "=" 
                    ({Byte} - ({Byte} ("#macro"|"#end") {Byte})) 
                    "#end" ]

  tok MacSep  = '\'' (Byte-'\'') [ '-' ]                                     ; End of parameter

  tok MacCond = ( ( "#ifdef" | "#ifndef" ) White {White} Name )              ; Conditionals
              | "#else" | "#end"

The reserved words are "Language", "Regular", "Grammar", "Context", "Free", "let", "tok", "ign", "com", "ica", "ind", "lan", "InGroup", "ExGroup", "Group", "other", "start", "err" and "reduce". Further, "cons", "nil" and words starting with "ign" have a special meaning when used as production names. Since version 1.8 the production names "none" and "some" also have a special meaning in all grammar definitions with a version number (>=1).

3.3 The Regular Grammar

Next to the introducing "Regular Grammar" keywords, the regular grammar is specified as a collection of equations. Following a leading keyword, that gives some hints how to cope with the equation, and eventually some option and group information (see below) a name is assigned to a regular expression. Have a look at the preceding definitions to get an idea who this looks like.

As with modern (f)lex implementations Styx now i.e. since version 1.6 supports the definition of inclusive or exclusive groups of regular expressions, too. They are useful to switch between different regular languages. ( see Example05 for a demonstration )

Version 2.0.1 introduces "dynamic" tokens. They are declared to the parser with the QlxDfn-production "defd" and initially unknown to the scanner, i.e. have no regular expressions. With the help of a special member production token values are assigned to them during the parse process. ( see Example08 for a demonstration )

; REG-Section

  let QlxDfns ; Qlx-Definitions
  :nil  :
  :cons : QlxDfn 
          QlxDfns

  let QlxDfn  ; Qlx-Definition
  :defn : QlxCat QlxOpt QlxGrp0 Ide QlxGrp1 "=" ExpQuot ; regular expression definition
  :defd : "tok" "<" Ide ">"                             ; dynamic token
  :igrp : "InGroup" Ide                                 ; inclusive group definition
  :xgrp : "ExGroup" Ide                                 ; exclusive group definition
  !tgrp : "ExGroup" Ide "[" "tok" "]"                   ; and ignore keywords in group (version >= 2.0.1)
  :mgrp : "Group"   Ide = Ids                           ; group list definition, introduced in version 1.7

  let Ids  ; Group list members
  :cons: Ide Ids0 ; 'Ide' can refer to a QlxGroup or the language as initial QlxGroup.

  let Ids0
  :nil :
  :cons: Ide Ids0

  let QlxGrp
  :non  :                ; no group information

  let QlxGrp0  
  :grp  : ":" Ide ":"    ; The regular expression belongs to QlxGroup or group list 'Ide'.
  :ign0 : QlxGrp         ; The regular expression belong to initial QlxGroup.

  let QlxGrp1  
  :grp  : "!"  Ide       ; The recognition of the regular expression activates QlxGroup 'Ide'.
  :igrp : "!"            ; The recognition of the regular expression activates initial QlxGroup.
                         ; The following definitions has been added in version 1.7:
  :pgrp : "!+" Ide       ; The recognition of the regular expression pushes the current QlxGroup onto the stack
                         ; and activates QlxGroup 'Ide'.
  :pigrp: "!+"           ; The recognition of the regular expression pushes the current QlxGroup onto the stack
                         ; and activates initial QlxGroup.
  :pop  : "!-"           ; The recognition of the regular expression pops and activates the top QlxGroup on the stack.  
  :ign0 : QlxGrp

  let QlxCat ; QlxCategory
  :letC : "let"
  :tokC : "tok"
  :indC : "ind"
  :lanC : "lan" ; Embedded language: 
                ; lan :Ide_Language: Ide_Startsymbol ! Ide_EofOrFollowTokenLanguage = Ide_EofOrFollowToken+
  :ignC : "ign"
  :comC : "com"

Groups and group lists must be defined before they can be referenced. With this exception the definitions can come in any order. This means that an applied occurrence does not need to follow it's definition textually. It is only required that no recursion is used. So, you can order the definitions due to other purposes. Note that contrary to the lex program no implicit semantics is placed on the order of the definitions, too.

Categories

The leading keyword in the equations (see QlxCat) describes the usage of a token. First, the equations introduced using "let" are auxiliary. They do not specify tokens but only regular sets eventually used in them. As a consequence they can't neither be grouped nor switch to some group. See in the above section for typical applications of this feature.

The next keyword "tok" introduces regular tokens. The identifier following this keyword and the optional option and group information are the only ones that can later be used within the context free grammar. Also, when implicitly used there as keywords, only these regular sets will be considered.

In order to support the specification of indented languages (since version 1.6), the keyword "ind" introduces so-called indent tokens as regular tokens. In the context free grammar they will be referenced by the corresponding minimal indent and dedent keywords. ( see Example06 for a demonstration )

Another regular token variant (since version 1.6) are the so-called embedded language tokens. They are introduced by the keyword "lan" and their values are not just strings but trees i.e. terms. Within such a definition the initial keyword must be followed by the name of the embedded language, the start symbol from which the parsing process should start and either the name of the "hosting" language along with a list of follow tokens or the name of the embedded language along with a list of so-called eof tokens.

In the context free grammar they are referenced by concatenating the names of the embedded language and the start symbol. ( look at the XML language definition in the reference section for a demonstration )

Last, the remaining keywords ("ign" and "com") introduce tokens that will be more or less ignored. "com" is for comments, and the semantic is, that they will be stored in the derivation tree (for evtl. source-source translation), but will not be accessible through the language specific interface. Also, both "com" and "ign" tokens can be inserted at any place within the language sources.

"ign" tokens are completely ignored and never even leave the scanner. Conceptually, they do their duty as formatting character. Because the scanner knows about the newline character and provides line and column position with each token, these classes of characters may (somehow indirect) be accessible in the source tree later. If no strange things are done with the control characters (i.e. only uses space and newline as formatters), on can fully reproduce the source from the derivation tree modulo trailing spaces and empty lines.

Collectively, all definitions beside the "let" ones are considered to form the tokens of the language. Styx's lexical analyzer requires from each of these token definitions that they are disjunctive from each other. So, no two of them may contain the same word. While the lex program resolves possible non-empty intersection by an implicit "priority", one has to make this explicit when using Styx. There are many ways to do this. One possibility is to use the difference operator ("-") to clarify the situation. Styx will issue errors as soon as non-empty intersections are detected.

In the language interface, the tokens will be offered as symbols. Basically, these are unique strings allowing them to be compared by the C identity predicate ("=="). String equal tokens will only stored once by this mean, too.

Normalizing Token

This can become a disadvantage when the tokens are abnormally defined within the language. Although we considered this a weak design anyway, few means are provided to introduce a normalizer for such tokens. In order to support case insensitive languages, a normalizer for these is build into Styx.

  let QlxOpt  ; QlxOption
  :non  :
  :ignca: "[" "ica" "]"
The "[ica]" option before the defined identifier indicates that the case has to be ignored. As a result all of the corresponding token values will be normalized to small letters. Note that using abnormal tokens has many disadvantages, among others a possible lost of source information. People who define such abnormalities are typically unable to decide whether they really mean what they do. I've seen, for example, PASCAL implementations, which were case insensitive but identifiers like "FileRead" being defined in them. This certainly means asking for trouble. We cannot help bad design and strongly suggest not to use normalizers on tokens.

Regular Expressions

Here we finally come to the right hand side of the regular equations.

  let Exp4     ; Expression prio 4
  :sequ : Seq
  :set  : Set
  :ident: Ide

The meaning of the set and sequence literals has already been defined when these token were introduced. The identifier denotes the regular set corresponding to some other equation.

  let Exp3     ; Expression prio 3
  :ign1 : Exp4
  :range: Exp4 ".." Exp4
  :ign2 : "(" Exp ")"

Round parenthesis may be used to group expressions, the double dot operator ".." can be applied to construct character ranges. It's both arguments have to be single characters.

  let Exp2     ; Expression prio 2
  :opt  : "[" Exp "]"
  :star : "{" Exp "}"
  :puls0: Exp "*"
  :plus : Exp3 "+"
  :plusn: Exp3 Limit
  :ign1 : Exp3

  let Limit              ; occurrence limit
  :ntime: Nat            ; exact ntimes
  :range: Nat "," OptNat ; minimum and optional maximum

  let OptNat
  :non : 
  :nat : Nat

Next in binding strength come the different sorts of monoids and options. The "+" suffix means one or more occurrences. The curly brackets or, since version 1.7, the "*" suffix is for zero or more occurrences and the square brackets means zero or one occurrence. Limited occurrences ( introduced in version 1.7 ) can be expressed by specifying a number or a range.

  let Exp1     ; Expression prio 1
  :conc : Exp1 Exp2
  :ign1 : Exp2

The concatenation is denoted by simply concatenating expressions. The corresponding operator is omitted.

  let Exp     ; Expression prio 0
  :union: Exp "|" Exp1
  :diff : Exp "-" Exp1
  :ign1 : Exp1

Finally, and weakest in binding strength, we have the set union ("|") and difference ("-") operations.

The following definitions refer to version 1.6, 1.7 and above. They introduce dyck, pattern and quotient expressions in order to cope with nested and heredoc comments, among others.

Quotient expressions are introduced in version 1.7. The quotient part of the expression can be a non-unicode character set or a sequence. ".." within the character set denotes a character range. In the first case, all characters belonging to the set will be stripped from the end of the token and re-scanned. In the second case, the defined sequence will be stripped from the end of the token and re-scanned. The scanner reports an error if the token doesn't end with that sequence.

  let ExpQuot ; quotient 
  :quot : ExpDyck "/" Exp4 ; Exp4 = Set or Seq
  :ign0 : ExpDyck

Dyck expressions, introduced in version 1.6, specify the left parenthesis, the inner part and the right parenthesis. To give an example, < /* > < */ > describe recursive C-like comments.

Pattern expressions are introduced in version 1.7. The production 'spat' specifies a start pattern token. The non-unicode character set argument constitutes the pattern, while the expression arguments constitutes the pattern prefix and suffix. ".." within the character set denotes a character range. There must always be a corresponding end pattern token of production 'epat' which tries to match the current start pattern. The ID of the end pattern token refers to the non-matching case.

  let ExpDyck ; dyck ( Exp )
  :dyck : "<" Exp ">" Exp0 "<" Exp ">"
  :spat : "<" "=" Exp0 ">" Set "<" Exp0 ">"
  :epat : "<" "?" Exp0 ">" Ide "<" Exp0 ">" ; Ide = start token ID
  :ign0 : Exp

  let Exp0
  :non  : ; no restriction on the inner part
  :ign0 : Exp

3.4 The Context-Free Grammar

Here we deal with the definition of the context free grammar section in the Styx sources. This is straight forward and basically a triple list.

On the top level we have a list of definitions (Dfns) of non-terminal identifiers, whose body consist of a list of productions (Prds) for these non-terminals, again each identified by a name. The body of the individual productions is formed by a list of members (Mbrs), which are either identifiers denoting terminals or non-terminals, strings denoting keywords or a non-specified sequence denoted by the keyword "[other]".

Since version 1.8 Styx supports an EBNF-like notation for list ans optional members. (see the member syntax below and Example02 for an application)

With the help of the member production "dtok" it is possible to introduce an as token X parsed value V as dynamic token Y, where X, Y are denoted by the left and right token identifiers. From that position within the parse process the scanner always reports V as Y. (styx version >= 2.0.1)

The non-terminal names defined have to be unique within the scope of the source and disjunctive from the names of the regular sets defined in the previous section. The production names have to be unique within each non-terminal definition.

The keywords (string members) have to belong to one of the defined regular sets of tokens.

; CFG-Section

  let Dfns    ; Definitions
  :nil  : 
  :cons : Dfn 
          Dfns

  let Dfn     ; Definition
  :defn : Cat DfnOpt Ide 
          Prds

  let Prds    ; Productions
  :nil  : 
  :cons : Prd 
          Prds

  let Prd     ; Production
  :prod : Lay Ide ":" 
            Mbrs

  let Mbrs    ; Members
  :nil  : 
  :cons : Mbr 
          Mbrs


  let Mbr     ; Member
  ; keywords
  :tkm  : Seq
  ; fallback for a non-grammatical source sequence
  :else : "[" "other" "]"
  :ign0 : Mbr1
  let Mbr1    ; Member
  ; (dynamic) token or nonterminal
  :ntm  : Ide

  ; EBNF extension: optional list member
  :klst0: "[" OptKey Mbr1 OptKey "..." OptKey "]" 
  ; EBNF extension: mandantory list member
  :klst1: "(" OptKey Mbr1 OptKey "..." OptKey ")"
  ; EBNF extension: optional member
  :opt  : "[" OptKey Mbr1 OptKey "]"

  ; introduction of dynamic token
  :dtok : Ide "<" Ide ">"

  let OptKey
  :nil  :
  :cons : Seq OptKey

Some options extend this construction, of which the most important is the distinction between start and inner productions. Start productions indicate those non-terminals which can later be parsed individually, while the inner productions can only be parsed as part of a start production. Referring back to the regular grammar specification this distinction is much like the "let" and "tok" categories. We use a similar syntactic construction for the distinction, a leading keyword. The start productions are indicated by a leading "start" and the inner productions by a leading "let".

  let Cat     ; Category
  :letC : "let"
  :bgnC : "start"

The remaining options deal with error recovery and pretty printing.

Use the error option to specify a non-terminal as resumption point within the implemented panic-mode error recovery. To enforce the default error handling where the parse process stops when a syntax error occurs you should omit the error option.

  let DfnOpt  ; DfnOption
  :non  :
  :errnt: "[" "err" "]"

The layout option gives the pretty printer some hints for the layout of the corresponding grammar phrases. Choose the colon (":") as default or if you aren't interested in that facility. ( look at the layout specification for details )

  let Lay
  :reg  : ":"
  :line : "?"
  :nof  : "!"

Explicit rules to solve reduce-reduce conflicts

Often language definitions are ambigous w.r.t LALR(1) parsing. As stated above one example is the function call and expression syntax of Lua. (see the styx grammar lua.sty which is part of Example07 as an application)

  let Conflicts    ; Conflict rules
  :nil  : 
  :cons : Conflict 
          Conflicts

  let Conflict     ; Conflict rule (according diagnostic output)
  :defn : "Context" State "." Token ":"
          Rules

  let State        ; Conflict state
  :nat : Nat       ; state index
  :seq : Seq       ; state symbol (keyword)
  :ide : Ide       ; state symbol (token, nonterminal)

  let Token        ; Conflict token
  :seq : Seq
  :ide : Ide

  let Rules0 
  :nil  : 
  :cons : Rule 
          Rules0

  let Rules        ; Conflicting reduce productions, explicitly ordered
  :cons : Rule 
          Rules0

  let Rule
  :red : "reduce" Ide "." Ide  ; Conflicting reduce production

Normalizing Productions

Some of the identifiers for the production names are reserved for normalization. These are "cons(0-9)*", "nil(0-9)*" and "ign(0-9)+". Beside other keywords used in the Styx grammar, you are otherwise free to chose these names. The mentioned identifiers serve it's duty as indications of how to make up the depth grammar. A separate section is devoted to this topic. See below.

As stated above since version 1.8 if hinted by a version number in the language section of the grammar definition the production names "none" and "some" are reserved for normalisation, too.


Next Previous Contents