The anatomy of a programming language

TIES542 Principles of Programming Languages, Spring 2017
Antti-Juhani Kaijanaho

Programming languages are conventionally described and analyzed by discussing syntax (in Finnish kielioppi or syntaksi) and semantics (in Finnish merkitysoppi or semantiikka) separately. This division goes back to the late 1950s when the first modern programming languages were being designed and implemented.

The precise boundary between syntax and semantics is fuzzy. Many practitioners and researchers define these two terms in the following manner:

syntax is whatever can be described with a context-free grammar whose terminal symbols are characters in the system character set; and
semantics is everything else.

In this document, I will be building a simple computer language and an interpreter for it (the full source code for the interpreter is available on Yousource). This language allows one to write arithmetical expressions with local variables (eg. 1+(2*3)/4 or let x = 4 in x *2), but nothing more; it thus is not really a programming language. It – or something like it – is, however, a part of every modern programming language, and it does illustrate all the fundamentals of programming language anatomy.

Note that since the course TIEA241 Automatons and Formal Languages is a prerequisite of this course, I will not be discussing what context-free grammars are, or its associated concepts, in much detail here.

Syntax

As discussed above, conventionally programming language syntax encompasses everything that can be described using a context-free grammar. This means that the following questions belong (with some exceptions) in the area of syntax:

What symbols (letters, digits, punctuation, something more exotic?) are used to write programs in this language?
How may those symbols be combined to form programs?
What symbol strings are syntactically valid programs?
What is the hierarchical structure of programs?

The answer to the first question is in principle determined by the terminal symbols of the grammar. In practice though, this is a bit more complicated: the context-free grammar is usually preceded by a lexical syntax, which defines a mapping from concrete character strings to the terminal symbols of the grammar. This is a matter of convenience only: separating lexical syntax from the context-free syntax simplifies the grammar and makes it easier to understand. The same thing could be written directly as a single huge context-free grammar, though.

This sort of syntax is sometimes called concrete syntax, because it specifies the concrete language as a set of character strings. Its dual is abstract syntax, which I will discuss later in this document.

Lexical syntax

In a typical programming language, lexical syntax specifies how a big sequence of (usually ASCII or Unicode) characters is split into lexemes (in Finnish lekseemi), which are akin to words and punctuation in natural languages. In Java, for example, possible lexemes include variable names like i, type names like Object, punctuation like { and }, literal constants like "hello" and operator names like %. Usually the program must consist of a sequence of lexemes, possibly separated by whitespace. The whitespace is typically ignored except so far as to they delimit lexemes.

Lexical syntax also defines a categorization of lexemes into token types such as identifier, string literal, opening curly bracket, and percent sign. Some token types encompass single lexemes, for example opening curly bracket comprises of the single lexeme {. Other token types encompass many different lexemes, for example string literal comprises "", "hello", "\t%s", and many others. In the latter case, lexical syntax also defines a semantic value for the lexeme; for example, "hello" has the semantic value hello and "\t%s" has the semantic value of a three-character string consisting of tab, percent sign, and the letter s. For each lexeme, the combination of its token type and semantic value (if any) is the token (in Finnish sananen) corresponding to the lexeme.

For a simple arithmetic calculator with local variables and parentheses, we might define a lexical syntax as follows:

There are 11 token types: variable name, numeric constant, let, in, =, +, -, *, /, (, and ).
A variable name consists of at least one character, the first being a letter or underscore (_) and the rest being letters, underscores or digits. However, let and in are not variable names.
Numeric constant starts with zero or more digits, optionally followed by a period (.) and zero or more digits. There must be at least one digit in a numeric constant.
Whitespace is ignored except so far as it separates lexemes.
The input is split into lexemes from left to right with no backtracking; when multiple lexemes are possible, the longest wins.

The last rule deals with inputs like fine + 1: without the rule, it could be split into f, in, e, +, 1. With the rule, we start from the left and look for the longest initial substring that is a lexeme; it is fine (a variable name). We then find + and 1; yielding the lexeme sequence fine, +, 1.

It is quite common to specify lexical syntax using this sort of semiformal English prose. Another common method is to write a context-free grammar for the set of possible lexemes. One might write the following grammar for the above lexical syntax (using the classic Backus-Naur Form, BNF, as the metalanguage):

<lexeme> ::= <variable name> | <numeric constant> | let | in | = | + | - | * | / | ( | )
<variable name> ::= <first character> <rest of the variable name>
<first character> ::= <letter> | _
<rest of the variable name> ::= <empty> | <name character> <rest of the variable name>
<name character> ::= <letter> | <digit> | _
<numeric constant> ::= <integer constant> | <float constant>
<integer constant> ::= <digits>
<digits> ::= <digit> | <digit> <digits>
<float constant> ::= <digits> . | . <digits> | <digits> . <digits>

Actually, the set of lexemes is usually a regular language, and as such we could also use regular expressions (the names are simply abbreviations; I use ${...} to reference abbreviations inside a regex):

LEXEME = (${NUMERIC_CONSTANT})|(${VARIABLE_NAME})|let|in|-|+|\*|/|(|)|=
NUMERIC_CONSTANT = ([0-9]+|[0-9]+\.[0-9]*|[0-9]*\.[0-9]+)
VARIABLE_NAME = [A-Za-z_][A-Za-z_0-9]*

Both the regular expression and context-free grammar methods require a rule to specify how a sequence of lexemes are recognized from a long string of characters: the last rule above works for them too. Both also require a disambiguation rule that says that let and in are reserved names and that if a lexeme is a reserved name, it will not be allowed as a variable name.

It is also possible to write a program that takes a string as input and outputs a sequence of tokens. This program is called a lexer or a scanner (in Finnish selaaja or lekseri). If it is written very clearly, it too can function as a specification of the lexical syntax.

# lexer2

Phrase-structure syntax

There is no established term for that part of language syntax which excludes lexical syntax; I choose to call it here phrase-structure syntax (in Finnish ilmaisurakennekielioppi). It is almost always defined using a context-free grammar, usually one that is LALR or sometimes even LL(1); it is desirable but not essential that the grammar is unambiguous.

The phrase-structure syntax uses the token stream defined by the lexical syntax as its starting point. The token types function as the terminal symbols of the formal grammar. The grammar itself ignores the semantic values of the tokens, but they are carried over to the parse tree (in Finnish jäsennyspuu) that the grammar defines for each syntactically valid program.

For the calculator with local variables, the following first attempt grammar could be made, using classic BNF as the metalanguage:

<expression> ::= <expression> + <expression>
               | <expression> - <expression>
               | <expression> * <expression>
               | <expression> / <expression>
               | + <expression>
               | - <expression>
               | <numeric constant>
               | <variable name>
               | let <variable name> = <expression> in <expression>
               | ( <expression> )

In BNF, nonterminals and those terminals that carry semantic values are given names that are put inside angle brackets; terminals that stand for themselves occur by themselves. Alternate right-hand sides of a production are separated by |.

This grammar is adequate in that it specifies what token sequences are syntactically valid. However, it falls short in another task that the phrase-structure syntax has: it needs to specify the hierarchical structure of the program. In this case, we need a grammar that tells us whether in 1 + 2 * 3 it is the addition or the multiplication that must be performed first.

Here is another grammar, one that is unambiguous and, in fact, LALR. The method used to produce it is taught in TIEA241 (at least as I teach that course; see lecture 12 of 2016).

<expression> ::= <term>
               | let <variable name> = <expression> in <expression>

<term> ::= <factor>
         | <term> + <factor>
         | <term> - <factor>
         
<factor> ::= <unary expression>
           | <factor> * <unary expression>
           | <factor> / <unary expression>

<unary expression> ::= <primary expression>
                     | + <unary expression>
                     | - <unary expression>

<primary expression> ::= <numeric constant>
                       | <variable name>
                       | ( <expression> )

As is taught in TIEA241, there are well defined procedures for turning context-free grammars into parsers. In the following code, I have followed an informal version of predictive recursive-descent parsing: I have not computed FIRST or FOLLOW sets, but the parser is still free of backtracking. The main disadvantage is that without FIRST and FOLLOW sets, the error messages are much poorer. (Note that the parser also constructs a data structure of type AST.Expression as it functions; I will discuss that part of it later.) A parser should always be based on an explicitly stated context-free grammar; program code is never a good definition of syntax.

# parser

Semantics

Although semantics as a word means the theory of meanings in a language, in the programming language context it also includes a lot that is more about legitimate usage than about meaning. This follows from the conventional definition of syntax as whatever context-free grammars are able to describe; there are many issues of valid usage that goes beyond the capabilities of context-free grammars, and as such they are typically counted as belonging under semantics.

Here I wish to introduce two very important words: static and dynamic.

Static (adjective) means anything about a program that does not depend on a particular execution of the program, or applies to all executions equally.
Dynamic (adjective) means things that do depend on which execution we are talking about.

Hence, it is conventional to talk about static semantics and dynamic semantics. I will discuss each separately. Before that, I will introduce you to abstract syntax, which the study of programming language semantics is based on.

Abstract syntax

While phrase-structure syntax is precise and useful for writing implementations and for determining correct syntax, it is usually much too cluttered for discussing semantics. Semantically, there is no difference between <term> and <factor>; they exist in the grammar solely for disambiguation purposes. Similarly, we usually do not care (when talking semantics) where in an expression there are parentheses, so long as we can tell which operations must be performed before which others.

The abstract syntax of a programming language specifies the semantically relevant constructs and how they may be combined to form programs, but it ignores details that are required for correct parsing. We often write an abstract syntax using a notation that looks superficially like a context-free grammar, like this:

\[\begin{array}{rcl} E,F &\in& \mathbf{Exp}\\ E,F &::=& E + F \\ &\mid& E - F \\ &\mid& E \mathbin* F \\ &\mid& E \mathbin/ F \\ &\mid& -E\\ &\mid& c \\ &\mid& x \\ &\mid& \mathbf{let}~x=E~\mathbf{in}~F \\ c,d &\in& \mathbf{Const}\\ x ,y,z&\in& \mathbf{Var} \end{array}\]

There are a couple of noteworthy features of an abstract syntax compared to a context-free grammar:

Both nonterminals and those terminals that have semantic values are defined to belong to sets; in the above abstact syntax, we have the sets $\mathbf{Var}$ of variable names, $\mathbf{Const}$ of numeric constants, and $\mathbf{Exp}$ of expressions.
The same nonterminal or the same terminal may have multiple names (for example, $E$ and $F$ are the same nonterminal, and $x$, $y$, and $z$ are the same terminal).
There are no parentheses nor any unary plus in this abstract syntax.
If one were to read this abstract syntax as a context-free grammar, it would be very ambiguous.

An abstract syntax is most usefully viewed as a definition of (often mutually) recursive tree data types. The correspondence between an abstract syntax and a (set of) data type declaration(s) in Haskell is almost trivial:

data E = E :+: E
       | E :-: E
       | E :*: E
       | E :/: E
       | Neg E
       | Let String E E
       | Const Double
       | Var String
       deriving Show

The translation of the abstract syntax into Java is much more verbose but not difficult. Each nonterminal becomes an abstract class and each production becomes a final subclass:

# ast

This Java data structure definition is equipped to handle the visitor pattern so that operations on the AST can be defined separately from the AST itself.

Once the data structure has been defined, the correspondence between the parse tree of a context-free language and the corresponding abstract syntax tree (AST) is easily determined. The parser shown above performs this translation on the fly, without constructing the parse tree and going straight to the AST.

Most research in the theoretical principles of programming languages has since the early 1980s focused on semantics, starting from an abstract syntax and leaving the concrete syntax aside as obvious and trivial, or at least uninteresting. This makes the concept of abstract syntax extraordinarily important.

Static semantics

Static semantics involves all parts of the language that are not described using the context-free grammar but which still is independent of any execution or common to all executions. Usually it involves various correctness checks, such as type checking. In the case of the calculator, the only static check that is needed is verifying that variables are used only if defined.

Most practical programming languages define their static semantics using semiformal English prose. For the calculator language, there are two static semantics rules:

A variable is only defined by a let...in construct The variable definition is visible only in the subexpression that follows in.
A variable may not be defined again if there is a visible definition for that variable.
A variable can only be used if there is a visible definition.

These rules are fairly simple to write into a visitor. The visitor itself can function as a definition of the static semantics, if it is clearly written.

# scopecheck

Dynamic semantics

Dynamic semantics defines how a program actually behaves once it is executed. We will discuss this much more in this course but later. For now, it suffices to say that we can write a definitional interpreter as a visitor to the AST:

# interpreter

Exercise

TASK: Reimplement the calculator using other technologies (1–2 points)

Deadline February 7, 2017. Submit under the source code submission rules.

Write an interpreter for the calculator language specified in this document using some other technology. For example, use Haskell instead of Java.

Your interpreter must be compilable and runnable either using freely available tools under Ubuntu 16.10 or in Agora computer classrooms using only software already installed there.

Include in your submission a README file giving compilation and usage instructions, including what compilers and libraries must be installed.

Added to clarify on 2017-01-17: The task is to reimplement the whole interpreter, capable of reading the calculator language as a character input and outputting the result (or diagnosing a static semantics violation). Thus, you will likely need a lexer, a parser, an AST, a static checker, and an evaluator (corresponding to the Interpreter class of the example interpreter).

Assessment:

1 point if the interpreter looks correct (even if seriously buggy) and mostly does not include any code borrowed directly from me or others
2 points if the 1 point criterion is fulfilled and the interpreter implements exactly the same language as defined in this document with no significant bugs

Literature

I discuss these some concepts a bit more briefly but with references to the literature in Section 2.3 (pages 29–31) of my doctoral dissertation.
Antti-Juhani Kaijanaho: Evidence-based programming language design: a philosophical and methodological exploration. Dissertation, University of Jyväskylä, Jyväskylä Studies in Computing 222, 2015; Available in JYX.