Shift-Reduce Parsers

What Parsers Do

A compiler is a program that translates source code written in a programming language to a stream of machine language statements. The process of compilation consists of two stages, parsing and code generation. In the parsing stage the compiler reads the source file and analyses its structure. The output of the parsing stage is typically an intermediate data structure that reflects the structure of the code in the source file. That data structure then acts as input for the algorithms in the code generation stage.

All modern programming languages are context free languages described by context free grammars. A source code file can be thought of as a single long string in the context free language. As we have already seen, a string is a member of a context free language if there is a derivation for that string in the grammar. We have also seen that derivations are essentially equivalent to parse trees. The purpose of parsing algorithms therefore is to construct a parse tree from an input string.

Shift-Reduce Parsers

In these lectures we are going to study a particular kind of parser called a shift-reduce parser. This kind of parser consists of an input stream, a parser stack, and some kind of control device. The parser shifts symbols from the input onto the stack. When the symbols on the top of the stack form what looks like the right hand side of some grammar rule, the parser reduces that right hand side to the left hand side of the rule by popping the symbols that make up the right hand side off the stack and pushing the variable on the left hand side in their place.

The purpose of the control is to keep track of what sits on the stack and to decide when to shift new symbols to the stack and when to reduce. The control is essentially a modified finite state machine that acts as a recognizer for right hand sides of grammar rules. Below I will describe the construction of the states and transitions of this machine. The control differs from a traditional finite state machine in one important respect. As terminals and variables get pushed on the stack we also push information about the current state on the stack along with them. When the control enters a state that indicates that a reduction should take place we pop the appropriate number of elements from the stack and then jump to the state sitting on the top of the stack. From that state we take the transition indicated by that state and the variable on the left hand side of the rule we are reducing.

Configurations

The primary task that the control carries out is the task of recognizing right hand sides of grammar rules. The states of the control track progress in recognizing a right hand side through the use of configurations. A configuration looks like a grammar rule augmented with an additional symbol that tracks our progress in identifying the right hand side. For example, consider a grammar rule

A → B C

At the start of the recognition process, before we have seen any of the symbols on the right hand side, we are in this configuration:

A → • B C

After we have seen the variable B we are in configuration

A → B • C

When we have seen the entire right hand side of the rule we are in the configuration

A → B C •

Configurations with • at the end of the right hand side signal reductions.

Preparing the Grammar

Shift-reduce grammars work with most reasonable grammars, although a slight amount of preparation is necessary. The first thing we have to do is to add an additional terminal symbol, called the end marker, to the terminal symbols of the grammar. The end marker will be the last symbol in the input stream. In every example below we will use the symbol $ as the end marker.

We also have to set up the grammar with one additional grammar rule, called the augmenting production. This rule introduces a new start symbol (usually S) and has a right hand side consisting of the old start variable followed by the end marker.

S → E $

LR(0)

The first and simplest kind of shift-reduce parser we are going to study is the LR(0) parser. Control states in the LR(0) control are constructed by carrying out a closure operation on configurations. The closure of a set of configurations adds new configurations by expanding variables that appear to the right of the •.

Here is the outline of an algorithm to compute the closure of a set of configurations:

Set CLOSURE(I) {
  J = I;
  repeat
    for( each item A → α•Bβ in J )
      for( each production B → γ of G )
        if( B → •γ is not in J )
          add B → •γ to J;
  until no more items are added to J in one round;
  return J;
}

For example, consider the grammar G₁:

S → E $

E → T | E + T

T → Id | ( E )

To form the closure of the set containing the configuration

S → • E $

we first add the configurations

E → • T

E → • E + T

Taking the closure of these additional configurations introduces two more configurations.

T → • Id

T → • ( E )

The configuration sets formed by taking the closures of configurations identify the states of the finite machine that drives the LR(0) parser. The start state of the finite machine, state 0, is formed by taking the closure of the configuration

S → • E $

Thus, the configuration set for the start state of the machine for G₁ is

If a configuration set contains a configuration with the • to the left of some grammar symbol we add a transition from that state labeled with that grammar symbol. The transition takes us to a new state whose configuration set is the closure of the original configuration with the • moved past that grammar symbol. For example, state 0 contains a configuration

T → • ( E )

This configuration allows us to add a transition labeled with the symbol ( that takes us to a new state whose configuration set is the closure of

T → ( • E )

The full control

Running the parser

The parser uses the control in combination with a state stack. Here is the algorithm the parser uses to process the input.

Start by putting the start state on the state stack.
Repeat the following steps until you arrive at the accept state:

Take the transition labeled by the next available symbol on the input. Remove that symbol from the input and push the state you went to onto the stack.
If you arrive at a state that allows you to do a reduction, pop as many states off the stack as the rule has items on its right hand side, return to the state that is now at the top of the stack, and take the transition labeled by the variable on the left hand side of the rule you just used in the reduction.

Building the parse tree

It is a relatively easy thing to construct a parse tree as a side effect of running a shift-reduce parser. The first thing required is to keep a record of the reductions that get fired as the parser runs. Reversing that list of reductions provides a recipe for building a right-most derivation for the input string.

For example, using the grammar above with the input string Id + (Id) fires a sequence of reductions:

T → Id

E → T

T → Id

E → T

T → (E)

E → E + T

S → E $

Reversing this sequence leads to the rightmost derivation

S⇒ E $ ⇒ E + T $ ⇒ E + ( E ) $ ⇒ E + ( T ) $ ⇒ E + ( Id ) $ ⇒

T + ( Id ) $ ⇒ Id + ( Id ) $

The usual methods can then be used to convert this derivation to a parse tree.

In practice, shift-reduce parsers fire parser actions whenever a reduction takes place. A parser action is a short piece of code that gets run whenever the reduction takes place. Typically parser actions construct fragments of the parse tree and store these in a parallel stack called the value stack. At the end of the parsing process the parse tree appears at the top of the value stack.

Shift-reduce and reduce-reduce conflicts

For the LR(0) control to be effective it has to provide unambiguous directions at all times. Unfortunately, two different kinds of ambiguities often arise in practice. To see how these ambiguities might arise, consider a slightly more complex grammar, G₂

S → E $

E → T | E + T

T → P | T * P

P → Id | ( E )

State 0 for the LR(0) control for this grammar has the following configuration set

The transition driven by T coming from state 0 takes us to a state whose configuration is

This state is an example of a state with a shift-reduce conflict. The configuration E → T • wants to fire a reduction, while the configuration T → T • * P prefers a shift action that will push the rest of its right hand side onto the stack.

Another kind of conflict that can occur is a reduce-reduce conflict in which a single state contains two or more configurations that want to trigger reductions using different rules.

SLR(1)

Although LR(0) parser states can have conflicts, in many cases we can apply a simple policy to resolve conflicts. The simple policy is to prefer shifting if the configuration set contains a configuration like T → T • * P and the next symbol in the input is *. If no such shift is possible, the next priority is given to reductions. In the case of reduce-reduce conflicts we try to determine what terminals may follow the non-terminal we are trying to reduce to. If one of those terminals appears as the next symbol in the input, that reduction gets fired.

To implement this algorithm we make use of two additional functions called the First and Follow functions:

Set FIRST(X) {
  if( X is a terminal )
    return {X}
  I = {};
  for( each production X → Y₁Y₂⋯Y_k in G )
    I = I ⋃ FIRST(Y₁);
  if( X → ε is in G )
    I = I ⋃  ε;
  return I;
}

Set FOLLOW(X) {
  I = {};
  for( each production A → αXβ in G )
    I = I ⋃ (FIRST(β) - {ε});
  for( each production A → αX in G )
    I = I ⋃ FOLLOW(A);
  return I;
}

Using these functions we can resolve shift-reduce and reduce-reduce conflicts by trying these actions in this order.

If the configuration set contains a configuration A → α•β and the next input symbol is in FIRST(β), shift that symbol.
If the configuration set contains a configuration A → α• and the next input symbol is in FOLLOW(A), fire that reduction.

The LR(0) algorithm augmented with these rules to resolve conflicts forms the simple LR, or SLR(1) algorithm.

LR(1)

Unfortunately, many grammars yield configuration sets that can not be fully disambiguated with these additional rules. In some cases it is possible to eliminate shift-reduce or reduce-reduce conflicts by rewriting the grammar. In other cases, the conflicts form an inherent feature of the language the grammar describes, and no amount of rewriting can fix the problem. In those cases we have to switch to a more powerful parsing algorithm.

The SLR algorithm described above makes limited use of the concept of lookahead. A shift-reduce parser that uses lookahead makes decisions about what to do next based on a combination of the current state and one or more symbols of lookahead taken from the front of the input stream. SLR uses lookahead only to resolve conflicts as they arise. A more powerful approach is to factor lookahead into the construction of all of the configuration sets in the parser's control. The simplest shift-reduce parser that makes full use of lookahead is the LR(1) parser, which uses a single symbol of lookahead to help determine parser actions.

Like the LR(0) parser, the LR(1) parser is based on configurations. The one new feature in LR(1) configurations is a lookahead component that lists the symbols the configuration expects to see at the front of the input stream after shifting all the elements of the right hand side. In addition, we have to modify the closure algorithm to include lookaheads.

LR(1) closure

State 0 of the LR(1) parser is formed by taking the closure of the configuration

S → • E $, {ε}

To form the closure of an LR(1) configuration we add configurations for rules that expand any variables found after the • in a configuration in the set. The lookahead component of any new configuration we add consists of any terminal symbols that might appear after the variable being expanded.

Here is the code for LR(1) closure algorithm:

Set CLOSURE(I) {
  repeat
    for( each item A → α•Bβ, {a} in I )
      for( each production B → γ of G )
        for( each terminal b in FIRST(βa) )
          add B → •γ, {b} to I;
  until no more items are added to I in one round;
  return I;
}

For example, here is the closure of the configuration shown above:

Some explanation is in order here. Starting with the configuration

S → • E $, {ε}

we first add configurations

These configurations have a lookahead of {$} because we added these configurations by expanding the variable E found in S → • E $, {ε}. The terminal symbol $ appears after the E in that configuration, so that becomes the lookahead for these two new configurations. The first of these two new configurations brings in configurations

The lookahead for these two configurations is also {$}, because the thing appearing after the T in the first configuration is the lookahead for that configuration, $. Expanding the second of the two new configurations leads to configurations

because the symbol that appears after the E in that second configuration is the +. Configurations that differ only in their lookahead get merged into single configurations, so the final configuration set will contain the configurations

Transitions are formed in the same way as before. Once again, after making the transition we form the closure of the configurations that participated in the transition. For example, upon the transition driven by T from state 0 above we come to a state

The potential shift-reduce conflict in this state is resolved by lookahead. If the lookahead is either $ or +, we resolve the conflict in favor of the reduction. For any other lookahead we instead attempt to do the shift instead (which only succeeds if the next symbol in the input is a *.)

LALR(1)

LR(1) is sufficiently powerful to serve as a parsing algorithm for most grammars of practical interest. It has one flaw, though. That addition of lookahead information leads to an increase in the number of configuration sets possible. Consequently, LR(1) grammars tend to have many states. For a language with about 100 grammar rules, the LR(1) control machine can have many thousands of states. Most software implementations of finite state machines are driven by tables. In particular, the transition table is a two dimensional table with a number of rows equal to the number of states and a number of columns equal to the number of symbols in the grammar. For a control with thousands of states and about a hundred grammar symbols this can produce a transition table that takes up close to a megabyte of space.

One solution to the bloat in LR(1) parsers is to merge configuration sets whose configurations differ only in their lookahead. In many practical grammars this can reduce the number of states back down to a managable number. The only downside is that occasionally merging states will reintroduce shift-reduce or reduce-reduce conflicts. In most such cases, the problem can be fixed by rewriting the grammar slightly. The LALR(1) parsing algorithm uses a control which is formed by merging LR(1) states to bring the number of states down to a more workable number. The full details of this algorithm are beyond the scope of these lecture notes: for further details you should consult one of the two references below.

References

The material in these lecture notes comes from two of the standard texts in compiler design:

Compilers: Principles, Techniques, & Tools, Second Edition by Aho, Lam, Sethi, and Ullman.
Crafting a Compiler by Fischer and LeBlanc.