Quantcast
Channel: Eric White's Blog
Viewing all articles
Browse latest Browse all 35

Building a Simple Recursive Descent Parser

$
0
0

[Blog Map]  This blog is inactive.  New blog: EricWhite.com/blog

In this post, I present the start of a recursive descent parser that will parse the simple grammar that I presented previously in this series.  I’ll point out some key features of the code so that it is easy to see how the code works.

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

The Symbol Class

We need to write a class for each symbol (production) in the grammar.  Most of these classes have a method named Produce that given a collection of Symbol objects, produces an instance of that symbol.  Each Produce method in the various classes is free to return null if the collection that is passed to the method can’t produce that symbol.

We need to define an abstract base class for each of these classes.  Most instances of classes that derive from the abstract base class will contain a list of constituent symbols, so the abstract base class includes a public property, ConstituentSymbols, of type List<Symbol>.

The ToString method for each of these classes returns the string representation of the symbol.  If a symbol is comprised of a list of constituent symbols (i.e. is not a terminal), then the ToString method returns the concatenated strings of the constituent symbols.

There is a constructor for Symbol that takes a params array of Object.  Each element in the params array can either be an instance of Symbol, or an object that implements IEnumerable<Symbol>.  Those are the only valid types in the params array.  The constructor throws an internal ParserException if anything other than one of those types is passed as an argument.  I explained this idiom in the previous post in this series, Creating a Collection from Singletons and Collections using LINQ.  This method also uses the StringConcatenate extension method, which I discussed in this topic in the LINQ to XML documentation.

publicabstractclassSymbol
{
    publicList<Symbol> ConstituentSymbols { get; set; }
    publicoverridestring ToString() {
        string s = ConstituentSymbols.Select(ct => ct.ToString()).StringConcatenate();
        return s;
    }
    public Symbol(paramsObject[] symbols)
    {
        List<Symbol> ls = newList<Symbol>();
        foreach (var item in symbols)
        {
            if (item isSymbol)
                ls.Add((Symbol)item);
            elseif (item isIEnumerable<Symbol>)
                foreach (var item2 in (IEnumerable<Symbol>)item)
                    ls.Add(item2);
            else
                // If this error is thrown, the parser is coded incorrectly.
                thrownewParserException("Internal error");
        }
        ConstituentSymbols = ls;
    }
    public Symbol() { }
}
 

The OpenParenthesis Class

There are two varieties of symbols – terminal, and non-terminal.  I discussed both types of symbols in the post on the Augmented Backus-Naur Form Grammar.  A subclass of the Symbol class for terminal symbols is super simple.  It contains an override of the ToString method, which returns the string of the terminal.  Due to the semantics of C#, it also needs to include a constructor with no parameters:

publicclassOpenParenthesis : Symbol
{
    publicoverridestring ToString() { return"("; }
    public OpenParenthesis() { }
}
 

All of the terminals except for DecimalDigit look like the OpenParenthesis class.  DecimalDigit is similar, except that its constructor takes a character, and the ToString method returns that character.  To make the code as succinct as possible, digits are stored as strings instead of characters.

publicclassDecimalDigit : Symbol
{
    privatestring CharacterValue;
    publicoverridestring ToString() { return CharacterValue; }
    public DecimalDigit(char c) { CharacterValue = c.ToString(); }
}

The Formula Class

A derived class for non-terminal symbols is also pretty simple.  There is one public method, Produce, which takes a collection of Symbol objects, and then produces an instance of the class if it can; otherwise it returns null.

Recursive descent parsers are ‘top-down’ parsers, so it makes sense to define the Formula class first.  If you examine the simple grammar that I defined, you can see that a formula is comprised of an expression, so the Formula.Produce method simply passes on the collection of Symbol objects to the Expression.Produce method.  Again, due to the semantics of C#, it is necessary to define the constructor that takes a params array of Object, but it doesn’t need to do anything other than call the constructor in the base class.

classFormula : Symbol
{
    publicstaticFormula Produce(IEnumerable<Symbol> symbols)
    {
        // formula = expression
 
        Expression e = Expression.Produce(symbols);
        return e == null ? null : newFormula(e);
    }
    public Formula(paramsObject[] symbols) : base(symbols) { }
}

The Expression Class

The Expression class is more interesting.  The grammar production for Expression is as follows:

expression = *whitespace nospace-expression *whitespace
 

This means that an expression consists of zero or more WhiteSpace symbols, followed by one and only one NospaceExpression, followed by zero or more WhiteSpace symbols.  Using LINQ, it is super easy to count the WhiteSpace symbols at the beginning and end of the collection of symbols.  The Expression.Produce method can then pass the symbols in the middle (between the white space) to NospaceExpression.Produce.  If that method returns a NospaceExpression object, then the method can instantiate an Expression object and return it.  The Expression.Process method makes use of the SkipLast extension method.  The Expression class looks like this:

publicclassExpression : Symbol
{
    publicstaticExpression Produce(IEnumerable<Symbol> symbols)
    {
        // expression = *whitespace nospace-expression *whitespace
 
        int whiteSpaceBefore = symbols.TakeWhile(s => s isWhiteSpace).Count();
        int whiteSpaceAfter = symbols.Reverse().TakeWhile(s => s isWhiteSpace).Count();
        IEnumerable<Symbol> noSpaceSymbolList = symbols
            .Skip(whiteSpaceBefore)
            .SkipLast(whiteSpaceAfter)
            .ToList();
        NospaceExpression n = NospaceExpression.Produce(noSpaceSymbolList);
        if (n != null)
            returnnewExpression(
                Enumerable.Repeat(newWhiteSpace(), whiteSpaceBefore),
                n,
                Enumerable.Repeat(newWhiteSpace(), whiteSpaceAfter));
        returnnull;
    }
 
    public Expression (paramsObject[] symbols) : base(symbols) { }
}
 

I follow the same pattern for every class – the grammar is included in a comment, followed by a bit of LINQ code to produce an instance of the class, returning null if necessary.

Let’s look at the code to instantiate the Expression object:

returnnewExpression(
    Enumerable.Repeat(newWhiteSpace(), whiteSpaceBefore),
    n,
    Enumerable.Repeat(newWhiteSpace(), whiteSpaceAfter));

The arguments in yellow are collections of Symbol.  The parameter n, which is of type NospaceExpression, is a singleton, so we’re taking advantage of the ability to pass either singletons or collections to the Expression constructor.

The NospaceExpression Class

To keep this first example as simple as possible, we’ll implement a dummy NospaceExpression class.  We’ll pretend that any collection of symbols is a valid NospaceExpression symbol.  In the next post, I’ll examine how we need to code the NospaceExpression class, as well as classes for some of the other symbols.

publicclassNospaceExpression : Symbol
{
    publicstaticNospaceExpression Produce(IEnumerable<Symbol> symbols)
    {
        returnnewNospaceExpression(symbols);
    }
 
    public NospaceExpression(paramsObject[] symbols) : base(symbols) { }
}

Projecting a Collection of Symbols from a String

One aspect of the approach that I took is that I first projected a collection of terminal Symbol objects from the collection of characters that make up the string being parsed.  The string class implements IEnumerable<char> so we can do a simple select, as follows:

IEnumerable<Symbol> symbols = s.Select(c =>
{
    switch (c)
    {
        case'0':
        case'1':
        case'2':
        case'3':
        case'4':
        case'5':
        case'6':
        case'7':
        case'8':
        case'9':
            returnnewDecimalDigit(c);
        case' ':
            returnnewWhiteSpace();
        case'+':
            returnnewPlus();
        case'-':
            returnnewMinus();
        case'*':
            returnnewAsterisk();
        case'/':
            returnnewForwardSlash();
        case'^':
            returnnewCaret();
        case'.':
            returnnewFullStop();
        case'(':
            returnnewOpenParenthesis();
        case')':
            returnnewCloseParenthesis();
        default:
            return (Symbol)null;
    }
});
 

If we parse the formula " (1+3)  ", and dump out the terminal symbols, we see this:

Terminal Symbols
================
WhiteSpace > <
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <
 

By first transforming the string into a collection of terminals, it allows us to write more expressive code.  It is easier to read this code:

If (t is WhiteSpace)
 

It is a bit harder to read this:

If (t is Character && t.ToString() == " " )
 

The DumpSymbolRecursive Method

As you can see, Symbol objects form a hierarchical tree – each symbol (except for terminals) has constituent symbols.  This allows us to write a recursive method to dump symbols to a StringBuilder object.

publicstaticvoid DumpSymbolRecursive(StringBuilder sb, Symbol symbol, int depth)
{
    sb.Append(string.Format("{0}{1} >{2}<",
        "".PadRight(depth * 2),
        symbol.GetType().Name.ToString(),
        symbol.ToString())).Append(Environment.NewLine);
    if (symbol.ConstituentSymbols != null)
        foreach (var childSymbol in symbol.ConstituentSymbols)
            DumpSymbolRecursive(sb, childSymbol, depth + 1);
}
 

If we parse the formula “ (1+3)  ” and dump it, we see:

Formula > (1+3)  <
  Expression > (1+3)  <
    WhiteSpace > <
    NospaceExpression >(1+3)<
      OpenParenthesis >(<
      DecimalDigit >1<
      Plus >+<
      DecimalDigit >3<
      CloseParenthesis >)<
    WhiteSpace > <
    WhiteSpace > <
 

We can see the instances of the Formula, Expression, and NospaceExpression classes, as well as the various terminals that make up the expression.

Complete Listing

Here is the complete listing of the first version of the SimpleFormulaParser example.  You can simply cut and paste this code into Visual Studio, and run it.  It doesn’t have any dependencies.  It transforms a formula to a collection of terminals, prints the terminals, and then uses the Formula, Expression, and NospaceExpression classes to produce a parse tree (which is incomplete because NospaceExpression is a dummy implementation).

In the next post in this series, I’ll continue to develop this simple parser.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
 
namespace SimpleParser
{
    publicabstractclassSymbol
    {
        publicList<Symbol> ConstituentSymbols { get; set; }
        publicoverridestring ToString() {
            string s = ConstituentSymbols.Select(ct => ct.ToString()).StringConcatenate();
            return s;
        }
        public Symbol(paramsObject[] symbols)
        {
            List<Symbol> ls = newList<Symbol>();
            foreach (var item in symbols)
            {
                if (item isSymbol)
                    ls.Add((Symbol)item);
                elseif (item isIEnumerable<Symbol>)
                    foreach (var item2 in (IEnumerable<Symbol>)item)
                        ls.Add(item2);
                else
                    // If this error is thrown, the parser is coded incorrectly.
                    thrownewParserException("Internal error");
            }
            ConstituentSymbols = ls;
        }
        public Symbol() { }
    }
 
    publicclassFormula : Symbol
    {
        publicstaticFormula Produce(IEnumerable<Symbol> symbols)
        {
            // formula = expression
 
            Expression e = Expression.Produce(symbols);
            return e == null ? null : newFormula(e);
        }
        public Formula(paramsObject[] symbols) : base(symbols) { }
    }
 
    publicclassExpression : Symbol
    {
        publicstaticExpression Produce(IEnumerable<Symbol> symbols)
        {
            // expression = *whitespace nospace-expression *whitespace
 
            int whiteSpaceBefore = symbols.TakeWhile(s => s isWhiteSpace).Count();
            int whiteSpaceAfter = symbols.Reverse().TakeWhile(s => s isWhiteSpace).Count();
            IEnumerable<Symbol> noSpaceSymbolList = symbols
                .Skip(whiteSpaceBefore)
                .SkipLast(whiteSpaceAfter)
                .ToList();
            NospaceExpression n = NospaceExpression.Produce(noSpaceSymbolList);
            if (n != null)
                returnnewExpression(
                    Enumerable.Repeat(newWhiteSpace(), whiteSpaceBefore),
                    n,
                    Enumerable.Repeat(newWhiteSpace(), whiteSpaceAfter));
            returnnull;
        }
 
        public Expression (paramsObject[] symbols) : base(symbols) { }
    }
 
    publicclassNospaceExpression : Symbol
    {
        publicstaticNospaceExpression Produce(IEnumerable<Symbol> symbols)
        {
            returnnewNospaceExpression(symbols);
        }
 
        public NospaceExpression(paramsObject[] symbols) : base(symbols) { }
    }
 
    publicclassDecimalDigit : Symbol
    {
        privatestring CharacterValue;
        publicoverridestring ToString() { return CharacterValue; }
        public DecimalDigit(char c) { CharacterValue = c.ToString(); }
    }
 
    publicclassWhiteSpace : Symbol
    {
        publicoverridestring ToString() { return" "; }
        public WhiteSpace() { }
    }
 
    publicclassPlus : Symbol
    {
        publicoverridestring ToString() { return"+"; }
        public Plus() { }
    }
 
    publicclassMinus : Symbol
    {
        publicoverridestring ToString() { return"-"; }
        public Minus() { }
    }
 
    publicclassAsterisk : Symbol
    {
        publicoverridestring ToString() { return"*"; }
        public Asterisk() { }
    }
 
    publicclassForwardSlash : Symbol
    {
        publicoverridestring ToString() { return"/"; }
        public ForwardSlash() { }
    }
 
    publicclassCaret : Symbol
    {
        publicoverridestring ToString() { return"^"; }
        public Caret() { }
    }
 
    publicclassFullStop : Symbol
    {
        publicoverridestring ToString() { return"."; }
        public FullStop() { }
    }
 
    publicclassOpenParenthesis : Symbol
    {
        publicoverridestring ToString() { return"("; }
        public OpenParenthesis() { }
    }
 
    publicclassCloseParenthesis : Symbol
    {
        publicoverridestring ToString() { return")"; }
        public CloseParenthesis() { }
    }
 
    publicclassSimpleFormulaParser
    {
        publicstaticSymbol ParseFormula(string s)
        {
            IEnumerable<Symbol> symbols = s.Select(c =>
            {
                switch (c)
                {
                    case'0':
                    case'1':
                    case'2':
                    case'3':
                    case'4':
                    case'5':
                    case'6':
                    case'7':
                    case'8':
                    case'9':
                        returnnewDecimalDigit(c);
                    case' ':
                        returnnewWhiteSpace();
                    case'+':
                        returnnewPlus();
                    case'-':
                        returnnewMinus();
                    case'*':
                        returnnewAsterisk();
                    case'/':
                        returnnewForwardSlash();
                    case'^':
                        returnnewCaret();
                    case'.':
                        returnnewFullStop();
                    case'(':
                        returnnewOpenParenthesis();
                    case')':
                        returnnewCloseParenthesis();
                    default:
                        return (Symbol)null;
                }
            });
#if true
            if (symbols.Any())
            {
                Console.WriteLine("Terminal Symbols");
                Console.WriteLine("================");
                foreach (var terminal in symbols)
                    Console.WriteLine("{0} >{1}<", terminal.GetType().Name.ToString(),
                        terminal.ToString());
                Console.WriteLine();
            }
#endif
            Formula formula = Formula.Produce(symbols);
            if (formula == null)
                thrownewParserException("Invalid formula");
            return formula;
        }
 
        publicstaticvoid DumpSymbolRecursive(StringBuilder sb, Symbol symbol, int depth)
        {
            sb.Append(string.Format("{0}{1} >{2}<",
                "".PadRight(depth * 2),
                symbol.GetType().Name.ToString(),
                symbol.ToString())).Append(Environment.NewLine);
            if (symbol.ConstituentSymbols != null)
                foreach (var childSymbol in symbol.ConstituentSymbols)
                    DumpSymbolRecursive(sb, childSymbol, depth + 1);
        }
    }
 
    classProgram
    {
        staticvoid Main(string[] args)
        {
            StringBuilder sb = newStringBuilder();
            Symbol f = SimpleFormulaParser.ParseFormula(" (1+3)  ");
            SimpleFormulaParser.DumpSymbolRecursive(sb, f, 0);
            Console.WriteLine(sb.ToString());
        }
    }
 
    publicclassParserException : Exception
    {
        public ParserException(string message) : base(message) { }
    }
 
    publicstaticclassMyExtensions
    {
        publicstaticIEnumerable<T> SkipLast<T>(thisIEnumerable<T> source,
            int count)
        {
            Queue<T> saveList = newQueue<T>();
            int saved = 0;
            foreach (T item in source)
            {
                if (saved < count)
                {
                    saveList.Enqueue(item);
                    ++saved;
                    continue;
                }
                saveList.Enqueue(item);
                yieldreturn saveList.Dequeue();
            }
            yieldbreak;
        }
 
        publicstaticstring StringConcatenate(thisIEnumerable<string> source)
        {
            StringBuilder sb = newStringBuilder();
            foreach (string s in source)
                sb.Append(s);
            return sb.ToString();
        }
 
        publicstaticstring StringConcatenate<T>(
            thisIEnumerable<T> source,
            Func<T, string> projectionFunc)
        {
            return source.Aggregate(newStringBuilder(),
                (s, i) => s.Append(projectionFunc(i)),
                s => s.ToString());
        }
    }
}
 


Viewing all articles
Browse latest Browse all 35

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>