Showing posts with label parsing. Show all posts
Showing posts with label parsing. Show all posts

Tuesday, June 26, 2012

Recursive descent parsers using active patterns, part 2

In the previous post, I presented a method to implement a parser relying solely on core F# features, namely active patterns.
In this post, I'll touch on some of advantages and limits of this approach.
Recursive descent parsers are limited to LL(k) grammars, which means that left-recursive grammars cannot be directly handled. It is necessary to rewrite such grammars to remove left recursion.

Consider the common problem of parsing expressions, e.g. Boolean expressions. A typical formulation of the  grammar could be:

Expr := AndExpr | OrExpr | NegExpr | Variable | True | False | "(" Expr ")"
AndExpr := Expr "AND" Expr
OrExpr := Expr "OR" Expr
NegExpr := "~" Expr | Expr

Note there is an ambiguity, as "A AND B OR C" could be parsed as (A AND B) OR C or as A AND (B OR C). In order to lift this ambiguity, priorities can be assigned to rules. Similarly for associativity, allowing to parse "A AND B AND C" as (A AND B) AND C or A AND (B AND C).

Such a grammar, if implemented naively using a recursive descent parser, would loop endlessly on some inputs: The parser would try to match rule AndExpr, which would try to match Expr, which would again try to match AndExpr...

An alternative formulation of the grammar can be used that will generate semantically equivalent results

Expr := OrExpr
OrExpr := AndExpr "OR" OrExpr | AndExpr
AndExpr := NegExpr "AND" AndExpr | NegExpr
NegExpr := "~" AtomExpr | AtomExpr
AtomExpr := Variable | True | False | "(" Expr ")"

Note the hierarchy between Expr, OrExpr, AndExpr, NegExpr and AtomExpr. The left-most rule appearing on an alternative in the right-hand side of each definition is always a "lower" rule: Expr refers to OrExpr, OrExpr to AndExpr, AndExpr to NegExpr, NegExpr to AtomExpr.
AtomExpr refers to the higher rule Expr in "(" Expr ")", but that's OK as Expr does not appear as the left-most rule (the token for a left parenthesis does).

What I like with this formulation is that it captures both associativity and operator precedence within the system of rules.

It is common to split parsers into two stages, using a lexer that processes the input stream in linear time, producing a pre-digested stream of so-called tokens. While this has undeniable advantages when it comes to performance, it prevents composing grammars and rule-specific tokenization. This is why most languages have keywords reserved for future use, and this also explains why you can't use identifiers that are also keywords even in situations where no ambiguity arises.

For these reasons, I personally prefer avoiding the separate lexing stage if performance constraints allow for it.

Regarding performance, it's important to write a recursive descent parser in such a way that decisions on the rule to try can always be done looking at a finite number of tokens.

In the example below, the second implementation performs significantly better:

let (|OrExpr|_|) s =
  match s with
  | AndExpr (e1, Token ("OR", OrExpr(es, s))) -> Some (e1 :: es, s)
  | AndExpr (e, s) -> Some ([e], s)
  | _ -> None
let (|OrExpr|_|) s =
  match s with
  | AndExpr (e1, s) ->
    let rec work s es =
      match s with
      | Token ("OR", AndExpr(e, s)) -> work s (e :: es)
      | _ -> (List.rev es, s)
    let es, s = work s []
    Some(e1 :: es, s)
  | _ -> None

To see what happens, consider the input "~X". This is a "degenerate" OrExpr with a single term. The first implementation will go all the way down through each rule to identify a NegExpr, promoting it to an AndExpr, then will look for "OR", which isn't there. The first match case will therefore fail, and the second will be tried, redoing the work. The same problem exists at all levels in AndExpr, NegExpr and AtomExpr. What makes matters worse is that this kind of "degenerate" case is actually the normal common case in typical expressions.

The second implementation uses the functional equivalent of a while loop, going as far as possible. The trick here is to factorize common prefixes into a single branch, thus avoiding redoing the work in case of a match failure.

An important problem with my simple example code is the lack of error messages. Those are important from a usability point of view, but I'm currently not too sure how to do this with manually-written parsers.

To summarize, using this technique requires some amount of manual work, which is the price to pay for not using smarter parsing frameworks. I wouldn't advise it over using parser generators, but it's interesting nonetheless. I can see it having uses for extremely small grammars, for instance when parsing fragments in a document. Otherwise, go for a parser generator such as fslex+fsyacc or ANTLR (which generates recursive descent parsers), or combinator-based parsers such as FParsec.

Friday, May 18, 2012

Recursive descent parser using active patterns

Today's post isn't specifically about games, but about parsing, which I find is a recurring task in many programming tasks, including game-related tasks.

In F#, the most popular methods for writing parsers are FParsec and fslex/fsyacc. Although parser generators are very useful, I'm always a bit reluctant to depend on third-party software.

In the past I have worked on a development tool for safety-critical systems, which was itself subject to some of the limitations of software for safety-critical systems. In particular, the use of lex and yacc was not accepted.

I fell back on a technique suitable for hand-written parsers, namely recursive descent parsers. I was positively surprised by the inherent simplicity of the method and the clarity of code that resulted.

I was first exposed to the idea of using active patterns for parsing when I read a blog post by Jon Harrop.
Active patterns is a feature specific to F#, and the F# wikibook has a chapter dedicated to them. I'll be using parameterized partial active patterns.

We start by defining a type for the input. The integer denotes the number of lines read so far, the list of strings is the input split at line-breaks.
open System.Text.RegularExpressions

type Stream = Stream of int * string list

A pattern to detect the end of the input, when the list of strings is empty.
let (|EOF|_|) = function
    | Stream(_, []) -> Some()
    | _ -> None

A pattern to detect end-of-lines, by recognizing list of strings which start with the empty string.
let (|EOL|_|) = function
    | Stream(n, "" :: rest) ->
        Some (Stream(n + 1, rest))
    | _ -> None
This pattern eats white space.
let (|OneWS|_|) = function
    | Stream(n, line :: rest) ->
        let trimmed = line.TrimStart()
        if trimmed <> line then
            Some (Stream(n, trimmed :: rest))
        else
            None
    | _ -> None
A convenient pattern to eat sequences of white space and newlines. Note the use of the rec keyword, which allows to refer to the pattern itself in its implementation.
let rec (|WS|_|) = function
    | OneWS (WS s)
    | EOL (WS s) -> Some s
    | s -> Some s
We could also have attempted another variant, where WS uses itself for the first part, which would implement a left-recursive grammar. Unfortunately, this pattern would be useless, as all it would do is raise stack-overflow exceptions.
let rec (|WS1|_|) = function
    | WS1 (OneWS s) -> Some s
    | s -> Some s
My variant of the Regex pattern differs a bit from the one in the wikibook, to avoid re-building Regex objects, which has a non-neglectable cost.
let (|Regex|_|) (regex : Regex) = function
    | Stream(n, line :: rest) ->
        let m = regex.Match(line)
        if m.Success then
            let lineRest = line.Substring(m.Length)
            let values =
                [ for gr in m.Groups -> gr.Value ]
            Some (values, Stream(n, lineRest :: rest))
        else None
    | _ -> None
A sample pattern to illustrate the use of the Regex pattern.
let personRegex = Regex("NAME: (\w+) AGE: (\d+)")
let (|Person|_|) = function
    | WS (Regex personRegex ([_ ; name ; age], s)) -> Some ((name, age), s)
    | _ -> None
A pattern to parse data of a large number of persons. I could have used recursion at the level of the pattern itself, similarly to WS. However, such a pattern would not be tail-recursive, and would cause stack overflows for large numbers of entries.
let (|Persons|_|) s =
    let rec work s persons =
        match s with
        | Person (p, s) -> work s (p :: persons)
        | _ -> (persons |> List.rev, s)
    let persons, s = work s []
    Some (persons, s)
Finally, some utility code to initialize streams, generate one million lines to parse, followed by the parsing itself.
let streamOfLines lines =
    Stream(0, lines |> List.ofSeq)

let streamOfFile path =
    Stream(0, System.IO.File.ReadAllLines(path) |> List.ofArray)

let stream =
    seq { for n in 0..1000000 do
            yield sprintf "  NAME: Anonymous AGE: %d              " (n % 99) }
    |> streamOfLines

#time "on"
match stream with
| Persons (p, WS EOF) -> printfn "Success %A" p
| _ -> printfn "Failed"
#time "off"
Parsing takes about 9 seconds on my PC (18s if I recreate the Regex object repeatedly). I think that's very decent performance.
In a future post, I'll describe how to parse Boolean formulas and expand a bit on the power and performance of this method.

Friday, December 26, 2008

Parsing DirectX .X files

This has been by far the most demanding episode to write in these series of tutorials, but finally it's there. See http://code.google.com/p/fs-gamedev/wiki/Tutorial10 for the text and source code.

This tutorial shows how to parse .X files, which are used for 3d models and scenes. Note that DirectX offers convenient functions to manipulate these files, meaning that the practical usefulness of the code in this tutorial is limited, unless you don't have access to DirectX.

This tutorial demonstrates the following concepts:

* Discriminated unions and simple pattern matching
* Stateless lexing
* Hiding side effects in IO operations
* Error handling using the option type
* Defining mutually and self dependent types
* Building functions at run-time by combining basic functions

The content of this tutorial is significantly denser than usual. The really interesting part in my opinion resides however in the design decision to separate parsing in two stages. During the first stage, second-stage parsers are dynamically created by combining simple parsers (such as a semi-colon parser, an int parser, even a "zero" parser parsing nothing). Second-stage parsing simply consists of calling the result of first-stage parsing, which results in hierarchical data representing faces, lights and materials composing a 3d scene.