You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2012/07/16 05:13:14 UTC

[lucy-dev] CFC's lexer and reserved words

Greets,

I've opened up an issue about making a minor mod to
the Flex file clownfish/src/CFCLexHeader.l:

    CFC lexer should treat reserved words as special cases of identifier
    https://issues.apache.org/jira/browse/LUCY-241

The rationale for this change is can be found in PLP3[1], section 2.2.2
"Scanner Code":

    ...keywords in most languages look just like identifiers, but are reserved
    for a special purpose... It is possible to write a finite automaton that
    distinguishes between keywords and identifiers but it requires a *lot* of
    states... Most scanners, both handwritten and automatically generated,
    therefore treat keywords as "exceptions" to the rule for identifiers.
    Before returning an identifier to the parser, the scanner looks it up in a
    hash table or trie (a trie of branching paths) to make sure it isn't
    really a keyword.

A similar recommendation (with sample code) can be found in "Introduction to
Compiler Construction with Unix" by Axel T. Schreiner and H. George Friedman,
Jr, Section 2.6 "Testing a Lexical Analyzer":

    One last advice: the reserved words of a programming language are
    particularly easy to write down as patterns.  Unfortunately, a long list
    of such self-representing patterns dramatically increases the size of the
    program generated by `lex`.  It is much more efficient to collect reserved
    words and identifiers with the same, single pattern and to screen the
    results with a small C function.

IMO, memory usage is not a concern for us (Schreiner's book is from 1985, a
different era with regards to memory).  If we really start caring about
RAM footprint or performance of the CFC lexer, we'll write a hand-coded
lexer like one of the others we have in the Lucy codebase:

    core/Lucy/Search/QueryParser/QueryLexer.c  // QueryLexer_tokenize()
    core/Lucy/Util/Json.c                      // S_do_parse_json())

But this is a simple change, and finishing out the patch allows somebody else
to get their hands dirty on actual generated lex code (a topic of this week's
chapter from PLP3 in the Lucy Book Club).

Marvin Humphrey

[1] http://wiki.apache.org/lucy/LucyBookClub
    The book is Programming Langage Pragmatics, 3rd edition by Michael L.
    Scott.