You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Jonathan Levinson <Jo...@intersystems.com> on 2009/10/06 21:46:28 UTC

Regular expression use

I noticed that if one is not careful in one's regular expression use,
the compilation for a regular expression can take minutes.  I'm not
talking about applying the pattern just compiling it!

 

Should regular expressions be avoided altogether and should one use
hand-crafted state machines for parsing, and tokenizing, or can regular
expressions be used as long as one is careful?  

 

Best Regards,

Jonathan S. Levinson

RE: Regular expression use

Posted by Jonathan Levinson <Jo...@intersystems.com>.

From the following link, it looks like we can call the Lexer to get tokens - independently of the parser.

http://www.antlr.org/wiki/display/ANTLR3/1.+Lexer

Here is the example from the above which gives me such a hope:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class MainLexer {
    public static void main(String[] args) {
        CharStream input = new ANTLRFileStream(args[0]);
        XMLLexer lexer = new XMLLexer(input);
        Token token;
        while ((token = lexer.nextToken())!=Token.EOF_TOKEN) {
        System.out.println("Token: "+token.getText());
        }
    } catch(Throwable t) {
        System.out.println("Exception: "+t);
        t.printStackTrace();
    }
    }
}

I don't know if CharStream or XMLLexer can take a String constructor or has a String factory, which is what we'd probably use within FOP.

Best Regards,
Jonathan S. Levinson

-----Original Message-----
From: Vincent Hennebert [mailto:vhennebert@gmail.com] 
Sent: Thursday, October 08, 2009 5:15 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: Regular expression use

Hi Jonathan,

Jonathan Levinson wrote:
> I'm sure someone has mentioned it already but what about the lexer support in ANTLR?
> 
> http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Lexical+analysis
> 
> ANTLR is available under the BSD license, which seems to be one with no strings attached:
> 
> http://www.antlr.org/license.html

Basically we’re back to the same discussion as about the parser
generator, this time at the lexer level.
http://markmail.org/thread/64rmyl7x4nyoxhh3

Among the tools mentioned in the above thread, it would be good to know
which ones allow to use the lexer independently of the parser. Unless we
decide to use both the lexer and parser anyway...


Vincent


> Best Regards,
> Jonathan S. Levinson
> 
> -----Original Message-----
> From: Vincent Hennebert [mailto:vhennebert@gmail.com] 
> Sent: Wednesday, October 07, 2009 6:51 AM
> To: fop-dev@xmlgraphics.apache.org
> Subject: Re: Regular expression use
> 
> Hi Jonathan,
> 
> Jonathan Levinson wrote:
>> I noticed that if one is not careful in one's regular expression use,
>> the compilation for a regular expression can take minutes.  I'm not
>> talking about applying the pattern just compiling it!
>>
>>  
>>
>> Should regular expressions be avoided altogether and should one use
>> hand-crafted state machines for parsing, and tokenizing, or can regular
>> expressions be used as long as one is careful?  
> 
> I’d say, use regular expressions as long as they are not too complex.
> But I guess you’re mentioning that in the context of property parsing,
> in which case I don’t think regular expressions are the ultimate answer.
> A proper lexer is likely to be needed, either generated or written by
> hand. As the latter solution quickly becomes a maintenance nightmare,
> some lexer generator will probably be needed. Question remains, which
> one, and I’m not even sure there’s one that exists whose license is
> ASLv2-compatible. Plus there are some issues specific to property
> parsing, like shorthands (which should ideally re-use the parsers of the
> individual properties), sub-properties, etc.
> 
> 
> Vincent

Re: Regular expression use

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Jonathan,

Jonathan Levinson wrote:
> I'm sure someone has mentioned it already but what about the lexer support in ANTLR?
> 
> http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Lexical+analysis
> 
> ANTLR is available under the BSD license, which seems to be one with no strings attached:
> 
> http://www.antlr.org/license.html

Basically we’re back to the same discussion as about the parser
generator, this time at the lexer level.
http://markmail.org/thread/64rmyl7x4nyoxhh3

Among the tools mentioned in the above thread, it would be good to know
which ones allow to use the lexer independently of the parser. Unless we
decide to use both the lexer and parser anyway...


Vincent


> Best Regards,
> Jonathan S. Levinson
> 
> -----Original Message-----
> From: Vincent Hennebert [mailto:vhennebert@gmail.com] 
> Sent: Wednesday, October 07, 2009 6:51 AM
> To: fop-dev@xmlgraphics.apache.org
> Subject: Re: Regular expression use
> 
> Hi Jonathan,
> 
> Jonathan Levinson wrote:
>> I noticed that if one is not careful in one's regular expression use,
>> the compilation for a regular expression can take minutes.  I'm not
>> talking about applying the pattern just compiling it!
>>
>>  
>>
>> Should regular expressions be avoided altogether and should one use
>> hand-crafted state machines for parsing, and tokenizing, or can regular
>> expressions be used as long as one is careful?  
> 
> I’d say, use regular expressions as long as they are not too complex.
> But I guess you’re mentioning that in the context of property parsing,
> in which case I don’t think regular expressions are the ultimate answer.
> A proper lexer is likely to be needed, either generated or written by
> hand. As the latter solution quickly becomes a maintenance nightmare,
> some lexer generator will probably be needed. Question remains, which
> one, and I’m not even sure there’s one that exists whose license is
> ASLv2-compatible. Plus there are some issues specific to property
> parsing, like shorthands (which should ideally re-use the parsers of the
> individual properties), sub-properties, etc.
> 
> 
> Vincent

RE: Regular expression use

Posted by Jonathan Levinson <Jo...@intersystems.com>.

I'm sure someone has mentioned it already but what about the lexer support in ANTLR?

http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Lexical+analysis

ANTLR is available under the BSD license, which seems to be one with no strings attached:

http://www.antlr.org/license.html

Best Regards,
Jonathan S. Levinson

-----Original Message-----
From: Vincent Hennebert [mailto:vhennebert@gmail.com] 
Sent: Wednesday, October 07, 2009 6:51 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: Regular expression use

Hi Jonathan,

Jonathan Levinson wrote:
> I noticed that if one is not careful in one's regular expression use,
> the compilation for a regular expression can take minutes.  I'm not
> talking about applying the pattern just compiling it!
> 
>  
> 
> Should regular expressions be avoided altogether and should one use
> hand-crafted state machines for parsing, and tokenizing, or can regular
> expressions be used as long as one is careful?  

I’d say, use regular expressions as long as they are not too complex.
But I guess you’re mentioning that in the context of property parsing,
in which case I don’t think regular expressions are the ultimate answer.
A proper lexer is likely to be needed, either generated or written by
hand. As the latter solution quickly becomes a maintenance nightmare,
some lexer generator will probably be needed. Question remains, which
one, and I’m not even sure there’s one that exists whose license is
ASLv2-compatible. Plus there are some issues specific to property
parsing, like shorthands (which should ideally re-use the parsers of the
individual properties), sub-properties, etc.

Vincent

Re: Regular expression use

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Jonathan,

Jonathan Levinson wrote:
> I noticed that if one is not careful in one's regular expression use,
> the compilation for a regular expression can take minutes.  I'm not
> talking about applying the pattern just compiling it!
> 
>  
> 
> Should regular expressions be avoided altogether and should one use
> hand-crafted state machines for parsing, and tokenizing, or can regular
> expressions be used as long as one is careful?  

I’d say, use regular expressions as long as they are not too complex.
But I guess you’re mentioning that in the context of property parsing,
in which case I don’t think regular expressions are the ultimate answer.
A proper lexer is likely to be needed, either generated or written by
hand. As the latter solution quickly becomes a maintenance nightmare,
some lexer generator will probably be needed. Question remains, which
one, and I’m not even sure there’s one that exists whose license is
ASLv2-compatible. Plus there are some issues specific to property
parsing, like shorthands (which should ideally re-use the parsers of the
individual properties), sub-properties, etc.

Vincent