You are viewing a plain text version of this content. The canonical link for it is here.

Posted to fop-dev@xmlgraphics.apache.org by Jonathan Levinson <Jo...@intersystems.com> on 2009/09/27 22:48:28 UTC

Questionable whether font-shorthand grammar LL(1)

Hi Vincent,

 

I dusted off my books on parsing and compiling (also using some
Web-sites to do research) and looked at writing a formal grammar for
font-shorthand.  

 

Because font-variant font-style and font-weight can occur in any order,
I could not (currently) come up with a grammar in which the directing
sets were disjoint for each non-terminal.  So I was unable to come up
with an LL(1) grammar.

 

For instance, here are two productions of my attempt at a grammar: 

 

<style-variant-weight> -> <variant-weight>

<style-variant-weight> -> <variant-style>

 

In each case, the first set of <style-variant-weight> shares a common
element in two different productions, the literal values for variant.
One needs to look ahead one more token to see if one has a
<variant-weight> or a <variant-style>.

 

According to Gough's "Syntax Analysis and Software Tools" (1988) "For
every production of the augmented grammar we derive a set of possible
1-lookahead symbols, which we call the director set for that production.
If and only if the director sets for different productions of the same
non-terminal are disjoint, i.e. have no common elements, is the grammar
LL(1)."

 

Also the grammar is ambiguous as we've discussed. 

 

<font> -> <style-variant-weight> <size> [ / <line-height>] <family>

 

 If the string starts with 'normal' and then goes on to define <size>
and <family> then one isn't sure whether style or variant or weight are
being specified.

 

Somehow one needs to special case 'normal' so that when the string
begins with normal - one value (say font-weight is set) and the other
two are not set which according to the spec means they are reset to
normal as well.

 

The books and web articles I read only discussed using recursive descent
when the grammar is LL(1).  I have the feeling that despite the
ambiguities in the grammar it is almost LL(k) because font-variant and
font-style and font-weight almost have disjoint values.   It is at least
LL(3) and I suspect it is LL(6).

 

Given your greater knowledge of parsing, do you know if an LL(k) parser
can always be implemented as recursive descent if one looks k tokens
ahead in one's parsing routine?

 

I also noticed that the fact that space separates the tokens must be in
an important part of any solution to the problem and that the
font-shorthand is more easily parsed (by any software) from
right-to-left than left-to-right.  This is because font-family is not
nullable and in a right-to-left parsing is the first element
encountered.    A non-terminal symbol is nullable if null can be validly
derived from it in terms of the grammar.

 

I'm not as convinced as you are that recursive descent parsing or a
formal bottom-up-parser will make the code simpler rather than more
complex because of the complexities of a formal grammar.   Of course,
however complex the grammar, a table-generating tool - like ANTLR - will
generate code, however complex, which will faithfully reflect the
inputted grammar.  However, none of the other properties in FOP use a
table-generating tool like ANTLR - and I'm not sure what the
consequences would be to FOP of introducing such a tool.  Given the
complexities of the grammar, I'm sure that a recursive descent parser
will be quite complex, and if we are going to use a grammar driven
approach we would be better off with a tool that generates parsers from
grammars rather than the recursive descent approach.  Also an advantage
of parser generators is that one doesn't have to rewrite so much code to
correct a mistake in one's grammar, if one makes a mistake, or if the
grammar changes.  Recursive descent parsing can pose its own maintenance
nightmares.

 

The current approach in FOP for font-shorthand is obscurely written but
strikes me as basically sound.

 

1)      One parses from right-to-left using the fact that spaces divide
tokens

2)      One lets property makers determine whether they apply to a
token.  Each property maker is a little parser of the token one feeds
it.  Because the property makers determine whether they apply to a
token, one can handle the fact that variant, weight and style can occur
in any order by feeding the current token to each of the property makers
for font-variant, font-weight, and font-style in turn.  Whatever they
accept is ipso-facto a font-variant or a font-weight or font-style.

 

Just want to let you know I take the problem seriously, and I'm not
trying to duck the responsibility of coming up with an adequate
solution.  I'm not sure what I did fits into a "job priority" which is
why I spent many hours this weekend on this research.

 

You are free to disagree with my observations and I notice that on
fop-dev forums you challenge us to make the code simpler, more reusable,
and better structured.

 

Best Regards,

Jonathan S. Levinson

Senior Software Developer

Object Group

InterSystems

617-621-0600

Re: Questionable whether font-shorthand grammar LL(1)

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Jonathan,

Jonathan Levinson wrote:
> Hi Vincent,
> 
> Excellent ideas!  
> 
> The diagram you drew is extremely useful!
> 
> If the font shorthand sub-language has a grammar that is regular then it also has a grammar that is LL(1).  So recursive descent parsing will work, if there is a regular grammar.
> 
> I think the best way of getting font shorthand to work would proceed in stages:
> 
> 1) First get the current code to properly parse and accept valid font shorthand expressions.  This should be very easy.  The one remaining problem (AFAIK) is the parsing of font-size/line-height where /line-height is optional.   Currently spaces are not allowed around the slash "/" and they should be.  I'm going to try to get to this problem as soon as I have time, probably in a day or so.

The current code predates the switch to Java 1.4 as a minimum
requirement, so couldn’t use the java.util.regex package. Feel free to
make use of regular expressions if you think that will make the job
easier.


> 2) Evaluate which parser or automaton approach is the simplest and produces better error states than the current approach.  
> 3) Implement the approach one has chosen in (2).

Good luck!

<snip/>

Vincent

RE: Questionable whether font-shorthand grammar LL(1)

Posted by Jonathan Levinson <Jo...@intersystems.com>.

Hi Vincent,

Excellent ideas!  

The diagram you drew is extremely useful!

If the font shorthand sub-language has a grammar that is regular then it also has a grammar that is LL(1).  So recursive descent parsing will work, if there is a regular grammar.

I think the best way of getting font shorthand to work would proceed in stages:

1) First get the current code to properly parse and accept valid font shorthand expressions.  This should be very easy.  The one remaining problem (AFAIK) is the parsing of font-size/line-height where /line-height is optional.   Currently spaces are not allowed around the slash "/" and they should be.  I'm going to try to get to this problem as soon as I have time, probably in a day or so.
2) Evaluate which parser or automaton approach is the simplest and produces better error states than the current approach.  
3) Implement the approach one has chosen in (2).

Best Regards,
Jonathan S. Levinson
Senior Software Developer
Object Group
InterSystems
617-621-0600

-----Original Message-----
From: Vincent Hennebert [mailto:vhennebert@gmail.com] 
Sent: Monday, September 28, 2009 8:13 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: Questionable whether font-shorthand grammar LL(1)

Hi Jonathan,

Interesting stuff!

Jonathan Levinson wrote:
> Hi Vincent,
> 
<snip/>
> 
> Because font-variant font-style and font-weight can occur in any 
> order, I could not (currently) come up with a grammar in which the 
> directing sets were disjoint for each non-terminal.  So I was unable 
> to come up with an LL(1) grammar.
> 
> For instance, here are two productions of my attempt at a grammar: 
> 
> <style-variant-weight> -> <variant-weight>
> 
> <style-variant-weight> -> <variant-style>
> 
> In each case, the first set of <style-variant-weight> shares a common 
> element in two different productions, the literal values for variant.
> One needs to look ahead one more token to see if one has a 
> <variant-weight> or a <variant-style>.

(I’ll call “modifier” any of the three style, variant, weight
properties.)
Taking the ‘normal’ case apart, and since ‘inherit’ is not allowed in the shorthand, I think the values for all modifiers are distinct:
‘italic’, ‘oblique’, ‘backslant’ for font-style, ‘small-caps’ for font-variant, and the various weight values for font-weight.

Since all modifiers are set to their initial values prior to the shorthand parsing, which is ‘normal’ for all three of them, I think we can simply ignore any ‘normal’ value found in the string. That is, accept it as a legal terminal but not do anything.

So I don’t think there is any ambiguity any more. What remains to be done is to check that the same modifier is not specified more than once (that includes checking that ‘normal’ is not specified more than
3 times). And it’s probably easier to check that at the semantic level instead of crafting special grammar rules.

<snip/>
> The books and web articles I read only discussed using recursive 
> descent when the grammar is LL(1).  I have the feeling that despite 
> the ambiguities in the grammar it is almost LL(k) because font-variant and
> font-style and font-weight almost have disjoint values.   It is at least
> LL(3) and I suspect it is LL(6).

The font-size property has the good idea of not allowing ‘normal’ as a value. The ‘normal’ case for modifiers can be ignored as explained above. So I think the grammar still is LL(1)

<snip/>
> I'm not as convinced as you are that recursive descent parsing or a 
> formal bottom-up-parser will make the code simpler rather than more
> complex because of the complexities of a formal grammar.   Of course,
> however complex the grammar, a table-generating tool - like ANTLR - 
> will generate code, however complex, which will faithfully reflect the 
> inputted grammar.  However, none of the other properties in FOP use a 
> table-generating tool like ANTLR - and I'm not sure what the 
> consequences would be to FOP of introducing such a tool.  Given the 
> complexities of the grammar, I'm sure that a recursive descent parser 
> will be quite complex, and if we are going to use a grammar driven 
> approach we would be better off with a tool that generates parsers 
> from grammars rather than the recursive descent approach.  Also an 
> advantage of parser generators is that one doesn't have to rewrite so 
> much code to correct a mistake in one's grammar, if one makes a 
> mistake, or if the grammar changes.  Recursive descent parsing can 
> pose its own maintenance nightmares.

Using a grammar tool like ANTLR is probably overkill to parse just a shorthand property. Moreover the grammar is not likely to change, so that reduces its usefulness even more. That said, most properties can accept expressions, where such a tool might actually be interesting.
I don’t know how far FOP goes to supporting expressions in other properties.

> The current approach in FOP for font-shorthand is obscurely written 
> but strikes me as basically sound.
> 
> 1)      One parses from right-to-left using the fact that spaces divide
> tokens

The problem is that font families can be specified with strings containing whitespace, that must be handled in a specific manner and not as a terminal delimitation. Otherwise parsing from right to left would indeed probably be relatively easy.

> 2)      One lets property makers determine whether they apply to a
> token.  Each property maker is a little parser of the token one feeds 
> it.  Because the property makers determine whether they apply to a 
> token, one can handle the fact that variant, weight and style can 
> occur in any order by feeding the current token to each of the 
> property makers for font-variant, font-weight, and font-style in turn.  
> Whatever they accept is ipso-facto a font-variant or a font-weight or font-style.
> 
> Just want to let you know I take the problem seriously, and I'm not 
> trying to duck the responsibility of coming up with an adequate 
> solution.  I'm not sure what I did fits into a "job priority" which is 
> why I spent many hours this weekend on this research.

Thanks for looking into this issue. I hope my comments above can help.

As an alternative: the fact that the shorthand property can be parsed using a regular expression shows that its grammar is a regular language [1], thus can also be parsed using an automaton. I’ve quickly sketched such an automaton that I’ve attached to this message. The error handling probably needs to be refined but otherwise it should already provide a good basis for a parser.

[1] http://en.wikipedia.org/wiki/Regular_language

> You are free to disagree with my observations and I notice that on 
> fop-dev forums you challenge us to make the code simpler, more 
> reusable, and better structured.

That’s indeed an important thing to keep in mind. I believe an LL parser would be relatively easy to implement. If you feel that the current approach could be simplified and improved to be as powerful as an LL parser, by all means, go for it.

Thanks,
Vincent

RE: Questionable whether font-shorthand grammar LL(1)

Posted by Jonathan Levinson <Jo...@intersystems.com>.

I agree - in this case - tokenizing - lexical analysis - is more difficult than parsing.

Best Regards,
Jonathan 

-----Original Message-----
From: Vincent Hennebert [mailto:vhennebert@gmail.com] 
Sent: Wednesday, September 30, 2009 6:25 AM
To: fop-dev@xmlgraphics.apache.org
Subject: Re: Questionable whether font-shorthand grammar LL(1)

Thanks everyone for your parser suggestions. I believe we should be able
to do without one for the font shorthand, but this is definitely
something to keep in mind if we want to improve the parsing of other
properties.

I’m starting to realise that the most difficult part is probably not so
much the grammar parsing as the lexical analysis. To be continued,
I guess...

Vincent


Laurent Caillette wrote:
> Hi all,
> 
> I've never used SableCC or JavaCC so I cannot compare, but I'm using ANTLR a lot. ANTLR is highly customizable and has a very strong community. It's integrated development environment offers a debugger and visualization of grammar ambiguities. It's not only simple to setup and use, it also offers all the comfort you can reasonably dream of when developing grammars.
> 
> Maybe that a tool like JarJar could reduce the pain of depending on one more library (with all possible conflicts that could happen to FOP users).
> 
> Because code generation has some drawbacks (at least in terms of build complexity) you may be interested by JParsec, which creates parsers dynamically from pure Java code. Disclaimer: never used it.
> http://jparsec.codehaus.org
> 
> Hope this will help you to do a reasonable choice.
> 
> c.
> 
> 
> -----Message d'origine-----
> De : berger.max@gmail.com [mailto:berger.max@gmail.com] De la part de Max Berger
> Envoyé : mardi 29 septembre 2009 13:00
> À : fop-dev@xmlgraphics.apache.org
> Objet : Re: Questionable whether font-shorthand grammar LL(1)
> 
> Hi Vincent,
> 
> 
> 2009/9/29 Vincent Hennebert <vh...@gmail.com>:
>>> How about specifing the grammer and using a tool such as JavaCC to
>>> generate the actual parser? This way you could focus more complete
>>> grammer and have to spend less time writing the parser.
>> That would be the same as using ANTLR. I feel that this is a bit
>> overkill for just parsing the font shorthand property, although that may
>> prove to be useful for other properties that can accept complex
>> expressions.
>> That said, JavaCC is an interesting suggestion, I didn’t think of it. If
>> a choice had to be made between ANTLR and JavaCC, which one would win?
> 
> ANTLR:
> - easy to use
> - requires runtime linking of jar [1] (a *huge* disadvantage imo)
> 
> JavaCC:
> - very sparse documentation
> - generates standalone java classes
> 
> SableCC:
> - better documentation
> - LGPL (And therefore maybe not feasible, although it would only be
> used at compile time and not runtime)
> 
> [1] http://beust.com/weblog/archives/000145.html
> 
> 
> Max

Re: Questionable whether font-shorthand grammar LL(1)

Posted by Vincent Hennebert <vh...@gmail.com>.

Thanks everyone for your parser suggestions. I believe we should be able
to do without one for the font shorthand, but this is definitely
something to keep in mind if we want to improve the parsing of other
properties.

I’m starting to realise that the most difficult part is probably not so
much the grammar parsing as the lexical analysis. To be continued,
I guess...

Vincent


Laurent Caillette wrote:
> Hi all,
> 
> I've never used SableCC or JavaCC so I cannot compare, but I'm using ANTLR a lot. ANTLR is highly customizable and has a very strong community. It's integrated development environment offers a debugger and visualization of grammar ambiguities. It's not only simple to setup and use, it also offers all the comfort you can reasonably dream of when developing grammars.
> 
> Maybe that a tool like JarJar could reduce the pain of depending on one more library (with all possible conflicts that could happen to FOP users).
> 
> Because code generation has some drawbacks (at least in terms of build complexity) you may be interested by JParsec, which creates parsers dynamically from pure Java code. Disclaimer: never used it.
> http://jparsec.codehaus.org
> 
> Hope this will help you to do a reasonable choice.
> 
> c.
> 
> 
> -----Message d'origine-----
> De : berger.max@gmail.com [mailto:berger.max@gmail.com] De la part de Max Berger
> Envoyé : mardi 29 septembre 2009 13:00
> À : fop-dev@xmlgraphics.apache.org
> Objet : Re: Questionable whether font-shorthand grammar LL(1)
> 
> Hi Vincent,
> 
> 
> 2009/9/29 Vincent Hennebert <vh...@gmail.com>:
>>> How about specifing the grammer and using a tool such as JavaCC to
>>> generate the actual parser? This way you could focus more complete
>>> grammer and have to spend less time writing the parser.
>> That would be the same as using ANTLR. I feel that this is a bit
>> overkill for just parsing the font shorthand property, although that may
>> prove to be useful for other properties that can accept complex
>> expressions.
>> That said, JavaCC is an interesting suggestion, I didn’t think of it. If
>> a choice had to be made between ANTLR and JavaCC, which one would win?
> 
> ANTLR:
> - easy to use
> - requires runtime linking of jar [1] (a *huge* disadvantage imo)
> 
> JavaCC:
> - very sparse documentation
> - generates standalone java classes
> 
> SableCC:
> - better documentation
> - LGPL (And therefore maybe not feasible, although it would only be
> used at compile time and not runtime)
> 
> [1] http://beust.com/weblog/archives/000145.html
> 
> 
> Max

RE: Questionable whether font-shorthand grammar LL(1)

Posted by Laurent Caillette <La...@ullink.com>.

Hi all,

I've never used SableCC or JavaCC so I cannot compare, but I'm using ANTLR a lot. ANTLR is highly customizable and has a very strong community. It's integrated development environment offers a debugger and visualization of grammar ambiguities. It's not only simple to setup and use, it also offers all the comfort you can reasonably dream of when developing grammars.

Maybe that a tool like JarJar could reduce the pain of depending on one more library (with all possible conflicts that could happen to FOP users).

Because code generation has some drawbacks (at least in terms of build complexity) you may be interested by JParsec, which creates parsers dynamically from pure Java code. Disclaimer: never used it.
http://jparsec.codehaus.org

Hope this will help you to do a reasonable choice.

c.


-----Message d'origine-----
De : berger.max@gmail.com [mailto:berger.max@gmail.com] De la part de Max Berger
Envoyé : mardi 29 septembre 2009 13:00
À : fop-dev@xmlgraphics.apache.org
Objet : Re: Questionable whether font-shorthand grammar LL(1)

Hi Vincent,


2009/9/29 Vincent Hennebert <vh...@gmail.com>:
>> How about specifing the grammer and using a tool such as JavaCC to
>> generate the actual parser? This way you could focus more complete
>> grammer and have to spend less time writing the parser.
> That would be the same as using ANTLR. I feel that this is a bit
> overkill for just parsing the font shorthand property, although that may
> prove to be useful for other properties that can accept complex
> expressions.
> That said, JavaCC is an interesting suggestion, I didn’t think of it. If
> a choice had to be made between ANTLR and JavaCC, which one would win?

ANTLR:
- easy to use
- requires runtime linking of jar [1] (a *huge* disadvantage imo)

JavaCC:
- very sparse documentation
- generates standalone java classes

SableCC:
- better documentation
- LGPL (And therefore maybe not feasible, although it would only be
used at compile time and not runtime)

[1] http://beust.com/weblog/archives/000145.html


Max


__________ Information provenant d'ESET NOD32 Antivirus, version de la base des signatures de virus 4466 (20090929) __________

Le message a été vérifié par ESET NOD32 Antivirus.

http://www.eset.com

Re: Questionable whether font-shorthand grammar LL(1)

Posted by Max Berger <ma...@berger.name>.

Hi Vincent,


2009/9/29 Vincent Hennebert <vh...@gmail.com>:
>> How about specifing the grammer and using a tool such as JavaCC to
>> generate the actual parser? This way you could focus more complete
>> grammer and have to spend less time writing the parser.
> That would be the same as using ANTLR. I feel that this is a bit
> overkill for just parsing the font shorthand property, although that may
> prove to be useful for other properties that can accept complex
> expressions.
> That said, JavaCC is an interesting suggestion, I didn’t think of it. If
> a choice had to be made between ANTLR and JavaCC, which one would win?

ANTLR:
- easy to use
- requires runtime linking of jar [1] (a *huge* disadvantage imo)

JavaCC:
- very sparse documentation
- generates standalone java classes

SableCC:
- better documentation
- LGPL (And therefore maybe not feasible, although it would only be
used at compile time and not runtime)

[1] http://beust.com/weblog/archives/000145.html



Max

Re: Questionable whether font-shorthand grammar LL(1)

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Max,

Max Berger wrote:
> Hi *,
> 
> I just want to throw in a different idea (you may ignore it if you like):
> 
> How about specifing the grammer and using a tool such as JavaCC to
> generate the actual parser? This way you could focus more complete
> grammer and have to spend less time writing the parser.

That would be the same as using ANTLR. I feel that this is a bit
overkill for just parsing the font shorthand property, although that may
prove to be useful for other properties that can accept complex
expressions.
That said, JavaCC is an interesting suggestion, I didn’t think of it. If
a choice had to be made between ANTLR and JavaCC, which one would win?


> JavaCC is BSD license, so we could easily integrate it in the fop build.
> 
> Max

Thanks,
Vincent



> 2009/9/28 Vincent Hennebert:
>> Hi Jonathan,
>>
>> Interesting stuff!
>>
>> Jonathan Levinson wrote:
>>> Hi Vincent,
>>>
>> <snip/>
>>> Because font-variant font-style and font-weight can occur in any order,
>>> I could not (currently) come up with a grammar in which the directing
>>> sets were disjoint for each non-terminal.  So I was unable to come up
>>> with an LL(1) grammar.
>>>
>>> For instance, here are two productions of my attempt at a grammar:
>>>
>>> <style-variant-weight> -> <variant-weight>
>>>
>>> <style-variant-weight> -> <variant-style>
>>>
>>> In each case, the first set of <style-variant-weight> shares a common
>>> element in two different productions, the literal values for variant.
>>> One needs to look ahead one more token to see if one has a
>>> <variant-weight> or a <variant-style>.
>> (I’ll call “modifier” any of the three style, variant, weight
>> properties.)
>> Taking the ‘normal’ case apart, and since ‘inherit’ is not allowed in
>> the shorthand, I think the values for all modifiers are distinct:
>> ‘italic’, ‘oblique’, ‘backslant’ for font-style, ‘small-caps’ for
>> font-variant, and the various weight values for font-weight.
>>
>> Since all modifiers are set to their initial values prior to the
>> shorthand parsing, which is ‘normal’ for all three of them, I think we
>> can simply ignore any ‘normal’ value found in the string. That is,
>> accept it as a legal terminal but not do anything.
>>
>> So I don’t think there is any ambiguity any more. What remains to be
>> done is to check that the same modifier is not specified more than once
>> (that includes checking that ‘normal’ is not specified more than
>> 3 times). And it’s probably easier to check that at the semantic level
>> instead of crafting special grammar rules.
>>
>>
>> <snip/>
>>> The books and web articles I read only discussed using recursive descent
>>> when the grammar is LL(1).  I have the feeling that despite the
>>> ambiguities in the grammar it is almost LL(k) because font-variant and
>>> font-style and font-weight almost have disjoint values.   It is at least
>>> LL(3) and I suspect it is LL(6).
>> The font-size property has the good idea of not allowing ‘normal’ as
>> a value. The ‘normal’ case for modifiers can be ignored as explained
>> above. So I think the grammar still is LL(1)
>>
>>
>> <snip/>
>>> I'm not as convinced as you are that recursive descent parsing or a
>>> formal bottom-up-parser will make the code simpler rather than more
>>> complex because of the complexities of a formal grammar.   Of course,
>>> however complex the grammar, a table-generating tool - like ANTLR - will
>>> generate code, however complex, which will faithfully reflect the
>>> inputted grammar.  However, none of the other properties in FOP use a
>>> table-generating tool like ANTLR - and I'm not sure what the
>>> consequences would be to FOP of introducing such a tool.  Given the
>>> complexities of the grammar, I'm sure that a recursive descent parser
>>> will be quite complex, and if we are going to use a grammar driven
>>> approach we would be better off with a tool that generates parsers from
>>> grammars rather than the recursive descent approach.  Also an advantage
>>> of parser generators is that one doesn't have to rewrite so much code to
>>> correct a mistake in one's grammar, if one makes a mistake, or if the
>>> grammar changes.  Recursive descent parsing can pose its own maintenance
>>> nightmares.
>> Using a grammar tool like ANTLR is probably overkill to parse just
>> a shorthand property. Moreover the grammar is not likely to change, so
>> that reduces its usefulness even more. That said, most properties can
>> accept expressions, where such a tool might actually be interesting.
>> I don’t know how far FOP goes to supporting expressions in other
>> properties.
>>
>>
>>> The current approach in FOP for font-shorthand is obscurely written but
>>> strikes me as basically sound.
>>>
>>> 1)      One parses from right-to-left using the fact that spaces divide
>>> tokens
>> The problem is that font families can be specified with strings
>> containing whitespace, that must be handled in a specific manner and not
>> as a terminal delimitation. Otherwise parsing from right to left would
>> indeed probably be relatively easy.
>>
>>
>>> 2)      One lets property makers determine whether they apply to a
>>> token.  Each property maker is a little parser of the token one feeds
>>> it.  Because the property makers determine whether they apply to a
>>> token, one can handle the fact that variant, weight and style can occur
>>> in any order by feeding the current token to each of the property makers
>>> for font-variant, font-weight, and font-style in turn.  Whatever they
>>> accept is ipso-facto a font-variant or a font-weight or font-style.
>>>
>>> Just want to let you know I take the problem seriously, and I'm not
>>> trying to duck the responsibility of coming up with an adequate
>>> solution.  I'm not sure what I did fits into a "job priority" which is
>>> why I spent many hours this weekend on this research.
>> Thanks for looking into this issue. I hope my comments above can help.
>>
>> As an alternative: the fact that the shorthand property can be parsed
>> using a regular expression shows that its grammar is a regular
>> language [1], thus can also be parsed using an automaton. I’ve quickly
>> sketched such an automaton that I’ve attached to this message. The error
>> handling probably needs to be refined but otherwise it should already
>> provide a good basis for a parser.
>>
>> [1] http://en.wikipedia.org/wiki/Regular_language
>>
>>
>>> You are free to disagree with my observations and I notice that on
>>> fop-dev forums you challenge us to make the code simpler, more reusable,
>>> and better structured.
>> That’s indeed an important thing to keep in mind. I believe an LL parser
>> would be relatively easy to implement. If you feel that the current
>> approach could be simplified and improved to be as powerful as an LL
>> parser, by all means, go for it.
>>
>>
>> Thanks,
>> Vincent
>>

Re: Questionable whether font-shorthand grammar LL(1)

Posted by Max Berger <ma...@berger.name>.

Hi *,

I just want to throw in a different idea (you may ignore it if you like):

How about specifing the grammer and using a tool such as JavaCC to
generate the actual parser? This way you could focus more complete
grammer and have to spend less time writing the parser.

JavaCC is BSD license, so we could easily integrate it in the fop build.

Max

2009/9/28 Vincent Hennebert <vh...@gmail.com>:
> Hi Jonathan,
>
> Interesting stuff!
>
> Jonathan Levinson wrote:
>> Hi Vincent,
>>
> <snip/>
>>
>> Because font-variant font-style and font-weight can occur in any order,
>> I could not (currently) come up with a grammar in which the directing
>> sets were disjoint for each non-terminal.  So I was unable to come up
>> with an LL(1) grammar.
>>
>> For instance, here are two productions of my attempt at a grammar:
>>
>> <style-variant-weight> -> <variant-weight>
>>
>> <style-variant-weight> -> <variant-style>
>>
>> In each case, the first set of <style-variant-weight> shares a common
>> element in two different productions, the literal values for variant.
>> One needs to look ahead one more token to see if one has a
>> <variant-weight> or a <variant-style>.
>
> (I’ll call “modifier” any of the three style, variant, weight
> properties.)
> Taking the ‘normal’ case apart, and since ‘inherit’ is not allowed in
> the shorthand, I think the values for all modifiers are distinct:
> ‘italic’, ‘oblique’, ‘backslant’ for font-style, ‘small-caps’ for
> font-variant, and the various weight values for font-weight.
>
> Since all modifiers are set to their initial values prior to the
> shorthand parsing, which is ‘normal’ for all three of them, I think we
> can simply ignore any ‘normal’ value found in the string. That is,
> accept it as a legal terminal but not do anything.
>
> So I don’t think there is any ambiguity any more. What remains to be
> done is to check that the same modifier is not specified more than once
> (that includes checking that ‘normal’ is not specified more than
> 3 times). And it’s probably easier to check that at the semantic level
> instead of crafting special grammar rules.
>
>
> <snip/>
>> The books and web articles I read only discussed using recursive descent
>> when the grammar is LL(1).  I have the feeling that despite the
>> ambiguities in the grammar it is almost LL(k) because font-variant and
>> font-style and font-weight almost have disjoint values.   It is at least
>> LL(3) and I suspect it is LL(6).
>
> The font-size property has the good idea of not allowing ‘normal’ as
> a value. The ‘normal’ case for modifiers can be ignored as explained
> above. So I think the grammar still is LL(1)
>
>
> <snip/>
>> I'm not as convinced as you are that recursive descent parsing or a
>> formal bottom-up-parser will make the code simpler rather than more
>> complex because of the complexities of a formal grammar.   Of course,
>> however complex the grammar, a table-generating tool - like ANTLR - will
>> generate code, however complex, which will faithfully reflect the
>> inputted grammar.  However, none of the other properties in FOP use a
>> table-generating tool like ANTLR - and I'm not sure what the
>> consequences would be to FOP of introducing such a tool.  Given the
>> complexities of the grammar, I'm sure that a recursive descent parser
>> will be quite complex, and if we are going to use a grammar driven
>> approach we would be better off with a tool that generates parsers from
>> grammars rather than the recursive descent approach.  Also an advantage
>> of parser generators is that one doesn't have to rewrite so much code to
>> correct a mistake in one's grammar, if one makes a mistake, or if the
>> grammar changes.  Recursive descent parsing can pose its own maintenance
>> nightmares.
>
> Using a grammar tool like ANTLR is probably overkill to parse just
> a shorthand property. Moreover the grammar is not likely to change, so
> that reduces its usefulness even more. That said, most properties can
> accept expressions, where such a tool might actually be interesting.
> I don’t know how far FOP goes to supporting expressions in other
> properties.
>
>
>> The current approach in FOP for font-shorthand is obscurely written but
>> strikes me as basically sound.
>>
>> 1)      One parses from right-to-left using the fact that spaces divide
>> tokens
>
> The problem is that font families can be specified with strings
> containing whitespace, that must be handled in a specific manner and not
> as a terminal delimitation. Otherwise parsing from right to left would
> indeed probably be relatively easy.
>
>
>> 2)      One lets property makers determine whether they apply to a
>> token.  Each property maker is a little parser of the token one feeds
>> it.  Because the property makers determine whether they apply to a
>> token, one can handle the fact that variant, weight and style can occur
>> in any order by feeding the current token to each of the property makers
>> for font-variant, font-weight, and font-style in turn.  Whatever they
>> accept is ipso-facto a font-variant or a font-weight or font-style.
>>
>> Just want to let you know I take the problem seriously, and I'm not
>> trying to duck the responsibility of coming up with an adequate
>> solution.  I'm not sure what I did fits into a "job priority" which is
>> why I spent many hours this weekend on this research.
>
> Thanks for looking into this issue. I hope my comments above can help.
>
> As an alternative: the fact that the shorthand property can be parsed
> using a regular expression shows that its grammar is a regular
> language [1], thus can also be parsed using an automaton. I’ve quickly
> sketched such an automaton that I’ve attached to this message. The error
> handling probably needs to be refined but otherwise it should already
> provide a good basis for a parser.
>
> [1] http://en.wikipedia.org/wiki/Regular_language
>
>
>> You are free to disagree with my observations and I notice that on
>> fop-dev forums you challenge us to make the code simpler, more reusable,
>> and better structured.
>
> That’s indeed an important thing to keep in mind. I believe an LL parser
> would be relatively easy to implement. If you feel that the current
> approach could be simplified and improved to be as powerful as an LL
> parser, by all means, go for it.
>
>
> Thanks,
> Vincent
>

Re: Questionable whether font-shorthand grammar LL(1)

Posted by Vincent Hennebert <vh...@gmail.com>.

Hi Jonathan,

Interesting stuff!

Jonathan Levinson wrote:
> Hi Vincent,
> 
<snip/>
> 
> Because font-variant font-style and font-weight can occur in any order,
> I could not (currently) come up with a grammar in which the directing
> sets were disjoint for each non-terminal.  So I was unable to come up
> with an LL(1) grammar.
> 
> For instance, here are two productions of my attempt at a grammar: 
> 
> <style-variant-weight> -> <variant-weight>
> 
> <style-variant-weight> -> <variant-style>
> 
> In each case, the first set of <style-variant-weight> shares a common
> element in two different productions, the literal values for variant.
> One needs to look ahead one more token to see if one has a
> <variant-weight> or a <variant-style>.

(I’ll call “modifier” any of the three style, variant, weight
properties.)
Taking the ‘normal’ case apart, and since ‘inherit’ is not allowed in
the shorthand, I think the values for all modifiers are distinct:
‘italic’, ‘oblique’, ‘backslant’ for font-style, ‘small-caps’ for
font-variant, and the various weight values for font-weight.

Since all modifiers are set to their initial values prior to the
shorthand parsing, which is ‘normal’ for all three of them, I think we
can simply ignore any ‘normal’ value found in the string. That is,
accept it as a legal terminal but not do anything.

So I don’t think there is any ambiguity any more. What remains to be
done is to check that the same modifier is not specified more than once
(that includes checking that ‘normal’ is not specified more than
3 times). And it’s probably easier to check that at the semantic level
instead of crafting special grammar rules.

<snip/>
> The books and web articles I read only discussed using recursive descent
> when the grammar is LL(1).  I have the feeling that despite the
> ambiguities in the grammar it is almost LL(k) because font-variant and
> font-style and font-weight almost have disjoint values.   It is at least
> LL(3) and I suspect it is LL(6).

The font-size property has the good idea of not allowing ‘normal’ as
a value. The ‘normal’ case for modifiers can be ignored as explained
above. So I think the grammar still is LL(1)

<snip/>
> I'm not as convinced as you are that recursive descent parsing or a
> formal bottom-up-parser will make the code simpler rather than more
> complex because of the complexities of a formal grammar.   Of course,
> however complex the grammar, a table-generating tool - like ANTLR - will
> generate code, however complex, which will faithfully reflect the
> inputted grammar.  However, none of the other properties in FOP use a
> table-generating tool like ANTLR - and I'm not sure what the
> consequences would be to FOP of introducing such a tool.  Given the
> complexities of the grammar, I'm sure that a recursive descent parser
> will be quite complex, and if we are going to use a grammar driven
> approach we would be better off with a tool that generates parsers from
> grammars rather than the recursive descent approach.  Also an advantage
> of parser generators is that one doesn't have to rewrite so much code to
> correct a mistake in one's grammar, if one makes a mistake, or if the
> grammar changes.  Recursive descent parsing can pose its own maintenance
> nightmares.

Using a grammar tool like ANTLR is probably overkill to parse just
a shorthand property. Moreover the grammar is not likely to change, so
that reduces its usefulness even more. That said, most properties can
accept expressions, where such a tool might actually be interesting.
I don’t know how far FOP goes to supporting expressions in other
properties.

> The current approach in FOP for font-shorthand is obscurely written but
> strikes me as basically sound.
> 
> 1)      One parses from right-to-left using the fact that spaces divide
> tokens

The problem is that font families can be specified with strings
containing whitespace, that must be handled in a specific manner and not
as a terminal delimitation. Otherwise parsing from right to left would
indeed probably be relatively easy.

> 2)      One lets property makers determine whether they apply to a
> token.  Each property maker is a little parser of the token one feeds
> it.  Because the property makers determine whether they apply to a
> token, one can handle the fact that variant, weight and style can occur
> in any order by feeding the current token to each of the property makers
> for font-variant, font-weight, and font-style in turn.  Whatever they
> accept is ipso-facto a font-variant or a font-weight or font-style.
> 
> Just want to let you know I take the problem seriously, and I'm not
> trying to duck the responsibility of coming up with an adequate
> solution.  I'm not sure what I did fits into a "job priority" which is
> why I spent many hours this weekend on this research.

Thanks for looking into this issue. I hope my comments above can help.

As an alternative: the fact that the shorthand property can be parsed
using a regular expression shows that its grammar is a regular
language [1], thus can also be parsed using an automaton. I’ve quickly
sketched such an automaton that I’ve attached to this message. The error
handling probably needs to be refined but otherwise it should already
provide a good basis for a parser.

[1] http://en.wikipedia.org/wiki/Regular_language

> You are free to disagree with my observations and I notice that on
> fop-dev forums you challenge us to make the code simpler, more reusable,
> and better structured.

That’s indeed an important thing to keep in mind. I believe an LL parser
would be relatively easy to implement. If you feel that the current
approach could be simplified and improved to be as powerful as an LL
parser, by all means, go for it.

Thanks,
Vincent