You are viewing a plain text version of this content. The canonical link for it is here.

Posted to doxia-dev@maven.apache.org by Vincent Massol <vi...@massol.net> on 2007/12/19 11:13:32 UTC

Doxia Parsing API?

Hi,

I'd like to implement a Doxia parser for XWiki. However I've noticed  
there's no standard in Doxia yet for parsing. Actually looking at  
Doxia confluence, twiki and Apt I see each does it with his own code.  
However the Confluence and TWiki implementations are very similar,  
each defining Block, BlockParser, etc.

Is someone working on offering a parsing api in doxia so that:
1) duplications can be removed
2) people can more easily write parsers for different syntaxes

Since my goal is to use Doxia as the rendering mechanism for XWiki I'd  
need the parser to be very fast (each request will lead to parsing the  
content). Does anyone have any idea how the Confluence parser compares  
for example with, say, a JavaCC-generated parser?

Thanks
-Vincent

Re: Doxia Parsing API?

Posted by Lukas Theussl <lt...@apache.org>.

Hi Vincent!

Confluence and TWiki modules are similar and different from the rest 
because they were supplied by the same (external) people. The only 
standard in doxia is the sink api, parsing is done basically 
independently in each module. I have reduced the problem of code 
duplication between the xdoc and xhtml modules by introducing some Xhtml 
base classes [1] but there is nothing like that for text parsers (yet).

I'm looking forward to your contributions! ;)

Cheers,
-Lukas

[1] https://svn.apache.org/viewvc?view=rev&revision=591684

Vincent Massol wrote:
> Hi,
> 
> I'd like to implement a Doxia parser for XWiki. However I've noticed  
> there's no standard in Doxia yet for parsing. Actually looking at  Doxia 
> confluence, twiki and Apt I see each does it with his own code.  However 
> the Confluence and TWiki implementations are very similar,  each 
> defining Block, BlockParser, etc.
> 
> Is someone working on offering a parsing api in doxia so that:
> 1) duplications can be removed
> 2) people can more easily write parsers for different syntaxes
> 
> Since my goal is to use Doxia as the rendering mechanism for XWiki I'd  
> need the parser to be very fast (each request will lead to parsing the  
> content). Does anyone have any idea how the Confluence parser compares  
> for example with, say, a JavaCC-generated parser?
> 
> Thanks
> -Vincent
> 
>

Re: Doxia Parsing API?

Posted by Jason van Zyl <ja...@maven.org>.

On 31 Dec 07, at 2:45 AM 31 Dec 07, Vincent Massol wrote:

>
> On Dec 27, 2007, at 11:20 AM, Vincent Massol wrote:
>
>> Hi Juan,
>>
>> Thanks for your email and sorry for my late answer, I've just seen  
>> the mails now.
>>
>> I've started using the confluence parser as a starting point for  
>> writing the XWiki parser. Re the speed, the confluence parser also  
>> generates a Block Tree but I'm not sure how this affects  
>> performance negatively.
>
> I can answer that... It'll matter for large documents since users  
> would not start to see anything output before the end of the  
> parsing. Modifying the parser to call traverse() whenever a block is  
> created would be very easy to do though. I think I might add a flag  
> for the xwiki parser to decide what to do.
>

Yah, modifying the parser to be more efficient would not be a problem.  
I was never dealing with anything more then a couple k.

> However there are some cases where the full parsing is required. For  
> example the XWiki TOC macro requires the full parsing to be done  
> since it needs to know all the section headers. Of course a second  
> level parsing could also be done, looking only for headers but that  
> would affect the performance a bit. So for all macros that require  
> the document structure for rendering we need the full parsing to be  
> done first. However it's hard to know quickly if the document  
> contains macros macros that work on the document structure and thus  
> we might have to parse the whole doc anyway...
>
> Thanks
> -Vincent
>
>> FWIW I've run some quick tests between the JavaCC-generated parser  
>> for XWiki that is in the wikimodel parser vs the "hand-written"  
>> Confluence parser in Doxia (since confluence and xwiki are of  
>> similar complexity for their syntaxes) and the result I got so far  
>> is that the "hand-written" parser is faster so I've gone ahead and  
>> used the "hand-written" confluence parser as a starting point.
>>
>> Thanks again
>> -Vincent
>>
>> On Dec 19, 2007, at 5:01 PM, Juan F. Codagnone wrote:
>>
>>> Hi Vicent,
>>>
>>> On Wednesday 19 December 2007, Vincent Massol wrote:
>>> ...
>>>> I'd like to implement a Doxia parser for XWiki. However I've  
>>>> noticed
>>>> there's no standard in Doxia yet for parsing. Actually looking at
>>>> Doxia confluence, twiki and Apt I see each does it with his own  
>>>> code.
>>>> However the Confluence and TWiki implementations are very similar,
>>>> each defining Block, BlockParser, etc.
>>> ...
>>>> content). Does anyone have any idea how the Confluence parser  
>>>> compares
>>>> for example with, say, a JavaCC-generated parser?
>>>
>>> The confluence parser was made after the twiki parser by Jason.
>>>
>>> When i first wrote the twiki parser i felt that it was easier to  
>>> make an adhoc
>>> parser instead of a generated one for a language that has many  
>>> exceptions.
>>> (Also i was also reading a TDD book at that time, and i wanted to  
>>> make some
>>> practice, and the adhoc parser was perfect)
>>>
>>> Here is the original post
>>> http://mail-archives.apache.org/mod_mbox/maven-doxia-dev/200511.mbox/%3c200511161959.22110.juam@users.sourceforge.net%3e
>>>
>>> Two years later i think it was a good decision. One developer that  
>>> never saw
>>> the original code was conforable adding new language feature and  
>>> bugfixes.
>>>
>>> In terms of of fast rendering mechanism, the twiki parser has a  
>>> draback: it
>>> first builds a block tree (like a DOM tree), and then the block  
>>> generates the
>>> events for the Sink.
>>>
>>> Juan.
>>>
>>> -- 
>>> Buenos Aires, Argentina                            22°C with winds  
>>> at 9 km/h E
>>
>

Thanks,

Jason

----------------------------------------------------------
Jason van Zyl
Founder,  Apache Maven
jason at sonatype dot com
----------------------------------------------------------

believe nothing, no matter where you read it,
or who has said it,
not even if i have said it,
unless it agrees with your own reason
and your own common sense.

-- Buddha

Re: Doxia Parsing API?

Posted by Vincent Massol <vi...@massol.net>.

On Dec 27, 2007, at 11:20 AM, Vincent Massol wrote:

> Hi Juan,
>
> Thanks for your email and sorry for my late answer, I've just seen  
> the mails now.
>
> I've started using the confluence parser as a starting point for  
> writing the XWiki parser. Re the speed, the confluence parser also  
> generates a Block Tree but I'm not sure how this affects performance  
> negatively.

I can answer that... It'll matter for large documents since users  
would not start to see anything output before the end of the parsing.  
Modifying the parser to call traverse() whenever a block is created  
would be very easy to do though. I think I might add a flag for the  
xwiki parser to decide what to do.

However there are some cases where the full parsing is required. For  
example the XWiki TOC macro requires the full parsing to be done since  
it needs to know all the section headers. Of course a second level  
parsing could also be done, looking only for headers but that would  
affect the performance a bit. So for all macros that require the  
document structure for rendering we need the full parsing to be done  
first. However it's hard to know quickly if the document contains  
macros macros that work on the document structure and thus we might  
have to parse the whole doc anyway...

Thanks
-Vincent

> FWIW I've run some quick tests between the JavaCC-generated parser  
> for XWiki that is in the wikimodel parser vs the "hand-written"  
> Confluence parser in Doxia (since confluence and xwiki are of  
> similar complexity for their syntaxes) and the result I got so far  
> is that the "hand-written" parser is faster so I've gone ahead and  
> used the "hand-written" confluence parser as a starting point.
>
> Thanks again
> -Vincent
>
> On Dec 19, 2007, at 5:01 PM, Juan F. Codagnone wrote:
>
>> Hi Vicent,
>>
>> On Wednesday 19 December 2007, Vincent Massol wrote:
>> ...
>>> I'd like to implement a Doxia parser for XWiki. However I've noticed
>>> there's no standard in Doxia yet for parsing. Actually looking at
>>> Doxia confluence, twiki and Apt I see each does it with his own  
>>> code.
>>> However the Confluence and TWiki implementations are very similar,
>>> each defining Block, BlockParser, etc.
>> ...
>>> content). Does anyone have any idea how the Confluence parser  
>>> compares
>>> for example with, say, a JavaCC-generated parser?
>>
>> The confluence parser was made after the twiki parser by Jason.
>>
>> When i first wrote the twiki parser i felt that it was easier to  
>> make an adhoc
>> parser instead of a generated one for a language that has many  
>> exceptions.
>> (Also i was also reading a TDD book at that time, and i wanted to  
>> make some
>> practice, and the adhoc parser was perfect)
>>
>> Here is the original post
>> http://mail-archives.apache.org/mod_mbox/maven-doxia-dev/200511.mbox/%3c200511161959.22110.juam@users.sourceforge.net%3e
>>
>> Two years later i think it was a good decision. One developer that  
>> never saw
>> the original code was conforable adding new language feature and  
>> bugfixes.
>>
>> In terms of of fast rendering mechanism, the twiki parser has a  
>> draback: it
>> first builds a block tree (like a DOM tree), and then the block  
>> generates the
>> events for the Sink.
>>
>> Juan.
>>
>> -- 
>> Buenos Aires, Argentina                            22°C with winds  
>> at 9 km/h E
>

Re: Doxia Parsing API?

Posted by Vincent Massol <vi...@massol.net>.

On Dec 27, 2007, at 11:20 AM, Vincent Massol wrote:

> Hi Juan,
>
> Thanks for your email and sorry for my late answer, I've just seen  
> the mails now.
>
> I've started using the confluence parser as a starting point for  
> writing the XWiki parser. Re the speed, the confluence parser also  
> generates a Block Tree but I'm not sure how this affects performance  
> negatively.
>
> FWIW I've run some quick tests between the JavaCC-generated parser  
> for XWiki that is in the wikimodel parser vs the "hand-written"  
> Confluence parser in Doxia (since confluence and xwiki are of  
> similar complexity for their syntaxes) and the result I got so far  
> is that the "hand-written" parser is faster so I've gone ahead and  
> used the "hand-written" confluence parser as a starting point.

Just to qualify this, the main reason between the 10 fold speed  
difference between wikimodel and Doxia is probably more because  
Wikimodel generates way more events. It generates events on Words,  
Spaces, Special characters, etc.

-Vincent

> On Dec 19, 2007, at 5:01 PM, Juan F. Codagnone wrote:
>
>> Hi Vicent,
>>
>> On Wednesday 19 December 2007, Vincent Massol wrote:
>> ...
>>> I'd like to implement a Doxia parser for XWiki. However I've noticed
>>> there's no standard in Doxia yet for parsing. Actually looking at
>>> Doxia confluence, twiki and Apt I see each does it with his own  
>>> code.
>>> However the Confluence and TWiki implementations are very similar,
>>> each defining Block, BlockParser, etc.
>> ...
>>> content). Does anyone have any idea how the Confluence parser  
>>> compares
>>> for example with, say, a JavaCC-generated parser?
>>
>> The confluence parser was made after the twiki parser by Jason.
>>
>> When i first wrote the twiki parser i felt that it was easier to  
>> make an adhoc
>> parser instead of a generated one for a language that has many  
>> exceptions.
>> (Also i was also reading a TDD book at that time, and i wanted to  
>> make some
>> practice, and the adhoc parser was perfect)
>>
>> Here is the original post
>> http://mail-archives.apache.org/mod_mbox/maven-doxia-dev/200511.mbox/%3c200511161959.22110.juam@users.sourceforge.net%3e
>>
>> Two years later i think it was a good decision. One developer that  
>> never saw
>> the original code was conforable adding new language feature and  
>> bugfixes.
>>
>> In terms of of fast rendering mechanism, the twiki parser has a  
>> draback: it
>> first builds a block tree (like a DOM tree), and then the block  
>> generates the
>> events for the Sink.
>>
>> Juan.
>>
>> -- 
>> Buenos Aires, Argentina                            22°C with winds  
>> at 9 km/h E
>

Re: Doxia Parsing API?

Posted by Vincent Massol <vi...@massol.net>.

Hi Juan,

Thanks for your email and sorry for my late answer, I've just seen the  
mails now.

I've started using the confluence parser as a starting point for  
writing the XWiki parser. Re the speed, the confluence parser also  
generates a Block Tree but I'm not sure how this affects performance  
negatively.

FWIW I've run some quick tests between the JavaCC-generated parser for  
XWiki that is in the wikimodel parser vs the "hand-written" Confluence  
parser in Doxia (since confluence and xwiki are of similar complexity  
for their syntaxes) and the result I got so far is that the "hand- 
written" parser is faster so I've gone ahead and used the "hand- 
written" confluence parser as a starting point.

Thanks again
-Vincent

On Dec 19, 2007, at 5:01 PM, Juan F. Codagnone wrote:

> Hi Vicent,
>
> On Wednesday 19 December 2007, Vincent Massol wrote:
> ...
>> I'd like to implement a Doxia parser for XWiki. However I've noticed
>> there's no standard in Doxia yet for parsing. Actually looking at
>> Doxia confluence, twiki and Apt I see each does it with his own code.
>> However the Confluence and TWiki implementations are very similar,
>> each defining Block, BlockParser, etc.
> ...
>> content). Does anyone have any idea how the Confluence parser  
>> compares
>> for example with, say, a JavaCC-generated parser?
>
> The confluence parser was made after the twiki parser by Jason.
>
> When i first wrote the twiki parser i felt that it was easier to  
> make an adhoc
> parser instead of a generated one for a language that has many  
> exceptions.
> (Also i was also reading a TDD book at that time, and i wanted to  
> make some
> practice, and the adhoc parser was perfect)
>
> Here is the original post
> http://mail-archives.apache.org/mod_mbox/maven-doxia-dev/200511.mbox/%3c200511161959.22110.juam@users.sourceforge.net%3e
>
> Two years later i think it was a good decision. One developer that  
> never saw
> the original code was conforable adding new language feature and  
> bugfixes.
>
> In terms of of fast rendering mechanism, the twiki parser has a  
> draback: it
> first builds a block tree (like a DOM tree), and then the block  
> generates the
> events for the Sink.
>
> Juan.
>
> -- 
> Buenos Aires, Argentina                            22°C with winds  
> at 9 km/h E

Re: Doxia Parsing API?

Posted by "Juan F. Codagnone" <ju...@zauber.com.ar>.

Hi Vicent,

On Wednesday 19 December 2007, Vincent Massol wrote:
...
> I'd like to implement a Doxia parser for XWiki. However I've noticed
> there's no standard in Doxia yet for parsing. Actually looking at
> Doxia confluence, twiki and Apt I see each does it with his own code.
> However the Confluence and TWiki implementations are very similar,
> each defining Block, BlockParser, etc.
...
> content). Does anyone have any idea how the Confluence parser compares
> for example with, say, a JavaCC-generated parser?

The confluence parser was made after the twiki parser by Jason.

When i first wrote the twiki parser i felt that it was easier to make an adhoc 
parser instead of a generated one for a language that has many exceptions. 
(Also i was also reading a TDD book at that time, and i wanted to make some 
practice, and the adhoc parser was perfect)

Here is the original post
http://mail-archives.apache.org/mod_mbox/maven-doxia-dev/200511.mbox/%3c200511161959.22110.juam@users.sourceforge.net%3e

Two years later i think it was a good decision. One developer that never saw 
the original code was conforable adding new language feature and bugfixes.

In terms of of fast rendering mechanism, the twiki parser has a draback: it 
first builds a block tree (like a DOM tree), and then the block generates the 
events for the Sink. 

Juan.

-- 
Buenos Aires, Argentina                            22°C with winds at 9 km/h E

Re: Doxia Parsing API?

Posted by Jason van Zyl <ja...@maven.org>.

On 19 Dec 07, at 2:13 AM 19 Dec 07, Vincent Massol wrote:

> Hi,
>
> I'd like to implement a Doxia parser for XWiki. However I've noticed  
> there's no standard in Doxia yet for parsing. Actually looking at  
> Doxia confluence, twiki and Apt I see each does it with his own  
> code. However the Confluence and TWiki implementations are very  
> similar, each defining Block, BlockParser, etc.
>
> Is someone working on offering a parsing api in doxia so that:
> 1) duplications can be removed
> 2) people can more easily write parsers for different syntaxes
>
> Since my goal is to use Doxia as the rendering mechanism for XWiki  
> I'd need the parser to be very fast (each request will lead to  
> parsing the content). Does anyone have any idea how the Confluence  
> parser compares for example with, say, a JavaCC-generated parser?
>

If you've written any recursive decent parsers then you can probably  
make one that is faster then a generated one. Or if you have tons of  
look ahead, or context switching the generator might be better if you  
don't like writing parsers. But the wiki stuff is pretty simple and  
the hand written parsers are probably pretty performant and can be  
tuned. I can be hard to tune the generated parsers, you pretty much  
get what it spits out.

> Thanks
> -Vincent
>

Thanks,

Jason

----------------------------------------------------------
Jason van Zyl
Founder,  Apache Maven
jason at sonatype dot com
----------------------------------------------------------

In short, man creates for himself a new religion of a rational
and technical order to justify his work and to be justified in it.

-- Jacques Ellul, The Technological Society