You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bertrand Delacretaz <bd...@codeconsult.ch> on 2003/08/06 15:07:09 UTC

FYI: SlopGenerator added (as the unstable "slop" block)

SlopGenerator (Simple Line Oriented Parser) parses text files using 
very simple rules, where lines starting with a name and a colon are 
converted to XML elements.

It is usable for parsing RFC822 messages, with some limitations 
mentioned on the samples page, which shouldn't be hard to remove if 
someone has an itch to scratch.

The simplistic "parsing" algorithm should make the SlopGenerator quite 
fast.

-Bertrand


Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Mercredi, 6 aoû 2003, à 16:06 Europe/Zurich, Vadim Gritsenko a écrit 
:
>
> ...Let me add one more question: can we have both in the same block?

There's no common code between SlopGenerator and Chaperon, and the very 
different ways of using them might create some confusion if they share 
the same block.

I didn't want to "pollute" Chaperon with experimental stuff, and having 
a separate slop block makes it easier to see if there's interest in it 
IMHO.

-Bertrand

Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Vadim Gritsenko <va...@verizon.net>.
Bruno Dumon wrote:

>On Wed, 2003-08-06 at 15:07, Bertrand Delacretaz wrote:
>  
>
>>SlopGenerator (Simple Line Oriented Parser)
>>    
>>
>
>I see a class named "SimpleSlopParser", which after expansion becomes
>"SimpleSimpleLineOperatorParserParser" :-)
>
>  
>
>> parses text files using 
>>very simple rules, where lines starting with a name and a colon are 
>>converted to XML elements.
>>
>>It is usable for parsing RFC822 messages, with some limitations 
>>mentioned on the samples page, which shouldn't be hard to remove if 
>>someone has an itch to scratch.
>>
>>    
>>
>
>I think the logical (and annoying) question now is: couldn't chaperon be
>used?
>

Let me add one more question: can we have both in the same block?

Vadim



Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Mercredi, 6 aoû 2003, à 16:55 Europe/Zurich, Stephan Michels a écrit  
:
> ...You're right, the Chaperon parser suffers on the fact that it  
> expects
> a well structured input, such like programming languages. Nevertheless,
> it works in some cases of semi-structured input..

I'm glad to hear you confirm my feelings - and you're right, Chaperon's  
handling of wiki text works.

> ...In the past I was asked sometimes for a better solution in cases
> of semi structured text, when I recommended 'regular fragmentations'
> (http://www.simonstl.com/projects/fragment/)

Sounds good.
Adding regexp-based line splitting to Slop should be easy, as long as  
you don't need elements that span multiple input lines.

> ...I would be happier to see an implementation of 'regular  
> fragmentations'....

Yes, this would be nice.
Maybe Slop's limitations will bug someone enough so that they start  
implementing it ;-)

> ...BTW, I working extremely on the Chaperon codebase, to
> change the implementation to Tomita's algorithm, which should avoid
> problems with parser conflict, inspired by following blog entry:
> http://blogs.gotdotnet.com/emeijer/permalink.aspx/793d40fe-47b5-4ab7- 
> a73c-53b50851a8ee

Sounds very interesting!

> ...But all this take time, and I'm not very fast.

same here ;-)

Thanks for your comments,
-Bertrand

Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Stephan Michels <st...@apache.org>.

On Wed, 6 Aug 2003, Bertrand Delacretaz wrote:

> > ...I think the logical (and annoying) question now is: couldn't
> > chaperon be
> > used?
>
> Most probably, but after looking at (and doing some work on) the wiki
> grammar of Chaperon I have a feeling that Chaperon isn't ideal for
> semi-structured line-oriented stuff.
>
> Stephan did a great job with the Chaperon wiki grammar, but if you look
> at it closely there seems to be a fight between the structure that
> Chaperon expects from its input and the more free-form wiki input.

You're right, the Chaperon parser suffers on the fact that it expects
a well structured input, such like programming languages. Nevertheless,
it works in some cases of semi-structured input.

In the past I was asked sometimes for a better solution in cases
of semi structured text, when I recommended 'regular fragmentations'
(http://www.simonstl.com/projects/fragment/)

> The
> fact that fairly extensive XSLT postprocessing of Chaperon's output for
> wiki text is required also shows this "impedance mismatch" IMHO.

The postprocessing is in most cases a 1:1 projection if the output
structure has the same structure as the AST(Abstract Syntax Tree), like
wiki.

> I wasn't sure whether to add SlopGenerator in the scratchpad or as an
> unstable block, but as several blocks are clearly experimental I think
> it can't hurt, and our virtual Darwin should take care of if eventually.

I would be happier to see an implementation of 'regular fragmentations'.
This should also work in Slop's case, like

^([^:]+) : ([^:]+)$

<line>
 <name>\1</name>
 <value>\2</value>
</line>

I tried this with the PatternTransformer, but came not very far,

BTW, I working extremely on the Chaperon codebase, to
change the implementation to Tomita's algorithm, which should avoid
problems with parser conflict, inspired by following blog entry:
http://blogs.gotdotnet.com/emeijer/permalink.aspx/793d40fe-47b5-4ab7-a73c-53b50851a8ee

I also add error productions and error recovery algorithms etc. I also
plan to make the grammar format simpler.

But all this take time, and I'm not very fast.

Stephan


Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Mercredi, 6 aoû 2003, à 15:31 Europe/Zurich, Bruno Dumon a écrit :

> I see a class named "SimpleSlopParser", which after expansion becomes
> "SimpleSimpleLineOperatorParserParser" :-)

But SimpleSimple is BeautifulBeautiful, isn't it ;-)
I get your point though: this thing doesn't do much, but I think it 
would be easy to have it handle RFC822 properly and efficiently.

> ...I think the logical (and annoying) question now is: couldn't 
> chaperon be
> used?

Most probably, but after looking at (and doing some work on) the wiki 
grammar of Chaperon I have a feeling that Chaperon isn't ideal for 
semi-structured line-oriented stuff.

Stephan did a great job with the Chaperon wiki grammar, but if you look 
at it closely there seems to be a fight between the structure that 
Chaperon expects from its input and the more free-form wiki input. The 
fact that fairly extensive XSLT postprocessing of Chaperon's output for 
wiki text is required also shows this "impedance mismatch" IMHO.

I wasn't sure whether to add SlopGenerator in the scratchpad or as an 
unstable block, but as several blocks are clearly experimental I think 
it can't hurt, and our virtual Darwin should take care of if eventually.

-Bertrand

Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Bruno Dumon <br...@outerthought.org>.
On Wed, 2003-08-06 at 15:07, Bertrand Delacretaz wrote:
> SlopGenerator (Simple Line Oriented Parser)

I see a class named "SimpleSlopParser", which after expansion becomes
"SimpleSimpleLineOperatorParserParser" :-)

>  parses text files using 
> very simple rules, where lines starting with a name and a colon are 
> converted to XML elements.
> 
> It is usable for parsing RFC822 messages, with some limitations 
> mentioned on the samples page, which shouldn't be hard to remove if 
> someone has an itch to scratch.
> 

I think the logical (and annoying) question now is: couldn't chaperon be
used?

(disclaimer: I don't know anything about either chaperon or slop)

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org                          bruno@apache.org


Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Jeudi, 7 aoû 2003, à 16:22 Europe/Zurich, Upayavira a écrit :

> Can your SlopGenerator handle multiple mail messages per file? ...

Not as is, you'd need to recognize the "From" line that begins each 
message to separate them, but it should be fairly easy to implement.

If you want to implement this, I'd suggest creating a SlopRfc822Parser 
that implements SlopParser, and instantiate this instead of 
SimpleSlopParser based on a sitemap parameter ("mode=rfc822" or 
something). Actually this would be more an "mbox format" mode than just 
rfc822.

-Bertrand

Re: FYI: SlopGenerator added (as the unstable "slop" block)

Posted by Upayavira <uv...@upaya.co.uk>.
Bertrand,

Can your SlopGenerator handle multiple mail messages per file? If so, that would be 
excellent, as it will make it possible to build a poor-man's mailing list archive. 
Subscribe a mailbox to the list and point the SlopGenerator at its unix mailbox file. 

This is something I could really do with.

Regards, 

Upayavira

> SlopGenerator (Simple Line Oriented Parser) parses text files using
> very simple rules, where lines starting with a name and a colon are
> converted to XML elements.
> 
> It is usable for parsing RFC822 messages, with some limitations 
> mentioned on the samples page, which shouldn't be hard to remove if
> someone has an itch to scratch.
> 
> The simplistic "parsing" algorithm should make the SlopGenerator quite
> fast.
> 
> -Bertrand
>