You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jspwiki.apache.org by Andrew Jaquith <an...@gmail.com> on 2009/10/13 01:26:23 UTC

Parsing fun

This is fairly amusing...

It turns out that one of the failing JSPWikiMarkupParser unit tests,  
testHyperlinks3, is failing because MarkupParser.cleanLink() isn't  
cleaning.

The link markup "hyperlink too" should result in "HyperlinkToo", no?  
It results instead in "Hyperlink too".

What's even more amusing is that MarkupParserTest doesn't even test  
for this type of (common?) whitespace condition. Clearly we need to  
test for this -- and I'll fix the bug too while I'm at it. :)

Andrew 

Re: Parsing fun

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
> As I've mentioned before, I for one would dearly love to eliminate
> the possibility of whitespace in page names, as that has clearly

If it's done in a PageNameResolver, you should be able to just take it  
out of the resolver loop and be done with it.

/Janne

Re: Parsing fun

Posted by Murray Altheim <mu...@altheim.com>.
Andrew Jaquith <an...@gmail.com> writes:
> Ok -- I think I am actually CONFUSED. The unit tests for MarkupParser
> show a test where the output of a cleaned-link is expected to have
> spaces.
>
> So, which is it? Do we allow spaces in wiki page names? If so, that
> would suggest the failing JSPWikiMarkupParser test is wrong.
>
> Also, if spaces are allowed, how do you resolve page names? Are pages
> "Test Page" and "TestPage" the same page?

Hi Andrew,

Since it's clear that different installations may have differing
requirements on page names (e.g., I want ours to be valid XML Names),
might it be possible to create a page name filter mechanism that acts
to enforce a specific contract on names?

As I've mentioned before, I for one would dearly love to eliminate
the possibility of whitespace in page names, as that has clearly
been the biggest confusion/problem for our users. Beyond that, for
projects that use the Topic Map functionality I developed, I'd need
to additionally contraint the page names to XML Names. To do this
it would be nice to not have to do a great deal of coding, or at
least be able to do that coding in one place (i.e., some kind of
page name filter class). I don't know if this would be considered
over-engineering or not, but it would be tremendously valuable for
our installations.

Cheers,

Murray

...........................................................................
Murray Altheim <murray09 at altheim dot com>                       ===  = =
http://www.altheim.com/murray/                                     = =  ===
SGML Grease Monkey, Banjo Player, Wantanabe Zen Monk               = =  = =

        In the evening
        The rice leaves in the garden
        Rustle in the autumn wind
        That blows through my reed hut.  -- Minamoto no Tsunenobu

Re: Parsing fun

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
On Oct 14, 2009, at 02:57 , Andrew Jaquith wrote:

> So, I want to understand this, and I think I'm close.
>
> If page "HyperlinkToo" exists, JSPWikiMarkupParser will render
> "hyperlink   too" as:
>
> "HyperlinkToo"
>
> ...but if "HyperlinkToo" does not exist, it will render as:
>
> "Hyperlink too"
>
> Is that right? It is, in essence, producing different canonical page
> names depending on whether the page exists. That is strange.

Close, but not quite.

[hyperlink too] gets rendered as "Hyperlink too", iff page "Hyperlink  
too" exists (in a case-insensitive manner). If it does not exist, we  
check for the existence of "HyperlinkToo", and if it does, we then use  
that.  To grab the relevant code from 2.8:

                    // Straight-away check for "Hyperlink too"
         boolean isThere = simplePageExists( page );
         String  finalName = page;

         // "hyperlink too" && "hyperlink toos"
         if ( !isThere && m_matchEnglishPlurals )
         {
             if ( page.endsWith( "s" ) )
             {
                 finalName = page.substring( 0, page.length() - 1 );
             }
             else
             {
                 finalName += "s";
             }

             isThere = simplePageExists( finalName );
         }

	// This is where we check for the alternate name.
         // "HyperlinkToo" & "HyperlinkToos"

         if( !isThere )
         {
             finalName = MarkupParser.wikifyLink( page );
             isThere = simplePageExists(finalName);

             if( !isThere && m_matchEnglishPlurals )
             {
                 if( finalName.endsWith( "s" ) )
                 {
                     finalName = finalName.substring( 0,  
finalName.length() - 1 );
                 }
                 else
                 {
                     finalName += "s";
                 }

                 isThere = simplePageExists( finalName );
             }
         }

         return isThere ? finalName : null;

> It also makes creating (say) a singleton caching WikiPath factory
> difficult because you'd need to know whether the page exists (and
> hence hold a reference to the ContentManager) before you could figure
> out what to canonicalize the path to.

I would hesitate doing that in the first place.  I learned that the  
hard way doing Priha - there are some fairly interesting and hairy  
issues as to when to empty the cache resulting in really hard-to-debug  
problems - and the performance gain is probably not that big anyway.   
Let's not do it now.

> I expect this is all stuff you considered before -- but want to make
> sure I understand before I correct any more unit tests.


The reason why this is done is quite simple: we don't want to break  
old pages or old URLs.  This allows you to simply grab your old  
wikimarkup without having to change anything.  However, new pages will  
be created with spaces in names, if you use the [ ] -linking style and  
have a space.  But of course [HyperlinkToo] and HyperlinkToo still  
link to a page called "HyperlinkToo", and it would not match  
"Hyperlink too".  But [Hyperlink too] would first check for "Hyperlink  
Too" and after that check for "HyperlinkToo".

The 2.6-2.8 way is very literal and quite easy to remember.  It just  
works :-).  I haven't had complaints (other than from Murray, but his  
requirement is for XML names, so it's not something most people would  
really ask for).

/Janne

Re: Parsing fun

Posted by Andrew Jaquith <an...@gmail.com>.
All good points, and after thinking about it for 30 seconds I agreed.
I've added a fairly tactical set of patches to JSPWikiMarkupParser and
MarkupParser.cleanlink (which had a bug in it!). It fixes the broken
unit tests and sidesteps the larger refactoring issue. Let's get 3.0
stable before we revisit.

Andrew

On Wed, Oct 14, 2009 at 11:48 AM, Janne Jalkanen
<Ja...@ecyrd.com> wrote:
>
> On Oct 14, 2009, at 03:11 , Andrew Jaquith wrote:
>
>> Oh, on a side note: getFinalPageName() moved to WikiEngine. That's
>> where it calls the various PageNameResolvers.
>
> Yeah, but it lost the old-style links on the way. Probably my bad.
>
>>
>> Would it be better to move the wikifyLink and cleanLink code into the
>> resolver code, perhaps make these additional resolvers? I'd have to do
>> a bunch of changes to caller code: about 25 references, not too bad.
>
> I would rather put them in an utility class, like TextUtil (PageNameUtil?).
>
>> Also, we could add a "CacheResolver" that simply keeps a cache of all
>> of the strings that have been requested, and the WikiPaths they
>> resolve to. Just a thought.
>
> Unless this can be shown to be an actual performance bottleneck, I wouldn't.
>  For example, that CacheResolver would need to keep track of all page
> renames and changes, and there are some edge cases creating some
> hard-to-find bugs...  CachingProvider e.g. looks like a fairly
> straightforward thing, and that's been one of the biggest source of problems
> for people in 2.x :-)
>
> So let's optimize that once we know with confidence that 3.0 works. Then we
> can break it again.
>
> /Janne
>

Re: Parsing fun

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
On Oct 14, 2009, at 03:11 , Andrew Jaquith wrote:

> Oh, on a side note: getFinalPageName() moved to WikiEngine. That's
> where it calls the various PageNameResolvers.

Yeah, but it lost the old-style links on the way. Probably my bad.

>
> Would it be better to move the wikifyLink and cleanLink code into the
> resolver code, perhaps make these additional resolvers? I'd have to do
> a bunch of changes to caller code: about 25 references, not too bad.

I would rather put them in an utility class, like TextUtil  
(PageNameUtil?).

> Also, we could add a "CacheResolver" that simply keeps a cache of all
> of the strings that have been requested, and the WikiPaths they
> resolve to. Just a thought.

Unless this can be shown to be an actual performance bottleneck, I  
wouldn't.  For example, that CacheResolver would need to keep track of  
all page renames and changes, and there are some edge cases creating  
some hard-to-find bugs...  CachingProvider e.g. looks like a fairly  
straightforward thing, and that's been one of the biggest source of  
problems for people in 2.x :-)

So let's optimize that once we know with confidence that 3.0 works.  
Then we can break it again.

/Janne

Re: Parsing fun

Posted by Andrew Jaquith <an...@gmail.com>.
Oh, on a side note: getFinalPageName() moved to WikiEngine. That's
where it calls the various PageNameResolvers.

Would it be better to move the wikifyLink and cleanLink code into the
resolver code, perhaps make these additional resolvers? I'd have to do
a bunch of changes to caller code: about 25 references, not too bad.
Also, we could add a "CacheResolver" that simply keeps a cache of all
of the strings that have been requested, and the WikiPaths they
resolve to. Just a thought.

Andrew

On Tue, Oct 13, 2009 at 7:57 PM, Andrew Jaquith
<an...@gmail.com> wrote:
> So, I want to understand this, and I think I'm close.
>
> If page "HyperlinkToo" exists, JSPWikiMarkupParser will render
> "hyperlink   too" as:
>
> "HyperlinkToo"
>
> ...but if "HyperlinkToo" does not exist, it will render as:
>
> "Hyperlink too"
>
> Is that right? It is, in essence, producing different canonical page
> names depending on whether the page exists. That is strange.
>
> It also makes creating (say) a singleton caching WikiPath factory
> difficult because you'd need to know whether the page exists (and
> hence hold a reference to the ContentManager) before you could figure
> out what to canonicalize the path to.
>
> I expect this is all stuff you considered before -- but want to make
> sure I understand before I correct any more unit tests.
>
> Andrew
>
>
> On Tue, Oct 13, 2009 at 4:33 PM, Janne Jalkanen
> <Ja...@ecyrd.com> wrote:
>>>> A few more clarifying questions:
>>>>
>>>> - Are page names normalized when stored? I.e., when persisted we
>>>> always use the MashedTogetherName?
>>>
>>> No, we use the result of MarkupParser.cleanLink() [which turns [foo bar]
>>> into [Foo bar].  It essentially just capitalizes it and removes illegal
>>> characters.]
>>
>> To clarify: upon store, page names are normalized with cleanLink() (which
>> retains spaces).
>>
>> Upon render, we first check for the new-style-normalized page name
>> (cleanLink()), then the old-style normalized page name (wikifyLink()).
>>
>> /Janne
>>
>

Re: Parsing fun

Posted by Andrew Jaquith <an...@gmail.com>.
So, I want to understand this, and I think I'm close.

If page "HyperlinkToo" exists, JSPWikiMarkupParser will render
"hyperlink   too" as:

"HyperlinkToo"

...but if "HyperlinkToo" does not exist, it will render as:

"Hyperlink too"

Is that right? It is, in essence, producing different canonical page
names depending on whether the page exists. That is strange.

It also makes creating (say) a singleton caching WikiPath factory
difficult because you'd need to know whether the page exists (and
hence hold a reference to the ContentManager) before you could figure
out what to canonicalize the path to.

I expect this is all stuff you considered before -- but want to make
sure I understand before I correct any more unit tests.

Andrew


On Tue, Oct 13, 2009 at 4:33 PM, Janne Jalkanen
<Ja...@ecyrd.com> wrote:
>>> A few more clarifying questions:
>>>
>>> - Are page names normalized when stored? I.e., when persisted we
>>> always use the MashedTogetherName?
>>
>> No, we use the result of MarkupParser.cleanLink() [which turns [foo bar]
>> into [Foo bar].  It essentially just capitalizes it and removes illegal
>> characters.]
>
> To clarify: upon store, page names are normalized with cleanLink() (which
> retains spaces).
>
> Upon render, we first check for the new-style-normalized page name
> (cleanLink()), then the old-style normalized page name (wikifyLink()).
>
> /Janne
>

Re: Parsing fun

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
>> A few more clarifying questions:
>>
>> - Are page names normalized when stored? I.e., when persisted we
>> always use the MashedTogetherName?
>
> No, we use the result of MarkupParser.cleanLink() [which turns [foo  
> bar] into [Foo bar].  It essentially just capitalizes it and removes  
> illegal characters.]

To clarify: upon store, page names are normalized with cleanLink()  
(which retains spaces).

Upon render, we first check for the new-style-normalized page name  
(cleanLink()), then the old-style normalized page name (wikifyLink()).

/Janne

Re: Parsing fun

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
On Oct 13, 2009, at 14:18 , Andrew Jaquith wrote:

> A few more clarifying questions:
>
> - Are page names normalized when stored? I.e., when persisted we
> always use the MashedTogetherName?

No, we use the result of MarkupParser.cleanLink() [which turns [foo  
bar] into [Foo bar].  It essentially just capitalizes it and removes  
illegal characters.]

> - When WikiPaths are resolved, do we resolve such that the WikiPath
> returned always uses the MashedTogetherName?

Good question.  2.8 resolves them in  
CommandResolver.getFinalPageName() but it appears that this code has  
been accidentally left out in 3.0...

The correct response would be to write a  
org.apache.wiki.content.resolver.OldStyleCamelCasePageNameResolver, I  
think.  That should make it work fairly transparently.

/Janne

> (1) JSPWikiMarkupParser.testHyperlinks3():
>
>        newPage("HyperlinkToo");
>        String src = "This should be a [hyperlink too]";
>        assertEquals( "This should be a <a class=\"wikipage\"
> href=\"/Wiki.jsp?page=HyperlinkToo\">hyperlink too</a>",
>                      translate(src) );

Is correct.  There only exists an old-style page name  
("HyperlinkToo".)  This is the pre-2.6 model.

> (2) JSPWikiMarkupParser.testSpacesInLinks1()
>
>        newPage("Foo bar");
>        String src = "[Foo bar]";
>        assertEquals( "<a class=\"wikipage\"
> href=\"/Wiki.jsp?page=Foo%20bar\">Foo bar</a>", translate(src) );

Correct too.  There is a page called "Foo bar", so linking to it with  
[Foo bar] should obviously work.

>
> (3) MarkupParserTest.testCleanLinkWithSpaces()
>
>        assertEquals( "HyperlinkToo",
> MarkupParser.cleanLink("hyperlink  too") );

Mmm... Incorrect.  It should actually return with "Hyperlink too".   
MarkupParser.wikifyLink() should then return "HyperlinkToo".

/Janne

Re: Parsing fun

Posted by Andrew Jaquith <an...@gmail.com>.
A few more clarifying questions:

- Are page names normalized when stored? I.e., when persisted we
always use the MashedTogetherName?
- When WikiPaths are resolved, do we resolve such that the WikiPath
returned always uses the MashedTogetherName?

Lastly, how do you reconcile these three tests:

(1) JSPWikiMarkupParser.testHyperlinks3():

        newPage("HyperlinkToo");
        String src = "This should be a [hyperlink too]";
        assertEquals( "This should be a <a class=\"wikipage\"
href=\"/Wiki.jsp?page=HyperlinkToo\">hyperlink too</a>",
                      translate(src) );

(2) JSPWikiMarkupParser.testSpacesInLinks1()

        newPage("Foo bar");
        String src = "[Foo bar]";
        assertEquals( "<a class=\"wikipage\"
href=\"/Wiki.jsp?page=Foo%20bar\">Foo bar</a>", translate(src) );

(3) MarkupParserTest.testCleanLinkWithSpaces()

        assertEquals( "HyperlinkToo",
MarkupParser.cleanLink("hyperlink  too") );


Andrew

On Tue, Oct 13, 2009 at 2:25 AM, Janne Jalkanen
<Ja...@ecyrd.com> wrote:
>
> Yes, we've allowed spaces in wiki page names since 2.6.
>
> However, we do also check - for backwards compatibility - also for the
> mashedtogethername if the wiki page with spaces does not exist. So
>
> [Test page] => first check for the existence of "Test page" => if not exist,
> then try to find "TestPage".
>
> There's a fairly hairy problem which is that we kind of assume that the
> comparison can be done in a case insensitive manner.  So on some operating
> systems (like Linux) [test page] might actually not match "Test page", but
> it would match "TestPage" (because mashing also normalizes the page name).
>  For 3.0 this is not going to be a problem, since all pages - also the ones
> with spaces - are normalized (or should be normalized; if they're not, then
> there's a bug.)
>
> This is why you probably have not noticed this, since the hop to page names
> with spaces was fairly transparent.
>
> /Janne
>
> On Oct 13, 2009, at 04:45 , Andrew Jaquith wrote:
>
>> Ok -- I think I am actually CONFUSED. The unit tests for MarkupParser
>> show a test where the output of a cleaned-link is expected to have
>> spaces.
>>
>> So, which is it? Do we allow spaces in wiki page names? If so, that
>> would suggest the failing JSPWikiMarkupParser test is wrong.
>>
>> Also, if spaces are allowed, how do you resolve page names? Are pages
>> "Test Page" and "TestPage" the same page?
>>
>> Maybe I need to RTFM... just not sure where the M is...
>>
>> Andrew
>>
>> On Mon, Oct 12, 2009 at 7:26 PM, Andrew Jaquith
>> <an...@gmail.com> wrote:
>>>
>>> This is fairly amusing...
>>>
>>> It turns out that one of the failing JSPWikiMarkupParser unit tests,
>>> testHyperlinks3, is failing because MarkupParser.cleanLink() isn't
>>> cleaning.
>>>
>>> The link markup "hyperlink too" should result in "HyperlinkToo", no? It
>>> results instead in "Hyperlink too".
>>>
>>> What's even more amusing is that MarkupParserTest doesn't even test for
>>> this
>>> type of (common?) whitespace condition. Clearly we need to test for this
>>> --
>>> and I'll fix the bug too while I'm at it. :)
>>>
>>> Andrew
>>>
>
>

Re: Parsing fun

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
Yes, we've allowed spaces in wiki page names since 2.6.

However, we do also check - for backwards compatibility - also for the  
mashedtogethername if the wiki page with spaces does not exist. So

[Test page] => first check for the existence of "Test page" => if not  
exist, then try to find "TestPage".

There's a fairly hairy problem which is that we kind of assume that  
the comparison can be done in a case insensitive manner.  So on some  
operating systems (like Linux) [test page] might actually not match  
"Test page", but it would match "TestPage" (because mashing also  
normalizes the page name).  For 3.0 this is not going to be a problem,  
since all pages - also the ones with spaces - are normalized (or  
should be normalized; if they're not, then there's a bug.)

This is why you probably have not noticed this, since the hop to page  
names with spaces was fairly transparent.

/Janne

On Oct 13, 2009, at 04:45 , Andrew Jaquith wrote:

> Ok -- I think I am actually CONFUSED. The unit tests for MarkupParser
> show a test where the output of a cleaned-link is expected to have
> spaces.
>
> So, which is it? Do we allow spaces in wiki page names? If so, that
> would suggest the failing JSPWikiMarkupParser test is wrong.
>
> Also, if spaces are allowed, how do you resolve page names? Are pages
> "Test Page" and "TestPage" the same page?
>
> Maybe I need to RTFM... just not sure where the M is...
>
> Andrew
>
> On Mon, Oct 12, 2009 at 7:26 PM, Andrew Jaquith
> <an...@gmail.com> wrote:
>> This is fairly amusing...
>>
>> It turns out that one of the failing JSPWikiMarkupParser unit tests,
>> testHyperlinks3, is failing because MarkupParser.cleanLink() isn't  
>> cleaning.
>>
>> The link markup "hyperlink too" should result in "HyperlinkToo",  
>> no? It
>> results instead in "Hyperlink too".
>>
>> What's even more amusing is that MarkupParserTest doesn't even test  
>> for this
>> type of (common?) whitespace condition. Clearly we need to test for  
>> this --
>> and I'll fix the bug too while I'm at it. :)
>>
>> Andrew
>>


Re: Parsing fun

Posted by Andrew Jaquith <an...@gmail.com>.
Ok -- I think I am actually CONFUSED. The unit tests for MarkupParser
show a test where the output of a cleaned-link is expected to have
spaces.

So, which is it? Do we allow spaces in wiki page names? If so, that
would suggest the failing JSPWikiMarkupParser test is wrong.

Also, if spaces are allowed, how do you resolve page names? Are pages
"Test Page" and "TestPage" the same page?

Maybe I need to RTFM... just not sure where the M is...

Andrew

On Mon, Oct 12, 2009 at 7:26 PM, Andrew Jaquith
<an...@gmail.com> wrote:
> This is fairly amusing...
>
> It turns out that one of the failing JSPWikiMarkupParser unit tests,
> testHyperlinks3, is failing because MarkupParser.cleanLink() isn't cleaning.
>
> The link markup "hyperlink too" should result in "HyperlinkToo", no? It
> results instead in "Hyperlink too".
>
> What's even more amusing is that MarkupParserTest doesn't even test for this
> type of (common?) whitespace condition. Clearly we need to test for this --
> and I'll fix the bug too while I'm at it. :)
>
> Andrew
>