You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Vadim Gritsenko <va...@reverycodes.com> on 2004/03/06 17:53:34 UTC

Re: cvs commit: cocoon-2.1 status.xml

joerg@apache.org wrote:

>       public void characters(char[] ch, int start, int length) {
>   
>           if (ch.length > 0 && start >= 0 && length > 1) {
>  -            String text = new String(ch, start, length);
>               if (elementStack.size() > 0) {
>                   IndexHelperField tos = (IndexHelperField) elementStack.peek();
>  -                tos.appendText(text);
>  +                tos.appendText(ch, start, length);
>               }
>  -            bodyText.append(text);
>  +            bodyText.append(' ');
>  +            bodyText.append(ch, start, length);
>           }
>       }
>

What will happen when "keyword" text is streamed as two characters 
events, "key" and "word"? I think it will become "key word", and 
indexing will break.

IIUC, idea was to add a space in between tags, i.e. so 
<p>some</p><p>text</p> is not indexed as "sometext". If that's correct, 
then better fix would be to add space only if boolean flag 
had_start_or_end_element_in_between_char_events set.


Vadim


Re: LuceneIndexContentHandler

Posted by Joerg Heinicke <jo...@gmx.de>.
On 11.03.2004 20:50, Joerg Heinicke wrote:

> Replace the startElements in your sample with endElements and it won't 
> work. But adding if(flag) like above additionally to endElement should 
> fix this. Did we get it now? :)

I have it done like the following

characters
     flag = true

endElement
    if(flag)
       append(' ');
       flag = false;

startElement
    if(flag)
       append(' ');
       flag = false;

But after having the closer look on the code I really wonder why my 
original proposal doing it in endElement depending on the text test 
should not work. The text only contains child::text() (to say it in 
XPath) and through the stack the text is completely reordered depending 
on the order of the endElement tags in the file:

<element>
     text1
     <element>
         text2
     </element>
     text3
</element>

should end in

"text2 text1 text3", shouldn't it? And this should be doable through the 
text test in endElement.

Joerg

Re: LuceneIndexContentHandler

Posted by Joerg Heinicke <jo...@gmx.de>.
On 12.03.2004 00:18, Vadim Gritsenko wrote:

> Yes, note that I wrote "addition" above - in addition to your suggestion:

Sorry, missed that obviously.

Joerg

Re: LuceneIndexContentHandler

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Joerg Heinicke wrote:

> On 11.03.2004 13:23, Vadim Gritsenko wrote:
>
>> startElement a
>> character 'key'
>> startElement b
>> character 'word'
>>
>> Will become "keyword" instead of "key word". No, this won't work, 
>> again :-)
>> Addition of
>

^^^^

>> startElement:
>>    if (flag)
>>        x.append(' ');
>>        flag = false;
>>
>> Should fix it, shouldn't it?
>
>
> This game is becoming funny :)
>
> Replace the startElements in your sample with endElements and it won't 
> work. But adding if(flag) like above additionally to endElement should 
> fix this. Did we get it now? :)


Yes, note that I wrote "addition" above - in addition to your suggestion:

characters:
    flag = true;

endElement:
    if (flag)
        x.append(' ');
        flag = false;

So yes, I think now it will work :-)

Vadim


Re: LuceneIndexContentHandler

Posted by Joerg Heinicke <jo...@gmx.de>.
On 11.03.2004 13:23, Vadim Gritsenko wrote:

> startElement a
> character 'key'
> startElement b
> character 'word'
> 
> Will become "keyword" instead of "key word". No, this won't work, again :-)
> Addition of
> 
> startElement:
>    if (flag)
>        x.append(' ');
>        flag = false;
> 
> Should fix it, shouldn't it?

This game is becoming funny :)

Replace the startElements in your sample with endElements and it won't 
work. But adding if(flag) like above additionally to endElement should 
fix this. Did we get it now? :)

Joerg

Re: LuceneIndexContentHandler

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Joerg Heinicke wrote:

> On 09.03.2004 13:43, Vadim Gritsenko wrote:
>
>>> Yes, I see your objection - and asked for them already in the bug 
>>> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>>>
>>> So what are the practical use cases this might occure? Maybe it's 
>>> only a theoretical problem depending on the "thing" the index is 
>>> created from? On which SAX stream the LuceneIndexHandler operates?
>>
>>
>> I remember there were issues already in other components with text 
>> being splitted up onto multiple character events. So, think of this 
>> as of preventive maintenance.
>
>
> Yes, for example this bug: 
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=26219. Two character 
> events following eachother come out of an XSLT process using 
> <xsl:value> twice following eachother. And the AbstractDOMTransformer 
> or more probable one of the component it uses drops the second and 
> following text events.
>
>>> I also don't get your implications for 
>>> "had_start_or_end_element_in_between_char_events". But I had a look 
>>> on the endElement(). It gets the elements from a stack and already 
>>> tests for text:
>>>     if (text != null && text.length() > 0) {
>>> Would it make sense to add the space in endElement, if the element 
>>> contains text, i.e. the above is true?
>>
>>
>> This was my first though... But then, multiple closing tags will 
>> cause multiple spaces...
>
>
> Ok, if this disturbs.
>
>> So, I thought, this should work:
>>
>> startElement:
>>    flag = true;
>>
>> endElement:
>>    flag = true;
>>
>> characters:
>>    if (flag)
>>        x.append(' ');
>>        flag = false;
>>
>> Does it solves the problem?
>
>
> Unfortunately not:
>
> startElement event
> character event 'key'
> character event 'word'
> character event 'test'
> endElement event
>
> So you would have 'key wordtest'.
>
> What about
>
> characters:
>     flag = true;
>
> endElement:
>     if (flag)
>         x.append(' ');
>         flag = false;
>
> This is similar like the above mentioned endElement text check, but 
> would prevent multiple spaces from output, wouldn't it?


startElement a
character 'key'
startElement b
character 'word'

Will become "keyword" instead of "key word". No, this won't work, again :-)
Addition of

startElement:
    if (flag)
        x.append(' ');
        flag = false;

Should fix it, shouldn't it?

Vadim



LuceneIndexContentHandler (was: cvs commit: cocoon-2.1 status.xml)

Posted by Joerg Heinicke <jo...@gmx.de>.
On 09.03.2004 13:43, Vadim Gritsenko wrote:

>> Yes, I see your objection - and asked for them already in the bug 
>> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>>
>> So what are the practical use cases this might occure? Maybe it's only 
>> a theoretical problem depending on the "thing" the index is created 
>> from? On which SAX stream the LuceneIndexHandler operates?
> 
> I remember there were issues already in other components with text being 
> splitted up onto multiple character events. So, think of this as of 
> preventive maintenance.

Yes, for example this bug: 
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=26219. Two character 
events following eachother come out of an XSLT process using <xsl:value> 
twice following eachother. And the AbstractDOMTransformer or more 
probable one of the component it uses drops the second and following 
text events.

>> I also don't get your implications for 
>> "had_start_or_end_element_in_between_char_events". But I had a look on 
>> the endElement(). It gets the elements from a stack and already tests 
>> for text:
>>     if (text != null && text.length() > 0) {
>> Would it make sense to add the space in endElement, if the element 
>> contains text, i.e. the above is true?
> 
> This was my first though... But then, multiple closing tags will cause 
> multiple spaces...

Ok, if this disturbs.

> So, I thought, this should work:
> 
> startElement:
>    flag = true;
> 
> endElement:
>    flag = true;
> 
> characters:
>    if (flag)
>        x.append(' ');
>        flag = false;
> 
> Does it solves the problem?

Unfortunately not:

startElement event
character event 'key'
character event 'word'
character event 'test'
endElement event

So you would have 'key wordtest'.

What about

characters:
     flag = true;

endElement:
     if (flag)
         x.append(' ');
         flag = false;

This is similar like the above mentioned endElement text check, but 
would prevent multiple spaces from output, wouldn't it?

Joerg

Re: cvs commit: cocoon-2.1 status.xml

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Joerg Heinicke wrote:

> On 09.03.2004 02:39, Vadim Gritsenko wrote:
>
>>>>       public void characters(char[] ch, int start, int length) {
>>>>             if (ch.length > 0 && start >= 0 && length > 1) {
>>>>  -            String text = new String(ch, start, length);
>>>>               if (elementStack.size() > 0) {
>>>>                   IndexHelperField tos = (IndexHelperField) 
>>>> elementStack.peek();
>>>>  -                tos.appendText(text);
>>>>  +                tos.appendText(ch, start, length);
>>>>               }
>>>>  -            bodyText.append(text);
>>>>  +            bodyText.append(' ');
>>>>  +            bodyText.append(ch, start, length);
>>>>           }
>>>>       }
>>>>
>>>
>>> What will happen when "keyword" text is streamed as two characters 
>>> events, "key" and "word"? I think it will become "key word", and 
>>> indexing will break.
>>>
>>> IIUC, idea was to add a space in between tags, i.e. so 
>>> <p>some</p><p>text</p> is not indexed as "sometext". If that's 
>>> correct, then better fix would be to add space only if boolean flag 
>>> had_start_or_end_element_in_between_char_events set.
>>
>>
>> Joerg?
>
>
> Your mail was neither ignored nor accidently deleted - I just didn't 
> know what really to write, but marked it as important in nice red 
> color in Mozilla :)


:-)


> Yes, I see your objection - and asked for them already in the bug 
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>
> So what are the practical use cases this might occure? Maybe it's only 
> a theoretical problem depending on the "thing" the index is created 
> from? On which SAX stream the LuceneIndexHandler operates?


I remember there were issues already in other components with text being 
splitted up onto multiple character events. So, think of this as of 
preventive maintenance.


> I also don't get your implications for 
> "had_start_or_end_element_in_between_char_events". But I had a look on 
> the endElement(). It gets the elements from a stack and already tests 
> for text:
>     if (text != null && text.length() > 0) {
> Would it make sense to add the space in endElement, if the element 
> contains text, i.e. the above is true?


This was my first though... But then, multiple closing tags will cause 
multiple spaces... So, I thought, this should work:

startElement:
    flag = true;

endElement:
    flag = true;

characters:
    if (flag)
        x.append(' ');
        flag = false;

Does it solves the problem?

Vadim



Re: cvs commit: cocoon-2.1 status.xml

Posted by Joerg Heinicke <jo...@gmx.de>.
On 09.03.2004 02:39, Vadim Gritsenko wrote:

>>>       public void characters(char[] ch, int start, int length) {
>>>             if (ch.length > 0 && start >= 0 && length > 1) {
>>>  -            String text = new String(ch, start, length);
>>>               if (elementStack.size() > 0) {
>>>                   IndexHelperField tos = (IndexHelperField) 
>>> elementStack.peek();
>>>  -                tos.appendText(text);
>>>  +                tos.appendText(ch, start, length);
>>>               }
>>>  -            bodyText.append(text);
>>>  +            bodyText.append(' ');
>>>  +            bodyText.append(ch, start, length);
>>>           }
>>>       }
>>>
>>
>> What will happen when "keyword" text is streamed as two characters 
>> events, "key" and "word"? I think it will become "key word", and 
>> indexing will break.
>>
>> IIUC, idea was to add a space in between tags, i.e. so 
>> <p>some</p><p>text</p> is not indexed as "sometext". If that's 
>> correct, then better fix would be to add space only if boolean flag 
>> had_start_or_end_element_in_between_char_events set.
> 
> Joerg?

Your mail was neither ignored nor accidently deleted - I just didn't 
know what really to write, but marked it as important in nice red color 
in Mozilla :)

Yes, I see your objection - and asked for them already in the bug 
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)

So what are the practical use cases this might occure? Maybe it's only a 
theoretical problem depending on the "thing" the index is created from? 
On which SAX stream the LuceneIndexHandler operates?

I also don't get your implications for 
"had_start_or_end_element_in_between_char_events". But I had a look on 
the endElement(). It gets the elements from a stack and already tests 
for text:
     if (text != null && text.length() > 0) {
Would it make sense to add the space in endElement, if the element 
contains text, i.e. the above is true?

Joerg

Re: cvs commit: cocoon-2.1 status.xml

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Vadim Gritsenko wrote:

> joerg@apache.org wrote:
>
>>       public void characters(char[] ch, int start, int length) {
>>             if (ch.length > 0 && start >= 0 && length > 1) {
>>  -            String text = new String(ch, start, length);
>>               if (elementStack.size() > 0) {
>>                   IndexHelperField tos = (IndexHelperField) 
>> elementStack.peek();
>>  -                tos.appendText(text);
>>  +                tos.appendText(ch, start, length);
>>               }
>>  -            bodyText.append(text);
>>  +            bodyText.append(' ');
>>  +            bodyText.append(ch, start, length);
>>           }
>>       }
>>
>
> What will happen when "keyword" text is streamed as two characters 
> events, "key" and "word"? I think it will become "key word", and 
> indexing will break.
>
> IIUC, idea was to add a space in between tags, i.e. so 
> <p>some</p><p>text</p> is not indexed as "sometext". If that's 
> correct, then better fix would be to add space only if boolean flag 
> had_start_or_end_element_in_between_char_events set.


Joerg?

Vadim