You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Vadim Gritsenko <va...@reverycodes.com> on 2004/03/06 17:53:34 UTC
Re: cvs commit: cocoon-2.1 status.xml
joerg@apache.org wrote:
> public void characters(char[] ch, int start, int length) {
>
> if (ch.length > 0 && start >= 0 && length > 1) {
> - String text = new String(ch, start, length);
> if (elementStack.size() > 0) {
> IndexHelperField tos = (IndexHelperField) elementStack.peek();
> - tos.appendText(text);
> + tos.appendText(ch, start, length);
> }
> - bodyText.append(text);
> + bodyText.append(' ');
> + bodyText.append(ch, start, length);
> }
> }
>
What will happen when "keyword" text is streamed as two characters
events, "key" and "word"? I think it will become "key word", and
indexing will break.
IIUC, idea was to add a space in between tags, i.e. so
<p>some</p><p>text</p> is not indexed as "sometext". If that's correct,
then better fix would be to add space only if boolean flag
had_start_or_end_element_in_between_char_events set.
Vadim
Re: LuceneIndexContentHandler
Posted by Joerg Heinicke <jo...@gmx.de>.
On 11.03.2004 20:50, Joerg Heinicke wrote:
> Replace the startElements in your sample with endElements and it won't
> work. But adding if(flag) like above additionally to endElement should
> fix this. Did we get it now? :)
I have it done like the following
characters
flag = true
endElement
if(flag)
append(' ');
flag = false;
startElement
if(flag)
append(' ');
flag = false;
But after having the closer look on the code I really wonder why my
original proposal doing it in endElement depending on the text test
should not work. The text only contains child::text() (to say it in
XPath) and through the stack the text is completely reordered depending
on the order of the endElement tags in the file:
<element>
text1
<element>
text2
</element>
text3
</element>
should end in
"text2 text1 text3", shouldn't it? And this should be doable through the
text test in endElement.
Joerg
Re: LuceneIndexContentHandler
Posted by Joerg Heinicke <jo...@gmx.de>.
On 12.03.2004 00:18, Vadim Gritsenko wrote:
> Yes, note that I wrote "addition" above - in addition to your suggestion:
Sorry, missed that obviously.
Joerg
Re: LuceneIndexContentHandler
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Joerg Heinicke wrote:
> On 11.03.2004 13:23, Vadim Gritsenko wrote:
>
>> startElement a
>> character 'key'
>> startElement b
>> character 'word'
>>
>> Will become "keyword" instead of "key word". No, this won't work,
>> again :-)
>> Addition of
>
^^^^
>> startElement:
>> if (flag)
>> x.append(' ');
>> flag = false;
>>
>> Should fix it, shouldn't it?
>
>
> This game is becoming funny :)
>
> Replace the startElements in your sample with endElements and it won't
> work. But adding if(flag) like above additionally to endElement should
> fix this. Did we get it now? :)
Yes, note that I wrote "addition" above - in addition to your suggestion:
characters:
flag = true;
endElement:
if (flag)
x.append(' ');
flag = false;
So yes, I think now it will work :-)
Vadim
Re: LuceneIndexContentHandler
Posted by Joerg Heinicke <jo...@gmx.de>.
On 11.03.2004 13:23, Vadim Gritsenko wrote:
> startElement a
> character 'key'
> startElement b
> character 'word'
>
> Will become "keyword" instead of "key word". No, this won't work, again :-)
> Addition of
>
> startElement:
> if (flag)
> x.append(' ');
> flag = false;
>
> Should fix it, shouldn't it?
This game is becoming funny :)
Replace the startElements in your sample with endElements and it won't
work. But adding if(flag) like above additionally to endElement should
fix this. Did we get it now? :)
Joerg
Re: LuceneIndexContentHandler
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Joerg Heinicke wrote:
> On 09.03.2004 13:43, Vadim Gritsenko wrote:
>
>>> Yes, I see your objection - and asked for them already in the bug
>>> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>>>
>>> So what are the practical use cases this might occure? Maybe it's
>>> only a theoretical problem depending on the "thing" the index is
>>> created from? On which SAX stream the LuceneIndexHandler operates?
>>
>>
>> I remember there were issues already in other components with text
>> being splitted up onto multiple character events. So, think of this
>> as of preventive maintenance.
>
>
> Yes, for example this bug:
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=26219. Two character
> events following eachother come out of an XSLT process using
> <xsl:value> twice following eachother. And the AbstractDOMTransformer
> or more probable one of the component it uses drops the second and
> following text events.
>
>>> I also don't get your implications for
>>> "had_start_or_end_element_in_between_char_events". But I had a look
>>> on the endElement(). It gets the elements from a stack and already
>>> tests for text:
>>> if (text != null && text.length() > 0) {
>>> Would it make sense to add the space in endElement, if the element
>>> contains text, i.e. the above is true?
>>
>>
>> This was my first though... But then, multiple closing tags will
>> cause multiple spaces...
>
>
> Ok, if this disturbs.
>
>> So, I thought, this should work:
>>
>> startElement:
>> flag = true;
>>
>> endElement:
>> flag = true;
>>
>> characters:
>> if (flag)
>> x.append(' ');
>> flag = false;
>>
>> Does it solves the problem?
>
>
> Unfortunately not:
>
> startElement event
> character event 'key'
> character event 'word'
> character event 'test'
> endElement event
>
> So you would have 'key wordtest'.
>
> What about
>
> characters:
> flag = true;
>
> endElement:
> if (flag)
> x.append(' ');
> flag = false;
>
> This is similar like the above mentioned endElement text check, but
> would prevent multiple spaces from output, wouldn't it?
startElement a
character 'key'
startElement b
character 'word'
Will become "keyword" instead of "key word". No, this won't work, again :-)
Addition of
startElement:
if (flag)
x.append(' ');
flag = false;
Should fix it, shouldn't it?
Vadim
LuceneIndexContentHandler (was: cvs commit: cocoon-2.1 status.xml)
Posted by Joerg Heinicke <jo...@gmx.de>.
On 09.03.2004 13:43, Vadim Gritsenko wrote:
>> Yes, I see your objection - and asked for them already in the bug
>> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>>
>> So what are the practical use cases this might occure? Maybe it's only
>> a theoretical problem depending on the "thing" the index is created
>> from? On which SAX stream the LuceneIndexHandler operates?
>
> I remember there were issues already in other components with text being
> splitted up onto multiple character events. So, think of this as of
> preventive maintenance.
Yes, for example this bug:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=26219. Two character
events following eachother come out of an XSLT process using <xsl:value>
twice following eachother. And the AbstractDOMTransformer or more
probable one of the component it uses drops the second and following
text events.
>> I also don't get your implications for
>> "had_start_or_end_element_in_between_char_events". But I had a look on
>> the endElement(). It gets the elements from a stack and already tests
>> for text:
>> if (text != null && text.length() > 0) {
>> Would it make sense to add the space in endElement, if the element
>> contains text, i.e. the above is true?
>
> This was my first though... But then, multiple closing tags will cause
> multiple spaces...
Ok, if this disturbs.
> So, I thought, this should work:
>
> startElement:
> flag = true;
>
> endElement:
> flag = true;
>
> characters:
> if (flag)
> x.append(' ');
> flag = false;
>
> Does it solves the problem?
Unfortunately not:
startElement event
character event 'key'
character event 'word'
character event 'test'
endElement event
So you would have 'key wordtest'.
What about
characters:
flag = true;
endElement:
if (flag)
x.append(' ');
flag = false;
This is similar like the above mentioned endElement text check, but
would prevent multiple spaces from output, wouldn't it?
Joerg
Re: cvs commit: cocoon-2.1 status.xml
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Joerg Heinicke wrote:
> On 09.03.2004 02:39, Vadim Gritsenko wrote:
>
>>>> public void characters(char[] ch, int start, int length) {
>>>> if (ch.length > 0 && start >= 0 && length > 1) {
>>>> - String text = new String(ch, start, length);
>>>> if (elementStack.size() > 0) {
>>>> IndexHelperField tos = (IndexHelperField)
>>>> elementStack.peek();
>>>> - tos.appendText(text);
>>>> + tos.appendText(ch, start, length);
>>>> }
>>>> - bodyText.append(text);
>>>> + bodyText.append(' ');
>>>> + bodyText.append(ch, start, length);
>>>> }
>>>> }
>>>>
>>>
>>> What will happen when "keyword" text is streamed as two characters
>>> events, "key" and "word"? I think it will become "key word", and
>>> indexing will break.
>>>
>>> IIUC, idea was to add a space in between tags, i.e. so
>>> <p>some</p><p>text</p> is not indexed as "sometext". If that's
>>> correct, then better fix would be to add space only if boolean flag
>>> had_start_or_end_element_in_between_char_events set.
>>
>>
>> Joerg?
>
>
> Your mail was neither ignored nor accidently deleted - I just didn't
> know what really to write, but marked it as important in nice red
> color in Mozilla :)
:-)
> Yes, I see your objection - and asked for them already in the bug
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
>
> So what are the practical use cases this might occure? Maybe it's only
> a theoretical problem depending on the "thing" the index is created
> from? On which SAX stream the LuceneIndexHandler operates?
I remember there were issues already in other components with text being
splitted up onto multiple character events. So, think of this as of
preventive maintenance.
> I also don't get your implications for
> "had_start_or_end_element_in_between_char_events". But I had a look on
> the endElement(). It gets the elements from a stack and already tests
> for text:
> if (text != null && text.length() > 0) {
> Would it make sense to add the space in endElement, if the element
> contains text, i.e. the above is true?
This was my first though... But then, multiple closing tags will cause
multiple spaces... So, I thought, this should work:
startElement:
flag = true;
endElement:
flag = true;
characters:
if (flag)
x.append(' ');
flag = false;
Does it solves the problem?
Vadim
Re: cvs commit: cocoon-2.1 status.xml
Posted by Joerg Heinicke <jo...@gmx.de>.
On 09.03.2004 02:39, Vadim Gritsenko wrote:
>>> public void characters(char[] ch, int start, int length) {
>>> if (ch.length > 0 && start >= 0 && length > 1) {
>>> - String text = new String(ch, start, length);
>>> if (elementStack.size() > 0) {
>>> IndexHelperField tos = (IndexHelperField)
>>> elementStack.peek();
>>> - tos.appendText(text);
>>> + tos.appendText(ch, start, length);
>>> }
>>> - bodyText.append(text);
>>> + bodyText.append(' ');
>>> + bodyText.append(ch, start, length);
>>> }
>>> }
>>>
>>
>> What will happen when "keyword" text is streamed as two characters
>> events, "key" and "word"? I think it will become "key word", and
>> indexing will break.
>>
>> IIUC, idea was to add a space in between tags, i.e. so
>> <p>some</p><p>text</p> is not indexed as "sometext". If that's
>> correct, then better fix would be to add space only if boolean flag
>> had_start_or_end_element_in_between_char_events set.
>
> Joerg?
Your mail was neither ignored nor accidently deleted - I just didn't
know what really to write, but marked it as important in nice red color
in Mozilla :)
Yes, I see your objection - and asked for them already in the bug
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25934 ;)
So what are the practical use cases this might occure? Maybe it's only a
theoretical problem depending on the "thing" the index is created from?
On which SAX stream the LuceneIndexHandler operates?
I also don't get your implications for
"had_start_or_end_element_in_between_char_events". But I had a look on
the endElement(). It gets the elements from a stack and already tests
for text:
if (text != null && text.length() > 0) {
Would it make sense to add the space in endElement, if the element
contains text, i.e. the above is true?
Joerg
Re: cvs commit: cocoon-2.1 status.xml
Posted by Vadim Gritsenko <va...@reverycodes.com>.
Vadim Gritsenko wrote:
> joerg@apache.org wrote:
>
>> public void characters(char[] ch, int start, int length) {
>> if (ch.length > 0 && start >= 0 && length > 1) {
>> - String text = new String(ch, start, length);
>> if (elementStack.size() > 0) {
>> IndexHelperField tos = (IndexHelperField)
>> elementStack.peek();
>> - tos.appendText(text);
>> + tos.appendText(ch, start, length);
>> }
>> - bodyText.append(text);
>> + bodyText.append(' ');
>> + bodyText.append(ch, start, length);
>> }
>> }
>>
>
> What will happen when "keyword" text is streamed as two characters
> events, "key" and "word"? I think it will become "key word", and
> indexing will break.
>
> IIUC, idea was to add a space in between tags, i.e. so
> <p>some</p><p>text</p> is not indexed as "sometext". If that's
> correct, then better fix would be to add space only if boolean flag
> had_start_or_end_element_in_between_char_events set.
Joerg?
Vadim