You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Markus Jelsma <ma...@openindex.io> on 2015/03/06 19:16:45 UTC

RE: Delimited payloads input issue

Well, the only work-around we found to actually work properly is to override the problem causing tokenizer implementations on by one. Regarding the WordDelimiterFilter, the quickest fix is enabling keepOriginal, if you don't want the original to stick around, the filter implementation must be modified to carry the original PayloadAttribute to its descendants.

Markus
 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Friday 27th February 2015 17:28
> To: solr-user <so...@lucene.apache.org>
> Subject: Delimited payloads input issue
> 
> Hi - we attempt to use payloads to identify different parts of extracted HTML pages and use the DelimitedPayloadTokenFilter to assign the correct payload to the tokens. However, we are having issues for some language analyzers and issues with some types of content for most regular analyzers.
> 
> If we, for example, want to assign payloads to the text within an H1 field that contains non-alphanumerics such as `Hello, i am a heading!`, and use |5 as delimiter and payload, we send the following to Solr, `Hello,|5 i|5 am|5 a|5 heading!|5`.
> This is not going to work because due to a WordDelimiterFilter, the tokens Hello and heading obviously loose their payload. We also cannot put the payload between the last alphanumeric and the following comma or exlamation mark because then those characters would become part of the payload if we use identity encoder, or it should fail if we use another encoder. We could solve this using a custom encoder that only takes the first character and ignores the rest, but this seems rather ugly.
> 
> On the other hand, we have issues using language specific tokenizers such as Kuromoji, i will immediately dump the delimited payload so it never reaches the DelimitedPayloadTokenFilter. And if we try chinese and have the StandardTokenizer enabled, we also loose the delimited payload.
> 
> Any of you have dealt with this before? Hints to share?
> 
> Many thanks,
> Markus
> 

Re: Delimited payloads input issue

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.
Hi Markus,

I’ve found this problem too. I’ve worked around it:
* write a custom attribute that has the data you want to carry-forward, one
with a no-op clear().  The no-op clear defeats WDF and other ill-behaved
filters (e.g. common-grams).
* Use a custom tokenizer that populates the attribute, and which can truly
clear the custom attribute
* Write a custom filter at the end of the chain that actually encodes the
attribute data into the payload.

This scheme has worked for me for sentence/paragraph IDs, which effectively
hold constant throughout the sentence.  It may be more complicated when the
data varies word-by-word since some Filters won’t work well.

I suppose the real solution is better Filters that use captureState and/or
an improved tokenStream design to clone attributes.  It’s better to clone a
state when introducing a new token than to clear it!

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Fri, Mar 6, 2015 at 1:16 PM, Markus Jelsma <ma...@openindex.io>
wrote:

> Well, the only work-around we found to actually work properly is to
> override the problem causing tokenizer implementations on by one. Regarding
> the WordDelimiterFilter, the quickest fix is enabling keepOriginal, if you
> don't want the original to stick around, the filter implementation must be
> modified to carry the original PayloadAttribute to its descendants.
>
> Markus
>
>
> -----Original message-----
> > From:Markus Jelsma <ma...@openindex.io>
> > Sent: Friday 27th February 2015 17:28
> > To: solr-user <so...@lucene.apache.org>
> > Subject: Delimited payloads input issue
> >
> > Hi - we attempt to use payloads to identify different parts of extracted
> HTML pages and use the DelimitedPayloadTokenFilter to assign the correct
> payload to the tokens. However, we are having issues for some language
> analyzers and issues with some types of content for most regular analyzers.
> >
> > If we, for example, want to assign payloads to the text within an H1
> field that contains non-alphanumerics such as `Hello, i am a heading!`, and
> use |5 as delimiter and payload, we send the following to Solr, `Hello,|5
> i|5 am|5 a|5 heading!|5`.
> > This is not going to work because due to a WordDelimiterFilter, the
> tokens Hello and heading obviously loose their payload. We also cannot put
> the payload between the last alphanumeric and the following comma or
> exlamation mark because then those characters would become part of the
> payload if we use identity encoder, or it should fail if we use another
> encoder. We could solve this using a custom encoder that only takes the
> first character and ignores the rest, but this seems rather ugly.
> >
> > On the other hand, we have issues using language specific tokenizers
> such as Kuromoji, i will immediately dump the delimited payload so it never
> reaches the DelimitedPayloadTokenFilter. And if we try chinese and have the
> StandardTokenizer enabled, we also loose the delimited payload.
> >
> > Any of you have dealt with this before? Hints to share?
> >
> > Many thanks,
> > Markus
> >
>