You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "david.w.smiley@gmail.com" <da...@gmail.com> on 2014/11/05 19:03:32 UTC

Multi-valued fields and TokenStream

Several times now, I’ve had to come up with work-arounds for a TokenStream
not knowing it’s processing the first value or a subsequent-value of a
multi-valued field.  Two of these times, the use-case was ensuring the
first position of each value started at a multiple of 1000 (or some other
configurable value), and the third was encoding sentence paragraph counters
(similar to a do-it-yourself position increment).

The work-arounds are awkward and hacky.  For example if you’re in control
of your Tokenizer, you can prefix subsequent values with a special flag,
and then do the right think in reset().  But then the highlighter or value
retrieval in general is impacted.  It’s also possible to create the fields
with the constructor that accepts a TokenStream that you’ve told it’s the
first or subsequent value but it’s awkward going that route, and sometimes
(e.g. Solr) it’s hard to know all the values you have up-front to even do
that.

It would be nice if TokenStream.reset() took a boolean ‘first’ argument.
Such a change would obviously be backwards incompatible.  Simply
overloading the method to call the no-arg version is problematic because
TokenStreams are a chain, and it would likely result in the chain getting
doubly-reset.

Any ideas?

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

Re: Multi-valued fields and TokenStream

Posted by Steve Rowe <sa...@gmail.com>.

> On Nov 6, 2014, at 3:13 PM, david.w.smiley@gmail.com wrote:
> 
> Are you suggesting that DefaultIndexingChain.PerField.invert(boolean firstValue) would, prior to calling reset(), call setPositionIncrement(Integer.MAX_VALUE), but only when ‘firstValue’ is false?  Hmmmm.  I guess that would work, although it seems a bit hacky and it’s tying this to a specific attribute when ideally we notify the chain as a whole what’s going on.  But it doesn’t require any new API, save for some javadocs.  And it’s extremely unlikely there would be a backwards-incompatible problem, so that’s good.  And I find this use is related to positions so it’s not so bad to abuse the position increment for this.  Nice idea Steve; this works for me.

Um, I meant something much simpler (but wrong): use the existing Analyzer.getPositionIncrementGap() to allow analysis components to infer whether a value was first.  I can see now from DefaultIndexingChain.PerField.invert(), though, that this info isn’t available to analysis components, but is only used to adjust the FieldInvertState’s position.  Sorry for the noise.

Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Multi-valued fields and TokenStream

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Nov 6, 2014 at 3:41 PM, david.w.smiley@gmail.com
<da...@gmail.com> wrote:
> On Thu, Nov 6, 2014 at 3:19 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>> Do the concatenation yourself with your own TokenStream. You can index
>> a field with a tokenstream for expert cases (the individual stored
>> values can be added separately)
>
>
> Yes, but that’s quite awkward and a fair amount of surrounding code when, in
> the end, it could be so much simpler if somehow the TokenStream could be
> notified.  I’d feel a little better about it if Lucene included the
> tokenStream concatenating code (I’ve done a prototype for this, I could work
> on it more and contribute) and if the Solr layer had a nice way of
> presenting all the values to the Solr FieldType at once instead of
> separately — SOLR-4329.
>
>>
>> No need to make the tokenstream API more complicated: its already very
>> complicated.
>
>
> Ehh, that’s arguable.  Steve’s suggestion amounts to one line of production
> code (javadoc & test is separate).  If that’s too much then adding a boolean
> argument to reset() would feel cleaner, be 0 lines of new code, but would be
> backwards-incompatible.  Shrug.

Thats just not true, thats why I am against such a change. It is not
one line, it makes the "protocol" of tokenstream a lot more complex,
we have to ensure the correct values are passed by all consumers
(including indexwriter) etc. Same goes regardless of whether it is
extra parameters or strange values.

Its also bogus to add such stuff when its specific to indexwriter
concatenating multiple fields, which anyone can do themselves, with a
TokenStream. Its unnecessary.

Instead we already provide an expert api (index a TokenStream) for you
to do whatever it is you want, without dirtying up lucene's API.

Sorry, i dont think we should hack booleans into indexwriter or
tokenstream or analyzer for this expert use case, when we already
supply you an API to do it, and just "dont wanna".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Multi-valued fields and TokenStream

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.

On Thu, Nov 6, 2014 at 3:19 PM, Robert Muir <rc...@gmail.com> wrote:

> Do the concatenation yourself with your own TokenStream. You can index
> a field with a tokenstream for expert cases (the individual stored
> values can be added separately)
>

Yes, but that’s quite awkward and a fair amount of surrounding code when,
in the end, it could be so much simpler if somehow the TokenStream could be
notified.  I’d feel a little better about it if Lucene included the
tokenStream concatenating code (I’ve done a prototype for this, I could
work on it more and contribute) and if the Solr layer had a nice way of
presenting all the values to the Solr FieldType at once instead of
separately — SOLR-4329.


> No need to make the tokenstream API more complicated: its already very
> complicated.
>

Ehh, that’s arguable.  Steve’s suggestion amounts to one line of production
code (javadoc & test is separate).  If that’s too much then adding a
boolean argument to reset() would feel cleaner, be 0 lines of new code, but
would be backwards-incompatible.  Shrug.

Another idea is if Field.tokenStream(Analyzer analyzer, TokenStream reuse)
had another boolean to indicate first value or not.  I think I like the
other ideas better though.


>
> On Thu, Nov 6, 2014 at 3:13 PM, david.w.smiley@gmail.com
> <da...@gmail.com> wrote:
> > Are you suggesting that DefaultIndexingChain.PerField.invert(boolean
> > firstValue) would, prior to calling reset(), call
> > setPositionIncrement(Integer.MAX_VALUE), but only when ‘firstValue’ is
> > false?  Hmmmm.  I guess that would work, although it seems a bit hacky
> and
> > it’s tying this to a specific attribute when ideally we notify the chain
> as
> > a whole what’s going on.  But it doesn’t require any new API, save for
> some
> > javadocs.  And it’s extremely unlikely there would be a
> > backwards-incompatible problem, so that’s good.  And I find this use is
> > related to positions so it’s not so bad to abuse the position increment
> for
> > this.  Nice idea Steve; this works for me.
> >
> > Does anyone else have an opinion before I create an issue?
> >
> > ~ David Smiley
> > Freelance Apache Lucene/Solr Search Consultant/Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> > On Thu, Nov 6, 2014 at 2:13 PM, Steve Rowe <sa...@gmail.com> wrote:
> >>
> >> Maybe the position increment gap would be useful?  If set to a value
> >> larger than likely max position for any individual value, it could be
> used
> >> to infer (non-)first-value-ness.
> >>
> >> > On Nov 5, 2014, at 1:03 PM, david.w.smiley@gmail.com wrote:
> >> >
> >> > Several times now, I’ve had to come up with work-arounds for a
> >> > TokenStream not knowing it’s processing the first value or a
> >> > subsequent-value of a multi-valued field.  Two of these times, the
> use-case
> >> > was ensuring the first position of each value started at a multiple
> of 1000
> >> > (or some other configurable value), and the third was encoding
> sentence
> >> > paragraph counters (similar to a do-it-yourself position increment).
> >> >
> >> > The work-arounds are awkward and hacky.  For example if you’re in
> >> > control of your Tokenizer, you can prefix subsequent values with a
> special
> >> > flag, and then do the right think in reset().  But then the
> highlighter or
> >> > value retrieval in general is impacted.  It’s also possible to create
> the
> >> > fields with the constructor that accepts a TokenStream that you’ve
> told it’s
> >> > the first or subsequent value but it’s awkward going that route, and
> >> > sometimes (e.g. Solr) it’s hard to know all the values you have
> up-front to
> >> > even do that.
> >> >
> >> > It would be nice if TokenStream.reset() took a boolean ‘first’
> argument.
> >> > Such a change would obviously be backwards incompatible.  Simply
> overloading
> >> > the method to call the no-arg version is problematic because
> TokenStreams
> >> > are a chain, and it would likely result in the chain getting
> doubly-reset.
> >> >
> >> > Any ideas?
> >> >
> >> > ~ David Smiley
> >> > Freelance Apache Lucene/Solr Search Consultant/Developer
> >> > http://www.linkedin.com/in/davidwsmiley
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Multi-valued fields and TokenStream

Posted by Robert Muir <rc...@gmail.com>.

Do the concatenation yourself with your own TokenStream. You can index
a field with a tokenstream for expert cases (the individual stored
values can be added separately)

No need to make the tokenstream API more complicated: its already very
complicated.

On Thu, Nov 6, 2014 at 3:13 PM, david.w.smiley@gmail.com
<da...@gmail.com> wrote:
> Are you suggesting that DefaultIndexingChain.PerField.invert(boolean
> firstValue) would, prior to calling reset(), call
> setPositionIncrement(Integer.MAX_VALUE), but only when ‘firstValue’ is
> false?  Hmmmm.  I guess that would work, although it seems a bit hacky and
> it’s tying this to a specific attribute when ideally we notify the chain as
> a whole what’s going on.  But it doesn’t require any new API, save for some
> javadocs.  And it’s extremely unlikely there would be a
> backwards-incompatible problem, so that’s good.  And I find this use is
> related to positions so it’s not so bad to abuse the position increment for
> this.  Nice idea Steve; this works for me.
>
> Does anyone else have an opinion before I create an issue?
>
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley
>
> On Thu, Nov 6, 2014 at 2:13 PM, Steve Rowe <sa...@gmail.com> wrote:
>>
>> Maybe the position increment gap would be useful?  If set to a value
>> larger than likely max position for any individual value, it could be used
>> to infer (non-)first-value-ness.
>>
>> > On Nov 5, 2014, at 1:03 PM, david.w.smiley@gmail.com wrote:
>> >
>> > Several times now, I’ve had to come up with work-arounds for a
>> > TokenStream not knowing it’s processing the first value or a
>> > subsequent-value of a multi-valued field.  Two of these times, the use-case
>> > was ensuring the first position of each value started at a multiple of 1000
>> > (or some other configurable value), and the third was encoding sentence
>> > paragraph counters (similar to a do-it-yourself position increment).
>> >
>> > The work-arounds are awkward and hacky.  For example if you’re in
>> > control of your Tokenizer, you can prefix subsequent values with a special
>> > flag, and then do the right think in reset().  But then the highlighter or
>> > value retrieval in general is impacted.  It’s also possible to create the
>> > fields with the constructor that accepts a TokenStream that you’ve told it’s
>> > the first or subsequent value but it’s awkward going that route, and
>> > sometimes (e.g. Solr) it’s hard to know all the values you have up-front to
>> > even do that.
>> >
>> > It would be nice if TokenStream.reset() took a boolean ‘first’ argument.
>> > Such a change would obviously be backwards incompatible.  Simply overloading
>> > the method to call the no-arg version is problematic because TokenStreams
>> > are a chain, and it would likely result in the chain getting doubly-reset.
>> >
>> > Any ideas?
>> >
>> > ~ David Smiley
>> > Freelance Apache Lucene/Solr Search Consultant/Developer
>> > http://www.linkedin.com/in/davidwsmiley
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Multi-valued fields and TokenStream

Posted by "david.w.smiley@gmail.com" <da...@gmail.com>.

Are you suggesting that DefaultIndexingChain.PerField.invert(boolean
firstValue) would, prior to calling reset(), call
setPositionIncrement(Integer.MAX_VALUE), but only when ‘firstValue’ is
false?  Hmmmm.  I guess that would work, although it seems a bit hacky and
it’s tying this to a specific attribute when ideally we notify the chain as
a whole what’s going on.  But it doesn’t require any new API, save for some
javadocs.  And it’s extremely unlikely there would be a
backwards-incompatible problem, so that’s good.  And I find this use is
related to positions so it’s not so bad to abuse the position increment for
this.  Nice idea Steve; this works for me.

Does anyone else have an opinion before I create an issue?

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Thu, Nov 6, 2014 at 2:13 PM, Steve Rowe <sa...@gmail.com> wrote:

> Maybe the position increment gap would be useful?  If set to a value
> larger than likely max position for any individual value, it could be used
> to infer (non-)first-value-ness.
>
> > On Nov 5, 2014, at 1:03 PM, david.w.smiley@gmail.com wrote:
> >
> > Several times now, I’ve had to come up with work-arounds for a
> TokenStream not knowing it’s processing the first value or a
> subsequent-value of a multi-valued field.  Two of these times, the use-case
> was ensuring the first position of each value started at a multiple of 1000
> (or some other configurable value), and the third was encoding sentence
> paragraph counters (similar to a do-it-yourself position increment).
> >
> > The work-arounds are awkward and hacky.  For example if you’re in
> control of your Tokenizer, you can prefix subsequent values with a special
> flag, and then do the right think in reset().  But then the highlighter or
> value retrieval in general is impacted.  It’s also possible to create the
> fields with the constructor that accepts a TokenStream that you’ve told
> it’s the first or subsequent value but it’s awkward going that route, and
> sometimes (e.g. Solr) it’s hard to know all the values you have up-front to
> even do that.
> >
> > It would be nice if TokenStream.reset() took a boolean ‘first’
> argument.  Such a change would obviously be backwards incompatible.  Simply
> overloading the method to call the no-arg version is problematic because
> TokenStreams are a chain, and it would likely result in the chain getting
> doubly-reset.
> >
> > Any ideas?
> >
> > ~ David Smiley
> > Freelance Apache Lucene/Solr Search Consultant/Developer
> > http://www.linkedin.com/in/davidwsmiley
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Multi-valued fields and TokenStream

Posted by Steve Rowe <sa...@gmail.com>.

Maybe the position increment gap would be useful?  If set to a value larger than likely max position for any individual value, it could be used to infer (non-)first-value-ness.

> On Nov 5, 2014, at 1:03 PM, david.w.smiley@gmail.com wrote:
> 
> Several times now, I’ve had to come up with work-arounds for a TokenStream not knowing it’s processing the first value or a subsequent-value of a multi-valued field.  Two of these times, the use-case was ensuring the first position of each value started at a multiple of 1000 (or some other configurable value), and the third was encoding sentence paragraph counters (similar to a do-it-yourself position increment).  
> 
> The work-arounds are awkward and hacky.  For example if you’re in control of your Tokenizer, you can prefix subsequent values with a special flag, and then do the right think in reset().  But then the highlighter or value retrieval in general is impacted.  It’s also possible to create the fields with the constructor that accepts a TokenStream that you’ve told it’s the first or subsequent value but it’s awkward going that route, and sometimes (e.g. Solr) it’s hard to know all the values you have up-front to even do that.
> 
> It would be nice if TokenStream.reset() took a boolean ‘first’ argument.  Such a change would obviously be backwards incompatible.  Simply overloading the method to call the no-arg version is problematic because TokenStreams are a chain, and it would likely result in the chain getting doubly-reset.
> 
> Any ideas?
> 
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org