You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Nadav Har'El <ny...@math.technion.ac.il> on 2007/01/03 14:46:13 UTC
Re: Payloads
On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
>..
> Some weeks ago I started working on an improved design which I would
> like to propose now. The new design simplifies the API extensions (the
> Field API remains unchanged) and uses less disk space in most use cases.
> Now there are only two classes that get new methods:
> - Token.setPayload()
> Use this method to add arbitrary metadata to a Token in the form of a
> byte[] array.
>...
Hi Michael,
For some uses (e.g., faceted search), one wants to add a payload to each
document, not per position for some text field. In the faceted search example,
we could use payloads to encode the list of facets that each document
belongs to. For this, with the old API, you could have added a fixed term
to an untokenized field, add add a payload to that entire untokenized field.
With the new API, it seems doing this is much more difficult and requires
writing some sort of new Analyzer - one that will do the regular analysis
that I want for the regulr fields, and add the payload to the one specific
field that lists the facets.
Am I understanding correctly? Or am I missing a better way to do this?
Thanks,
Nadav.
--
Nadav Har'El | Wednesday, Jan 3 2007, 13 Tevet 5767
IBM Haifa Research Lab |-----------------------------------------
|If you lost your left arm, your right arm
http://nadav.harel.org.il |would be left.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Couldn't agree more. This is good progress.
>
> I like the payloads patch, but I would like to see the lazy prox
> stream (Lucene 761) stuff done (or at least details given on it) so
> that we can hook this into Similarity so that it can be hooked into
> scoring. For 761 and the payload stuff, we need to make sure we do
> some benchmarking tests (see Doron's latest contribution under
> contrib/Benchmark for some cool tools to help w/ benchmarking)
>
> If you can do 761, I can then merge the two and then I can put up a
> patch for review that hooks in the scoring/Similarity idea that I
> _think_ will work and will allow a payload scoring factor to be
> calculated into the TermScorer and will be backward compatible and
> would allow people to score payloads w/o having to change very much.
>
> -Grant
Yep makes sense, Grant. I'm going to work on 761 the next days...
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gs...@apache.org>.
Couldn't agree more. This is good progress.
I like the payloads patch, but I would like to see the lazy prox
stream (Lucene 761) stuff done (or at least details given on it) so
that we can hook this into Similarity so that it can be hooked into
scoring. For 761 and the payload stuff, we need to make sure we do
some benchmarking tests (see Doron's latest contribution under
contrib/Benchmark for some cool tools to help w/ benchmarking)
If you can do 761, I can then merge the two and then I can put up a
patch for review that hooks in the scoring/Similarity idea that I
_think_ will work and will allow a payload scoring factor to be
calculated into the TermScorer and will be backward compatible and
would allow people to score payloads w/o having to change very much.
-Grant
On Jan 18, 2007, at 11:31 AM, Michael Busch wrote:
> Grant Ingersoll wrote:
>> Just to put in two cents: the Flexible Indexing thread has also
>> talked about the notion of being able to store arbitrary data at:
>> token, field, doc and Index level.
>>
>> -Grant
>>
>
> Yes I agree that this should be the long-term goal. The payload
> feature is just a first step in the direction of a flexible index
> format. I think it makes sense to add new functions incrementally,
> as long as we try to only extend the API in a way, so that it is
> compatible with the long-term goal, as Doug suggested already.
> After the payload patch is committed we can work on a more
> sophisticated per-doc-metadata solution. Until then we can use
> payloads for that use case. Flexible indexing is very complex and
> progress is progress... :-)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jan 18, 2007, at 8:59 AM, Grant Ingersoll wrote:
> I think one thing that would really bolster the flex. indexing
> format changes would be to have someone write another
> implementation for it so that we can iron out any interface details
> that may be needed. For instance, maybe the Kino merge model?
Workin' on it. Subversion trunk for KS uses a unified postings
format, including per-position boost. Today I'm attempting to adapt
PostingsWriter, SegTermDocs (soon to be renamed PostingList) and the
scorers to deal with any logical combination of store_field_boost,
store_freq, store_position, and store_boost. I may need to work up
the position-aware coordinator for BooleanScorer before long, because
that will also have to tolerate multiple postings formats.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gs...@apache.org>.
I agree (and this has been discussed on this very thread in the past,
see Doug's comments). I would love to have someone take a look at
the flexible indexing patch that was submitted (I have looked a
little at it, but it is going to need more than just me since it is a
big change, although it is b. compatible, I believe. It needs to be
benchmarked, tested in threads, etc. so it may be a while to get to
the Flex. format. Thus, it _may_ make sense to put in payloads
first and mark them as "developer beware" in the comments and let
them be tested in the real world.
I think one thing that would really bolster the flex. indexing format
changes would be to have someone write another implementation for it
so that we can iron out any interface details that may be needed.
For instance, maybe the Kino merge model?
-Grant
On Jan 18, 2007, at 11:45 AM, Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long
>> as we try to only extend the API in a way, so that it is
>> compatible with the long-term goal, as Doug suggested already.
>> After the payload patch is committed we can work on a more
>> sophisticated per-doc-metadata solution. Until then we can use
>> payloads for that use case.
>
> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible
> indexing format, we limit our ability to change our minds about how
> that API should look later when we discover a more harmonious
> solution.
>
> If we're going to go the incremental route, IMO any API should be
> marked as experimental, or better, made private so that we can toy
> with it "in-house" on Lucene's innards, auditioning the changes
> before finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long as
>> we try to only extend the API in a way, so that it is compatible with
>> the long-term goal, as Doug suggested already. After the payload
>> patch is committed we can work on a more sophisticated
>> per-doc-metadata solution. Until then we can use payloads for that
>> use case.
>
I think my comment was a bit confusing. The main intention of the
payloads is to use it for storing per-term metadata. However, with the
workaround Nadav suggested it is also possible to use it for per-doc
metadata, by simply storing only one token per document in a special
field. This solution works but is probably not the nicest. But why not
use this workaround as long as the payloads patch does not introduce an
API for the per-doc metadata that has to be removed/changed when we come
up with a dedicated implementation for that use case. With the payloads
patch I tried to keep the API changes as simple as possible (changes are
only made to Token and TermPositions). These changes are under
discussion in this thread with the intention to make them compatible
with the flexible-indexing API. I couldn't agree more that the API has
to be well-planned and I'd love to see your comments about the API
extensions I suggested, Marvin.
> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible
> indexing format, we limit our ability to change our minds about how
> that API should look later when we discover a more harmonious solution.
>
> If we're going to go the incremental route, IMO any API should be
> marked as experimental, or better, made private so that we can toy
> with it "in-house" on Lucene's innards, auditioning the changes before
> finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
I certainly agree to your suggestion to mark new APIs as experimental.
Then people would know that the API may change in the future and could
use it in their apps at own risk. At the same time we would benefit from
valuable feedback from those users that would help us perfecting the
API. The idea of having a flexible index format is already a year old I
think and at least in Java-Lucene there hasn't been made any progress
yet. So I'm all for the incremental approach, while marking new APIs
carefully as experimental.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
> I think it makes sense to add new functions incrementally, as long
> as we try to only extend the API in a way, so that it is compatible
> with the long-term goal, as Doug suggested already. After the
> payload patch is committed we can work on a more sophisticated per-
> doc-metadata solution. Until then we can use payloads for that use
> case.
I respectfully disagree with this plan.
APIs are forever, implementations are ephemeral.
By making a public API available for one aspect of the flexible
indexing format, we limit our ability to change our minds about how
that API should look later when we discover a more harmonious solution.
If we're going to go the incremental route, IMO any API should be
marked as experimental, or better, made private so that we can toy
with it "in-house" on Lucene's innards, auditioning the changes
before finalizing the API.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Just to put in two cents: the Flexible Indexing thread has also talked
> about the notion of being able to store arbitrary data at: token,
> field, doc and Index level.
>
> -Grant
>
Yes I agree that this should be the long-term goal. The payload feature
is just a first step in the direction of a flexible index format. I
think it makes sense to add new functions incrementally, as long as we
try to only extend the API in a way, so that it is compatible with the
long-term goal, as Doug suggested already. After the payload patch is
committed we can work on a more sophisticated per-doc-metadata solution.
Until then we can use payloads for that use case. Flexible indexing is
very complex and progress is progress... :-)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Grant Ingersoll <gs...@apache.org>.
Just to put in two cents: the Flexible Indexing thread has also
talked about the notion of being able to store arbitrary data at:
token, field, doc and Index level.
-Grant
On Jan 18, 2007, at 11:01 AM, Nadav Har'El wrote:
> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>> As you pointed out it is still possible to have per-doc payloads. You
>> need an analyzer which adds just one Token with payload to a specific
>> field for each doc. I understand that this code would be quite
>> ugly on
>> the app side. A more elegant solution might be LUCENE-580. With that
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams)
>> to a
>> Document without having to use an analyzer. You could use a
>> TokenStream
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need
> more
> than one of these tokens with payloads, I can just add several
> fields with
> the same name (this should work, although the description of
> LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
> --
> Nadav Har'El | Thursday, Jan 18 2007, 28
> Tevet 5767
> IBM Haifa Research Lab
> |-----------------------------------------
> |If glory comes after death,
> I'm not in a
> http://nadav.harel.org.il |hurry. (Latin proverb)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Nadav Har'El wrote:
> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>
>> As you pointed out it is still possible to have per-doc payloads. You
>> need an analyzer which adds just one Token with payload to a specific
>> field for each doc. I understand that this code would be quite ugly on
>> the app side. A more elegant solution might be LUCENE-580. With that
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a
>> Document without having to use an analyzer. You could use a TokenStream
>>
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need more
> than one of these tokens with payloads, I can just add several fields with
> the same name (this should work, although the description of LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
>
Yes for your use case it would indeed make sense to just add a single
Token to a field. But there are other use cases that would benefit from
580. E. g. when using UIMA as a parser. UIMA does not work per-field, it
materializes the tokens of all fields in a CAS. So the indexer can't
call the parser per field, the parsing has to be done before indexing.
So it would make sense to do the parsing and then add TokenStreams for
the different fields to the Document that only iterate through the CAS.
This is of course also possible by adding multiple Field instances
containing single Tokens to a Document, but the performance would
suffer. Each Token would be wrapped in a Field object and then hold in a
list in Document.
So I think being able to add TokenStreams to a Document makes sense.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
> As you pointed out it is still possible to have per-doc payloads. You
> need an analyzer which adds just one Token with payload to a specific
> field for each doc. I understand that this code would be quite ugly on
> the app side. A more elegant solution might be LUCENE-580. With that
> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a
> Document without having to use an analyzer. You could use a TokenStream
Thanks, this sounds like a good idea.
In fact, I could live with something even simpler: I want to be able
to create a Field with a single token (with its payload). If I need more
than one of these tokens with payloads, I can just add several fields with
the same name (this should work, although the description of LUCENE-580
suggests that it might have a bug in this area).
I'll add a comment about this use-case to LUCENE-580.
--
Nadav Har'El | Thursday, Jan 18 2007, 28 Tevet 5767
IBM Haifa Research Lab |-----------------------------------------
|If glory comes after death, I'm not in a
http://nadav.harel.org.il |hurry. (Latin proverb)
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Michael Busch <bu...@gmail.com>.
Nadav Har'El wrote:
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search example,
> we could use payloads to encode the list of facets that each document
> belongs to. For this, with the old API, you could have added a fixed term
> to an untokenized field, add add a payload to that entire untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?
>
> Thanks,
> Nadav.
>
>
Hi Nadav,
you are referring to the first design I proposed in
http://www.gossamer-threads.com/lists/lucene/java-dev/37409
In that design I indeed had a method
public Field(String name, String value, Store store, Index index,
TermVector termVector, Payload payload);
which makes it easily possible to add a Payload without having to
implement an Analyzer. The Field API is already complex, that's the
reason why I removed this method in the new payloads version. And in
this thread we're also discussing to make the Token API more flexible,
so that it will be easier in the future to add more functionality.
As you pointed out it is still possible to have per-doc payloads. You
need an analyzer which adds just one Token with payload to a specific
field for each doc. I understand that this code would be quite ugly on
the app side. A more elegant solution might be LUCENE-580. With that
patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a
Document without having to use an analyzer. You could use a TokenStream
implementation that emits only one Token. That would be a very simple
class. Another benefit is that whenever we add more functionality to
Token, we would not have to also provide another Field constructor. Do
you think this makes sense? I haven't looked at the LUCENE-580 code and
probably it needs to be updated since it is some months old, but I like
the idea.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Mon, Jan 08, 2007, Nicolas Lalev�e wrote about "Re: Payloads":
> I have looked closer to how lucene index, and I realized that for the facet
> feature, the kind of payload handling by Michael's patch are not designed for
> that. In this patch, the payloads are in the posting, ie in the tis, frq, prx
> files. Payload at the document level, that would be accessed in a scorer,
> should be better in the TermVector files, which are ordered by docs and not
> by term.
Well, it's sort of the same thing... Michael's patch allows putting payloads
at each position in a posting list; If you create a posting list which has
just one position per doc, you basically created a per-doc payload, ordered
by doc (like all posting lists).
And creating this posting list is easy: just pick an arbitrary field name
F and an arbitrary word W, and index the term (F,W) with the payload you want
for each document (basically, the list of categories that this document
belongs to).
I'm not saying this is the best way to do it, and certainly not the cleanest,
but it's just one of the things that payloads enable you to do.
--
Nadav Har'El | Wednesday, Jan 10 2007, 20 Tevet 5767
IBM Haifa Research Lab |-----------------------------------------
|Lumber Cartel member #2224.
http://nadav.harel.org.il |http://lumbercartel.freeyellow.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Payloads
Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 3 Janvier 2007 14:46, Nadav Har'El a écrit :
> On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
> >..
> > Some weeks ago I started working on an improved design which I would
> > like to propose now. The new design simplifies the API extensions (the
> > Field API remains unchanged) and uses less disk space in most use cases.
> > Now there are only two classes that get new methods:
> > - Token.setPayload()
> > Use this method to add arbitrary metadata to a Token in the form of a
> > byte[] array.
> >...
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search
> example, we could use payloads to encode the list of facets that each
> document belongs to. For this, with the old API, you could have added a
> fixed term to an untokenized field, add add a payload to that entire
> untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?
I have looked closer to how lucene index, and I realized that for the facet
feature, the kind of payload handling by Michael's patch are not designed for
that. In this patch, the payloads are in the posting, ie in the tis, frq, prx
files. Payload at the document level, that would be accessed in a scorer,
should be better in the TermVector files, which are ordered by docs and not
by term.
--
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org