You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Nadav Har'El <ny...@math.technion.ac.il> on 2007/01/03 14:46:13 UTC

Re: Payloads

On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
>..
> Some weeks ago I started working on an improved design which I would 
> like to propose now. The new design simplifies the API extensions (the 
> Field API remains unchanged) and uses less disk space in most use cases. 
> Now there are only two classes that get new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form of a 
> byte[] array.
>...

Hi Michael,

For some uses (e.g., faceted search), one wants to add a payload to each
document, not per position for some text field. In the faceted search example,
we could use payloads to encode the list of facets that each document
belongs to. For this, with the old API, you could have added a fixed term
to an untokenized field, add add a payload to that entire untokenized field.

With the new API, it seems doing this is much more difficult and requires
writing some sort of new Analyzer - one that will do the regular analysis
that I want for the regulr fields, and add the payload to the one specific
field that lists the facets.
Am I understanding correctly? Or am I missing a better way to do this?

Thanks,
Nadav.

-- 
Nadav Har'El                        |    Wednesday, Jan  3 2007, 13 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |If you lost your left arm, your right arm
http://nadav.harel.org.il           |would be left.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Couldn't agree more.  This is good progress.
>
> I like the payloads patch, but I would like to see the lazy prox 
> stream (Lucene 761) stuff done (or at least details given on it) so 
> that we can hook this into Similarity so that it can be hooked into 
> scoring.  For 761 and the payload stuff, we need to make sure we do 
> some benchmarking tests (see Doron's latest contribution under 
> contrib/Benchmark for some cool tools to help w/ benchmarking)
>
> If you can do 761, I can then merge the two and then I can put up a 
> patch for review that hooks in the scoring/Similarity idea that I 
> _think_ will work and will allow a payload scoring factor to be 
> calculated into the TermScorer and will be backward compatible and 
> would allow people to score payloads w/o having to change very much.
>
> -Grant
Yep makes sense, Grant. I'm going to work on 761 the next days...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Grant Ingersoll <gs...@apache.org>.
Couldn't agree more.  This is good progress.

I like the payloads patch, but I would like to see the lazy prox  
stream (Lucene 761) stuff done (or at least details given on it) so  
that we can hook this into Similarity so that it can be hooked into  
scoring.  For 761 and the payload stuff, we need to make sure we do  
some benchmarking tests (see Doron's latest contribution under  
contrib/Benchmark for some cool tools to help w/ benchmarking)

If you can do 761, I can then merge the two and then I can put up a  
patch for review that hooks in the scoring/Similarity idea that I  
_think_ will work and will allow a payload scoring factor to be  
calculated into the TermScorer and will be backward compatible and  
would allow people to score payloads w/o having to change very much.

-Grant

On Jan 18, 2007, at 11:31 AM, Michael Busch wrote:

> Grant Ingersoll wrote:
>> Just to put in two cents: the Flexible Indexing thread has also  
>> talked about the notion of being able to store arbitrary data at:  
>> token, field, doc and Index level.
>>
>> -Grant
>>
>
> Yes I agree that this should be the long-term goal. The payload  
> feature is just a first step in the direction of a flexible index  
> format. I think it makes sense to add new functions incrementally,  
> as long as we try to only extend the API in a way, so that it is  
> compatible with the long-term goal, as Doug suggested already.  
> After the payload patch is committed we can work on a more  
> sophisticated per-doc-metadata solution. Until then we can use  
> payloads for that use case. Flexible indexing is very complex and  
> progress is progress... :-)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jan 18, 2007, at 8:59 AM, Grant Ingersoll wrote:

> I think one thing that would really bolster the flex. indexing  
> format changes would be to have someone write another  
> implementation for it so that we can iron out any interface details  
> that may be needed.  For instance, maybe the Kino merge model?

Workin' on it.  Subversion trunk for KS uses a unified postings  
format, including per-position boost.  Today I'm attempting to adapt  
PostingsWriter, SegTermDocs (soon to be renamed PostingList) and the  
scorers to deal with any logical combination of store_field_boost,  
store_freq, store_position, and store_boost.  I may need to work up  
the position-aware coordinator for BooleanScorer before long, because  
that will also have to tolerate multiple postings formats.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Grant Ingersoll <gs...@apache.org>.
I agree (and this has been discussed on this very thread in the past,  
see Doug's comments).  I would love to have someone take a look at  
the flexible indexing patch that was submitted (I have looked a  
little at it, but it is going to need more than just me since it is a  
big change, although it is b. compatible, I believe.  It needs to be  
benchmarked, tested in threads, etc. so it may be a while to get to  
the Flex. format.   Thus, it _may_ make sense to put in payloads  
first and mark them as "developer beware" in the comments and let  
them be tested in the real world.

I think one thing that would really bolster the flex. indexing format  
changes would be to have someone write another implementation for it  
so that we can iron out any interface details that may be needed.   
For instance, maybe the Kino merge model?

-Grant

On Jan 18, 2007, at 11:45 AM, Marvin Humphrey wrote:

>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long  
>> as we try to only extend the API in a way, so that it is  
>> compatible with the long-term goal, as Doug suggested already.  
>> After the payload patch is committed we can work on a more  
>> sophisticated per-doc-metadata solution. Until then we can use  
>> payloads for that use case.
>
> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible  
> indexing format, we limit our ability to change our minds about how  
> that API should look later when we discover a more harmonious  
> solution.
>
> If we're going to go the incremental route, IMO any API should be  
> marked as experimental, or better, made private so that we can toy  
> with it "in-house" on Lucene's innards, auditioning the changes  
> before finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long as 
>> we try to only extend the API in a way, so that it is compatible with 
>> the long-term goal, as Doug suggested already. After the payload 
>> patch is committed we can work on a more sophisticated 
>> per-doc-metadata solution. Until then we can use payloads for that 
>> use case.
>
I think my comment was a bit confusing. The main intention of the 
payloads is to use it for storing per-term metadata. However, with the 
workaround Nadav suggested it is also possible to use it for per-doc 
metadata, by simply storing only one token per document in a special 
field. This solution works but is probably not the nicest. But why not 
use this workaround as long as the payloads patch does not introduce an 
API for the per-doc metadata that has to be removed/changed when we come 
up with a dedicated implementation for that use case. With the payloads 
patch I tried to keep the API changes as simple as possible (changes are 
only made to Token and TermPositions). These changes are under 
discussion in this thread with the intention to make them compatible 
with the flexible-indexing API. I couldn't agree more that the API has 
to be well-planned and I'd love to see your comments about the API 
extensions I suggested, Marvin.

> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible 
> indexing format, we limit our ability to change our minds about how 
> that API should look later when we discover a more harmonious solution.
>
> If we're going to go the incremental route, IMO any API should be 
> marked as experimental, or better, made private so that we can toy 
> with it "in-house" on Lucene's innards, auditioning the changes before 
> finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>

I certainly agree to your suggestion to mark new APIs as experimental. 
Then people would know that the API may change in the future and could 
use it in their apps at own risk. At the same time we would benefit from 
valuable feedback from those users that would help us perfecting the 
API. The idea of having a flexible index format is already a year old I 
think and at least in Java-Lucene there hasn't been made any progress 
yet. So I'm all for the incremental approach, while marking new APIs 
carefully as experimental.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:

> I think it makes sense to add new functions incrementally, as long  
> as we try to only extend the API in a way, so that it is compatible  
> with the long-term goal, as Doug suggested already. After the  
> payload patch is committed we can work on a more sophisticated per- 
> doc-metadata solution. Until then we can use payloads for that use  
> case.

I respectfully disagree with this plan.

APIs are forever, implementations are ephemeral.

By making a public API available for one aspect of the flexible  
indexing format, we limit our ability to change our minds about how  
that API should look later when we discover a more harmonious solution.

If we're going to go the incremental route, IMO any API should be  
marked as experimental, or better, made private so that we can toy  
with it "in-house" on Lucene's innards, auditioning the changes  
before finalizing the API.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Just to put in two cents: the Flexible Indexing thread has also talked 
> about the notion of being able to store arbitrary data at: token, 
> field, doc and Index level.
>
> -Grant
>

Yes I agree that this should be the long-term goal. The payload feature 
is just a first step in the direction of a flexible index format. I 
think it makes sense to add new functions incrementally, as long as we 
try to only extend the API in a way, so that it is compatible with the 
long-term goal, as Doug suggested already. After the payload patch is 
committed we can work on a more sophisticated per-doc-metadata solution. 
Until then we can use payloads for that use case. Flexible indexing is 
very complex and progress is progress... :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Grant Ingersoll <gs...@apache.org>.
Just to put in two cents: the Flexible Indexing thread has also  
talked about the notion of being able to store arbitrary data at:  
token, field, doc and Index level.

-Grant

On Jan 18, 2007, at 11:01 AM, Nadav Har'El wrote:

> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>> As you pointed out it is still possible to have per-doc payloads. You
>> need an analyzer which adds just one Token with payload to a specific
>> field for each doc. I understand that this code would be quite  
>> ugly on
>> the app side. A more elegant solution might be LUCENE-580. With that
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams)  
>> to a
>> Document without having to use an analyzer. You could use a  
>> TokenStream
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need  
> more
> than one of these tokens with payloads, I can just add several  
> fields with
> the same name (this should work, although the description of  
> LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
> -- 
> Nadav Har'El                        |     Thursday, Jan 18 2007, 28  
> Tevet 5767
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                     |If glory comes after death,  
> I'm not in a
> http://nadav.harel.org.il           |hurry. (Latin proverb)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.
Nadav Har'El wrote:
> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>   
>> As you pointed out it is still possible to have per-doc payloads. You 
>> need an analyzer which adds just one Token with payload to a specific 
>> field for each doc. I understand that this code would be quite ugly on 
>> the app side. A more elegant solution might be LUCENE-580. With that 
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a 
>> Document without having to use an analyzer. You could use a TokenStream 
>>     
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need more
> than one of these tokens with payloads, I can just add several fields with
> the same name (this should work, although the description of LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
>   
Yes for your use case it would indeed make sense to just add a single 
Token to a field. But there are other use cases that would benefit from 
580. E. g. when using UIMA as a parser. UIMA does not work per-field, it 
materializes the tokens of all fields in a CAS. So the indexer can't 
call the parser per field, the parsing has to be done before indexing. 
So it would make sense to do the parsing and then add TokenStreams for 
the different fields to the Document that only iterate through the CAS.
This is of course also possible by adding multiple Field instances 
containing single Tokens to a Document, but the performance would 
suffer. Each Token would be wrapped in a Field object and then hold in a 
list in Document.

So I think being able to add TokenStreams to a Document makes sense.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
> As you pointed out it is still possible to have per-doc payloads. You 
> need an analyzer which adds just one Token with payload to a specific 
> field for each doc. I understand that this code would be quite ugly on 
> the app side. A more elegant solution might be LUCENE-580. With that 
> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a 
> Document without having to use an analyzer. You could use a TokenStream 

Thanks, this sounds like a good idea.

In fact, I could live with something even simpler: I want to be able
to create a Field with a single token (with its payload). If I need more
than one of these tokens with payloads, I can just add several fields with
the same name (this should work, although the description of LUCENE-580
suggests that it might have a bug in this area).

I'll add a comment about this use-case to LUCENE-580.

-- 
Nadav Har'El                        |     Thursday, Jan 18 2007, 28 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |If glory comes after death, I'm not in a
http://nadav.harel.org.il           |hurry. (Latin proverb)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.
Nadav Har'El wrote:
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search example,
> we could use payloads to encode the list of facets that each document
> belongs to. For this, with the old API, you could have added a fixed term
> to an untokenized field, add add a payload to that entire untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?
>
> Thanks,
> Nadav.
>
>   
Hi Nadav,

you are referring to the first design I proposed in
http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In that design I indeed had a method
public Field(String name, String value, Store store, Index index, 
TermVector termVector, Payload payload);

which makes it easily possible to add a Payload without having to 
implement an Analyzer. The Field API is already complex, that's the 
reason why I removed this method in the new payloads version. And in 
this thread we're also discussing to make the Token API more flexible, 
so that it will be easier in the future to add more functionality.

As you pointed out it is still possible to have per-doc payloads. You 
need an analyzer which adds just one Token with payload to a specific 
field for each doc. I understand that this code would be quite ugly on 
the app side. A more elegant solution might be LUCENE-580. With that 
patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a 
Document without having to use an analyzer. You could use a TokenStream 
implementation that emits only one Token. That would be a very simple 
class. Another benefit is that whenever we add more functionality to 
Token, we would not have to also provide another Field constructor. Do 
you think this makes sense? I haven't looked at the LUCENE-580 code and 
probably it needs to be updated since it is some months old, but I like 
the idea.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Nadav Har'El <ny...@math.technion.ac.il>.
On Mon, Jan 08, 2007, Nicolas Lalev�e wrote about "Re: Payloads":
> I have looked closer to how lucene index, and I realized that for the facet 
> feature, the kind of payload handling by Michael's patch are not designed for 
> that. In this patch, the payloads are in the posting, ie in the tis, frq, prx 
> files. Payload at the document level, that would be accessed in a scorer, 
> should be better in the TermVector files, which are ordered by docs and not 
> by term.

Well, it's sort of the same thing... Michael's patch allows putting payloads
at each position in a posting list; If you create a posting list which has
just one position per doc, you basically created a per-doc payload, ordered
by doc (like all posting lists).
And creating this posting list is easy: just pick an arbitrary field name
F and an arbitrary word W, and index the term (F,W) with the payload you want
for each document (basically, the list of categories that this document
belongs to).

I'm not saying this is the best way to do it, and certainly not the cleanest,
but it's just one of the things that payloads enable you to do.

-- 
Nadav Har'El                        |    Wednesday, Jan 10 2007, 20 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |Lumber Cartel member #2224.
http://nadav.harel.org.il           |http://lumbercartel.freeyellow.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Payloads

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Mercredi 3 Janvier 2007 14:46, Nadav Har'El a écrit :
> On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
> >..
> > Some weeks ago I started working on an improved design which I would
> > like to propose now. The new design simplifies the API extensions (the
> > Field API remains unchanged) and uses less disk space in most use cases.
> > Now there are only two classes that get new methods:
> > - Token.setPayload()
> >  Use this method to add arbitrary metadata to a Token in the form of a
> > byte[] array.
> >...
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search
> example, we could use payloads to encode the list of facets that each
> document belongs to. For this, with the old API, you could have added a
> fixed term to an untokenized field, add add a payload to that entire
> untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?

I have looked closer to how lucene index, and I realized that for the facet 
feature, the kind of payload handling by Michael's patch are not designed for 
that. In this patch, the payloads are in the posting, ie in the tis, frq, prx 
files. Payload at the document level, that would be accessed in a scorer, 
should be better in the TermVector files, which are ordered by docs and not 
by term.

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org