You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael Busch <bu...@gmail.com> on 2006/12/20 15:19:18 UTC

Payloads

Hi all,

currently it is not possible to add generic payloads to a posting list. 
However, this feature would be useful for various use cases. Some examples:
- XML search
  to index XML documents and allow structured search (e.g. XPath) it is 
neccessary to store the depths of the terms
- part-of-speech
  payloads can be used to store the part of speech of a term occurrence
- term boost
  for terms that occur e.g. in bold font a payload containing a boost 
value can be stored
- ...

The feature payloads has been requested and discussed a couple of times, 
e. g. in
- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In the latter thread I proposed a design a couple of months ago that 
adds the possibility to Lucene to store variable-length payloads inline 
in the posting list of a term. However, this design had some drawbacks: 
the already complex field API was extended and the payloads encoding was 
not optimal in terms of disk space.  Furthermore, the overall Lucene 
runtime performance suffered due to the growth of the .prx file. In the 
meantime the patch LUCENE-687 (Lazy skipping on proximity file) was 
committed, which reduces the number of reads and seeks on the .prx file. 
This minimizes the performance degradation of a bigger .prx file. Also, 
LUCENE-695 (Improve BufferedIndexInput.readBytes() performance) was 
committed, that speeds up reading mid-size chunks of bytes, which is 
beneficial for payloads that are bigger than just a few bytes.

Some weeks ago I started working on an improved design which I would 
like to propose now. The new design simplifies the API extensions (the 
Field API remains unchanged) and uses less disk space in most use cases. 
Now there are only two classes that get new methods:
- Token.setPayload()
  Use this method to add arbitrary metadata to a Token in the form of a 
byte[] array.
 
- TermPositions.getPayload()
  Use this method to retrieve the payload of a term occurrence.
 
The implementation is very flexible: the user does not have to enable 
payloads explicilty for a field and can add payloads to all, some or no 
Tokens. Due to the improved encoding those use cases are handled 
efficiently in terms of disk space.

Another thing I would like to point out is that this feature is 
backwards compatible, meaning that the file format only changes if the 
user explicitly adds payloads to the index. If no payloads are used, all 
data structures remain unchanged.

I'm going to open a new JIRA issue soon containing the patch and details 
about implementation and file format changes.

One more comment: It is a rather big patch and this is the initial 
version, so I'm sure there will be a lot of discussions. I would like to 
encourage people who consider this feature as useful to try it out and 
give me some feedback about possible improvements.

Best regards,
- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Grant Ingersoll wrote:
> Couldn't agree more.  This is good progress.
>
> I like the payloads patch, but I would like to see the lazy prox 
> stream (Lucene 761) stuff done (or at least details given on it) so 
> that we can hook this into Similarity so that it can be hooked into 
> scoring.  For 761 and the payload stuff, we need to make sure we do 
> some benchmarking tests (see Doron's latest contribution under 
> contrib/Benchmark for some cool tools to help w/ benchmarking)
>
> If you can do 761, I can then merge the two and then I can put up a 
> patch for review that hooks in the scoring/Similarity idea that I 
> _think_ will work and will allow a payload scoring factor to be 
> calculated into the TermScorer and will be backward compatible and 
> would allow people to score payloads w/o having to change very much.
>
> -Grant
Yep makes sense, Grant. I'm going to work on 761 the next days...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Grant Ingersoll <gs...@apache.org>.

Couldn't agree more.  This is good progress.

I like the payloads patch, but I would like to see the lazy prox  
stream (Lucene 761) stuff done (or at least details given on it) so  
that we can hook this into Similarity so that it can be hooked into  
scoring.  For 761 and the payload stuff, we need to make sure we do  
some benchmarking tests (see Doron's latest contribution under  
contrib/Benchmark for some cool tools to help w/ benchmarking)

If you can do 761, I can then merge the two and then I can put up a  
patch for review that hooks in the scoring/Similarity idea that I  
_think_ will work and will allow a payload scoring factor to be  
calculated into the TermScorer and will be backward compatible and  
would allow people to score payloads w/o having to change very much.

-Grant

On Jan 18, 2007, at 11:31 AM, Michael Busch wrote:

> Grant Ingersoll wrote:
>> Just to put in two cents: the Flexible Indexing thread has also  
>> talked about the notion of being able to store arbitrary data at:  
>> token, field, doc and Index level.
>>
>> -Grant
>>
>
> Yes I agree that this should be the long-term goal. The payload  
> feature is just a first step in the direction of a flexible index  
> format. I think it makes sense to add new functions incrementally,  
> as long as we try to only extend the API in a way, so that it is  
> compatible with the long-term goal, as Doug suggested already.  
> After the payload patch is committed we can work on a more  
> sophisticated per-doc-metadata solution. Until then we can use  
> payloads for that use case. Flexible indexing is very complex and  
> progress is progress... :-)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jan 18, 2007, at 8:59 AM, Grant Ingersoll wrote:

> I think one thing that would really bolster the flex. indexing  
> format changes would be to have someone write another  
> implementation for it so that we can iron out any interface details  
> that may be needed.  For instance, maybe the Kino merge model?

Workin' on it.  Subversion trunk for KS uses a unified postings  
format, including per-position boost.  Today I'm attempting to adapt  
PostingsWriter, SegTermDocs (soon to be renamed PostingList) and the  
scorers to deal with any logical combination of store_field_boost,  
store_freq, store_position, and store_boost.  I may need to work up  
the position-aware coordinator for BooleanScorer before long, because  
that will also have to tolerate multiple postings formats.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Grant Ingersoll <gs...@apache.org>.

I agree (and this has been discussed on this very thread in the past,  
see Doug's comments).  I would love to have someone take a look at  
the flexible indexing patch that was submitted (I have looked a  
little at it, but it is going to need more than just me since it is a  
big change, although it is b. compatible, I believe.  It needs to be  
benchmarked, tested in threads, etc. so it may be a while to get to  
the Flex. format.   Thus, it _may_ make sense to put in payloads  
first and mark them as "developer beware" in the comments and let  
them be tested in the real world.

I think one thing that would really bolster the flex. indexing format  
changes would be to have someone write another implementation for it  
so that we can iron out any interface details that may be needed.   
For instance, maybe the Kino merge model?

-Grant

On Jan 18, 2007, at 11:45 AM, Marvin Humphrey wrote:

>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long  
>> as we try to only extend the API in a way, so that it is  
>> compatible with the long-term goal, as Doug suggested already.  
>> After the payload patch is committed we can work on a more  
>> sophisticated per-doc-metadata solution. Until then we can use  
>> payloads for that use case.
>
> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible  
> indexing format, we limit our ability to change our minds about how  
> that API should look later when we discover a more harmonious  
> solution.
>
> If we're going to go the incremental route, IMO any API should be  
> marked as experimental, or better, made private so that we can toy  
> with it "in-house" on Lucene's innards, auditioning the changes  
> before finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Marvin Humphrey wrote:
>
> On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:
>
>> I think it makes sense to add new functions incrementally, as long as 
>> we try to only extend the API in a way, so that it is compatible with 
>> the long-term goal, as Doug suggested already. After the payload 
>> patch is committed we can work on a more sophisticated 
>> per-doc-metadata solution. Until then we can use payloads for that 
>> use case.
>
I think my comment was a bit confusing. The main intention of the 
payloads is to use it for storing per-term metadata. However, with the 
workaround Nadav suggested it is also possible to use it for per-doc 
metadata, by simply storing only one token per document in a special 
field. This solution works but is probably not the nicest. But why not 
use this workaround as long as the payloads patch does not introduce an 
API for the per-doc metadata that has to be removed/changed when we come 
up with a dedicated implementation for that use case. With the payloads 
patch I tried to keep the API changes as simple as possible (changes are 
only made to Token and TermPositions). These changes are under 
discussion in this thread with the intention to make them compatible 
with the flexible-indexing API. I couldn't agree more that the API has 
to be well-planned and I'd love to see your comments about the API 
extensions I suggested, Marvin.

> I respectfully disagree with this plan.
>
> APIs are forever, implementations are ephemeral.
>
> By making a public API available for one aspect of the flexible 
> indexing format, we limit our ability to change our minds about how 
> that API should look later when we discover a more harmonious solution.
>
> If we're going to go the incremental route, IMO any API should be 
> marked as experimental, or better, made private so that we can toy 
> with it "in-house" on Lucene's innards, auditioning the changes before 
> finalizing the API.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>

I certainly agree to your suggestion to mark new APIs as experimental. 
Then people would know that the API may change in the future and could 
use it in their apps at own risk. At the same time we would benefit from 
valuable feedback from those users that would help us perfecting the 
API. The idea of having a flexible index format is already a year old I 
think and at least in Java-Lucene there hasn't been made any progress 
yet. So I'm all for the incremental approach, while marking new APIs 
carefully as experimental.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jan 18, 2007, at 8:31 AM, Michael Busch wrote:

> I think it makes sense to add new functions incrementally, as long  
> as we try to only extend the API in a way, so that it is compatible  
> with the long-term goal, as Doug suggested already. After the  
> payload patch is committed we can work on a more sophisticated per- 
> doc-metadata solution. Until then we can use payloads for that use  
> case.

I respectfully disagree with this plan.

APIs are forever, implementations are ephemeral.

By making a public API available for one aspect of the flexible  
indexing format, we limit our ability to change our minds about how  
that API should look later when we discover a more harmonious solution.

If we're going to go the incremental route, IMO any API should be  
marked as experimental, or better, made private so that we can toy  
with it "in-house" on Lucene's innards, auditioning the changes  
before finalizing the API.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Grant Ingersoll wrote:
> Just to put in two cents: the Flexible Indexing thread has also talked 
> about the notion of being able to store arbitrary data at: token, 
> field, doc and Index level.
>
> -Grant
>

Yes I agree that this should be the long-term goal. The payload feature 
is just a first step in the direction of a flexible index format. I 
think it makes sense to add new functions incrementally, as long as we 
try to only extend the API in a way, so that it is compatible with the 
long-term goal, as Doug suggested already. After the payload patch is 
committed we can work on a more sophisticated per-doc-metadata solution. 
Until then we can use payloads for that use case. Flexible indexing is 
very complex and progress is progress... :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Grant Ingersoll <gs...@apache.org>.

Just to put in two cents: the Flexible Indexing thread has also  
talked about the notion of being able to store arbitrary data at:  
token, field, doc and Index level.

-Grant

On Jan 18, 2007, at 11:01 AM, Nadav Har'El wrote:

> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>> As you pointed out it is still possible to have per-doc payloads. You
>> need an analyzer which adds just one Token with payload to a specific
>> field for each doc. I understand that this code would be quite  
>> ugly on
>> the app side. A more elegant solution might be LUCENE-580. With that
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams)  
>> to a
>> Document without having to use an analyzer. You could use a  
>> TokenStream
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need  
> more
> than one of these tokens with payloads, I can just add several  
> fields with
> the same name (this should work, although the description of  
> LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
> -- 
> Nadav Har'El                        |     Thursday, Jan 18 2007, 28  
> Tevet 5767
> IBM Haifa Research Lab               
> |-----------------------------------------
>                                     |If glory comes after death,  
> I'm not in a
> http://nadav.harel.org.il           |hurry. (Latin proverb)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Nadav Har'El wrote:
> On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
>   
>> As you pointed out it is still possible to have per-doc payloads. You 
>> need an analyzer which adds just one Token with payload to a specific 
>> field for each doc. I understand that this code would be quite ugly on 
>> the app side. A more elegant solution might be LUCENE-580. With that 
>> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a 
>> Document without having to use an analyzer. You could use a TokenStream 
>>     
>
> Thanks, this sounds like a good idea.
>
> In fact, I could live with something even simpler: I want to be able
> to create a Field with a single token (with its payload). If I need more
> than one of these tokens with payloads, I can just add several fields with
> the same name (this should work, although the description of LUCENE-580
> suggests that it might have a bug in this area).
>
> I'll add a comment about this use-case to LUCENE-580.
>
>   
Yes for your use case it would indeed make sense to just add a single 
Token to a field. But there are other use cases that would benefit from 
580. E. g. when using UIMA as a parser. UIMA does not work per-field, it 
materializes the tokens of all fields in a CAS. So the indexer can't 
call the parser per field, the parsing has to be done before indexing. 
So it would make sense to do the parsing and then add TokenStreams for 
the different fields to the Document that only iterate through the CAS.
This is of course also possible by adding multiple Field instances 
containing single Tokens to a Document, but the performance would 
suffer. Each Token would be wrapped in a Field object and then hold in a 
list in Document.

So I think being able to add TokenStreams to a Document makes sense.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Thu, Jan 18, 2007, Michael Busch wrote about "Re: Payloads":
> As you pointed out it is still possible to have per-doc payloads. You 
> need an analyzer which adds just one Token with payload to a specific 
> field for each doc. I understand that this code would be quite ugly on 
> the app side. A more elegant solution might be LUCENE-580. With that 
> patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a 
> Document without having to use an analyzer. You could use a TokenStream 

Thanks, this sounds like a good idea.

In fact, I could live with something even simpler: I want to be able
to create a Field with a single token (with its payload). If I need more
than one of these tokens with payloads, I can just add several fields with
the same name (this should work, although the description of LUCENE-580
suggests that it might have a bug in this area).

I'll add a comment about this use-case to LUCENE-580.

-- 
Nadav Har'El                        |     Thursday, Jan 18 2007, 28 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |If glory comes after death, I'm not in a
http://nadav.harel.org.il           |hurry. (Latin proverb)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Nadav Har'El wrote:
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search example,
> we could use payloads to encode the list of facets that each document
> belongs to. For this, with the old API, you could have added a fixed term
> to an untokenized field, add add a payload to that entire untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?
>
> Thanks,
> Nadav.
>
>   
Hi Nadav,

you are referring to the first design I proposed in
http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In that design I indeed had a method
public Field(String name, String value, Store store, Index index, 
TermVector termVector, Payload payload);

which makes it easily possible to add a Payload without having to 
implement an Analyzer. The Field API is already complex, that's the 
reason why I removed this method in the new payloads version. And in 
this thread we're also discussing to make the Token API more flexible, 
so that it will be easier in the future to add more functionality.

As you pointed out it is still possible to have per-doc payloads. You 
need an analyzer which adds just one Token with payload to a specific 
field for each doc. I understand that this code would be quite ugly on 
the app side. A more elegant solution might be LUCENE-580. With that 
patch you are able to add pre-analyzed fields (i. e. TokenStreams) to a 
Document without having to use an analyzer. You could use a TokenStream 
implementation that emits only one Token. That would be a very simple 
class. Another benefit is that whenever we add more functionality to 
Token, we would not have to also provide another Field constructor. Do 
you think this makes sense? I haven't looked at the LUCENE-580 code and 
probably it needs to be updated since it is some months old, but I like 
the idea.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Mon, Jan 08, 2007, Nicolas Lalev�e wrote about "Re: Payloads":
> I have looked closer to how lucene index, and I realized that for the facet 
> feature, the kind of payload handling by Michael's patch are not designed for 
> that. In this patch, the payloads are in the posting, ie in the tis, frq, prx 
> files. Payload at the document level, that would be accessed in a scorer, 
> should be better in the TermVector files, which are ordered by docs and not 
> by term.

Well, it's sort of the same thing... Michael's patch allows putting payloads
at each position in a posting list; If you create a posting list which has
just one position per doc, you basically created a per-doc payload, ordered
by doc (like all posting lists).
And creating this posting list is easy: just pick an arbitrary field name
F and an arbitrary word W, and index the term (F,W) with the payload you want
for each document (basically, the list of categories that this document
belongs to).

I'm not saying this is the best way to do it, and certainly not the cleanest,
but it's just one of the things that payloads enable you to do.

-- 
Nadav Har'El                        |    Wednesday, Jan 10 2007, 20 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |Lumber Cartel member #2224.
http://nadav.harel.org.il           |http://lumbercartel.freeyellow.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Mercredi 3 Janvier 2007 14:46, Nadav Har'El a écrit :
> On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
> >..
> > Some weeks ago I started working on an improved design which I would
> > like to propose now. The new design simplifies the API extensions (the
> > Field API remains unchanged) and uses less disk space in most use cases.
> > Now there are only two classes that get new methods:
> > - Token.setPayload()
> >  Use this method to add arbitrary metadata to a Token in the form of a
> > byte[] array.
> >...
>
> Hi Michael,
>
> For some uses (e.g., faceted search), one wants to add a payload to each
> document, not per position for some text field. In the faceted search
> example, we could use payloads to encode the list of facets that each
> document belongs to. For this, with the old API, you could have added a
> fixed term to an untokenized field, add add a payload to that entire
> untokenized field.
>
> With the new API, it seems doing this is much more difficult and requires
> writing some sort of new Analyzer - one that will do the regular analysis
> that I want for the regulr fields, and add the payload to the one specific
> field that lists the facets.
> Am I understanding correctly? Or am I missing a better way to do this?

I have looked closer to how lucene index, and I realized that for the facet 
feature, the kind of payload handling by Michael's patch are not designed for 
that. In this patch, the payloads are in the posting, ie in the tis, frq, prx 
files. Payload at the document level, that would be accessed in a scorer, 
should be better in the TermVector files, which are ordered by docs and not 
by term.

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nadav Har'El <ny...@math.technion.ac.il>.

On Wed, Dec 20, 2006, Michael Busch wrote about "Payloads":
>..
> Some weeks ago I started working on an improved design which I would 
> like to propose now. The new design simplifies the API extensions (the 
> Field API remains unchanged) and uses less disk space in most use cases. 
> Now there are only two classes that get new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form of a 
> byte[] array.
>...

Hi Michael,

For some uses (e.g., faceted search), one wants to add a payload to each
document, not per position for some text field. In the faceted search example,
we could use payloads to encode the list of facets that each document
belongs to. For this, with the old API, you could have added a fixed term
to an untokenized field, add add a payload to that entire untokenized field.

With the new API, it seems doing this is much more difficult and requires
writing some sort of new Analyzer - one that will do the regular analysis
that I want for the regulr fields, and add the payload to the one specific
field that lists the facets.
Am I understanding correctly? Or am I missing a better way to do this?

Thanks,
Nadav.

-- 
Nadav Har'El                        |    Wednesday, Jan  3 2007, 13 Tevet 5767
IBM Haifa Research Lab              |-----------------------------------------
                                    |If you lost your left arm, your right arm
http://nadav.harel.org.il           |would be left.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Nicolas Lalevée wrote:
> Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
>   
>> Hi Michael,
>>
>> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>>
>> I am planning on starting on this soon (I know, I have been saying
>> that for a while, but I really am.)  At any rate, another set of eyes
>> would be good and I would be interested in hearing how your version
>> compares/works with this patch from Nicolas.
>>     
>
> In fact the work I have done is more about the storing part of Lucene than the 
> indexing part. But I think that the mechanism of defining in Java 
> an "IndexFormat" I have introduced in my patch will be usefull in defining 
> how the payload should be read and wrote.
>
> About my patch, it needs to be synchronized with the current trunk. I will 
> update it soon. It just need some clean up.
>
> Nicolas
>
>   

That's right, Nicolas' patch makes the Lucene *store* more flexible, 
whereas my payloads patch extends the *index* data structures.

Nicolas, I'm aware of your patch but haven't looked completely at it 
yet. I think it would be a great thing if our patches would work 
together. And with Dougs suggestions (see his response) we would be on 
the right track to the flexible indexing format! I would love to work 
together with you to achieve this goal. I will look at your patch more 
closely in the next days.

- Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
> Hi Michael,
>
> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>
> I am planning on starting on this soon (I know, I have been saying
> that for a while, but I really am.)  At any rate, another set of eyes
> would be good and I would be interested in hearing how your version
> compares/works with this patch from Nicolas.

In fact the work I have done is more about the storing part of Lucene than the 
indexing part. But I think that the mechanism of defining in Java 
an "IndexFormat" I have introduced in my patch will be usefull in defining 
how the payload should be read and wrote.

About my patch, it needs to be synchronized with the current trunk. I will 
update it soon. It just need some clean up.

Nicolas

>
> -Grant
>
> On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
> > Hi all,
> >
> > currently it is not possible to add generic payloads to a posting
> > list. However, this feature would be useful for various use cases.
> > Some examples:
> > - XML search
> >  to index XML documents and allow structured search (e.g. XPath) it
> > is neccessary to store the depths of the terms
> > - part-of-speech
> >  payloads can be used to store the part of speech of a term occurrence
> > - term boost
> >  for terms that occur e.g. in bold font a payload containing a
> > boost value can be stored
> > - ...
> >
> > The feature payloads has been requested and discussed a couple of
> > times, e. g. in
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
> >
> > In the latter thread I proposed a design a couple of months ago
> > that adds the possibility to Lucene to store variable-length
> > payloads inline in the posting list of a term. However, this design
> > had some drawbacks: the already complex field API was extended and
> > the payloads encoding was not optimal in terms of disk space.
> > Furthermore, the overall Lucene runtime performance suffered due to
> > the growth of the .prx file. In the meantime the patch LUCENE-687
> > (Lazy skipping on proximity file) was committed, which reduces the
> > number of reads and seeks on the .prx file. This minimizes the
> > performance degradation of a bigger .prx file. Also, LUCENE-695
> > (Improve BufferedIndexInput.readBytes() performance) was committed,
> > that speeds up reading mid-size chunks of bytes, which is
> > beneficial for payloads that are bigger than just a few bytes.
> >
> > Some weeks ago I started working on an improved design which I
> > would like to propose now. The new design simplifies the API
> > extensions (the Field API remains unchanged) and uses less disk
> > space in most use cases. Now there are only two classes that get
> > new methods:
> > - Token.setPayload()
> >  Use this method to add arbitrary metadata to a Token in the form
> > of a byte[] array.
> > - TermPositions.getPayload()
> >  Use this method to retrieve the payload of a term occurrence.
> > The implementation is very flexible: the user does not have to
> > enable payloads explicilty for a field and can add payloads to all,
> > some or no Tokens. Due to the improved encoding those use cases are
> > handled efficiently in terms of disk space.
> >
> > Another thing I would like to point out is that this feature is
> > backwards compatible, meaning that the file format only changes if
> > the user explicitly adds payloads to the index. If no payloads are
> > used, all data structures remain unchanged.
> >
> > I'm going to open a new JIRA issue soon containing the patch and
> > details about implementation and file format changes.
> >
> > One more comment: It is a rather big patch and this is the initial
> > version, so I'm sure there will be a lot of discussions. I would
> > like to encourage people who consider this feature as useful to try
> > it out and give me some feedback about possible improvements.
> >
> > Best regards,
> > - Michael
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Grant Ingersoll <gr...@gmail.com>.

Hi Michael,

Have a look at https://issues.apache.org/jira/browse/LUCENE-662

I am planning on starting on this soon (I know, I have been saying  
that for a while, but I really am.)  At any rate, another set of eyes  
would be good and I would be interested in hearing how your version  
compares/works with this patch from Nicolas.

-Grant

On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:

> Hi all,
>
> currently it is not possible to add generic payloads to a posting  
> list. However, this feature would be useful for various use cases.  
> Some examples:
> - XML search
>  to index XML documents and allow structured search (e.g. XPath) it  
> is neccessary to store the depths of the terms
> - part-of-speech
>  payloads can be used to store the part of speech of a term occurrence
> - term boost
>  for terms that occur e.g. in bold font a payload containing a  
> boost value can be stored
> - ...
>
> The feature payloads has been requested and discussed a couple of  
> times, e. g. in
> - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
>
> In the latter thread I proposed a design a couple of months ago  
> that adds the possibility to Lucene to store variable-length  
> payloads inline in the posting list of a term. However, this design  
> had some drawbacks: the already complex field API was extended and  
> the payloads encoding was not optimal in terms of disk space.   
> Furthermore, the overall Lucene runtime performance suffered due to  
> the growth of the .prx file. In the meantime the patch LUCENE-687  
> (Lazy skipping on proximity file) was committed, which reduces the  
> number of reads and seeks on the .prx file. This minimizes the  
> performance degradation of a bigger .prx file. Also, LUCENE-695  
> (Improve BufferedIndexInput.readBytes() performance) was committed,  
> that speeds up reading mid-size chunks of bytes, which is  
> beneficial for payloads that are bigger than just a few bytes.
>
> Some weeks ago I started working on an improved design which I  
> would like to propose now. The new design simplifies the API  
> extensions (the Field API remains unchanged) and uses less disk  
> space in most use cases. Now there are only two classes that get  
> new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form  
> of a byte[] array.
> - TermPositions.getPayload()
>  Use this method to retrieve the payload of a term occurrence.
> The implementation is very flexible: the user does not have to  
> enable payloads explicilty for a field and can add payloads to all,  
> some or no Tokens. Due to the improved encoding those use cases are  
> handled efficiently in terms of disk space.
>
> Another thing I would like to point out is that this feature is  
> backwards compatible, meaning that the file format only changes if  
> the user explicitly adds payloads to the index. If no payloads are  
> used, all data structures remain unchanged.
>
> I'm going to open a new JIRA issue soon containing the patch and  
> details about implementation and file format changes.
>
> One more comment: It is a rather big patch and this is the initial  
> version, so I'm sure there will be a lot of discussions. I would  
> like to encourage people who consider this feature as useful to try  
> it out and give me some feedback about possible improvements.
>
> Best regards,
> - Michael
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Doug,

sorry for the late response. I was on vacation after New Year's... oh 
btw. Happy New Year to everyone! :-)

Doug Cutting wrote:
> Michael Busch wrote:
>> Yes I could introduce a new class called e.g. PayloadToken that 
>> extends Token (good that it is not final anymore). Not sure if I 
>> understand your mixin interface idea... could you elaborate, please?
>
> I'm not entirely sure I understand it either!
>
> If Payload is an interface that tokens might implement, then some 
> posting implementations would treat tokens that implement Payload 
> specially.  And there might be other interfaces, say, PartOfSpeech, or 
> Emphasis, that tokens might implement, and that might also be handled 
> by some posting implementations.  A particular analyzer could emit 
> tokens that implement several of these interfaces, e.g., both 
> PartOfSpeech and Emphasis.  So these interfaces would be mixins.  But, 
> of course, they'd also have to each be implemented by the Token 
> subclass, since Java doesn't support multi-inheritance of implementation.
>
> I'm not sure this is the best approach: it's just the first one that 
> comes to my mind.  Perhaps instead Tokens should have a list of 
> aspects, each of which implement a TokenAspect interface, or somesuch.
>
> It would be best to have an idea of how we'd like to be able to 
> flexibly add token features like text-emphasis and part-of-speech that 
> are handled specially by posting implementations before we add the 
> Payload feature.  So if the "mixin" approach is not a good idea, then 
> we should try to think of a better one.  If we can't think of a good 
> approach, then we can always punt, add Payloads now, and deal with the 
> consequences later.  But it's worth trying first.  Working through a 
> few examples in pseudo code is perhaps a worthwhile task.
>
> Doug
Having a list of aspects for each Token really seems tempting. Something 
like:

public interface TokenAspect {
  String getAspectName();
}

Token gets new methods:

public void addTokenAspect(TokenAspect aspect);
public TokenAspect getTokenAspect(String name);
public List getTokenAspects();

Then Payload would implement TokenAspect and DocumentWriter (and maybe 
PostingWriter in the future) can check if a Token has that aspect.
And Ning pointed out that this approach is also nice for chaining of 
Analyzers or Filters. Different analyzers can simply add different 
aspects to a Token. The only concern that I have is performance. With 
this approach we would have to initialize a Map for every Token that has 
one aspect or more. Can we afford this or would indexing speed suffer?

A solution with different Mixin interfaces would not have this 
performance overhead. However, chaining of Analyzers is not easily 
possible. E. g., if an Analyzer emits a Token subclass which implements 
Payload and a TokenFilter wants to add another Mixin interface, lets say 
PartOfSpeech, then the Filter would have to instantiate another Token 
subclass that implements Payload and PartOfSpeech and either copy the 
data from the first Token subclass or decorate it. The latter would 
result in rather long and not very nice looking code for Token subclasses.

So besides the performance overhead I like the aspect approach. But 
maybe there are other solutions we didn't think about yet, or I got you 
wrong Doug and you had something different in mind? Thoughts?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Ning Li <ni...@gmail.com>.

On 12/22/06, Doug Cutting <cu...@apache.org> wrote:
> Ning Li wrote:
> > The draft proposal seems to suggest the following (roughly):
> >  A dictionary entry is <Term, FilePointer>.
>
> Perhaps this ought to be <Term, TermInfo>, where TermInfo contains a
> FilePointer and perhaps other information (e.g., frequency data).

Yes. Another example is skip data.

> So the ideal solution would permit both different formats to either
> share a file, or to use their own file(s).

Agree.

> Is it worth the complexity
> this would add to the API?  Or should we jettison the notion of multiple
> posting files per segment?

+1 for a single posting file per segment. I was wondering if we wanted
to provide all the flexibility possible. Things will be much simpler
with a single posting file per segment... :-)

Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Dec 22, 2006, at 10:36 AM, Doug Cutting wrote:

> The easiest way to do this would be to have separate files in each  
> segment for each PostingFormat.  It would be better if different  
> posting formats could share files, but that's harder to coordinate.

The approach I'm taking in KinoSearch 0.20 is for each field to get  
its own postings file, named _XXX.pYYY, where "_XXX" is the segment  
name and "YYY" is the field number.  That allows a single decoder to  
be pointed at each file.  _XXX.frq and _XXX.prx have been eliminated.

One file per format would also work.

> Alternately we could force all postings into a single file per  
> segment.  That would simplify the APIs, but prohibit certain file  
> formats, like the one Lucene uses currently.

In theory, we could also have one file per property: doc num, freq,  
positions, boost, payload.  The base Posting object would have only  
document number, and each subclass would add a new property, and a  
new file.

I'm not sure that's better, as it precludes optimizations such as the  
even/odd trick currently used in _XXX.frq, but it merits mention as  
the conceptual opposite of having one file per format.

Matchers would be happy with that scheme no matter what.

> So the ideal solution would permit both different formats to either  
> share a file, or to use their own file(s).  Is it worth the  
> complexity this would add to the API?  Or should we jettison the  
> notion of multiple posting files per segment?

Does punting on this issue have any drawbacks other than an unknown  
performance impact?  Can we design the API so that we leave open the  
option of allowing the user to spec multiple files if that proves  
advantageous later?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Doug Cutting <cu...@apache.org>.

Ning Li wrote:
> The draft proposal seems to suggest the following (roughly):
>  A dictionary entry is <Term, FilePointer>.

Perhaps this ought to be <Term, TermInfo>, where TermInfo contains a 
FilePointer and perhaps other information (e.g., frequency data).

>  A posting entry for a term in a document is <Doc, PostingContent>.
> Classes which implement PostingFormat decide the format of PostingContent.

Yes.

> Is it a good idea to allow PostingFormat to decide whether and how to
> store posting content in multiple files?

Ideally, yes.  The easiest way to do this would be to have separate 
files in each segment for each PostingFormat.  It would be better if 
different posting formats could share files, but that's harder to 
coordinate.

Alternately we could force all postings into a single file per segment. 
  That would simplify the APIs, but prohibit certain file formats, like 
the one Lucene uses currently.

So the ideal solution would permit both different formats to either 
share a file, or to use their own file(s).  Is it worth the complexity 
this would add to the API?  Or should we jettison the notion of multiple 
posting files per segment?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Dec 22, 2006, at 9:17 AM, Ning Li wrote:

> The question is, should the number of
> files used to store postings be customizable?

I think it ought to remain an implementation detail for now.  Using  
multiple files is an optimization of unknown advantage.   
Optimizations have to work very hard to justify being put into public  
APIs because they constrain later refactoring and may in fact prevent  
better optimizations from being implemented later.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Doug Cutting <cu...@apache.org>.

Ning Li wrote:
> I'm aware of this design. Boolean and phrase queries are an example.
> The point is, there are different queries whose processing will
> (continue to) require different information of terms, especially when
> flexible posting is allowed. The question is, should the number of
> files used to store postings be customizable?

If one needs to search the same data with both unranked boolean 
operators and with ranked proximity, one could use different fields.  If 
that's an acceptable answer, then we might get away with a single 
posting file per segment.  Back-compatibility will be a pain, but we 
probably shouldn't let that drive the design.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Ning Li <ni...@gmail.com>.

On 12/22/06, Marvin Humphrey <ma...@rectangular.com> wrote:
> Precision would be enhanced if boolean scoring took position into
> account, and could be further enhanced if each position were assigned
> a boost.  For that purpose, having everything in one file is an
> advantage, as it cuts down disk seeks.  Turn off freqs, positions,
> and boosts, and you have only doc_nums, which is ideal for matching
> rather than scoring, yielding a performance gain.

I'm aware of this design. Boolean and phrase queries are an example.
The point is, there are different queries whose processing will
(continue to) require different information of terms, especially when
flexible posting is allowed. The question is, should the number of
files used to store postings be customizable?

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Dec 21, 2006, at 1:58 PM, Ning Li wrote:

> Storing all the posting content, e.g. frequencies and positions, in a
> single file greatly simplifies things. However, this could cause some
> performance penalty. For example, boolean query 'Apache AND Lucene'
> would have to paw through positions. But position indexing for Apache
> and Lucene is necessary to support phrase query '"Apache Lucene"'.

Precision would be enhanced if boolean scoring took position into  
account, and could be further enhanced if each position were assigned  
a boost.  For that purpose, having everything in one file is an  
advantage, as it cuts down disk seeks.  Turn off freqs, positions,  
and boosts, and you have only doc_nums, which is ideal for matching  
rather than scoring, yielding a performance gain.

What's being considered doesn't really speak to the motivation of  
improving existing core functionality, though.  It's more about  
expanding the API to allow new applications.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Ning Li <ni...@gmail.com>.

> 1. Make the index format extensible by adding user-implementable reader
> and writer interfaces for postings.
> ...
> Here's a very rough, sketchy, first draft of a type (1) proposal.

Nice!

In approach 1, what is the best abstraction of a flexible index format
for Lucene?

The draft proposal seems to suggest the following (roughly):
  A dictionary entry is <Term, FilePointer>.
  A posting entry for a term in a document is <Doc, PostingContent>.
Classes which implement PostingFormat decide the format of PostingContent.

Storing all the posting content, e.g. frequencies and positions, in a
single file greatly simplifies things. However, this could cause some
performance penalty. For example, boolean query 'Apache AND Lucene'
would have to paw through positions. But position indexing for Apache
and Lucene is necessary to support phrase query '"Apache Lucene"'.

Is it a good idea to allow PostingFormat to decide whether and how to
store posting content in multiple files?
  A dictionary entry is <Term, <FilePointer>+>.
  A posting entry for a term in a document is <Doc, <PostingContent>+>.
Each PostingContent is stored in a separate file.

Or is a two-file abstraction good enough? It supports all formats in
approaches 2 and 3.
  A dictionary entry is <Term, FreqPointer, ProxPointer>.
  A posting entry for a term in a document is <Doc,
PerDocPostingContent, <Position, PerPositionPostingContent>+>.
Doc and PerDocPostingContent are stored in a .frq file.
Position and PerPositionPostingContent are stored in a .prx file.

What Michael called Payload can be viewed as PerPositionPostingContent here.


> I'm not sure this is the best approach: it's just the first one that
> comes to my mind.  Perhaps instead Tokens should have a list of aspects,
> each of which implement a TokenAspect interface, or somesuch.

Making Token have a list of aspects would work. A particular analyzer
would add certain types of aspects to the tokens it emits. For
example, one analyzer adds a TextEmphasis aspect to a token. Another
analyzer adds a PartOfSpeech aspect to the same token. A particular
posting implementation would expect certain types of aspects. For
example, one may require a TextEmphasis aspect and a PartOfSpeech
aspect. The posting implementation generates posting content (payload)
by encoding the values of both aspects.


Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Doug Cutting <cu...@apache.org>.

Michael Busch wrote:
> Yes I could introduce a new class called e.g. PayloadToken that extends 
> Token (good that it is not final anymore). Not sure if I understand your 
> mixin interface idea... could you elaborate, please?

I'm not entirely sure I understand it either!

If Payload is an interface that tokens might implement, then some 
posting implementations would treat tokens that implement Payload 
specially.  And there might be other interfaces, say, PartOfSpeech, or 
Emphasis, that tokens might implement, and that might also be handled by 
some posting implementations.  A particular analyzer could emit tokens 
that implement several of these interfaces, e.g., both PartOfSpeech and 
Emphasis.  So these interfaces would be mixins.  But, of course, they'd 
also have to each be implemented by the Token subclass, since Java 
doesn't support multi-inheritance of implementation.

I'm not sure this is the best approach: it's just the first one that 
comes to my mind.  Perhaps instead Tokens should have a list of aspects, 
each of which implement a TokenAspect interface, or somesuch.

It would be best to have an idea of how we'd like to be able to flexibly 
add token features like text-emphasis and part-of-speech that are 
handled specially by posting implementations before we add the Payload 
feature.  So if the "mixin" approach is not a good idea, then we should 
try to think of a better one.  If we can't think of a good approach, 
then we can always punt, add Payloads now, and deal with the 
consequences later.  But it's worth trying first.  Working through a few 
examples in pseudo code is perhaps a worthwhile task.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Doug Cutting wrote:
>
> A reason not to commit something like this now would be if it 
> complicates the effort to make the format extensible.  Each index 
> feature we add now will require back-compatibility in the future, and 
> we should be hesitant to add features that might be difficult to 
> support in the future.
Yes, I agree.

I had the idea of defining Payload as an interface:

public interface Payload {
    void serialize(IndexOutput out) throws IOException;
    int serializedLength();
    void deserialize(IndexInput in, int length) throws IOException;
}

and to have a default implementation ByteArrayPayload that works like my 
current patch. Then people could write their own implementation of 
Payload and define how to serialize the content.
>
> For example, this modifies the Token API.  If, long-term, we think 
> that Token should be extensible, then perhaps we should make it 
> extensible now, and add this through a subclass of Token (perhaps a 
> mixin interface that Tokens can implement).
>
Yes I could introduce a new class called e.g. PayloadToken that extends 
Token (good that it is not final anymore). Not sure if I understand your 
mixin interface idea... could you elaborate, please?

> I like the Payload feature, and think it should probably be added.  I 
> just want to make sure that we've first thought a bit about its 
> future-compatibility.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Doug Cutting <cu...@apache.org>.

Michael Busch wrote:
> the other hand, if people would like to use the payloads soon I guess 
> due to the backwards compatibility it would be low risk to add it to the 
> current index format to provide this feature until we can finish the 
> flexible format?

A reason not to commit something like this now would be if it 
complicates the effort to make the format extensible.  Each index 
feature we add now will require back-compatibility in the future, and we 
should be hesitant to add features that might be difficult to support in 
the future.

For example, this modifies the Token API.  If, long-term, we think that 
Token should be extensible, then perhaps we should make it extensible 
now, and add this through a subclass of Token (perhaps a mixin interface 
that Tokens can implement).

I like the Payload feature, and think it should probably be added.  I 
just want to make sure that we've first thought a bit about its 
future-compatibility.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Samedi 23 Décembre 2006 00:32, Michael Busch a écrit :
> Nicolas Lalevée wrote:
> > I have just looked at it. It looks great :)
>
> Thanks! :-)
>
> > But I still doesn't understand why a new entry in the fieldinfo is
> > needed.
>
> The entry is not really *needed*, but I use it for
> backwards-compatibility and as an optimization for fields that don't
> have any tokens with payloads. For fields with payloads the
> PositionDelta is shifted one bit, so for certain values this means that
> the VInt needs an extra byte. I have an index with about 500k web
> documents and measured, that about 8% of all PositionDelta values would
> need one extra byte in case PositionDelta is shifted. For my index that
> means roughly 4% growth of the total index size. With using a fieldbit,
> payloads can be disabled for a field and therefore the shifting of
> PositionDelta can be avoided. Furthermore, if the payload-fieldbit is
> not enabled, then the index format does not change at all.
>
> > There is the same for TermVector. And code like that fail for no obvious
> > reason :
> >
> > Document doc = new Document();
> > doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED,
> > TermVector.WITH_POSITIONS_OFFSETS));
> > doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED,
> > TermVector.NO));
> >
> > RAMDirectory ram = new RAMDirectory();
> > IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
> > writer.addDocument(doc);
> > writer.close();
> >
> > Knowing a little bit about how lucene works, I have an idea why this
> > fail, but can we avoid this ?
> >
> > Nicolas
>
> In the payload case there is no problem like this one. There is no new
> Field option that can be used to set the fieldbit explicitly. The bit is
> set automatically for a field as soon as the first Token of that field
> that carries a payload is encountered.

ok, thanks for the explaination. I looked closer to how indexing works, and in 
fact the issue I was talking about was I think a bug. Filling a jira issue.

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Nicolas Lalevée wrote:
>
> I have just looked at it. It looks great :)
>   
Thanks! :-)

> But I still doesn't understand why a new entry in the fieldinfo is needed. 
>   

The entry is not really *needed*, but I use it for 
backwards-compatibility and as an optimization for fields that don't 
have any tokens with payloads. For fields with payloads the 
PositionDelta is shifted one bit, so for certain values this means that 
the VInt needs an extra byte. I have an index with about 500k web 
documents and measured, that about 8% of all PositionDelta values would 
need one extra byte in case PositionDelta is shifted. For my index that 
means roughly 4% growth of the total index size. With using a fieldbit, 
payloads can be disabled for a field and therefore the shifting of 
PositionDelta can be avoided. Furthermore, if the payload-fieldbit is 
not enabled, then the index format does not change at all.

> There is the same for TermVector. And code like that fail for no obvious 
> reason :
>
> Document doc = new Document();
> doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED, 
> TermVector.WITH_POSITIONS_OFFSETS));
> doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED, TermVector.NO));
>
> RAMDirectory ram = new RAMDirectory();
> IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
> writer.addDocument(doc);
> writer.close();
>
> Knowing a little bit about how lucene works, I have an idea why this fail, but 
> can we avoid this ?
>
> Nicolas
>   
In the payload case there is no problem like this one. There is no new 
Field option that can be used to set the fieldbit explicitly. The bit is 
set automatically for a field as soon as the first Token of that field 
that carries a payload is encountered.

Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.

Le Mercredi 20 Décembre 2006 20:42, Michael Busch a écrit :
> Doug Cutting wrote:
> > Michael,
> >
> > This sounds like very good work.  The back-compatibility of this
> > approach is great.  But we should also consider this in the broader
> > context of index-format flexibility.
> >
> > Three general approaches have been proposed.  They are not exclusive.
> >
> > 1. Make the index format extensible by adding user-implementable
> > reader and writer interfaces for postings.
> >
> > 2. Add a richer set of standard index formats, including things like
> > compressed fields, no-positions, per-position weights, etc.
> >
> > 3. Provide hooks for including arbitrary binary data.
> >
> > Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of
> > type (2) are most friendly to non-Java implementations, since the
> > semantics of each variation are well-defined.
> >
> > I don't see a reason not to pursue all three, but in a coordinated
> > manner.  In particular, we don't want to add a feature of type (3)
> > that would make it harder to add type (1) APIs.  It would thus be best
> > if we had a rough specification of type (1) and type (2).  A proposal
> > of type (2) is at:
> >
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > But I'm not sure that we yet have any proposed designs for an
> > extensible posting API.  (Is anyone aware of one?)  This payload
> > proposal can probably be easily incorporated into such a design, but I
> > would have more confidence if we had one.  I guess I should attempt one!
>
> Doug,
>
> thanks for your detailed response. I'm aware that the long-term goal is
> the flexible index format and I see the payloads patch only as a part of
> it. The patch focuses on extending the index data structures and about a
> possible payload encoding. It doesn't focus yet on a flexible API, it
> only offers the two mentioned low-level methods to add and retrieve byte
> arrays.
>
> I would love to work with you guys on the flexible index format and to
> combine my patch with your suggestions and the patch from Nicolas! I
> will look at your proposal and Nicolas' patch tomorrow (have to go now).
> I just attached my patch (LUCENE-755), so if you get a chance you could
> take a look at it.

I have just looked at it. It looks great :)
But I still doesn't understand why a new entry in the fieldinfo is needed. 
There is the same for TermVector. And code like that fail for no obvious 
reason :

Document doc = new Document();
doc.add(new Field("f1", "v1", Store.YES, Index.TOKENIZED, 
TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("f1", "v2", Store.YES, Index.TOKENIZED, TermVector.NO));

RAMDirectory ram = new RAMDirectory();
IndexWriter writer = new IndexWriter(ram, new StandardAnalyzer(), true);
writer.addDocument(doc);
writer.close();

Knowing a little bit about how lucene works, I have an idea why this fail, but 
can we avoid this ?

Nicolas

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Michael Busch <bu...@gmail.com>.

Doug Cutting wrote:
> Michael,
>
> This sounds like very good work.  The back-compatibility of this 
> approach is great.  But we should also consider this in the broader 
> context of index-format flexibility.
>
> Three general approaches have been proposed.  They are not exclusive.
>
> 1. Make the index format extensible by adding user-implementable 
> reader and writer interfaces for postings.
>
> 2. Add a richer set of standard index formats, including things like 
> compressed fields, no-positions, per-position weights, etc.
>
> 3. Provide hooks for including arbitrary binary data.
>
> Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of 
> type (2) are most friendly to non-Java implementations, since the 
> semantics of each variation are well-defined.
>
> I don't see a reason not to pursue all three, but in a coordinated 
> manner.  In particular, we don't want to add a feature of type (3) 
> that would make it harder to add type (1) APIs.  It would thus be best 
> if we had a rough specification of type (1) and type (2).  A proposal 
> of type (2) is at:
>
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> But I'm not sure that we yet have any proposed designs for an 
> extensible posting API.  (Is anyone aware of one?)  This payload 
> proposal can probably be easily incorporated into such a design, but I 
> would have more confidence if we had one.  I guess I should attempt one!
>

Doug,

thanks for your detailed response. I'm aware that the long-term goal is 
the flexible index format and I see the payloads patch only as a part of 
it. The patch focuses on extending the index data structures and about a 
possible payload encoding. It doesn't focus yet on a flexible API, it 
only offers the two mentioned low-level methods to add and retrieve byte 
arrays.

I would love to work with you guys on the flexible index format and to 
combine my patch with your suggestions and the patch from Nicolas! I 
will look at your proposal and Nicolas' patch tomorrow (have to go now). 
I just attached my patch (LUCENE-755), so if you get a chance you could 
take a look at it.

Maybe it would make sense now to follow your suggestion you made earlier 
this year and start a new package to work on the new index format? On 
the other hand, if people would like to use the payloads soon I guess 
due to the backwards compatibility it would be low risk to add it to the 
current index format to provide this feature until we can finish the 
flexible format?

- Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Payloads

Posted by Doug Cutting <cu...@apache.org>.

Michael Busch wrote:
 > Some weeks ago I started working on an improved design which I would
 > like to propose now. The new design simplifies the API extensions (the
 > Field API remains unchanged) and uses less disk space in most use cases.
 > Now there are only two classes that get new methods:
 > - Token.setPayload()
 >  Use this method to add arbitrary metadata to a Token in the form of a
 > byte[] array.
 >
 > - TermPositions.getPayload()
 >  Use this method to retrieve the payload of a term occurrence.

Michael,

This sounds like very good work.  The back-compatibility of this 
approach is great.  But we should also consider this in the broader 
context of index-format flexibility.

Three general approaches have been proposed.  They are not exclusive.

1. Make the index format extensible by adding user-implementable reader 
and writer interfaces for postings.

2. Add a richer set of standard index formats, including things like 
compressed fields, no-positions, per-position weights, etc.

3. Provide hooks for including arbitrary binary data.

Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of type 
(2) are most friendly to non-Java implementations, since the semantics 
of each variation are well-defined.

I don't see a reason not to pursue all three, but in a coordinated 
manner.  In particular, we don't want to add a feature of type (3) that 
would make it harder to add type (1) APIs.  It would thus be best if we 
had a rough specification of type (1) and type (2).  A proposal of type 
(2) is at:

http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

But I'm not sure that we yet have any proposed designs for an extensible 
posting API.  (Is anyone aware of one?)  This payload proposal can 
probably be easily incorporated into such a design, but I would have 
more confidence if we had one.  I guess I should attempt one!

Here's a very rough, sketchy, first draft of a type (1) proposal.

IndexWriter#setPostingFormat(PostingFormat)
IndexWriter#setDictionaryFormat(DictionaryFormat)

interface PostingFormat {
   PostingInverter getInverter(FieldInfo, Segment, Directory);
   PostingReader getReader(FieldInfo, Segment, Directory);
   PostingWriter getWriter(FieldInfo, Segment, Directory);
}

interface PostingPointer {} ???

interface DictionaryFormat {
   DictionaryWriter getWriter(FieldInfo, Segment, Directory);
   DictionaryWriter getReader(FieldInfo, Segment, Directory);
}

IndexWriter#addDocument(Document doc)
   loop over doc.fields
     call PostingFormat#getPostingInverter(FieldInfo, Segment, Directory)
       to create a PostingInverter
     if field is analyzed
       call Analyzer#tokenStream() to get TokenStream
       loop over tokens
         PostingInverter#collectToken(Token, Field);
     else
       PostingInverter#collectToken(Field);

   call DictionaryFormat#getWriter(FieldInfo, Segment, Directory)
     to create a DictionaryWriter
   Iterator<Term> terms = PostingInverter#getTerms();
   loop over terms
     PostingPointer p = PostingInverter#getPointer();
     PostingInverter#write(term);
     DictionaryWriter#addTerm(term, p);

IndexMerger#mergePostings()
   call DictionaryFormat#getReader(FieldInfo, Segment, Directory)
     to create a DictionaryReader
   loop over fields
     call PostingFormat#getWriter(FieldInfo, Segment, Directory)
       to create a PostingWriter
     loop over segments
       call PostingFormat#getReader(FieldInfo, Segment, Directory)
	to create a PostingReader
       loop over dictionary.terms
         PostingPointer p = PostingWriter#getPointer();
         DictionaryWriter#addTerm(Term, p);
	loop over docs
	  int doc = PostingReader#readPostings();
           PostingWriter#writePostings(doc);

So the question is, does something like this conflict with your 
proposal?  Should Term and/or Token be extensible?  If so, what should 
their interfaces look like?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org