You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2007/03/12 01:35:09 UTC

[jira] Updated: (LUCENE-755) Payloads

     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

I'm attaching the new patch with the following changes:
- applies cleanly on the current trunk
- fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix
- added the following warning comments to the new APIs:

  *  Warning: The status of the Payloads feature is experimental. The APIs
  *  introduced here might change in the future and will not be supported anymore
  *  in such a case. If you want to use this feature in a production environment
  *  you should wait for an official release.


Another comment about an API change: In BufferedIndexOutput I changed the method 
  protected abstract void flushBuffer(byte[] b, int len) throws IOException;
to
  protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException;

which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies. 

Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> I haven't looked at your latest patch yet, so this is just guesswork, 
> but was thinking in TermScorer, around line 75 or so, we could add:
>
> score *= similarity.scorePayload(payloadBuffer);
>
TermScorer currently doesn't iterate over the positions. It uses a 
buffer to load 32 doc/freq pairs from TermDocs using the read() method. 
In order to use per-term boosts you would have to change the TermScorer 
to not use a buffer anymore and use TermDocs.next() instead. Then you 
can iterate over the positions and get the payloads. This is a 
significant change to TermScorer and performance would probably suffer 
for indexes that don't have payloads. I actually admit that I had the 
same in mind (I mentioned that in LUCENE-761), but after looking closer 
at TermScorer I changed my mind here.

I believe the better option is to create a new scorer subclass like 
WeightedTermScorer which should be used if payloads containing per-term 
boosts are stored in the index.

- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Grant Ingersoll <gs...@apache.org>.
I haven't looked at your latest patch yet, so this is just guesswork,  
but was thinking in TermScorer, around line 75 or so, we could add:

score *= similarity.scorePayload(payloadBuffer);

The default Similarity would just return 1.  This would allow people  
to incorporate a score based on what is in the payload, per their  
application needs and would be completely backward-compatible.  We  
may even want to postpone the decoding of the payload to inside the  
Similarity for performance reasons, but that should be tested, since  
that could be cause for confusion for people overriding Similarity.   
I will have to look at some of the other Scorers to see if there is a  
way to incorporate into some of them.

None of this would prevent using payloads for other things as well,  
such as the XPath query example.

Doing this would involve switching over to using TermPositions like  
we talked about.  Like I said, I will take a look at it and see if  
anything resonates.

-Grant

On Mar 11, 2007, at 11:26 PM, Michael Busch wrote:

> Grant Ingersoll wrote:
>> Cool.  I will try and take a look at it tomorrow.  Since we have  
>> the lazy SegTermPos thing in now, we should be able to integrate  
>> this into scoring via the Similarity and merge TermDocs and  
>> TermPositions like you suggested.
>>
>> If I can get the Scoring piece in and people are fine w/ the  
>> flushBuffer change then hopefully we can get this in this week.  I  
>> will try to post a patch that includes your patch and the scoring  
>> integration by tomorrow or Tuesday if that is fine with you.
>>
> I'm not completely sure how you want to integrate this in the  
> Similarity class. Payloads can not only be used for scoring.  
> Consider for example XML search: the payloads can be used here to  
> store in which element a term occurs. During search (e. g. an XPath  
> query) the payloads would be used then to find hits, not for scoring.
>
> On the other hand if you want to store e. g. per-postions boosts in  
> the payloads, you could use the norm en/decoding methods that are  
> already in Similarity. You could use the following code in a  
> TokenStream:
>  byte[] payload = new byte[1];
>  payload[0] = Similari.encodeNorm(boost);
>  token.setPayload(payload);
>
> and in a scorer you could get the boost then with:
>  termPositions.getPayload(payloadBuffer);
>  float boost = Similarity.decodeNorm(payloadBuffer[0]);
>
> But maybe you have something different in mind? Could you  
> elaborate, please?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Cool.  I will try and take a look at it tomorrow.  Since we have the 
> lazy SegTermPos thing in now, we should be able to integrate this into 
> scoring via the Similarity and merge TermDocs and TermPositions like 
> you suggested.
>
> If I can get the Scoring piece in and people are fine w/ the 
> flushBuffer change then hopefully we can get this in this week.  I 
> will try to post a patch that includes your patch and the scoring 
> integration by tomorrow or Tuesday if that is fine with you.
>
I'm not completely sure how you want to integrate this in the Similarity 
class. Payloads can not only be used for scoring. Consider for example 
XML search: the payloads can be used here to store in which element a 
term occurs. During search (e. g. an XPath query) the payloads would be 
used then to find hits, not for scoring.

On the other hand if you want to store e. g. per-postions boosts in the 
payloads, you could use the norm en/decoding methods that are already in 
Similarity. You could use the following code in a TokenStream:
  byte[] payload = new byte[1];
  payload[0] = Similari.encodeNorm(boost);
  token.setPayload(payload);

and in a scorer you could get the boost then with:
  termPositions.getPayload(payloadBuffer);
  float boost = Similarity.decodeNorm(payloadBuffer[0]);

But maybe you have something different in mind? Could you elaborate, please?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Grant Ingersoll <gr...@gmail.com>.
Cool.  I will try and take a look at it tomorrow.  Since we have the  
lazy SegTermPos thing in now, we should be able to integrate this  
into scoring via the Similarity and merge TermDocs and TermPositions  
like you suggested.

If I can get the Scoring piece in and people are fine w/ the  
flushBuffer change then hopefully we can get this in this week.  I  
will try to post a patch that includes your patch and the scoring  
integration by tomorrow or Tuesday if that is fine with you.

-Grant

On Mar 11, 2007, at 8:35 PM, Michael Busch (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-755? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Michael Busch updated LUCENE-755:
> ---------------------------------
>
>     Attachment: payloads.patch
>
> I'm attaching the new patch with the following changes:
> - applies cleanly on the current trunk
> - fixed a bug in FSDirectory which affected payloads with length  
> greater than 1024 bytes and extended testcase TestPayloads to test  
> this fix
> - added the following warning comments to the new APIs:
>
>   *  Warning: The status of the Payloads feature is experimental.  
> The APIs
>   *  introduced here might change in the future and will not be  
> supported anymore
>   *  in such a case. If you want to use this feature in a  
> production environment
>   *  you should wait for an official release.
>
>
> Another comment about an API change: In BufferedIndexOutput I  
> changed the method
>   protected abstract void flushBuffer(byte[] b, int len) throws  
> IOException;
> to
>   protected abstract void flushBuffer(byte[] b, int offset, int  
> len) throws IOException;
>
> which means that subclasses of BufferedIndexOutput won't compile  
> anymore. I made this change for performance reasons: If a payload  
> is longer than 1024 bytes (standard buffer size of  
> BufferedIndexOutput) then it can be flushed efficiently to disk  
> without having to perform array copies.
>
> Is this API change acceptable? Users who have custom subclasses of  
> BufferedIndexOutput would have to change their classes in order to  
> work.
>
>> Payloads
>> --------
>>
>>                 Key: LUCENE-755
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Index
>>            Reporter: Michael Busch
>>         Assigned To: Michael Busch
>>         Attachments: payload.patch, payloads.patch, payloads.patch
>>
>>
>> This patch adds the possibility to store arbitrary metadata  
>> (payloads) together with each position of a term in its posting  
>> lists. A while ago this was discussed on the dev mailing list,  
>> where I proposed an initial design. This patch has a much improved  
>> design with modifications, that make this new feature easier to  
>> use and more efficient.
>> A payload is an array of bytes that can be stored inline in the  
>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>> simply store and retrieve byte arrays in the posting lists in an  
>> efficient way.
>> API and Usage
>> ------------------------------
>> The new class index.Payload is basically just a wrapper around a  
>> byte[] array together with int variables for offset and length. So  
>> a user does not have to create a byte array for every payload, but  
>> can rather allocate one array for all payloads of a document and  
>> provide offset and length information. This reduces object  
>> allocations on the application side.
>> In order to store payloads in the posting lists one has to provide  
>> a TokenStream or TokenFilter that produces Tokens with payloads. I  
>> added the following two methods to the Token class:
>>   /** Sets this Token's payload. */
>>   public void setPayload(Payload payload);
>>
>>   /** Returns this Token's payload. */
>>   public Payload getPayload();
>> In order to retrieve the data from the index the interface  
>> TermPositions now offers two new methods:
>>   /** Returns the payload length of the current term position.
>>    *  This is invalid until {@link #nextPosition()} is called for
>>    *  the first time.
>>    *
>>    * @return length of the current payload in number of bytes
>>    */
>>   int getPayloadLength();
>>
>>   /** Returns the payload data of the current term position.
>>    * This is invalid until {@link #nextPosition()} is called for
>>    * the first time.
>>    * This method must not be called more than once after each call
>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>    * so if the payload data for the current position is not needed,
>>    * this method may not be called at all for performance reasons.
>>    *
>>    * @param data the array into which the data of this payload is  
>> to be
>>    *             stored, if it is big enough; otherwise, a new byte 
>> [] array
>>    *             is allocated for this purpose.
>>    * @param offset the offset in the array into which the data of  
>> this payload
>>    *               is to be stored.
>>    * @return a byte[] array containing the data of this payload
>>    * @throws IOException
>>    */
>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>> Furthermore, this patch indroduces the new method  
>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>> there was only a writeBytes()-method without an offset argument.
>> Implementation details
>> ------------------------------
>> - One field bit in FieldInfos is used to indicate if payloads are  
>> enabled for a field. The user does not have to enable payloads for  
>> a field, this is done automatically:
>>    * The DocumentWriter enables payloads for a field, if one ore  
>> more Tokens carry payloads.
>>    * The SegmentMerger enables payloads for a field during a  
>> merge, if payloads are enabled for that field in one or more  
>> segments.
>> - Backwards compatible: If payloads are not used, then the formats  
>> of the ProxFile and FreqFile don't change
>> - Payloads are stored inline in the posting list of a term in the  
>> ProxFile. A payload of a term occurrence is stored right after its  
>> PositionDelta.
>> - Same-length compression: If payloads are enabled for a field,  
>> then the PositionDelta is shifted one bit. The lowest bit is used  
>> to indicate whether the length of the following payload is stored  
>> explicitly. If not, i. e. the bit is false, then the payload has  
>> the same length as the payload of the previous term occurrence.
>> - In order to support skipping on the ProxFile the length of the  
>> payload at every skip point has to be known. Therefore the payload  
>> length is also stored in the skip list located in the FreqFile.  
>> Here the same-length compression is also used: The lowest bit of  
>> DocSkip is used to indicate if the payload length is stored for a  
>> SkipDatum or if the length is the same as in the last SkipDatum.
>> - Payloads are loaded lazily. When a user calls  
>> TermPositions.nextPosition() then only the position and the  
>> payload length is loaded from the ProxFile. If the user calls  
>> getPayload() then the payload is actually loaded. If getPayload()  
>> is not called before nextPosition() is called again, then the  
>> payload data is just skipped.
>>
>> Changes of file formats
>> ------------------------------
>> - FieldInfos (.fnm)
>> The format of the .fnm file does not change. The only change is  
>> the use of the sixth lowest-order bit (0x20) of the FieldBits. If  
>> this bit is set, then payloads are enabled for the corresponding  
>> field.
>> - ProxFile (.prx)
>> ProxFile (.prx) -->  <TermPositions>^TermCount
>> TermPositions   --> <Positions>^DocFreq
>> Positions       --> <PositionDelta, Payload?>^Freq
>> Payload         --> <PayloadLength?, PayloadData>
>> PositionDelta   --> VInt
>> PayloadLength   --> VInt
>> PayloadData     --> byte^PayloadLength
>> For payloads disabled (unchanged):
>> PositionDelta is the difference between the position of the  
>> current occurrence in the document and the previous occurrence (or  
>> zero, if this is the first   occurrence in this document).
>>
>> For Payloads enabled:
>> PositionDelta/2 is the difference between the position of the  
>> current occurrence in the document and the previous occurrence. If  
>> PositionDelta is odd, then PayloadLength is stored. If  
>> PositionDelta is even, then the length of the current payload  
>> equals the length of the previous payload and thus PayloadLength  
>> is omitted.
>> - FreqFile (.frq)
>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>> PayloadLength --> VInt
>> For payloads disabled (unchanged):
>> DocSkip records the document number before every SkipInterval th  
>> document in TermFreqs. Document numbers are represented as  
>> differences from the previous value in the sequence.
>> For payloads enabled:
>> DocSkip/2 records the document number before every SkipInterval  
>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>> follows. If DocSkip is even, then the length of the payload at the  
>> current skip point equals the length of the payload at the last  
>> skip point and thus PayloadLength is omitted.
>> This encoding is space efficient for different use cases:
>>    * If only some fields of an index have payloads, then there's  
>> no space overhead for the fields with payloads disabled.
>>    * If the payloads of consecutive term positions have the same  
>> length, then the length only has to be stored once for every term.  
>> This should be a common case, because users probably use the same  
>> format for all payloads.
>>    * If only a few terms of a field have payloads, then we don't  
>> waste much space because we benefit again from the same-length- 
>> compression since we only have to store the length zero for the  
>> empty payloads once per term.
>> All unit tests pass.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org