You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2006/12/20 20:10:21 UTC

[jira] Created: (LUCENE-755) Payloads

Payloads
--------

                 Key: LUCENE-755
                 URL: http://issues.apache.org/jira/browse/LUCENE-755
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Index
            Reporter: Michael Busch
         Assigned To: Michael Busch


This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.

A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 

API and Usage
------------------------------   
The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.

In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
  /** Sets this Token's payload. */
  public void setPayload(Payload payload);
  
  /** Returns this Token's payload. */
  public Payload getPayload();

In order to retrieve the data from the index the interface TermPositions now offers two new methods:
  /** Returns the payload length of the current term position.
   *  This is invalid until {@link #nextPosition()} is called for
   *  the first time.
   * 
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();
  
  /** Returns the payload data of the current term position.
   * This is invalid until {@link #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of {@link #nextPosition()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   * 
   * @param data the array into which the data of this payload is to be
   *             stored, if it is big enough; otherwise, a new byte[] array
   *             is allocated for this purpose. 
   * @param offset the offset in the array into which the data of this payload
   *               is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] getPayload(byte[] data, int offset) throws IOException;

Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 

Implementation details
------------------------------
- One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
   * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
   * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
- Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
- Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
- Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
- In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
- Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
  
Changes of file formats
------------------------------
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 

- ProxFile (.prx)
ProxFile (.prx) -->  <TermPositions>^TermCount
TermPositions   --> <Positions>^DocFreq
Positions       --> <PositionDelta, Payload?>^Freq
Payload         --> <PayloadLength?, PayloadData>
PositionDelta   --> VInt
PayloadLength   --> VInt 
PayloadData     --> byte^PayloadLength

For payloads disabled (unchanged):
PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
  
For Payloads enabled:
PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.

- FreqFile (.frq)

SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
PayloadLength --> VInt

For payloads disabled (unchanged):
DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.

For payloads enabled:
DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.


This encoding is space efficient for different use cases:
   * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
   * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
   * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.

All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

Another one! (previous version didn't apply cleanly anymore after committing LUCENE-818, Mike is keeping me busy ;-) ).

Grant, did you get a chance to review the patch? I would like to go ahead and commit it soon with the API warnings if nobody objects...

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/LUCENE-755?page=all ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: http://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch resolved LUCENE-755.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.2

I just committed this. Payload is serializable now.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>             Fix For: 2.2
>
>         Attachments: payload.patch, payloads.patch, payloads.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479781 ] 

Grant Ingersoll commented on LUCENE-755:
----------------------------------------

Nicolas,

Are you implying your patch fits in with 662 (and needs to be applied after) or it is just in the style of 662 but isn't dependent on?

Thanks,
Grant

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481014 ] 

Michael Busch commented on LUCENE-755:
--------------------------------------

Grant Ingersoll commented on LUCENE-755:
----------------------------------------

> OK, I've applied the patch.  All tests pass for me.  I think it looks  
> good.  Have you run any benchmarks on it?  I ran the standard one on  
> the patched version and on trunk, in a totally unscientific test.  In  
> theory, the case with no payloads should perform very closely to the  
> existing code, and this seems to be born out by me running the micro- 
> standard (ant run-task in contrib/benchmark).   Once we have this  

Grant, thank you for running the benchmarks!
In case no payloads are used there is indeed no performance decrease to 
expect, because the file format does not change at all in that case.

> committed someone can take a crack at adding support to the  
> benchmarker for payloads.

Good point! This will help us finding possible optimizations.

> Payload should probably be serializable.

Agreed. Will do ...

> All in all, I think we could commit this, then adding the search/ 
> scoring capabilities like we've talked about.  I like the  
> documentation/comments you have added, very useful.  (One of these  
> days I will take on documenting the index package like I intend to,  
> so what you've added will be quite helpful!)   We will/may want to  

That's what I was planning to do as well... haven't had time yet. But 
good that there's another volunteer, so we can split the work ;-)

> add in, for example, a PayloadQuery and derivatives and a QueryParser  
> operator that supported searching in the payload, or possibly  
> boosting if a certain term has a certain type of payload (not that I  
> want anything to do with the QueryParser).  Even beyond that,  
> SpanPayloadQuery, etc.  I will possibly have some cycles to actually  
> write some code for these next week.

Yes there are lots of things we could do. I was also thinking about
providing a demo that uses payloads. Let's commit this first, then
we can start working on these items...

> Just throwing this out there, I'm not sure I really mean it or  
> not  :-) , but:
> do you think it would be useful to consider restricting the size of  
> the payload?  I know, I know, as soon as we put a limit on it,  
> someone will want to expand it, but I was thinking if we knew the  
> size had a limit we could better control the performance and caching,  
> etc. on the scoring/search side.    I guess it is buyer beware, maybe  
> we put some javadocs on this.

Hmm, I'm not sure if we should limit the size... since there are
so many different use cases I wouldn't even know how to pick such 
a limit. However, if we discover later that a limit would be helpful
to optimize things on the search side we could think about a limit
parameter on field level, which would be easy to add if we introduce
a schema and global field semantics with FI.

> Also, I started http://wiki.apache.org/lucene-java/Payloads as I  
> think we will want to have some docs explaining why Payloads are  
> useful in non-javadoc format.

Cool, that will be helpful!

> On a side note, have a look at http://wiki.apache.org/lucene-java/ 
> PatchCheckList to see if there is anything you feel you can add.

Thanks for reviewing this so thoroughly, Grant! I will commit it soon!

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

Attaching a new patch. The previous one didn't apply cleanly anymore after LUCENE-710 was committed.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> I haven't looked at your latest patch yet, so this is just guesswork, 
> but was thinking in TermScorer, around line 75 or so, we could add:
>
> score *= similarity.scorePayload(payloadBuffer);
>
TermScorer currently doesn't iterate over the positions. It uses a 
buffer to load 32 doc/freq pairs from TermDocs using the read() method. 
In order to use per-term boosts you would have to change the TermScorer 
to not use a buffer anymore and use TermDocs.next() instead. Then you 
can iterate over the positions and get the payloads. This is a 
significant change to TermScorer and performance would probably suffer 
for indexes that don't have payloads. I actually admit that I had the 
same in mind (I mentioned that in LUCENE-761), but after looking closer 
at TermScorer I changed my mind here.

I believe the better option is to create a new scorer subclass like 
WeightedTermScorer which should be used if payloads containing per-term 
boosts are stored in the index.

- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Grant Ingersoll <gs...@apache.org>.
I haven't looked at your latest patch yet, so this is just guesswork,  
but was thinking in TermScorer, around line 75 or so, we could add:

score *= similarity.scorePayload(payloadBuffer);

The default Similarity would just return 1.  This would allow people  
to incorporate a score based on what is in the payload, per their  
application needs and would be completely backward-compatible.  We  
may even want to postpone the decoding of the payload to inside the  
Similarity for performance reasons, but that should be tested, since  
that could be cause for confusion for people overriding Similarity.   
I will have to look at some of the other Scorers to see if there is a  
way to incorporate into some of them.

None of this would prevent using payloads for other things as well,  
such as the XPath query example.

Doing this would involve switching over to using TermPositions like  
we talked about.  Like I said, I will take a look at it and see if  
anything resonates.

-Grant

On Mar 11, 2007, at 11:26 PM, Michael Busch wrote:

> Grant Ingersoll wrote:
>> Cool.  I will try and take a look at it tomorrow.  Since we have  
>> the lazy SegTermPos thing in now, we should be able to integrate  
>> this into scoring via the Similarity and merge TermDocs and  
>> TermPositions like you suggested.
>>
>> If I can get the Scoring piece in and people are fine w/ the  
>> flushBuffer change then hopefully we can get this in this week.  I  
>> will try to post a patch that includes your patch and the scoring  
>> integration by tomorrow or Tuesday if that is fine with you.
>>
> I'm not completely sure how you want to integrate this in the  
> Similarity class. Payloads can not only be used for scoring.  
> Consider for example XML search: the payloads can be used here to  
> store in which element a term occurs. During search (e. g. an XPath  
> query) the payloads would be used then to find hits, not for scoring.
>
> On the other hand if you want to store e. g. per-postions boosts in  
> the payloads, you could use the norm en/decoding methods that are  
> already in Similarity. You could use the following code in a  
> TokenStream:
>  byte[] payload = new byte[1];
>  payload[0] = Similari.encodeNorm(boost);
>  token.setPayload(payload);
>
> and in a scorer you could get the boost then with:
>  termPositions.getPayload(payloadBuffer);
>  float boost = Similarity.decodeNorm(payloadBuffer[0]);
>
> But maybe you have something different in mind? Could you  
> elaborate, please?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
> Cool.  I will try and take a look at it tomorrow.  Since we have the 
> lazy SegTermPos thing in now, we should be able to integrate this into 
> scoring via the Similarity and merge TermDocs and TermPositions like 
> you suggested.
>
> If I can get the Scoring piece in and people are fine w/ the 
> flushBuffer change then hopefully we can get this in this week.  I 
> will try to post a patch that includes your patch and the scoring 
> integration by tomorrow or Tuesday if that is fine with you.
>
I'm not completely sure how you want to integrate this in the Similarity 
class. Payloads can not only be used for scoring. Consider for example 
XML search: the payloads can be used here to store in which element a 
term occurs. During search (e. g. an XPath query) the payloads would be 
used then to find hits, not for scoring.

On the other hand if you want to store e. g. per-postions boosts in the 
payloads, you could use the norm en/decoding methods that are already in 
Similarity. You could use the following code in a TokenStream:
  byte[] payload = new byte[1];
  payload[0] = Similari.encodeNorm(boost);
  token.setPayload(payload);

and in a scorer you could get the boost then with:
  termPositions.getPayload(payloadBuffer);
  float boost = Similarity.decodeNorm(payloadBuffer[0]);

But maybe you have something different in mind? Could you elaborate, please?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Updated: (LUCENE-755) Payloads

Posted by Grant Ingersoll <gr...@gmail.com>.
Cool.  I will try and take a look at it tomorrow.  Since we have the  
lazy SegTermPos thing in now, we should be able to integrate this  
into scoring via the Similarity and merge TermDocs and TermPositions  
like you suggested.

If I can get the Scoring piece in and people are fine w/ the  
flushBuffer change then hopefully we can get this in this week.  I  
will try to post a patch that includes your patch and the scoring  
integration by tomorrow or Tuesday if that is fine with you.

-Grant

On Mar 11, 2007, at 8:35 PM, Michael Busch (JIRA) wrote:

>
>      [ https://issues.apache.org/jira/browse/LUCENE-755? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Michael Busch updated LUCENE-755:
> ---------------------------------
>
>     Attachment: payloads.patch
>
> I'm attaching the new patch with the following changes:
> - applies cleanly on the current trunk
> - fixed a bug in FSDirectory which affected payloads with length  
> greater than 1024 bytes and extended testcase TestPayloads to test  
> this fix
> - added the following warning comments to the new APIs:
>
>   *  Warning: The status of the Payloads feature is experimental.  
> The APIs
>   *  introduced here might change in the future and will not be  
> supported anymore
>   *  in such a case. If you want to use this feature in a  
> production environment
>   *  you should wait for an official release.
>
>
> Another comment about an API change: In BufferedIndexOutput I  
> changed the method
>   protected abstract void flushBuffer(byte[] b, int len) throws  
> IOException;
> to
>   protected abstract void flushBuffer(byte[] b, int offset, int  
> len) throws IOException;
>
> which means that subclasses of BufferedIndexOutput won't compile  
> anymore. I made this change for performance reasons: If a payload  
> is longer than 1024 bytes (standard buffer size of  
> BufferedIndexOutput) then it can be flushed efficiently to disk  
> without having to perform array copies.
>
> Is this API change acceptable? Users who have custom subclasses of  
> BufferedIndexOutput would have to change their classes in order to  
> work.
>
>> Payloads
>> --------
>>
>>                 Key: LUCENE-755
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Index
>>            Reporter: Michael Busch
>>         Assigned To: Michael Busch
>>         Attachments: payload.patch, payloads.patch, payloads.patch
>>
>>
>> This patch adds the possibility to store arbitrary metadata  
>> (payloads) together with each position of a term in its posting  
>> lists. A while ago this was discussed on the dev mailing list,  
>> where I proposed an initial design. This patch has a much improved  
>> design with modifications, that make this new feature easier to  
>> use and more efficient.
>> A payload is an array of bytes that can be stored inline in the  
>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>> simply store and retrieve byte arrays in the posting lists in an  
>> efficient way.
>> API and Usage
>> ------------------------------
>> The new class index.Payload is basically just a wrapper around a  
>> byte[] array together with int variables for offset and length. So  
>> a user does not have to create a byte array for every payload, but  
>> can rather allocate one array for all payloads of a document and  
>> provide offset and length information. This reduces object  
>> allocations on the application side.
>> In order to store payloads in the posting lists one has to provide  
>> a TokenStream or TokenFilter that produces Tokens with payloads. I  
>> added the following two methods to the Token class:
>>   /** Sets this Token's payload. */
>>   public void setPayload(Payload payload);
>>
>>   /** Returns this Token's payload. */
>>   public Payload getPayload();
>> In order to retrieve the data from the index the interface  
>> TermPositions now offers two new methods:
>>   /** Returns the payload length of the current term position.
>>    *  This is invalid until {@link #nextPosition()} is called for
>>    *  the first time.
>>    *
>>    * @return length of the current payload in number of bytes
>>    */
>>   int getPayloadLength();
>>
>>   /** Returns the payload data of the current term position.
>>    * This is invalid until {@link #nextPosition()} is called for
>>    * the first time.
>>    * This method must not be called more than once after each call
>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>    * so if the payload data for the current position is not needed,
>>    * this method may not be called at all for performance reasons.
>>    *
>>    * @param data the array into which the data of this payload is  
>> to be
>>    *             stored, if it is big enough; otherwise, a new byte 
>> [] array
>>    *             is allocated for this purpose.
>>    * @param offset the offset in the array into which the data of  
>> this payload
>>    *               is to be stored.
>>    * @return a byte[] array containing the data of this payload
>>    * @throws IOException
>>    */
>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>> Furthermore, this patch indroduces the new method  
>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>> there was only a writeBytes()-method without an offset argument.
>> Implementation details
>> ------------------------------
>> - One field bit in FieldInfos is used to indicate if payloads are  
>> enabled for a field. The user does not have to enable payloads for  
>> a field, this is done automatically:
>>    * The DocumentWriter enables payloads for a field, if one ore  
>> more Tokens carry payloads.
>>    * The SegmentMerger enables payloads for a field during a  
>> merge, if payloads are enabled for that field in one or more  
>> segments.
>> - Backwards compatible: If payloads are not used, then the formats  
>> of the ProxFile and FreqFile don't change
>> - Payloads are stored inline in the posting list of a term in the  
>> ProxFile. A payload of a term occurrence is stored right after its  
>> PositionDelta.
>> - Same-length compression: If payloads are enabled for a field,  
>> then the PositionDelta is shifted one bit. The lowest bit is used  
>> to indicate whether the length of the following payload is stored  
>> explicitly. If not, i. e. the bit is false, then the payload has  
>> the same length as the payload of the previous term occurrence.
>> - In order to support skipping on the ProxFile the length of the  
>> payload at every skip point has to be known. Therefore the payload  
>> length is also stored in the skip list located in the FreqFile.  
>> Here the same-length compression is also used: The lowest bit of  
>> DocSkip is used to indicate if the payload length is stored for a  
>> SkipDatum or if the length is the same as in the last SkipDatum.
>> - Payloads are loaded lazily. When a user calls  
>> TermPositions.nextPosition() then only the position and the  
>> payload length is loaded from the ProxFile. If the user calls  
>> getPayload() then the payload is actually loaded. If getPayload()  
>> is not called before nextPosition() is called again, then the  
>> payload data is just skipped.
>>
>> Changes of file formats
>> ------------------------------
>> - FieldInfos (.fnm)
>> The format of the .fnm file does not change. The only change is  
>> the use of the sixth lowest-order bit (0x20) of the FieldBits. If  
>> this bit is set, then payloads are enabled for the corresponding  
>> field.
>> - ProxFile (.prx)
>> ProxFile (.prx) -->  <TermPositions>^TermCount
>> TermPositions   --> <Positions>^DocFreq
>> Positions       --> <PositionDelta, Payload?>^Freq
>> Payload         --> <PayloadLength?, PayloadData>
>> PositionDelta   --> VInt
>> PayloadLength   --> VInt
>> PayloadData     --> byte^PayloadLength
>> For payloads disabled (unchanged):
>> PositionDelta is the difference between the position of the  
>> current occurrence in the document and the previous occurrence (or  
>> zero, if this is the first   occurrence in this document).
>>
>> For Payloads enabled:
>> PositionDelta/2 is the difference between the position of the  
>> current occurrence in the document and the previous occurrence. If  
>> PositionDelta is odd, then PayloadLength is stored. If  
>> PositionDelta is even, then the length of the current payload  
>> equals the length of the previous payload and thus PayloadLength  
>> is omitted.
>> - FreqFile (.frq)
>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>> PayloadLength --> VInt
>> For payloads disabled (unchanged):
>> DocSkip records the document number before every SkipInterval th  
>> document in TermFreqs. Document numbers are represented as  
>> differences from the previous value in the sequence.
>> For payloads enabled:
>> DocSkip/2 records the document number before every SkipInterval  
>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>> follows. If DocSkip is even, then the length of the payload at the  
>> current skip point equals the length of the payload at the last  
>> skip point and thus PayloadLength is omitted.
>> This encoding is space efficient for different use cases:
>>    * If only some fields of an index have payloads, then there's  
>> no space overhead for the fields with payloads disabled.
>>    * If the payloads of consecutive term positions have the same  
>> length, then the length only has to be stored once for every term.  
>> This should be a common case, because users probably use the same  
>> format for all payloads.
>>    * If only a few terms of a field have payloads, then we don't  
>> waste much space because we benefit again from the same-length- 
>> compression since we only have to store the length zero for the  
>> empty payloads once per term.
>> All unit tests pass.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-755:
---------------------------------

    Attachment: payloads.patch

I'm attaching the new patch with the following changes:
- applies cleanly on the current trunk
- fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix
- added the following warning comments to the new APIs:

  *  Warning: The status of the Payloads feature is experimental. The APIs
  *  introduced here might change in the future and will not be supported anymore
  *  in such a case. If you want to use this feature in a production environment
  *  you should wait for an official release.


Another comment about an API change: In BufferedIndexOutput I changed the method 
  protected abstract void flushBuffer(byte[] b, int len) throws IOException;
to
  protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException;

which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies. 

Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Nicolas Lalevée (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463414 ] 

Nicolas Lalevée commented on LUCENE-755:
----------------------------------------

The patch I have just upload (payload.patch) is Michael's one (payloads.patch) with the customization of how payload are written and read, exactly like I did for Lucene-662. An IndexFormat is in fact a factory of PayloadWriter and PayloadReader, this index format being stored in the Directory instance.

Note that I haven't changed the javadoc neither the comments included in Michael's patch, it needs some cleanup if somebody is interested in commiting it.
And sorry for the name of the patch I have uploaded, it is a little bit confusing now, and I can't change it's name. I will be more carefull next time when naming my patch files.

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-755?page=comments#action_12460496 ] 
            
Grant Ingersoll commented on LUCENE-755:
----------------------------------------

Great patch, Michael, and something that will come in handy for a lot of people.  I can vouch it applies cleanly and all the tests pass.  

Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me.

One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation?  It seems we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used for, or am I missing something?  Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is easily remedied since we have access to the TermPositions when we construct TermScorer.)  Span Queries could easily be extended to include payload information since they use the TermPositions, which would be useful for post-processing algorithms.

I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???).  The default implementation would be to ignore any payload, I suppose.  We could also add a callback in the Similarity mechanism, something like:

float calculatePayloadFactor(byte[] payload);
or 
float calculatePayloadFactor(Term term, byte[] payload);

Then this factor could be added/multiplied into the term score or whatever other scorers use it?????? 

Is this making any sense?


> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: http://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/LUCENE-755?page=comments#action_12460647 ] 
            
Michael Busch commented on LUCENE-755:
--------------------------------------

> Great patch, Michael, and something that will come in handy for a lot of people. I can vouch it applies cleanly and all the tests pass.

Cool, thanks for trying it out, Grant! :-)

> Now I am not sure I am totally understanding everything just yet so the following is thinking aloud, but bear with me.

> One of the big unanswered questions (besides how this fits into the whole flexible indexing scheme as discussed on the Payloads and 
> Flexible indexing threads on java-dev) at this point for me is: how do we expose/integrate this into the scoring side of the equation? It seems 
> we would need some interfaces that hook into the scoring mechanism so that people can define what all these payloads are actually used 
> for, or am I missing something? Yet the TermScorer takes in the TermDocs, so it doesn't yet have access to the payloads (although this is 
> easily remedied since we have access to the TermPositions when we construct TermScorer.) Span Queries could easily be extended to 
> include payload information since they use the TermPositions, which would be useful for post-processing algorithms.

I would say it really depends on the use case of the payloads. For example XML search: here payloads can be used to store depths information of terms. An extended Span class could then take the depth information into account for query evaluation. As you pointed out the span classes already have easy access to the payloads since they use TermPositions, so to implement such a subclass should be fairly simple.

> I can imagine an interface that you would have to be set on the Query/Scorer (and inherited unless otherwise set???). The default 
> implementation would be to ignore any payload, I suppose. We could also add a callback in the Similarity mechanism, something like:
>
> float calculatePayloadFactor(byte[] payload);
> or
> float calculatePayloadFactor(Term term, byte[] payload);
>
> Then this factor could be added/multiplied into the term score or whatever other scorers use it??????
> 
> Is this making any sense?

I believe the case you're describing here is per-term norms/boosts? Yah I think this would work and you are right, the Scorers have to have access to TermPositions, TermDocs is not sufficient. So yes, it would be nice if TermScorer would use TermPositions instead of TermDocs. I just opened LUCENE-761, which changes SegmentTermPositions to clone the proxStream lazily at the first time nextPosition() is called. Then the costs for creating TermDocs and TermPositions are the same and together with lazy prox skipping (LUCENE-687) there's no reason anymore to not use TermPositions.

However, as currently discussed on java-dev, per-term boosts could also be part of a new posting format in the flexible index scheme and thus not stored in the payloads.

So in general this patch doesn't add yet a new search feature to Lucene, it rather opens the door for new features in the future. The way to add such a new feature is then:
1) Write an analyzer that provides data neccessary for the new feature and produces Tokens with payloads containing these data.
2) Write/extend a Scorer that has access to TermPositions and makes use of the data in the payloads for matching or scoring or both.


> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: http://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-755) Payloads

Posted by "Nicolas Lalevée (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Lalevée updated LUCENE-755:
-----------------------------------

    Attachment: payload.patch

> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

> - Introduce index-level metadata. Preferable in XML format, so it  
> will be human readable. Later on, we can store information about  
> the index format in this file, like the codecs that are used to  
> store the data.

To provoke thought about what index-level metadata might go in this  
file, the contents of a KS "segments_2.yaml" file immediately after  
indexing an html presentation of the US constitution is below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


slothbear:~/projects/ks/perl marvin$ cat uscon_invindex/segments_2.yaml
ks_version: 0.20_02
fields:
   title: 'KinoSearch::Schema::FieldSpec'
   url: 'USConSchema::UnIndexedField'
   content: 'KinoSearch::Schema::FieldSpec'
format: 1
generation: 2
seg_counter: 1
segments:
   _1:
     term_list_index:
       skip_interval: 16
       format: 1
       index_interval: 128
       size: 8
       counts:
         title: 1
         content: 8
     posting_list:
       format: 1
     compound_file:
       format: 1
       sub_files:
         _1.tlx2:
           offset: 138575
           length: 93
         _1.p0:
           offset: 138134
           length: 441
         _1.tvx:
           offset: 137718
           length: 416
         _1.tv:
           offset: 73487
           length: 64231
         _1.tl0:
           offset: 73259
           length: 228
         _1.p2:
           offset: 56393
           length: 16866
         _1.ds:
           offset: 7015
           length: 49378
         _1.tl2:
           offset: 421
           length: 6594
         _1.dsx:
           offset: 5
           length: 416
         _1.tlx0:
           offset: 0
           length: 5
     term_vectors:
       format: 1
     term_list:
       skip_interval: 16
       format: 1
       index_interval: 128
       size: 923
       counts:
         title: 41
         content: 923
     doc_storage:
       format: 1
     seg_info:
       seg_name: _1
       doc_count: 52
       field_names:
         - title
         - url
         - content
version: 1173732193033



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Dimanche 11 Mars 2007 22:41, Michael Busch a écrit :
> Hi Grant,
>
> I certainly agree that it would be great if we could make some progress
> and commit the payloads patch soon. I think it is quite independent from
> FI. FI will introduce different posting formats (see Wiki:
> http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be
> part of some of those formats, but not all (i. e. per-position payloads
> only make sense if positions are stored).
>
> The only concern some people had was about the API the patch introduces.
> It extends Token and TermPositions. Doug's argument was, that if we
> introduce new APIs now but want to change them with FI, then it will be
> hard to support those APIs. I think that is a valid point, but at the
> same time it slows down progress to have to plan ahead in too many
> directions. That's why I'd vote for marking the new APIs as experimental
> so that people can try them out at own risk.
> If we could agree on that approach then I'd go ahead and submit an
> updated payloads patch in the next days, that applies cleanly on the
> current trunk and contains the additional warnings in the javadocs.
>
>
> In regard of FI and 662 however I really believe we should split it up
> and plan ahead (in a way I mentioned already), so that we have more
> isolated patches. It is really great that we have 662 already (Nicolas,
> thank you so much for your hard work, I hope you'll keep working with us
> on FI!!). We'll probably use some of that code, and it will definitely
> be helpful.

thanks ! :)

About the code split you are talking about, I definitively agree. Here is what 
will contain the three parts :
1) index format concept :
- there is an interface defining it, just for now handling the filename 
extensions.
- modify the directory abstract class and the implementations to be the 
container of the index format.
- modify the SegmentInfos class to do some check about the opened index format 
and the index format defined in the Directory class.
- modify the writer to make it check format conflits while adding raw indexes
2) extensibility of the store reader/writer :
- add to the previous interface some new entry points : a FieldsReader and a 
FieldsWriter.
- split the current FieldsReader and FieldsWriter in two parts : the part 
which will be still handled by Lucene, and the extendable ones which will be 
instanciated by a DefaultIndexFormat.
- split the implementation of Field in two parts : the Field and a FieldData, 
so the user will be able to define his custom field-data java object.
3) New: extensibility of the posting reader/writer
this is just a draft for now, but here is what was done :
- move Posting from a inner class to a public class
- make TermInfo handling a pool of "pointers" : the default implementation has 
two, the frq one and the prx one.
- extract the posting writing from DocumentWriter into a DefaultPostingWriter.

I can provide a patch for the first step.

cheers,
Nicolas

>
> Michael
>
> Grant Ingersoll wrote:
> > Hi Michael,
> >
> > This is very good.  I know 662 is different, just wasn't sure if
> > Nicolas patch was meant to be applied after 662, b/c I know we had
> > discussed this before.
> >
> > I do agree with you about planning this out, but I also know that
> > patches seem to motivate people the best and provide a certain
> > concreteness to it all.  I mostly started asking questions on these
> > two issues b/c I wanted to spur some more discussion and see if we can
> > get people motivated to move on it.
> >
> > I was hoping that I would be able to apply each patch to two different
> > checkouts so I could start seeing where the overlap is and how they
> > could fit together (I also admit I was procrastinating on my ApacheCon
> > talk...).  In the new, flexible world, the payloads implementation
> > could be a separate implementation of the indexing or it could be part
> > of the core/existing file format implementation.  Sometimes I just
> > need to get my hands on the code to get a real feel for what I feel is
> > the best way to do it.
> >
> > I agree about the XML storage for Index information.  We do that in
> > our in-house wrapper around Lucene, storing info about the language,
> > analyzer used, etc.  We may also want a binary index-level storage
> > capability.  I know most people just create a single document usually
> > to store binary info about the index, but an binary storage might be
> > good too.
> >
> > Part of me says to apply the Payloads patch now, as it provides a lot
> > of bang for the buck and I think the FI is going to take a lot longer
> > to hash out.  However, I know that it may pin us in or force us to
> > change things for FI.  Ultimately, I would love to see both these
> > features for the next release, but that isn't a requirement.  Also, on
> > FI, I would love to see two different implementations of whatever API
> > we choose before releasing it, as I always find two implementations of
> > an Interface really work out the API details.
> >
> > -Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Michael Busch <bu...@gmail.com>.
Grant Ingersoll wrote:
>
>> In regard of FI and 662 however I really believe we should split it 
>> up and plan ahead (in a way I mentioned already), so that we have 
>> more isolated patches. It is really great that we have 662 already 
>> (Nicolas, thank you so much for your hard work, I hope you'll keep 
>> working with us on FI!!). We'll probably use some of that code, and 
>> it will definitely be helpful.
>>
>
> +1  I think this makes a lot of sense.  We have been deliberating 
> these changes for some time, so no reason to hurry.  I don't think 
> they are urgent, yet they really will give us more flexibility and 
> more capabilities for more people, so it will be a good thing to have.
>

Right, we don't have to hurry. But still it would be cool to have some 
of the FI features in the next release and once we start (now!) we 
should try to keep the momentum going!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Grant Ingersoll <gr...@gmail.com>.
On Mar 11, 2007, at 5:41 PM, Michael Busch wrote:

> Hi Grant,
>
> I certainly agree that it would be great if we could make some  
> progress and commit the payloads patch soon. I think it is quite  
> independent from FI. FI will introduce different posting formats  
> (see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing).  
> Payloads will be part of some of those formats, but not all (i. e.  
> per-position payloads only make sense if positions are stored).
>

Yep, I agree.

> The only concern some people had was about the API the patch  
> introduces. It extends Token and TermPositions. Doug's argument  
> was, that if we introduce new APIs now but want to change them with  
> FI, then it will be hard to support those APIs. I think that is a  
> valid point, but at the same time it slows down progress to have to  
> plan ahead in too many directions. That's why I'd vote for marking  
> the new APIs as experimental so that people can try them out at own  
> risk.
> If we could agree on that approach then I'd go ahead and submit an  
> updated payloads patch in the next days, that applies cleanly on  
> the current trunk and contains the additional warnings in the  
> javadocs.
>

+1.

>
> In regard of FI and 662 however I really believe we should split it  
> up and plan ahead (in a way I mentioned already), so that we have  
> more isolated patches. It is really great that we have 662 already  
> (Nicolas, thank you so much for your hard work, I hope you'll keep  
> working with us on FI!!). We'll probably use some of that code, and  
> it will definitely be helpful.
>

+1  I think this makes a lot of sense.  We have been deliberating  
these changes for some time, so no reason to hurry.  I don't think  
they are urgent, yet they really will give us more flexibility and  
more capabilities for more people, so it will be a good thing to have.


> Michael
>
> Grant Ingersoll wrote:
>> Hi Michael,
>>
>> This is very good.  I know 662 is different, just wasn't sure if  
>> Nicolas patch was meant to be applied after 662, b/c I know we had  
>> discussed this before.
>>
>> I do agree with you about planning this out, but I also know that  
>> patches seem to motivate people the best and provide a certain  
>> concreteness to it all.  I mostly started asking questions on  
>> these two issues b/c I wanted to spur some more discussion and see  
>> if we can get people motivated to move on it.
>>
>> I was hoping that I would be able to apply each patch to two  
>> different checkouts so I could start seeing where the overlap is  
>> and how they could fit together (I also admit I was  
>> procrastinating on my ApacheCon talk...).  In the new, flexible  
>> world, the payloads implementation could be a separate  
>> implementation of the indexing or it could be part of the core/ 
>> existing file format implementation.  Sometimes I just need to get  
>> my hands on the code to get a real feel for what I feel is the  
>> best way to do it.
>>
>> I agree about the XML storage for Index information.  We do that  
>> in our in-house wrapper around Lucene, storing info about the  
>> language, analyzer used, etc.  We may also want a binary index- 
>> level storage capability.  I know most people just create a single  
>> document usually to store binary info about the index, but an  
>> binary storage might be good too.
>>
>> Part of me says to apply the Payloads patch now, as it provides a  
>> lot of bang for the buck and I think the FI is going to take a lot  
>> longer to hash out.  However, I know that it may pin us in or  
>> force us to change things for FI.  Ultimately, I would love to see  
>> both these features for the next release, but that isn't a  
>> requirement.  Also, on FI, I would love to see two different  
>> implementations of whatever API we choose before releasing it, as  
>> I always find two implementations of an Interface really work out  
>> the API details.
>>
>> -Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Michael Busch <bu...@gmail.com>.
Hi Grant,

I certainly agree that it would be great if we could make some progress 
and commit the payloads patch soon. I think it is quite independent from 
FI. FI will introduce different posting formats (see Wiki: 
http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be 
part of some of those formats, but not all (i. e. per-position payloads 
only make sense if positions are stored).

The only concern some people had was about the API the patch introduces. 
It extends Token and TermPositions. Doug's argument was, that if we 
introduce new APIs now but want to change them with FI, then it will be 
hard to support those APIs. I think that is a valid point, but at the 
same time it slows down progress to have to plan ahead in too many 
directions. That's why I'd vote for marking the new APIs as experimental 
so that people can try them out at own risk.
If we could agree on that approach then I'd go ahead and submit an 
updated payloads patch in the next days, that applies cleanly on the 
current trunk and contains the additional warnings in the javadocs.


In regard of FI and 662 however I really believe we should split it up 
and plan ahead (in a way I mentioned already), so that we have more 
isolated patches. It is really great that we have 662 already (Nicolas, 
thank you so much for your hard work, I hope you'll keep working with us 
on FI!!). We'll probably use some of that code, and it will definitely 
be helpful.

Michael

Grant Ingersoll wrote:
> Hi Michael,
>
> This is very good.  I know 662 is different, just wasn't sure if 
> Nicolas patch was meant to be applied after 662, b/c I know we had 
> discussed this before.
>
> I do agree with you about planning this out, but I also know that 
> patches seem to motivate people the best and provide a certain 
> concreteness to it all.  I mostly started asking questions on these 
> two issues b/c I wanted to spur some more discussion and see if we can 
> get people motivated to move on it.
>
> I was hoping that I would be able to apply each patch to two different 
> checkouts so I could start seeing where the overlap is and how they 
> could fit together (I also admit I was procrastinating on my ApacheCon 
> talk...).  In the new, flexible world, the payloads implementation 
> could be a separate implementation of the indexing or it could be part 
> of the core/existing file format implementation.  Sometimes I just 
> need to get my hands on the code to get a real feel for what I feel is 
> the best way to do it.
>
> I agree about the XML storage for Index information.  We do that in 
> our in-house wrapper around Lucene, storing info about the language, 
> analyzer used, etc.  We may also want a binary index-level storage 
> capability.  I know most people just create a single document usually 
> to store binary info about the index, but an binary storage might be 
> good too.
>
> Part of me says to apply the Payloads patch now, as it provides a lot 
> of bang for the buck and I think the FI is going to take a lot longer 
> to hash out.  However, I know that it may pin us in or force us to 
> change things for FI.  Ultimately, I would love to see both these 
> features for the next release, but that isn't a requirement.  Also, on 
> FI, I would love to see two different implementations of whatever API 
> we choose before releasing it, as I always find two implementations of 
> an Interface really work out the API details.
>
> -Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

Posted by Grant Ingersoll <gr...@gmail.com>.
Hi Michael,

This is very good.  I know 662 is different, just wasn't sure if  
Nicolas patch was meant to be applied after 662, b/c I know we had  
discussed this before.

I do agree with you about planning this out, but I also know that  
patches seem to motivate people the best and provide a certain  
concreteness to it all.  I mostly started asking questions on these  
two issues b/c I wanted to spur some more discussion and see if we  
can get people motivated to move on it.

I was hoping that I would be able to apply each patch to two  
different checkouts so I could start seeing where the overlap is and  
how they could fit together (I also admit I was procrastinating on my  
ApacheCon talk...).  In the new, flexible world, the payloads  
implementation could be a separate implementation of the indexing or  
it could be part of the core/existing file format implementation.   
Sometimes I just need to get my hands on the code to get a real feel  
for what I feel is the best way to do it.

I agree about the XML storage for Index information.  We do that in  
our in-house wrapper around Lucene, storing info about the language,  
analyzer used, etc.  We may also want a binary index-level storage  
capability.  I know most people just create a single document usually  
to store binary info about the index, but an binary storage might be  
good too.

Part of me says to apply the Payloads patch now, as it provides a lot  
of bang for the buck and I think the FI is going to take a lot longer  
to hash out.  However, I know that it may pin us in or force us to  
change things for FI.  Ultimately, I would love to see both these  
features for the next release, but that isn't a requirement.  Also,  
on FI, I would love to see two different implementations of whatever  
API we choose before releasing it, as I always find two  
implementations of an Interface really work out the API details.

-Grant


On Mar 10, 2007, at 6:27 PM, Michael Busch wrote:

> Hi Grant,
>
> LUCENE-662 contains different ideas:
> 1) introduction of an index format concept
> 2) extensibility of the store reader/writer
> 3) New: extensibility of the posting reader/writer
>
> IMO we should split this up, that way it will be easier to develop  
> smaller patches that focus on adding one particular feature.  
> However, it is important to plan the API, so that different patches  
> (like payloads) fit in. On the other hand it will be nearly  
> impossible to plan an API that is perfect and won't change anymore  
> without having the actual implementions. Therefore I suggest the  
> following steps:
> a) define the different work items of flexible indexing
> b) plan a API rougly that fits with all items
> c) develop the different items, commit them but with either  
> protected or as experimental marked APIs
> d) after all items are completed and committed (and hopefully  
> tested by some brave community members ;)) finalize the API and  
> remove experimental comments (or make public)
>
> Let's start with a):
>
> The following items come to my mind (please feel free to add/remove/ 
> complain):
> - Introduce index-level metadata. Preferable in XML format, so it  
> will be human readable. Later on, we can store information about  
> the index format in this file, like the codecs that are used to  
> store the data. We should also make this public, so that users can  
> store their own index metadata. (Remark: LUCENE-783 is also a neat  
> idea, we can write one xml parser for both items)
>
> - Introduce index format. Nicolas has already written a lot of code  
> in this regard! It will include different interfaces for the  
> different extension points (FieldsFormat, PostingFormat,  
> DictionaryFormat). We can use the xml file to store which actual  
> formats are used in the corresponding index.
>
> - Implement the different extensions. LUCENE-662 includes an  
> extensible FieldsWriter, LUCENE-755 the payloads feature. Doug and  
> Ning suggested already nice interfaces for PostingFormat and  
> DictionaryFormat in the payloads thread on java-dev.
>
> - Write standard implementations for the different formats. In the  
> wiki is already a list of desired posting formats.
>
>
> I suggest we should finalize this list first. Then I will add this  
> list to the wiki under Flexible indexing and gather information  
> from the different discussions on java-dev which I already  
> mentioned. Then we should discuss the different items of this list  
> in greater depth and plan the APIs (step b) ).  And then we're  
> already ready for step c) and the fun starts :-).
>
> Michael
>
>
> Grant Ingersoll wrote:
>> I think it makes the most sense to get flexible indexing in first,  
>> and then make payloads work with it.  On the other hand, payloads  
>> looked pretty straightforward to me, whereas FI is much more  
>> involved (or at least it feels that way).
>>
>> As it is right now, I would like to at least review the two  
>> patches and start thinking about them in greater depth.  The  
>> payloads patch needs a little more work in that I want to  
>> integrate it with the Similarity class so people can customize  
>> their scoring.
>>
>> -Grant
>>
>> On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:
>>
>>>
>>>     [ https://issues.apache.org/jira/browse/LUCENE-755? 
>>> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
>>> tabpanel#action_12479841 ]
>>>
>>> Nicolas Lalevée commented on LUCENE-755:
>>> ----------------------------------------
>>>
>>> Grant>
>>> The patch I have propsed here has no dependency on LUCENE-662, I  
>>> just "imported" some ideas from it and put them there. Since the  
>>> LUCENE-662 have involved, the patches will probably make  
>>> conflicts. The best to use here is Michael's one. I think it  
>>> won't conflit with LUCENE-662. And if both are intended to be  
>>> commited, then the best is to commit the both seperately and redo  
>>> the work I have done with the provided patch (I remember that it  
>>> was quite easy).
>>>
>>>
>>>> Payloads
>>>> --------
>>>>
>>>>                 Key: LUCENE-755
>>>>                 URL: https://issues.apache.org/jira/browse/ 
>>>> LUCENE-755
>>>>             Project: Lucene - Java
>>>>          Issue Type: New Feature
>>>>          Components: Index
>>>>            Reporter: Michael Busch
>>>>         Assigned To: Michael Busch
>>>>         Attachments: payload.patch, payloads.patch
>>>>
>>>>
>>>> This patch adds the possibility to store arbitrary metadata  
>>>> (payloads) together with each position of a term in its posting  
>>>> lists. A while ago this was discussed on the dev mailing list,  
>>>> where I proposed an initial design. This patch has a much  
>>>> improved design with modifications, that make this new feature  
>>>> easier to use and more efficient.
>>>> A payload is an array of bytes that can be stored inline in the  
>>>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>>>> simply store and retrieve byte arrays in the posting lists in an  
>>>> efficient way.
>>>> API and Usage
>>>> ------------------------------
>>>> The new class index.Payload is basically just a wrapper around a  
>>>> byte[] array together with int variables for offset and length.  
>>>> So a user does not have to create a byte array for every  
>>>> payload, but can rather allocate one array for all payloads of a  
>>>> document and provide offset and length information. This reduces  
>>>> object allocations on the application side.
>>>> In order to store payloads in the posting lists one has to  
>>>> provide a TokenStream or TokenFilter that produces Tokens with  
>>>> payloads. I added the following two methods to the Token class:
>>>>   /** Sets this Token's payload. */
>>>>   public void setPayload(Payload payload);
>>>>
>>>>   /** Returns this Token's payload. */
>>>>   public Payload getPayload();
>>>> In order to retrieve the data from the index the interface  
>>>> TermPositions now offers two new methods:
>>>>   /** Returns the payload length of the current term position.
>>>>    *  This is invalid until {@link #nextPosition()} is called for
>>>>    *  the first time.
>>>>    *
>>>>    * @return length of the current payload in number of bytes
>>>>    */
>>>>   int getPayloadLength();
>>>>
>>>>   /** Returns the payload data of the current term position.
>>>>    * This is invalid until {@link #nextPosition()} is called for
>>>>    * the first time.
>>>>    * This method must not be called more than once after each call
>>>>    * of {@link #nextPosition()}. However, payloads are loaded  
>>>> lazily,
>>>>    * so if the payload data for the current position is not needed,
>>>>    * this method may not be called at all for performance reasons.
>>>>    *
>>>>    * @param data the array into which the data of this payload  
>>>> is to be
>>>>    *             stored, if it is big enough; otherwise, a new  
>>>> byte[] array
>>>>    *             is allocated for this purpose.
>>>>    * @param offset the offset in the array into which the data  
>>>> of this payload
>>>>    *               is to be stored.
>>>>    * @return a byte[] array containing the data of this payload
>>>>    * @throws IOException
>>>>    */
>>>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>>>> Furthermore, this patch indroduces the new method  
>>>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>>>> there was only a writeBytes()-method without an offset argument.
>>>> Implementation details
>>>> ------------------------------
>>>> - One field bit in FieldInfos is used to indicate if payloads  
>>>> are enabled for a field. The user does not have to enable  
>>>> payloads for a field, this is done automatically:
>>>>    * The DocumentWriter enables payloads for a field, if one ore  
>>>> more Tokens carry payloads.
>>>>    * The SegmentMerger enables payloads for a field during a  
>>>> merge, if payloads are enabled for that field in one or more  
>>>> segments.
>>>> - Backwards compatible: If payloads are not used, then the  
>>>> formats of the ProxFile and FreqFile don't change
>>>> - Payloads are stored inline in the posting list of a term in  
>>>> the ProxFile. A payload of a term occurrence is stored right  
>>>> after its PositionDelta.
>>>> - Same-length compression: If payloads are enabled for a field,  
>>>> then the PositionDelta is shifted one bit. The lowest bit is  
>>>> used to indicate whether the length of the following payload is  
>>>> stored explicitly. If not, i. e. the bit is false, then the  
>>>> payload has the same length as the payload of the previous term  
>>>> occurrence.
>>>> - In order to support skipping on the ProxFile the length of the  
>>>> payload at every skip point has to be known. Therefore the  
>>>> payload length is also stored in the skip list located in the  
>>>> FreqFile. Here the same-length compression is also used: The  
>>>> lowest bit of DocSkip is used to indicate if the payload length  
>>>> is stored for a SkipDatum or if the length is the same as in the  
>>>> last SkipDatum.
>>>> - Payloads are loaded lazily. When a user calls  
>>>> TermPositions.nextPosition() then only the position and the  
>>>> payload length is loaded from the ProxFile. If the user calls  
>>>> getPayload() then the payload is actually loaded. If getPayload 
>>>> () is not called before nextPosition() is called again, then the  
>>>> payload data is just skipped.
>>>>
>>>> Changes of file formats
>>>> ------------------------------
>>>> - FieldInfos (.fnm)
>>>> The format of the .fnm file does not change. The only change is  
>>>> the use of the sixth lowest-order bit (0x20) of the FieldBits.  
>>>> If this bit is set, then payloads are enabled for the  
>>>> corresponding field.
>>>> - ProxFile (.prx)
>>>> ProxFile (.prx) -->  <TermPositions>^TermCount
>>>> TermPositions   --> <Positions>^DocFreq
>>>> Positions       --> <PositionDelta, Payload?>^Freq
>>>> Payload         --> <PayloadLength?, PayloadData>
>>>> PositionDelta   --> VInt
>>>> PayloadLength   --> VInt
>>>> PayloadData     --> byte^PayloadLength
>>>> For payloads disabled (unchanged):
>>>> PositionDelta is the difference between the position of the  
>>>> current occurrence in the document and the previous occurrence  
>>>> (or zero, if this is the first   occurrence in this document).
>>>>
>>>> For Payloads enabled:
>>>> PositionDelta/2 is the difference between the position of the  
>>>> current occurrence in the document and the previous occurrence.  
>>>> If PositionDelta is odd, then PayloadLength is stored. If  
>>>> PositionDelta is even, then the length of the current payload  
>>>> equals the length of the previous payload and thus PayloadLength  
>>>> is omitted.
>>>> - FreqFile (.frq)
>>>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>>>> PayloadLength --> VInt
>>>> For payloads disabled (unchanged):
>>>> DocSkip records the document number before every SkipInterval th  
>>>> document in TermFreqs. Document numbers are represented as  
>>>> differences from the previous value in the sequence.
>>>> For payloads enabled:
>>>> DocSkip/2 records the document number before every SkipInterval  
>>>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>>>> follows. If DocSkip is even, then the length of the payload at  
>>>> the current skip point equals the length of the payload at the  
>>>> last skip point and thus PayloadLength is omitted.
>>>> This encoding is space efficient for different use cases:
>>>>    * If only some fields of an index have payloads, then there's  
>>>> no space overhead for the fields with payloads disabled.
>>>>    * If the payloads of consecutive term positions have the same  
>>>> length, then the length only has to be stored once for every  
>>>> term. This should be a common case, because users probably use  
>>>> the same format for all payloads.
>>>>    * If only a few terms of a field have payloads, then we don't  
>>>> waste much space because we benefit again from the same-length- 
>>>> compression since we only have to store the length zero for the  
>>>> empty payloads once per term.
>>>> All unit tests pass.
>>>
>>> --This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 12, 2007, at 5:08 PM, Grant Ingersoll wrote:

> I can see having storage at:
> Index
> Document/Field  //already exists
> Token

I hadn't thought of it that way, as a logical extension outwards at  
all levels.

If I understand you correctly, it's a clever point, but the thing is,  
it's cake for someone to add arbitrary index-level data on their own,  
just by adding their own file.  We'd have to come up with and support  
an infrastructure for handling this kind of data, and whatever we  
invented would be unlikely to suit all needs.  Ergo, I think it makes  
sense for us to focus on the Token and Document/Field levels.

I think we can do much better with regards to opening up Document/ 
Field retrieval.  Under global field semantics, the fieldbits Byte is  
no longer needed.  Go one step beyond that, and change the field  
number to a field name string, and documents can be handled as  
monolithic blobs when merging segments.  Document storage becomes  
simply a combination of fixed width storage and (optional) variable  
width storage, and the possibilities for subclassing break wide  
open.  Extended thoughts below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Begin forwarded message:
From: Marvin Humphrey <ma...@rectangular.com>
Date: February 26, 2007 1:26:00 PM PST
To: KinoSearch discussion forum <ki...@rectangular.com>
Subject: [KinoSearch] Subclassing DocWriter/DocReader
Reply-To: KinoSearch discussion forum <ki...@rectangular.com>

Greets,

The file format changes in the new KS have opened up possibilities  
for subclassing DocWriter/DocReader, the classes responsible for  
storage/retrieval of serialized documents.

Here are some potential features that subclasses could implement:

   * storage of arbitrary data (e.g. arrayref values)
   * different field values for display and searching
   * complete document recovery
   * arbitrary compression algo choice
   * lazy loading
   * optimized external document storage (e.g. in SQL DB)

Anything else?  The more ideas we dream up now and consider how to  
support, the better the design will be.

Right now, there are two files, _XXX.ds and _XXX.dsx, with .ds being  
"document storage", and .dsx being "document storage index".  .ds is  
a stack of variable width records -- serialized documents -- stored  
end to end.  .dsx is a stack of fixed width records: 64-bit pointers  
into the variable-width .ds file.  (For a more extensive explanation,  
see <http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/ 
FileFormat.html>)

The fixed width file, I intend to monkey with myself, because I'm  
going to start storing document boost as a 32-bit float within it.  
(That's what's driving this development track -- I need a place to  
put these doc boosts.)

My thinking is, why not add more than that?  So long as the  
additional data is fixed width, we can still index into the .dsx file  
quickly.

The variable width .ds file is up for grabs.  Right now, docs are  
serialized using a scheme derived from Lucene which isn't really  
optimal for KS and doesn't need to be as complicated as it is.  So  
long as we can recover a hash from the serialized data, we're fine.

Rough sketch example subclasses implementing storage of arbitrary  
data and external storage in a DB are below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

#--------------------------------------------------------------------

package ArbitraryDataDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use Storable qw( nfreeze );

sub store_doc {
     my ( $self, $doc ) = @_;
     my %ret_hash = ( var_width_data => nfreeze($doc) );
     return \%ret_hash;
}

package ArbitraryDataDocReader;
use base qw( KinoSearch::Index::DocReader );
use Storable qw( thaw );

sub fetch_doc {
     my ( $self, %args ) = @_;
     my $serialized;
     $self->read_var_width( \$serialized, $args{var_width_bytes} );
     return thaw($$serialized);
}


#--------------------------------------------------------------------

package DBDocWriter;
use base qw( KinoSearch::Index::DocWriter );
use DBI;

sub fixed_width_data_size { 8 }

sub store_doc {
     my ( $self, $doc ) = @_;
     $self->store_in_db($doc);
     my %ret_hash = ( fixed_width_data => $doc->{primary_key} );
     return \%ret_hash;
}

package DBDocReader;
use base qw( KinoSearch::Index::DocReader );
use DBI;

sub fixed_width_data_size { 8 }

sub fetch_doc {
     my ( $self, %args ) = @_;
     return $self->fetch_from_db( $args{fixed_width_data} );
}



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Grant Ingersoll <gr...@gmail.com>.
On Mar 12, 2007, at 6:54 PM, Michael Busch wrote:

> Marvin Humphrey wrote:
>>
>> On Mar 12, 2007, at 2:11 PM, Michael Busch wrote:
>>
>>> I think our best option here is to have a closed XML file for the  
>>> index format/configuration (something like you sent in your other  
>>> mail) plus a binary file for custom index-level metadata like  
>>> Grant suggested.
>>
>> Why the binary file?
>>
> Well, it's not needed for FI and storing the index configuration.  
> Grant mentioned though, that he knows users who would like to have  
> such a feature:
>
> Grant Ingersoll wrote:
> > We may also want a binary index-level storage capability.  I know  
> most people just create a single document usually to store binary  
> info about the index, but an binary storage might be good too.
>

I'm just thinking there might be info you want to store at the index  
level that could be significant and you don't want to write it out in  
XML, kind of like we have binary stored fields now.

I can see having storage at:
Index
Document/Field  //already exists
Token

The key to Index and Token storage is that it not effect performance,  
either that or it is an alternate implementation.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 12, 2007, at 3:54 PM, Michael Busch wrote:

> Sounds interesting! I will take a closer look at it...

Here's an introduction courtesy of JYaml, a YAML library for Java:

   http://jyaml.sourceforge.net/tutorial.html

For an example of how YAML is well suited to the task of serializing  
index metadata, consider how the association of field names to field  
numbers in KS's segments file gets expressed:

       field_names:
         - title
         - url
         - content

It's an ordered list, so title is 0, url is 1, and content is 2.  You  
can hack ordered lists into XML, but they won't be as clear, terse,  
or standard as that.

To be fair, YAML is a lousy markup language.  :)  But we don't need  
markup, we need data serialization.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> On Mar 12, 2007, at 2:11 PM, Michael Busch wrote:
>
>> I think our best option here is to have a closed XML file for the 
>> index format/configuration (something like you sent in your other 
>> mail) plus a binary file for custom index-level metadata like Grant 
>> suggested.
>
> Why the binary file?
>
Well, it's not needed for FI and storing the index configuration. Grant 
mentioned though, that he knows users who would like to have such a feature:

Grant Ingersoll wrote:
 > We may also want a binary index-level storage capability.  I know 
most people just create a single document usually to store binary info 
about the index, but an binary storage might be good too.



>> Btw, I'm not really familiar with YAML. Maybe you could explain 
>> briefly why you chose YAML over XML in KinoSearch?
>
> First, it's more readable.
>
> Second, it's designed for exactly this purpose.  It's a data 
> serialization language.  (YAML officially stands for "YAML ain't 
> markup language".)  XML can handle this task, but not as elegantly.
>
> Third, it's ubiquitous in both the Perl and the Ruby communities. It's 
> very close as JSON as well, so anybody who's done Javascript/AJAX 
> programming can grok it -- but even if you have no experience with it, 
> the fundamentals are easily grasped.  It's got sufficient market 
> penetration that it isn't going anywhere, so there's nothing to be 
> gained by going with the relatively more-established XML.  It has its 
> flaws, chiefly having to do with how it handles very complex data, but 
> XML has the same problem, and the kind of data we're talking about is 
> pretty simple.
>
> Last, it's a bit more compact, though that wasn't a major consideration.
Sounds interesting! I will take a closer look at it...

- Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 12, 2007, at 2:11 PM, Michael Busch wrote:

> I think our best option here is to have a closed XML file for the  
> index format/configuration (something like you sent in your other  
> mail) plus a binary file for custom index-level metadata like Grant  
> suggested.

Why the binary file?

> Btw, I'm not really familiar with YAML. Maybe you could explain  
> briefly why you chose YAML over XML in KinoSearch?

First, it's more readable.

Second, it's designed for exactly this purpose.  It's a data  
serialization language.  (YAML officially stands for "YAML ain't  
markup language".)  XML can handle this task, but not as elegantly.

Third, it's ubiquitous in both the Perl and the Ruby communities.  
It's very close as JSON as well, so anybody who's done Javascript/ 
AJAX programming can grok it -- but even if you have no experience  
with it, the fundamentals are easily grasped.  It's got sufficient  
market penetration that it isn't going anywhere, so there's nothing  
to be gained by going with the relatively more-established XML.  It  
has its flaws, chiefly having to do with how it handles very complex  
data, but XML has the same problem, and the kind of data we're  
talking about is pretty simple.

Last, it's a bit more compact, though that wasn't a major consideration.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:
>
> I'm going to respond to this over several mails (: and possibly days 
> :) because there's an awful lot here, and I've already implemented a 
> lot of it in KS.
>
>> We should also make this public, so that users can store their own 
>> index metadata.
>> (Remark: LUCENE-783 is also a neat idea, we can write one xml parser 
>> for both items)
>
> There's a significant downside to allowing users to store arbitrary 
> data in an XML index file: you can't use a bare-bones parser, 
> hand-coded for a tiny, controlled subset of XML syntax and a limited 
> set of data structures.  You'd need a full-on XML encoder/decoder, 
> presumably an existing one that would be added as a dependency.
>
> The only reason that the KinoSearch's YAML codec requires only 600 
> lines of C is that it's a closed system.  No multi-line strings.  No 
> objects.  No nulls.  You get the picture.
>
That's a good point, Marvin. The parser would be much simpler if we had 
no open XML file. I think our best option here is to have a closed XML 
file for the index format/configuration (something like you sent in your 
other mail) plus a binary file for custom index-level metadata like 
Grant suggested.

Btw, I'm not really familiar with YAML. Maybe you could explain briefly 
why you chose YAML over XML in KinoSearch?

- Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

I'm going to respond to this over several mails (: and possibly  
days :) because there's an awful lot here, and I've already  
implemented a lot of it in KS.

> We should also make this public, so that users can store their own  
> index metadata.
> (Remark: LUCENE-783 is also a neat idea, we can write one xml  
> parser for both items)

There's a significant downside to allowing users to store arbitrary  
data in an XML index file: you can't use a bare-bones parser, hand- 
coded for a tiny, controlled subset of XML syntax and a limited set  
of data structures.  You'd need a full-on XML encoder/decoder,  
presumably an existing one that would be added as a dependency.

The only reason that the KinoSearch's YAML codec requires only 600  
lines of C is that it's a closed system.  No multi-line strings.  No  
objects.  No nulls.  You get the picture.

Is there anything you're envisioning that can't be done using a  
wrapper class and auxiliary/external files?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 13, 2007, at 2:03 AM, Nicolas Lalevée wrote:

>> At present KS allows you to attach both a Similarity and an Analyzer
>> to a field name via a FieldSpec subclass.  I haven't quite figured
>> out how to attach a posting format.  Should it return an object, like
>> FieldSpec's similarity() method does?  Should it actually implement a
>> codec?  Not sure yet.  What do you think?
>
> The posting format defines how you want to store the terms data, so  
> defines
> how to search.

Hmm.  I'm talking about the stuff currently held in .frq, .prx,  
and .fNNN in Lucene.  That's not the terms data.  I think we're  
miscommunicating.

KinoSearch 0.20_01 and forward move the postings data  
from .frq, .prx, and .fNNN to a single file per field, with the  
extension .pNNN.  The philosophy of KS 0.20 is to have all binary  
"files" be decodable by launching a single iterator at the front of  
the file and having it read to the end.  (They're actually virtual  
files within the compound file -- KS only supports the compound  
format.)  That translates one posting format per file.

> I don't think it is a good idea to mix different kind of
> posting format in the same index.

Allowing different fields to use different posting formats is very  
important.

When matching a value in a "category" field, all you might care about  
is whether the doc hits or not -- you don't care about freq, boost,  
per-position boost, any of that.  The posting format for "category"  
would thus specify "doc num only", and the .pNNN file would consist  
entirely of a sequence of delta-doc_num VInts.

In contrast, a "content" field scoring HTML source material might  
specify a posting format that includes boost-per-position.  Each  
record would have one doc_num, one freq, several positions, and  
several boosts.  The file would be much more complex.

If you want to score based on "content", but constrain results based  
on "category", you want to allow the simpler format for the  
"category" field, or you'll be wasting both disk and CPU.

It's actually possible to make different multiple posting formats  
work within a single monolithic postings file, but I opted to avoid  
that for the sake of simplicity and ease of debugging.

> It will make Lucene the responsablilty to
> manage different kind of readers instanciating different kind of  
> termEnums
> and so on.

I've actually chosen to break up the term list into two separate  
files per field as well.  This was a more costly and dubious choice,  
but was harmonious with KinoSearch's expansion of field semantics.

KS will soon allow users to determine sort order of term texts within  
each field.  Keeping separate TermLists for each field means that I  
don't need to to worry about either tracking field numbers/names or  
switching up comparators -- the TermList iterator terminates rather  
than proceed on to another field like TermEnum does.

> I don't really know what will be the different kind of impact of a
> such feature, but it might be quite difficult to manage it  
> correctly. But as
> the posting format can be redefined by the user, he can implement a  
> custom
> format which is handling internally different kind of data  
> associated to
> terms.

If you guarantee that the posting format for a given field can never  
change by imposing global field semantics, it's not a big deal.  If  
you break things up by field at both the file and the data structure  
level, it gets even easier.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Nicolas Lalevée <ni...@anyware-tech.com>.
Le Lundi 12 Mars 2007 21:34, Marvin Humphrey a écrit :
> On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:
> > - Introduce index format. Nicolas has already written a lot of code
> > in this regard!
>
> I worry that going the interface route is going to be too
> restrictive.  When I looked at Nicholas's index format spec, I
> immediately wanted to add an Analyzer and a bunch of other stuff to
> it.  Other people are going to want to add their own stuff.
>
> My suggestion is that the top-level plan for the index be called
> Schema, and that it be an abstract class.  An email to the KS list
> explaining the rationale behind KinoSearch's current version of this
> is below my sig.  Here are the API docs:
>
>    http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/
> Schema.html
>    http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/
> FieldSpec.html
>
> It uses global field semantics, which Hoss won't be happy about.  ;)
> However, I'm grateful to Hoss for past critiques, as they've helped
> me to refine and improve how Schema works.  For instance, as of KS
> 0.20_02 you can introduce new field_name => FieldSpec associations to
> KS at any time during indexing.
>
> It may be that adapting Lucene to use something like what KS uses
> would be too radical a change.  However, I believe that one reason
> that flexible indexing has been in incubation so long is that the
> current mechanism for attaching semantics to field names does not
> scale as well as it might.
>
> For instance, the logical extension of the current FieldInfos system
> is to add booleans as described at <http://wiki.apache.org/lucene-
> java/FlexibleIndexing>.  However, conflict resolution during segment
> merging is going to present challenges.  What happens when in one
> segment 'content' has freq and in another segment it doesn't?  Things
> are so much easier if the posting format, once set, never changes.

Here you raise another issue. The "IndexFormat" of my submitted patch only 
talks about how data is stored : the field data and the terms/posting data. 
Here you are talking about how the term/posting are created before storing 
them in the index. I agree with you that the behaviour is not clearely 
defined when there are different kind of indexing options for the same field. 
This produce bugs like LUCENE-766. And I think I am still confused about it 
because rethinking about the attached path, the termvector data will be 
computed even if the user have put a TermVector.NO.

This issue needs to be discussed of course, but this is related to the 
implementation of a specific new format proposed here 
<http://wiki.apache.org/lucene-java/FlexibleIndexing> and the design of the 
Field constructor.

> > It will include different interfaces for the different extension
> > points (FieldsFormat, PostingFormat, DictionaryFormat).
>
> KS still uses TermDocs and its children, but I'm about to go in and
> replace them with PostingList.  What subclass of Posting the
> PostingList returns would be controlled by the FieldSpec.
>
> At present KS allows you to attach both a Similarity and an Analyzer
> to a field name via a FieldSpec subclass.  I haven't quite figured
> out how to attach a posting format.  Should it return an object, like
> FieldSpec's similarity() method does?  Should it actually implement a
> codec?  Not sure yet.  What do you think?

The posting format defines how you want to store the terms data, so defines 
how to search. I don't think it is a good idea to mix different kind of 
posting format in the same index. It will make Lucene the responsablilty to 
manage different kind of readers instanciating different kind of termEnums 
and so on. I don't really know what will be the different kind of impact of a 
such feature, but it might be quite difficult to manage it correctly. But as 
the posting format can be redefined by the user, he can implement a custom 
format which is handling internally different kind of data associated to 
terms.

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
> --------------------------------------------------------------------
>
> Begin forwarded message:
> From: Marvin Humphrey <ma...@rectangular.com>
> Date: February 27, 2007 1:08:33 AM PST
> To: KinoSearch discussion forum <ki...@rectangular.com>
> Subject: [KinoSearch] KinoSearch::Schema - Rationale
> Reply-To: KinoSearch discussion forum <ki...@rectangular.com>
>
> Greets,
>
> The thing about Lucene/KS indexes is that all the information you
> need to read them can never be stored in the index files alone
> because there's always that bleedin' Analyzer.  You can look at a
> Lucene index and see that it has fields with certain names that are
> indexed, stored, etc, but you can't actually make sense of the
> index's content unless you know everything about all Analyzers used
> at index-time.
>
> Since the Analyzer is not hooked to the index file, but has to be
> created anew in every app that interacts with the index, it's often
> wrong, and analyzer mismatches are a constant source of confusion,
> frustration, and error for users.
>
> KinoSearch::Schema solves the Analyzer problem.  Not only that, but
> it sets the stage for attaching ever more semantic meaning to field
> names.  Not just booleans like "I'm indexed" and "I'm stored", but
> behaviors, objects...  For example, each field may now be associated
> with its own Similarity implementation, which affects scoring.  In
> the reasonably near future, the plan is to allow each FieldSpec to
> define a comparison sub which determines the sort order of terms.
> And so on.
>
> Schema is somewhat akin to SWISH's index configuration file, which
> can hold regexes, stoplists, and so on.  In fact, an earlier
> incarnation of Schema was primarily concerned with reading/writing a
> configuration file.  It attempted to solve the Lucene Analyzer
> problem by storing EVERYTHING, including a class name for the
> Analyzer; at search-time, the Analyzer object was created by calling
> a no-arg constructor.
>
> I ash-canned that design after trying to write docs explaining the
> bit about the no-arg constructor -- too confusing, not Perlish, and
> ultimately, less direct than allowing the user to write arbitrary
> code.  It's hard to maintain security, though, when you allow data
> files to contain code.  (I'm sure SWISH manages it, I just don't want
> the same headache).
>
> The thinking behind KinoSearch::Schema is, if you're going to create
> a index configuration file that has code in it, why not go all the
> way, and make it a Perl module?  It's the best of all worlds.  You
> get to leverage the power of the language itself when defining your
> index structure, but it's also a self-contained, complete spec that
> both your indexing app and your search app can load.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>
> _______________________________________________
> KinoSearch mailing list
> KinoSearch@rectangular.com
> http://www.rectangular.com/mailman/listinfo/kinosearch
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 13, 2007, at 2:38 AM, Michael Busch wrote:

> Global field semantics make our life with FI much easier in a  
> single index. But even with global field semantics we would have  
> the same problem with the IndexWriter.addIndexes() method, no? I'm  
> curious about how you solved that conflict in KinoSearch?

I didn't.

The KinoSearch equivalent of IndexWriter.addIndexes() fails if either  
you attempt to add an index created using a different subclass of  
Schema, or if any mismatches are detected when comparing field name  
=> spec pairings.  No conflict resolution is attempted -- only  
validation.

By committing to resolving all field property conflicts, Lucene  
creates two problems for itself.

First, there's the burden of writing, maintaining, and using the  
conflict resolution code for each property.  Sometimes this code is  
problematic, as illustrated by a Michael McCandless post to java-user  
from this morning:

   Note, however, that you must do this for all Field instances by that
   same field name because whenever Lucene merges segments, if even one
   Document did not disable norms then this will "spread" so that all
   documents keep their norms, for the same field name.

Second, Lucene limits the kinds of properties that may be attached to  
field names to those where conflict resolution is possible, and which  
may be expressed entirely via a single boolean value.  If you want to  
hang more sophisticated semantics off of field names, it is necessary  
to apply ad-hoc solutions outside the system:  
PerFieldAnalyzerWrapper, subclassing Similarity and making lengthNorm 
() polymorphic depending on field name, etc.

Things get easier to control, grok, and extend if all per-field  
behaviors are determined by a single class rather than spread out.   
An Analyzer spec can be associated with a field name permanently,  
eliminating analyzer mismatches.  So can a Similarity  
implementation... soon, a posting format.

Every feature that accumulates adds to the pressure on Lucene's  
conflict resolution system and acts as a drag on innovation (because  
we are reluctant to complicate the interface further, as Yonik was  
with segOmitNorms).  By trading away a certain amount of flexibility  
with regards to what properties may be hung off of individual field  
values, that pressure is released, and we get a simplified code base  
and increased freedom to hang a greater diversity of properties off  
of individual field names.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Michael Busch <bu...@gmail.com>.
Marvin Humphrey wrote:
>
> It uses global field semantics, which Hoss won't be happy about.  ;)  
> However, I'm grateful to Hoss for past critiques, as they've helped me 
> to refine and improve how Schema works.  For instance, as of KS 
> 0.20_02 you can introduce new field_name => FieldSpec associations to 
> KS at any time during indexing.
> It may be that adapting Lucene to use something like what KS uses 
> would be too radical a change.  However, I believe that one reason 
> that flexible indexing has been in incubation so long is that the 
> current mechanism for attaching semantics to field names does not 
> scale as well as it might.
>
> For instance, the logical extension of the current FieldInfos system 
> is to add booleans as described at 
> <http://wiki.apache.org/lucene-java/FlexibleIndexing>.  However, 
> conflict resolution during segment merging is going to present 
> challenges.  What happens when in one segment 'content' has freq and 
> in another segment it doesn't?  Things are so much easier if the 
> posting format, once set, never changes.
>
I was thinking in the same direction. Global field semantics make our 
life with FI much easier in a single index. But even with global field 
semantics we would have the same problem with the 
IndexWriter.addIndexes() method, no? I'm curious about how you solved 
that conflict in KinoSearch? Btw, I like it that you don't force the 
user to define all fields up front but rather allow to add fields at any 
time. I think if we implement global field semantics in Lucene we should 
go the same way.

I'm going to respond in more detail to your other points in this email 
tomorrow. I want to read the KinoSearch specs first but it's already 
kind of late....

- Michael


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Flexible indexing

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mar 10, 2007, at 3:27 PM, Michael Busch wrote:

> - Introduce index format. Nicolas has already written a lot of code  
> in this regard!

I worry that going the interface route is going to be too  
restrictive.  When I looked at Nicholas's index format spec, I  
immediately wanted to add an Analyzer and a bunch of other stuff to  
it.  Other people are going to want to add their own stuff.

My suggestion is that the top-level plan for the index be called  
Schema, and that it be an abstract class.  An email to the KS list  
explaining the rationale behind KinoSearch's current version of this  
is below my sig.  Here are the API docs:

   http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/ 
Schema.html
   http://www.rectangular.com/kinosearch/docs/devel/KinoSearch/Schema/ 
FieldSpec.html

It uses global field semantics, which Hoss won't be happy about.  ;)   
However, I'm grateful to Hoss for past critiques, as they've helped  
me to refine and improve how Schema works.  For instance, as of KS  
0.20_02 you can introduce new field_name => FieldSpec associations to  
KS at any time during indexing.

It may be that adapting Lucene to use something like what KS uses  
would be too radical a change.  However, I believe that one reason  
that flexible indexing has been in incubation so long is that the  
current mechanism for attaching semantics to field names does not  
scale as well as it might.

For instance, the logical extension of the current FieldInfos system  
is to add booleans as described at <http://wiki.apache.org/lucene- 
java/FlexibleIndexing>.  However, conflict resolution during segment  
merging is going to present challenges.  What happens when in one  
segment 'content' has freq and in another segment it doesn't?  Things  
are so much easier if the posting format, once set, never changes.

> It will include different interfaces for the different extension  
> points (FieldsFormat, PostingFormat, DictionaryFormat).

KS still uses TermDocs and its children, but I'm about to go in and  
replace them with PostingList.  What subclass of Posting the  
PostingList returns would be controlled by the FieldSpec.

At present KS allows you to attach both a Similarity and an Analyzer  
to a field name via a FieldSpec subclass.  I haven't quite figured  
out how to attach a posting format.  Should it return an object, like  
FieldSpec's similarity() method does?  Should it actually implement a  
codec?  Not sure yet.  What do you think?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

--------------------------------------------------------------------

Begin forwarded message:
From: Marvin Humphrey <ma...@rectangular.com>
Date: February 27, 2007 1:08:33 AM PST
To: KinoSearch discussion forum <ki...@rectangular.com>
Subject: [KinoSearch] KinoSearch::Schema - Rationale
Reply-To: KinoSearch discussion forum <ki...@rectangular.com>

Greets,

The thing about Lucene/KS indexes is that all the information you  
need to read them can never be stored in the index files alone  
because there's always that bleedin' Analyzer.  You can look at a  
Lucene index and see that it has fields with certain names that are  
indexed, stored, etc, but you can't actually make sense of the  
index's content unless you know everything about all Analyzers used  
at index-time.

Since the Analyzer is not hooked to the index file, but has to be  
created anew in every app that interacts with the index, it's often  
wrong, and analyzer mismatches are a constant source of confusion,  
frustration, and error for users.

KinoSearch::Schema solves the Analyzer problem.  Not only that, but  
it sets the stage for attaching ever more semantic meaning to field  
names.  Not just booleans like "I'm indexed" and "I'm stored", but  
behaviors, objects...  For example, each field may now be associated  
with its own Similarity implementation, which affects scoring.  In  
the reasonably near future, the plan is to allow each FieldSpec to  
define a comparison sub which determines the sort order of terms.   
And so on.

Schema is somewhat akin to SWISH's index configuration file, which  
can hold regexes, stoplists, and so on.  In fact, an earlier  
incarnation of Schema was primarily concerned with reading/writing a  
configuration file.  It attempted to solve the Lucene Analyzer  
problem by storing EVERYTHING, including a class name for the  
Analyzer; at search-time, the Analyzer object was created by calling  
a no-arg constructor.

I ash-canned that design after trying to write docs explaining the  
bit about the no-arg constructor -- too confusing, not Perlish, and  
ultimately, less direct than allowing the user to write arbitrary  
code.  It's hard to maintain security, though, when you allow data  
files to contain code.  (I'm sure SWISH manages it, I just don't want  
the same headache).

The thinking behind KinoSearch::Schema is, if you're going to create  
a index configuration file that has code in it, why not go all the  
way, and make it a Perl module?  It's the best of all worlds.  You  
get to leverage the power of the language itself when defining your  
index structure, but it's also a self-contained, complete spec that  
both your indexing app and your search app can load.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Flexible indexing (was: Re: [jira] Commented: (LUCENE-755) Payloads)

Posted by Michael Busch <bu...@gmail.com>.
Hi Grant,

LUCENE-662 contains different ideas:
1) introduction of an index format concept
2) extensibility of the store reader/writer
3) New: extensibility of the posting reader/writer

IMO we should split this up, that way it will be easier to develop 
smaller patches that focus on adding one particular feature. However, it 
is important to plan the API, so that different patches (like payloads) 
fit in. On the other hand it will be nearly impossible to plan an API 
that is perfect and won't change anymore without having the actual 
implementions. Therefore I suggest the following steps:
a) define the different work items of flexible indexing
b) plan a API rougly that fits with all items
c) develop the different items, commit them but with either protected or 
as experimental marked APIs
d) after all items are completed and committed (and hopefully tested by 
some brave community members ;)) finalize the API and remove 
experimental comments (or make public)

Let's start with a):

The following items come to my mind (please feel free to 
add/remove/complain):
- Introduce index-level metadata. Preferable in XML format, so it will 
be human readable. Later on, we can store information about the index 
format in this file, like the codecs that are used to store the data. We 
should also make this public, so that users can store their own index 
metadata. (Remark: LUCENE-783 is also a neat idea, we can write one xml 
parser for both items)

- Introduce index format. Nicolas has already written a lot of code in 
this regard! It will include different interfaces for the different 
extension points (FieldsFormat, PostingFormat, DictionaryFormat). We can 
use the xml file to store which actual formats are used in the 
corresponding index.

- Implement the different extensions. LUCENE-662 includes an extensible 
FieldsWriter, LUCENE-755 the payloads feature. Doug and Ning suggested 
already nice interfaces for PostingFormat and DictionaryFormat in the 
payloads thread on java-dev.

- Write standard implementations for the different formats. In the wiki 
is already a list of desired posting formats.


I suggest we should finalize this list first. Then I will add this list 
to the wiki under Flexible indexing and gather information from the 
different discussions on java-dev which I already mentioned. Then we 
should discuss the different items of this list in greater depth and 
plan the APIs (step b) ).  And then we're already ready for step c) and 
the fun starts :-).

Michael


Grant Ingersoll wrote:
> I think it makes the most sense to get flexible indexing in first, and 
> then make payloads work with it.  On the other hand, payloads looked 
> pretty straightforward to me, whereas FI is much more involved (or at 
> least it feels that way).
>
> As it is right now, I would like to at least review the two patches 
> and start thinking about them in greater depth.  The payloads patch 
> needs a little more work in that I want to integrate it with the 
> Similarity class so people can customize their scoring.
>
> -Grant
>
> On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:
>
>>
>>     [ 
>> https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841 
>> ]
>>
>> Nicolas Lalevée commented on LUCENE-755:
>> ----------------------------------------
>>
>> Grant>
>> The patch I have propsed here has no dependency on LUCENE-662, I just 
>> "imported" some ideas from it and put them there. Since the 
>> LUCENE-662 have involved, the patches will probably make conflicts. 
>> The best to use here is Michael's one. I think it won't conflit with 
>> LUCENE-662. And if both are intended to be commited, then the best is 
>> to commit the both seperately and redo the work I have done with the 
>> provided patch (I remember that it was quite easy).
>>
>>
>>> Payloads
>>> --------
>>>
>>>                 Key: LUCENE-755
>>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>>             Project: Lucene - Java
>>>          Issue Type: New Feature
>>>          Components: Index
>>>            Reporter: Michael Busch
>>>         Assigned To: Michael Busch
>>>         Attachments: payload.patch, payloads.patch
>>>
>>>
>>> This patch adds the possibility to store arbitrary metadata 
>>> (payloads) together with each position of a term in its posting 
>>> lists. A while ago this was discussed on the dev mailing list, where 
>>> I proposed an initial design. This patch has a much improved design 
>>> with modifications, that make this new feature easier to use and 
>>> more efficient.
>>> A payload is an array of bytes that can be stored inline in the 
>>> ProxFile (.prx). Therefore this patch provides low-level APIs to 
>>> simply store and retrieve byte arrays in the posting lists in an 
>>> efficient way.
>>> API and Usage
>>> ------------------------------
>>> The new class index.Payload is basically just a wrapper around a 
>>> byte[] array together with int variables for offset and length. So a 
>>> user does not have to create a byte array for every payload, but can 
>>> rather allocate one array for all payloads of a document and provide 
>>> offset and length information. This reduces object allocations on 
>>> the application side.
>>> In order to store payloads in the posting lists one has to provide a 
>>> TokenStream or TokenFilter that produces Tokens with payloads. I 
>>> added the following two methods to the Token class:
>>>   /** Sets this Token's payload. */
>>>   public void setPayload(Payload payload);
>>>
>>>   /** Returns this Token's payload. */
>>>   public Payload getPayload();
>>> In order to retrieve the data from the index the interface 
>>> TermPositions now offers two new methods:
>>>   /** Returns the payload length of the current term position.
>>>    *  This is invalid until {@link #nextPosition()} is called for
>>>    *  the first time.
>>>    *
>>>    * @return length of the current payload in number of bytes
>>>    */
>>>   int getPayloadLength();
>>>
>>>   /** Returns the payload data of the current term position.
>>>    * This is invalid until {@link #nextPosition()} is called for
>>>    * the first time.
>>>    * This method must not be called more than once after each call
>>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>>    * so if the payload data for the current position is not needed,
>>>    * this method may not be called at all for performance reasons.
>>>    *
>>>    * @param data the array into which the data of this payload is to be
>>>    *             stored, if it is big enough; otherwise, a new 
>>> byte[] array
>>>    *             is allocated for this purpose.
>>>    * @param offset the offset in the array into which the data of 
>>> this payload
>>>    *               is to be stored.
>>>    * @return a byte[] array containing the data of this payload
>>>    * @throws IOException
>>>    */
>>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>>> Furthermore, this patch indroduces the new method 
>>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far 
>>> there was only a writeBytes()-method without an offset argument.
>>> Implementation details
>>> ------------------------------
>>> - One field bit in FieldInfos is used to indicate if payloads are 
>>> enabled for a field. The user does not have to enable payloads for a 
>>> field, this is done automatically:
>>>    * The DocumentWriter enables payloads for a field, if one ore 
>>> more Tokens carry payloads.
>>>    * The SegmentMerger enables payloads for a field during a merge, 
>>> if payloads are enabled for that field in one or more segments.
>>> - Backwards compatible: If payloads are not used, then the formats 
>>> of the ProxFile and FreqFile don't change
>>> - Payloads are stored inline in the posting list of a term in the 
>>> ProxFile. A payload of a term occurrence is stored right after its 
>>> PositionDelta.
>>> - Same-length compression: If payloads are enabled for a field, then 
>>> the PositionDelta is shifted one bit. The lowest bit is used to 
>>> indicate whether the length of the following payload is stored 
>>> explicitly. If not, i. e. the bit is false, then the payload has the 
>>> same length as the payload of the previous term occurrence.
>>> - In order to support skipping on the ProxFile the length of the 
>>> payload at every skip point has to be known. Therefore the payload 
>>> length is also stored in the skip list located in the FreqFile. Here 
>>> the same-length compression is also used: The lowest bit of DocSkip 
>>> is used to indicate if the payload length is stored for a SkipDatum 
>>> or if the length is the same as in the last SkipDatum.
>>> - Payloads are loaded lazily. When a user calls 
>>> TermPositions.nextPosition() then only the position and the payload 
>>> length is loaded from the ProxFile. If the user calls getPayload() 
>>> then the payload is actually loaded. If getPayload() is not called 
>>> before nextPosition() is called again, then the payload data is just 
>>> skipped.
>>>
>>> Changes of file formats
>>> ------------------------------
>>> - FieldInfos (.fnm)
>>> The format of the .fnm file does not change. The only change is the 
>>> use of the sixth lowest-order bit (0x20) of the FieldBits. If this 
>>> bit is set, then payloads are enabled for the corresponding field.
>>> - ProxFile (.prx)
>>> ProxFile (.prx) -->  <TermPositions>^TermCount
>>> TermPositions   --> <Positions>^DocFreq
>>> Positions       --> <PositionDelta, Payload?>^Freq
>>> Payload         --> <PayloadLength?, PayloadData>
>>> PositionDelta   --> VInt
>>> PayloadLength   --> VInt
>>> PayloadData     --> byte^PayloadLength
>>> For payloads disabled (unchanged):
>>> PositionDelta is the difference between the position of the current 
>>> occurrence in the document and the previous occurrence (or zero, if 
>>> this is the first   occurrence in this document).
>>>
>>> For Payloads enabled:
>>> PositionDelta/2 is the difference between the position of the 
>>> current occurrence in the document and the previous occurrence. If 
>>> PositionDelta is odd, then PayloadLength is stored. If PositionDelta 
>>> is even, then the length of the current payload equals the length of 
>>> the previous payload and thus PayloadLength is omitted.
>>> - FreqFile (.frq)
>>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>>> PayloadLength --> VInt
>>> For payloads disabled (unchanged):
>>> DocSkip records the document number before every SkipInterval th 
>>> document in TermFreqs. Document numbers are represented as 
>>> differences from the previous value in the sequence.
>>> For payloads enabled:
>>> DocSkip/2 records the document number before every SkipInterval th  
>>> document in TermFreqs. If DocSkip is odd, then PayloadLength 
>>> follows. If DocSkip is even, then the length of the payload at the 
>>> current skip point equals the length of the payload at the last skip 
>>> point and thus PayloadLength is omitted.
>>> This encoding is space efficient for different use cases:
>>>    * If only some fields of an index have payloads, then there's no 
>>> space overhead for the fields with payloads disabled.
>>>    * If the payloads of consecutive term positions have the same 
>>> length, then the length only has to be stored once for every term. 
>>> This should be a common case, because users probably use the same 
>>> format for all payloads.
>>>    * If only a few terms of a field have payloads, then we don't 
>>> waste much space because we benefit again from the 
>>> same-length-compression since we only have to store the length zero 
>>> for the empty payloads once per term.
>>> All unit tests pass.
>>
>> --This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: [jira] Commented: (LUCENE-755) Payloads

Posted by Grant Ingersoll <gr...@gmail.com>.
I think it makes the most sense to get flexible indexing in first,  
and then make payloads work with it.  On the other hand, payloads  
looked pretty straightforward to me, whereas FI is much more involved  
(or at least it feels that way).

As it is right now, I would like to at least review the two patches  
and start thinking about them in greater depth.  The payloads patch  
needs a little more work in that I want to integrate it with the  
Similarity class so people can customize their scoring.

-Grant

On Mar 10, 2007, at 9:30 AM, Nicolas Lalevée (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-755? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12479841 ]
>
> Nicolas Lalevée commented on LUCENE-755:
> ----------------------------------------
>
> Grant>
> The patch I have propsed here has no dependency on LUCENE-662, I  
> just "imported" some ideas from it and put them there. Since the  
> LUCENE-662 have involved, the patches will probably make conflicts.  
> The best to use here is Michael's one. I think it won't conflit  
> with LUCENE-662. And if both are intended to be commited, then the  
> best is to commit the both seperately and redo the work I have done  
> with the provided patch (I remember that it was quite easy).
>
>
>> Payloads
>> --------
>>
>>                 Key: LUCENE-755
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Index
>>            Reporter: Michael Busch
>>         Assigned To: Michael Busch
>>         Attachments: payload.patch, payloads.patch
>>
>>
>> This patch adds the possibility to store arbitrary metadata  
>> (payloads) together with each position of a term in its posting  
>> lists. A while ago this was discussed on the dev mailing list,  
>> where I proposed an initial design. This patch has a much improved  
>> design with modifications, that make this new feature easier to  
>> use and more efficient.
>> A payload is an array of bytes that can be stored inline in the  
>> ProxFile (.prx). Therefore this patch provides low-level APIs to  
>> simply store and retrieve byte arrays in the posting lists in an  
>> efficient way.
>> API and Usage
>> ------------------------------
>> The new class index.Payload is basically just a wrapper around a  
>> byte[] array together with int variables for offset and length. So  
>> a user does not have to create a byte array for every payload, but  
>> can rather allocate one array for all payloads of a document and  
>> provide offset and length information. This reduces object  
>> allocations on the application side.
>> In order to store payloads in the posting lists one has to provide  
>> a TokenStream or TokenFilter that produces Tokens with payloads. I  
>> added the following two methods to the Token class:
>>   /** Sets this Token's payload. */
>>   public void setPayload(Payload payload);
>>
>>   /** Returns this Token's payload. */
>>   public Payload getPayload();
>> In order to retrieve the data from the index the interface  
>> TermPositions now offers two new methods:
>>   /** Returns the payload length of the current term position.
>>    *  This is invalid until {@link #nextPosition()} is called for
>>    *  the first time.
>>    *
>>    * @return length of the current payload in number of bytes
>>    */
>>   int getPayloadLength();
>>
>>   /** Returns the payload data of the current term position.
>>    * This is invalid until {@link #nextPosition()} is called for
>>    * the first time.
>>    * This method must not be called more than once after each call
>>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>>    * so if the payload data for the current position is not needed,
>>    * this method may not be called at all for performance reasons.
>>    *
>>    * @param data the array into which the data of this payload is  
>> to be
>>    *             stored, if it is big enough; otherwise, a new byte 
>> [] array
>>    *             is allocated for this purpose.
>>    * @param offset the offset in the array into which the data of  
>> this payload
>>    *               is to be stored.
>>    * @return a byte[] array containing the data of this payload
>>    * @throws IOException
>>    */
>>   byte[] getPayload(byte[] data, int offset) throws IOException;
>> Furthermore, this patch indroduces the new method  
>> IndexOutput.writeBytes(byte[] b, int offset, int length). So far  
>> there was only a writeBytes()-method without an offset argument.
>> Implementation details
>> ------------------------------
>> - One field bit in FieldInfos is used to indicate if payloads are  
>> enabled for a field. The user does not have to enable payloads for  
>> a field, this is done automatically:
>>    * The DocumentWriter enables payloads for a field, if one ore  
>> more Tokens carry payloads.
>>    * The SegmentMerger enables payloads for a field during a  
>> merge, if payloads are enabled for that field in one or more  
>> segments.
>> - Backwards compatible: If payloads are not used, then the formats  
>> of the ProxFile and FreqFile don't change
>> - Payloads are stored inline in the posting list of a term in the  
>> ProxFile. A payload of a term occurrence is stored right after its  
>> PositionDelta.
>> - Same-length compression: If payloads are enabled for a field,  
>> then the PositionDelta is shifted one bit. The lowest bit is used  
>> to indicate whether the length of the following payload is stored  
>> explicitly. If not, i. e. the bit is false, then the payload has  
>> the same length as the payload of the previous term occurrence.
>> - In order to support skipping on the ProxFile the length of the  
>> payload at every skip point has to be known. Therefore the payload  
>> length is also stored in the skip list located in the FreqFile.  
>> Here the same-length compression is also used: The lowest bit of  
>> DocSkip is used to indicate if the payload length is stored for a  
>> SkipDatum or if the length is the same as in the last SkipDatum.
>> - Payloads are loaded lazily. When a user calls  
>> TermPositions.nextPosition() then only the position and the  
>> payload length is loaded from the ProxFile. If the user calls  
>> getPayload() then the payload is actually loaded. If getPayload()  
>> is not called before nextPosition() is called again, then the  
>> payload data is just skipped.
>>
>> Changes of file formats
>> ------------------------------
>> - FieldInfos (.fnm)
>> The format of the .fnm file does not change. The only change is  
>> the use of the sixth lowest-order bit (0x20) of the FieldBits. If  
>> this bit is set, then payloads are enabled for the corresponding  
>> field.
>> - ProxFile (.prx)
>> ProxFile (.prx) -->  <TermPositions>^TermCount
>> TermPositions   --> <Positions>^DocFreq
>> Positions       --> <PositionDelta, Payload?>^Freq
>> Payload         --> <PayloadLength?, PayloadData>
>> PositionDelta   --> VInt
>> PayloadLength   --> VInt
>> PayloadData     --> byte^PayloadLength
>> For payloads disabled (unchanged):
>> PositionDelta is the difference between the position of the  
>> current occurrence in the document and the previous occurrence (or  
>> zero, if this is the first   occurrence in this document).
>>
>> For Payloads enabled:
>> PositionDelta/2 is the difference between the position of the  
>> current occurrence in the document and the previous occurrence. If  
>> PositionDelta is odd, then PayloadLength is stored. If  
>> PositionDelta is even, then the length of the current payload  
>> equals the length of the previous payload and thus PayloadLength  
>> is omitted.
>> - FreqFile (.frq)
>> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
>> PayloadLength --> VInt
>> For payloads disabled (unchanged):
>> DocSkip records the document number before every SkipInterval th  
>> document in TermFreqs. Document numbers are represented as  
>> differences from the previous value in the sequence.
>> For payloads enabled:
>> DocSkip/2 records the document number before every SkipInterval  
>> th  document in TermFreqs. If DocSkip is odd, then PayloadLength  
>> follows. If DocSkip is even, then the length of the payload at the  
>> current skip point equals the length of the payload at the last  
>> skip point and thus PayloadLength is omitted.
>> This encoding is space efficient for different use cases:
>>    * If only some fields of an index have payloads, then there's  
>> no space overhead for the fields with payloads disabled.
>>    * If the payloads of consecutive term positions have the same  
>> length, then the length only has to be stored once for every term.  
>> This should be a common case, because users probably use the same  
>> format for all payloads.
>>    * If only a few terms of a field have payloads, then we don't  
>> waste much space because we benefit again from the same-length- 
>> compression since we only have to store the length zero for the  
>> empty payloads once per term.
>> All unit tests pass.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Nicolas Lalevée (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479841 ] 

Nicolas Lalevée commented on LUCENE-755:
----------------------------------------

Grant>
The patch I have propsed here has no dependency on LUCENE-662, I just "imported" some ideas from it and put them there. Since the LUCENE-662 have involved, the patches will probably make conflicts. The best to use here is Michael's one. I think it won't conflit with LUCENE-662. And if both are intended to be commited, then the best is to commit the both seperately and redo the work I have done with the provided patch (I remember that it was quite easy).


> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-755) Payloads

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481003 ] 

Grant Ingersoll commented on LUCENE-755:
----------------------------------------

OK, I've applied the patch.  All tests pass for me.  I think it looks  
good.  Have you run any benchmarks on it?  I ran the standard one on  
the patched version and on trunk, in a totally unscientific test.  In  
theory, the case with no payloads should perform very closely to the  
existing code, and this seems to be born out by me running the micro- 
standard (ant run-task in contrib/benchmark).   Once we have this  
committed someone can take a crack at adding support to the  
benchmarker for payloads.

Payload should probably be serializable.

All in all, I think we could commit this, then adding the search/ 
scoring capabilities like we've talked about.  I like the  
documentation/comments you have added, very useful.  (One of these  
days I will take on documenting the index package like I intend to,  
so what you've added will be quite helpful!)   We will/may want to  
add in, for example, a PayloadQuery and derivatives and a QueryParser  
operator that supported searching in the payload, or possibly  
boosting if a certain term has a certain type of payload (not that I  
want anything to do with the QueryParser).  Even beyond that,  
SpanPayloadQuery, etc.  I will possibly have some cycles to actually  
write some code for these next week.

Just throwing this out there, I'm not sure I really mean it or  
not :-), but:
do you think it would be useful to consider restricting the size of  
the payload?  I know, I know, as soon as we put a limit on it,  
someone will want to expand it, but I was thinking if we knew the  
size had a limit we could better control the performance and caching,  
etc. on the scoring/search side.    I guess it is buyer beware, maybe  
we put some javadocs on this.

Also, I started http://wiki.apache.org/lucene-java/Payloads as I  
think we will want to have some docs explaining why Payloads are  
useful in non-javadoc format.

On a side note, have a look at http://wiki.apache.org/lucene-java/ 
PatchCheckList to see if there is anything you feel you can add.



--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




> Payloads
> --------
>
>                 Key: LUCENE-755
>                 URL: https://issues.apache.org/jira/browse/LUCENE-755
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>         Attachments: payload.patch, payloads.patch, payloads.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> ------------------------------   
> The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now offers two new methods:
>   /** Returns the payload length of the current term position.
>    *  This is invalid until {@link #nextPosition()} is called for
>    *  the first time.
>    * 
>    * @return length of the current payload in number of bytes
>    */
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>    * This is invalid until {@link #nextPosition()} is called for
>    * the first time.
>    * This method must not be called more than once after each call
>    * of {@link #nextPosition()}. However, payloads are loaded lazily,
>    * so if the payload data for the current position is not needed,
>    * this method may not be called at all for performance reasons.
>    * 
>    * @param data the array into which the data of this payload is to be
>    *             stored, if it is big enough; otherwise, a new byte[] array
>    *             is allocated for this purpose. 
>    * @param offset the offset in the array into which the data of this payload
>    *               is to be stored.
>    * @return a byte[] array containing the data of this payload
>    * @throws IOException
>    */
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was only a writeBytes()-method without an offset argument. 
> Implementation details
> ------------------------------
> - One field bit in FieldInfos is used to indicate if payloads are enabled for a field. The user does not have to enable payloads for a field, this is done automatically:
>    * The DocumentWriter enables payloads for a field, if one ore more Tokens carry payloads.
>    * The SegmentMerger enables payloads for a field during a merge, if payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the PositionDelta is shifted one bit. The lowest bit is used to indicate whether the length of the following payload is stored explicitly. If not, i. e. the bit is false, then the payload has the same length as the payload of the previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at every skip point has to be known. Therefore the payload length is also stored in the skip list located in the FreqFile. Here the same-length compression is also used: The lowest bit of DocSkip is used to indicate if the payload length is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() then only the position and the payload length is loaded from the ProxFile. If the user calls getPayload() then the payload is actually loaded. If getPayload() is not called before nextPosition() is called again, then the payload data is just skipped.
>   
> Changes of file formats
> ------------------------------
> - FieldInfos (.fnm)
> The format of the .fnm file does not change. The only change is the use of the sixth lowest-order bit (0x20) of the FieldBits. If this bit is set, then payloads are enabled for the corresponding field. 
> - ProxFile (.prx)
> ProxFile (.prx) -->  <TermPositions>^TermCount
> TermPositions   --> <Positions>^DocFreq
> Positions       --> <PositionDelta, Payload?>^Freq
> Payload         --> <PayloadLength?, PayloadData>
> PositionDelta   --> VInt
> PayloadLength   --> VInt 
> PayloadData     --> byte^PayloadLength
> For payloads disabled (unchanged):
> PositionDelta is the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first   occurrence in this document).
>   
> For Payloads enabled:
> PositionDelta/2 is the difference between the position of the current occurrence in the document and the previous occurrence. If PositionDelta is odd, then PayloadLength is stored. If PositionDelta is even, then the length of the current payload equals the length of the previous payload and thus PayloadLength is omitted.
> - FreqFile (.frq)
> SkipDatum     --> DocSkip, PayloadLength?, FreqSkip, ProxSkip
> PayloadLength --> VInt
> For payloads disabled (unchanged):
> DocSkip records the document number before every SkipInterval th document in TermFreqs. Document numbers are represented as differences from the previous value in the sequence.
> For payloads enabled:
> DocSkip/2 records the document number before every SkipInterval th  document in TermFreqs. If DocSkip is odd, then PayloadLength follows. If DocSkip is even, then the length of the payload at the current skip point equals the length of the payload at the last skip point and thus PayloadLength is omitted.
> This encoding is space efficient for different use cases:
>    * If only some fields of an index have payloads, then there's no space overhead for the fields with payloads disabled.
>    * If the payloads of consecutive term positions have the same length, then the length only has to be stored once for every term. This should be a common case, because users probably use the same format for all payloads.
>    * If only a few terms of a field have payloads, then we don't waste much space because we benefit again from the same-length-compression since we only have to store the length zero for the empty payloads once per term.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org