You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "kkewwei (Jira)" <ji...@apache.org> on 2019/12/19 12:49:00 UTC
[jira] [Updated] (LUCENE-9101) lengthsBuf in CompressingTermVectorsWriter is a bit redundant

     [ https://issues.apache.org/jira/browse/LUCENE-9101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kkewwei updated LUCENE-9101:
----------------------------
    Description: 
In CompressingTermVectorsWriter, We use lengthsBuf to save the length of every terms, for example: "a a a b1 b1 b1 b1", the lengthsBuf=[1,1,1,2,2,2,2], a appears three times, we count three time, it seems a bit redundant.

We use it in CompressingTermVectorsWriter.flushOffsets:

 
{code:java}
private void flushOffsets(int[] fieldNums) throws IOException {
  ......
  // lengths
  writer.reset(vectorsStream);
  for (DocData dd : pendingDocs) {
    for (FieldData fd : dd.fields) {
      if ((fd.flags & OFFSETS) != 0) {
        int pos = 0;
        for (int i = 0; i < fd.numTerms; ++i) {
          for (int j = 0; j < fd.freqs[i]; ++j) { 
            writer.add(lengthsBuf[fd.offStart + pos++] - fd.prefixLengths[i] - fd.suffixLengths[i]);
          }
        }
        assert pos == fd.totalPositions;
      }
    }
  }
  writer.finish();
}{code}
 

we can simply it: lengthsBuf=[1,2], the same term just count one time. we could use `int count;` to count which current term we are process, for example:

 
{code:java}
private void flushOffsets(int[] fieldNums) throws IOException {
  ......
  // lengths
  writer.reset(vectorsStream);
  for (DocData dd : pendingDocs) {
    for (FieldData fd : dd.fields) {
      if ((fd.flags & OFFSETS) != 0) {
        int pos = 0;
        for (int i = 0; i < fd.numTerms; ++i) {
          count ++;
          for (int j = 0; j < fd.freqs[i]; ++j) {
            writer.add(lengthsBuf[count] - fd.prefixLengths[i] - fd.suffixLengths[i]);
          } 
        }
        assert pos == fd.totalPositions;
      }
    }
  }
  writer.finish();
}{code}
 

 

  was:
In CompressingTermVectorsWriter, We use lengthsBuf to save the length of every terms, for example: "a a a b1 b1 b1 b1", the lengthsBuf=[1,1,1,2,2,2,2], a appears three times, we count three time, it seems a bit redundant.

We use it in CompressingTermVectorsWriter.flushOffsets:

 
{code:java}
private void flushOffsets(int[] fieldNums) throws IOException {
  ......
  // lengths
  writer.reset(vectorsStream);
  for (DocData dd : pendingDocs) {
    for (FieldData fd : dd.fields) {
      if ((fd.flags & OFFSETS) != 0) {
        int pos = 0;
        for (int i = 0; i < fd.numTerms; ++i) {
          for (int j = 0; j < fd.freqs[i]; ++j) { 
            writer.add(lengthsBuf[fd.offStart + pos++] - fd.prefixLengths[i] - fd.suffixLengths[i]);
          }
        }
        assert pos == fd.totalPositions;
      }
    }
  }
  writer.finish();
}{code}
 

we can simply it: lengthsBuf=[1,2], the same term just count one time. we could use `int count;` to count which current term we are process, for example:

 
{code:java}
private void flushOffsets(int[] fieldNums) throws IOException {
  ......
  // lengths
  writer.reset(vectorsStream);
  for (DocData dd : pendingDocs) {
    for (FieldData fd : dd.fields) {
      if ((fd.flags & OFFSETS) != 0) {
        int pos = 0;
        for (int i = 0; i < fd.numTerms; ++i) {
          count ++;
          for (int j = 0; j < fd.freqs[i]; ++j) { // 每个域的每个distinct(词)
            writer.add(lengthsBuf[count] - fd.prefixLengths[i] - fd.suffixLengths[i]);
          } 
        }
        assert pos == fd.totalPositions;
      }
    }
  }
  writer.finish();
}{code}
 

 


> lengthsBuf  in CompressingTermVectorsWriter is a bit redundant
> --------------------------------------------------------------
>
>                 Key: LUCENE-9101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9101
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 8.2
>            Reporter: kkewwei
>            Priority: Major
>
> In CompressingTermVectorsWriter, We use lengthsBuf to save the length of every terms, for example: "a a a b1 b1 b1 b1", the lengthsBuf=[1,1,1,2,2,2,2], a appears three times, we count three time, it seems a bit redundant.
> We use it in CompressingTermVectorsWriter.flushOffsets:
>  
> {code:java}
> private void flushOffsets(int[] fieldNums) throws IOException {
>   ......
>   // lengths
>   writer.reset(vectorsStream);
>   for (DocData dd : pendingDocs) {
>     for (FieldData fd : dd.fields) {
>       if ((fd.flags & OFFSETS) != 0) {
>         int pos = 0;
>         for (int i = 0; i < fd.numTerms; ++i) {
>           for (int j = 0; j < fd.freqs[i]; ++j) { 
>             writer.add(lengthsBuf[fd.offStart + pos++] - fd.prefixLengths[i] - fd.suffixLengths[i]);
>           }
>         }
>         assert pos == fd.totalPositions;
>       }
>     }
>   }
>   writer.finish();
> }{code}
>  
> we can simply it: lengthsBuf=[1,2], the same term just count one time. we could use `int count;` to count which current term we are process, for example:
>  
> {code:java}
> private void flushOffsets(int[] fieldNums) throws IOException {
>   ......
>   // lengths
>   writer.reset(vectorsStream);
>   for (DocData dd : pendingDocs) {
>     for (FieldData fd : dd.fields) {
>       if ((fd.flags & OFFSETS) != 0) {
>         int pos = 0;
>         for (int i = 0; i < fd.numTerms; ++i) {
>           count ++;
>           for (int j = 0; j < fd.freqs[i]; ++j) {
>             writer.add(lengthsBuf[count] - fd.prefixLengths[i] - fd.suffixLengths[i]);
>           } 
>         }
>         assert pos == fd.totalPositions;
>       }
>     }
>   }
>   writer.finish();
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org