You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "kkewwei (Jira)" <ji...@apache.org> on 2019/12/19 12:48:00 UTC
[jira] [Created] (LUCENE-9101) lengthsBuf in
CompressingTermVectorsWriter is a bit redundant
kkewwei created LUCENE-9101:
-------------------------------
Summary: lengthsBuf in CompressingTermVectorsWriter is a bit redundant
Key: LUCENE-9101
URL: https://issues.apache.org/jira/browse/LUCENE-9101
Project: Lucene - Core
Issue Type: Improvement
Components: core/codecs
Affects Versions: 8.2
Reporter: kkewwei
In CompressingTermVectorsWriter, We use lengthsBuf to save the length of every terms, for example: "a a a b1 b1 b1 b1", the lengthsBuf=[1,1,1,2,2,2,2], a appeared three times, we count three time, it seems a bit redundant.
We use it in CompressingTermVectorsWriter.flushOffsets:
{code:java}
private void flushOffsets(int[] fieldNums) throws IOException {
......
// lengths
writer.reset(vectorsStream);
for (DocData dd : pendingDocs) {
for (FieldData fd : dd.fields) {
if ((fd.flags & OFFSETS) != 0) {
int pos = 0;
for (int i = 0; i < fd.numTerms; ++i) {
for (int j = 0; j < fd.freqs[i]; ++j) {
writer.add(lengthsBuf[fd.offStart + pos++] - fd.prefixLengths[i] - fd.suffixLengths[i]);
}
}
assert pos == fd.totalPositions;
}
}
}
writer.finish();
}{code}
we can simply it: lengthsBuf=[1,2], the same term just count one time. we could use `int count;` to count which current term we are process, for example:
{code:java}
private void flushOffsets(int[] fieldNums) throws IOException {
......
// lengths
writer.reset(vectorsStream);
for (DocData dd : pendingDocs) {
for (FieldData fd : dd.fields) {
if ((fd.flags & OFFSETS) != 0) {
int pos = 0;
for (int i = 0; i < fd.numTerms; ++i) {
count ++;
for (int j = 0; j < fd.freqs[i]; ++j) { // 每个域的每个distinct(词)
writer.add(lengthsBuf[count] - fd.prefixLengths[i] - fd.suffixLengths[i]);
}
}
assert pos == fd.totalPositions;
}
}
}
writer.finish();
}{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org