You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/09/19 17:09:32 UTC

[jira] Created: (LUCENE-2654) bulk-code each chunk b/w indexed terms in the terms dict

bulk-code each chunk b/w indexed terms in the terms dict
--------------------------------------------------------

Key: LUCENE-2654
URL: https://issues.apache.org/jira/browse/LUCENE-2654
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 4.0
Reporter: Michael McCandless
Priority: Minor

This is an idea for exploration that came up w/ Robert...

In PrefixCodedTermsDict (used by the default Standard codec), we encode each term entry "standalone", using vInts. We store the changed suffix (start, end, bytes), then metadata for the term like docFreq, frq start, prx start, skip start. Each of these ints is a vInt, which is relatively costly.

If instead we store the N terms between indexed terms "column-stride", using bulk codec like FOR/PFOR, so that the 32 docFreqs are stored as one block, 32 frq deltas as another, etc., then seek and next should be faster. Ie, we could make decode of the metadata lazy, so that a seek to a term that does not exist may be able avoid any metadata decode entirely. Sequential scanning (lots of .next in a row) would also be faster, even if it needs the metadata since bulk-decode should be faster than multiple vInt decodes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org