You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/10/20 14:39:44 UTC

[jira] Created: (LUCENE-1426) Next steps towards flexible indexing

Next steps towards flexible indexing
------------------------------------

                 Key: LUCENE-1426
                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 2.9


In working on LUCENE-1410 (PFOR compression) I tried to prototype
switching the postings files to use PFOR instead of vInts for
encoding.

But it quickly became difficult.  EG we currently mux the skip data
into the .frq file, which messes up the int blocks.  We inline
payloads with positions which would also mess up the int blocks.
Skipping offsets and TermInfo offsets hardwire the file pointers of
frq & prox files yet I need to change these to block + offset, etc.

Separately this thread also started up, on how to customize how Lucene
stores positional information in the index:

  http://www.gossamer-threads.com/lists/lucene/java-user/66264

So I decided to make a bit more progress towards "flexible indexing"
by first modularizing/isolating the classes that actually write the
index format.  The idea is to capture the logic of each (terms, freq,
positions/payloads) into separate interfaces and switch the flushing
of a new segment as well as writing the segment during merging to use
the same APIs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641574#action_12641574 ] 

Michael Busch commented on LUCENE-1426:
---------------------------------------

{quote}
+1 This sounds like a great way to approach flexible indexing: incrementally. 
{quote}

Couldn't agree more. This is great!

{quote}
The next step, which is trickier, is to modularize/genericize the
classes the read from the index, and then refactor
SegmentTerm(Enum,Docs,Positions) to use that codec API.
{quote}

Yes this is definitely the tricky part. I've been thinking a bit about this and was wondering if for the read APIs we could do something similar as with the new Token API (LUCENE-1422)? TermDocs could have a list of Attributes that the posting list offers. If for example no payloads are stored in the posting list, then TermDocs should not offer that corresponding Attribute.
This approach should be just as fast as the current API. When the application opens a TermDocs, it could check for the offered Attributes before it starts iterating the postinglist, and keep references to the Attribute. (in fact that's exactly the same approach as the TokenStream/Token/Consumer approach in LUCENE-1422).

Thoughts?

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641121#action_12641121 ] 

Paul Elschot commented on LUCENE-1426:
--------------------------------------

bq. We inline payloads with positions which would also mess up the int blocks.

Which begs the question whether we should also allow compression of these payloads.
I think we should do that because normally only one or two bytes will be used as payload per position.
Thinking about this: position+payload actually looks a lot like docId+freq, could that
be used to simplify future index formats for inverted terms?
Btw. allowing a payload to accompany the field norms would allow to store a kind of
dictionary for the position payloads. This could help to keep the position payloads small
so they would compress nicely.

bq. Both SegmentMerger & FreqProxTermsWriter now use the same codec API to write postings.

That is indeed a big step.

bq. It's all package private.

Good for now, making it public might actually reduce flexibility for new index formats.



> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641137#action_12641137 ] 

Michael McCandless commented on LUCENE-1426:
--------------------------------------------

bq. During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset.

Yes, there's this issue:

  https://issues.apache.org/jira/browse/LUCENE-1278

And you had found this one:

  http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf

And then Doug referenced this:

  http://citeseer.ist.psu.edu/cutting90optimizations.html

I think the idea makes tons of sense (saving a seek) and one of my
goals in phase 2 (genericizing the reading of an index) is to make
pulsing a drop-in codec as an example & litmus test.  Terms iteration
may suffer, though, unless we put this in a separate file.

I also think, at the opposite end of the spectrum, it would make sense
for very common terms to use simple n-bit packing (PFOR minus the
exceptions).  For massive terms we need the fastest search we can
get, since that gates when you have to start sharding.

bq. I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this

Well the stock market seems to think the credit crunch is improving,
today... of course who knows what'll happen tomorrow!  Good luck :)

Also, I'd like to explore improving the terms dict indexing -- I don't
think we need to load a TermInfo instance for every indexed term, into
RAM.  I think we just need the term & seek data (into the tis file),
then you seek there and skip to the TermInfo you need.  This should
save a good amount of RAM for large indices with odd terms, sicne each
TermInfo instance requires a pointer to it (4 or 8 bytes), an object
header (8 bytes at least) then 20 bytes for the members.

All these explorations should become simple drop-in codecs, once I can
finish phase 2.


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641132#action_12641132 ] 

Doug Cutting commented on LUCENE-1426:
--------------------------------------

+1 This sounds like a great way to approach flexible indexing: incrementally.

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1426.
----------------------------------------

    Resolution: Fixed

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641599#action_12641599 ] 

Paul Elschot commented on LUCENE-1426:
--------------------------------------

bq. ... it would make sense to use VInts for very short postings and PFOR for the rest. I just do not remember rationale behind it.
bq. ... cool idea to actually inline very short postings into term dict instead of storing offset.

Iirc the rationale was that PFOR has most performance benefits on integer arrays of more than 100 elements.
Shorter lists of numbers might also benefit from using (P)FOR instead of VInt, I don't know how big the break even size is.

bq. for starters (we) could simply implement random access as "load & decode the entire block, then look at the part you want" and then assess the cost.

I've just started some performance tests on PFOR patching (i.e. filling in the exceptions), and I'm not happy with what I'm seeing. More on this later at 1410.


On allowing a payload to accompany the field norms:
bq. Couldn't stored fields, once they are faster (with column-stride fields, LUCENE-1231) solve this?

Yes.


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641125#action_12641125 ] 

Paul Elschot commented on LUCENE-1426:
--------------------------------------

bq. Skipping offsets and TermInfo offsets hardwire the file pointers of  frq & prox files yet I need to change these to block + offset, etc.

Does the offset imply that there is also a need for random access into each block?
For such blocks PFOR patching might better be avoided.
Even with patching random access is possible, but it is not available yet at LUCENE-1410.


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641140#action_12641140 ] 

Michael McCandless commented on LUCENE-1426:
--------------------------------------------


bq. Which begs the question whether we should also allow compression of these payloads.

I think that's interesting, but would probably be rather application dependent.

{quote}
Btw. allowing a payload to accompany the field norms would allow to store a kind of
dictionary for the position payloads. This could help to keep the position payloads small
so they would compress nicely.
{quote}

Couldn't stored fields, once they are faster (with column-stride
fields, LUCENE-1231) solve this?


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641139#action_12641139 ] 

Michael McCandless commented on LUCENE-1426:
--------------------------------------------


{quote}
Does the offset imply that there is also a need for random access into each block?
For such blocks PFOR patching might better be avoided.
Even with patching random access is possible, but it is not available yet at LUCENE-1410.
{quote}

Yeah this is one of the reasons why I'm thinking for frequent terms we
may want to fallback to pure nbit packing (which would make random
access simple).

But, for starters would could simply implement random access as "load
& decode the entire block, then look at the part you want" and then
assess the cost.  While it will clearly increase the cost of queries
that do alot of skipping (eg AND query of N terms), it may not matter
so much since these queries should be fairly fast now.  It's the OR of
frequent term queries that we need to improve since that limits how
big an index you can put on one box.


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Eks Dev (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641128#action_12641128 ] 

Eks Dev commented on LUCENE-1426:
---------------------------------

Just a few random thoughts on this topic

- I am sure I read somewhere in these pdfs that were floating around that it would make sense to use VInts for very short postings and PFOR for the rest. I just do not remember rationale behind it.   

- During omitTf() discussion, we came up with cool idea to actually inline very short postings into term dict instead of storing offset. This way we spare one seek per term in many cases, as well as some space for storing offset. I do not know if this is a problem, but sounds reasonable. With standard Zipfian distribution, a lot of postings should get inlined. Use cases where we have query expansion on many terms (think spell checker, synonyms ...) should benefit from that heavily. These postings are small but there is a lot of them, so it adds up... seek is deadly :)

I am sorry to miss the party here with PFOR, but let us hope this credit crunch gets over soon so I that I could dedicate some time to fun things like this :)

cheers, eks 


  

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641747#action_12641747 ] 

Michael McCandless commented on LUCENE-1426:
--------------------------------------------

bq. TermDocs could have a list of Attributes that the posting list offers.

I like this approach.

Though unlike LUCENE-1422, where Token remains separate from
TokenStream (and I'm still not sure it should be...?), I think for
TermDocs there would not be the analog of a separate Token.
Ie, it would look something like this:

  myPerDocAttr = termDocs.getAttribute(MyPerDoc.class);

  while(termDocs.next()) {
    x = myPerDocAttr.getValue(...);
  }

However, this form of flexibility is actually beyond what I'm aiming
for, for the first step of reader flexibility (there are so many
facets of "flexible indexing"!).

For starters I'd like to allow flexibility on how you encode the
existing postings (doc/freq/positions/payloads).  Whereas this
flexibility is in extending what stuff is actually stored into & read
from the index.  I think we should do both, but my focus now is on the
first one, specifically to be able to drop in a codec that uses
pulsing, a less RAM-intestive terms dict indexing, and/or PFOR, etc.


> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1426) Next steps towards flexible indexing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1426:
---------------------------------------

    Attachment: LUCENE-1426.patch

Attached patch.  I think it's ready to commit... I'll wait a few days.

This factors the writing of postings into separate Format* classes.
The approach I took is similar to what I did for DocumentsWriter,
where there is a hierarchical consumer interface (abstract class) for
each of fields, terms, docs, and positions writing.  Then there's a
corresponding set of concrete classes (the "codec chain") that write
today's index format.  There is no change to the index format.

Here are the details:

  * This only applies to postings (not stored fields, term vectors,
    norms, field infos)

  * Both SegmentMerger & FreqProxTermsWriter now use the same codec
    API to write postings.  I think this is a big step forward: we now
    have a single set of classes that ever write the postings.

  * You can't yet customize this codec chain; we can add that at some
    point.  It's all package private.

  * I don't yet allow the codec to override SegmentInfo.files(); at
    some point (when I first try to make a codec that uses different
    files) I will add this.

I ran a quick performance test, indexing wikipedia, and found
negligible performance cost of this.

The next step, which is trickier, is to modularize/genericize the
classes the read from the index, and then refactor
SegmentTerm{Enum,Docs,Positions} to use that codec API.

Then, finally, I want to make a codec that uses PFOR to encode
postings.

> Next steps towards flexible indexing
> ------------------------------------
>
>                 Key: LUCENE-1426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1426.patch
>
>
> In working on LUCENE-1410 (PFOR compression) I tried to prototype
> switching the postings files to use PFOR instead of vInts for
> encoding.
> But it quickly became difficult.  EG we currently mux the skip data
> into the .frq file, which messes up the int blocks.  We inline
> payloads with positions which would also mess up the int blocks.
> Skipping offsets and TermInfo offsets hardwire the file pointers of
> frq & prox files yet I need to change these to block + offset, etc.
> Separately this thread also started up, on how to customize how Lucene
> stores positional information in the index:
>   http://www.gossamer-threads.com/lists/lucene/java-user/66264
> So I decided to make a bit more progress towards "flexible indexing"
> by first modularizing/isolating the classes that actually write the
> index format.  The idea is to capture the logic of each (terms, freq,
> positions/payloads) into separate interfaces and switch the flushing
> of a new segment as well as writing the segment during merging to use
> the same APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org