You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/01/18 01:20:43 UTC

[jira] Created: (LUCENE-2872) Terms dict should block-encode terms

Terms dict should block-encode terms
------------------------------------

                 Key: LUCENE-2872
                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 4.0
         Attachments: LUCENE-2872.patch

With PrefixCodedTermsReader/Writer we now encode each term standalone,
ie its bytes, metadata, details for postings (frq/prox file pointers),
etc.

But, this is costly when something wants to visit many terms but pull
metadata for only few (eg respelling, certain MTQs).  This is
particularly costly for sep codec because it has more metadata to
store, per term.

So instead I think we should block-encode all terms between indexed
term, so that the metadata is stored "column stride" instead.  This
makes it faster to enum just terms.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984589#action_12984589 ] 

Simon Willnauer commented on LUCENE-2872:
-----------------------------------------

WOW nice mike! do you have benchmark numbers here by any chance? After all those improvements -  FST, TermState, BlockCoded TermDict etc. I wonder if we reached the 10k% in the 3.0 vs. 4.0 united~2.0 benchmark...

> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch, LUCENE-2872.patch, LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2872:
---------------------------------------

    Attachment: LUCENE-2872.patch

New patch, specializes read* in ByteArrayDataInput (poached from LUCENE-2824).

> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch, LUCENE-2872.patch, LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-2872.
----------------------------------------

    Resolution: Fixed

> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch, LUCENE-2872.patch, LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2872:
---------------------------------------

    Attachment: LUCENE-2872.patch

Patch.

I think it's basically working, but there are still a bunch of nocommits.

> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2872:
---------------------------------------

    Attachment: LUCENE-2872.patch

New patch -- cleaned up all the nocommits, and cutover to common prefix for all terms in the block.

> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch, LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984703#action_12984703 ] 

Michael McCandless commented on LUCENE-2872:
--------------------------------------------

I did run the benchmark -- but lost the output :(  I compared standard on trunk vs standard w/ block terms dict.

There were solid gains, especially for the MTQs that visit many terms but few docs (like the respelling case).

> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch, LUCENE-2872.patch, LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2872) Terms dict should block-encode terms

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984306#action_12984306 ] 

Robert Muir commented on LUCENE-2872:
-------------------------------------

+1 to commit, the last specialization made all the difference on my benchmarks.

I think this will pave the way for us to fix Sep codec in the branch...


> Terms dict should block-encode terms
> ------------------------------------
>
>                 Key: LUCENE-2872
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2872
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2872.patch, LUCENE-2872.patch, LUCENE-2872.patch
>
>
> With PrefixCodedTermsReader/Writer we now encode each term standalone,
> ie its bytes, metadata, details for postings (frq/prox file pointers),
> etc.
> But, this is costly when something wants to visit many terms but pull
> metadata for only few (eg respelling, certain MTQs).  This is
> particularly costly for sep codec because it has more metadata to
> store, per term.
> So instead I think we should block-encode all terms between indexed
> term, so that the metadata is stored "column stride" instead.  This
> makes it faster to enum just terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org