You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2010/10/25 11:39:19 UTC
[jira] Created: (LUCENE-2722) Sep codec should store less in terms
dict
Sep codec should store less in terms dict
-----------------------------------------
Key: LUCENE-2722
URL: https://issues.apache.org/jira/browse/LUCENE-2722
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0
I'm working on improving Lucene's performance with int block codecs
(FOR/PFOR), but in early perf testing I found that these codecs cause
a big perf hit to those MTQs that need to scan many terms but don't
end up accepting many of those terms (eg fuzzy, wildcard, regexp).
This is because sep codec stores much more in the terms dict, since
each file is separate, ie seek points for each of doc, frq, pos, pyl,
skp files.
So I'd like to shift these seek points to instead be stored in the doc
file, except for the doc seek point itself. Since a given query will
always need to seek to the doc file, this does not add an extra seek.
But it saves tons of vInt decodes for the next/seke intensive MTQs...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Commented: (LUCENE-2722) Sep codec should store less in
terms dict
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924515#action_12924515 ]
Michael McCandless commented on LUCENE-2722:
--------------------------------------------
Note that this is a change to the index format, but no core codecs actually use the [abstract] sep codec yet (LUCENE-1410 will change that).
So I ran a simple perf test using MockSep, which naively writes each vInt separately:
||Query||QPS trunk||QPS patch||Pct diff||||
|"unit state"|7.74|7.78|{color:green}0.4%{color}|
|uni*|17.12|17.31|{color:green}1.1%{color}|
|+unit +state|10.75|10.95|{color:green}1.9%{color}|
|unit*|29.71|30.38|{color:green}2.3%{color}|
|unit state|11.99|12.36|{color:green}3.1%{color}|
|un*d|65.00|67.85|{color:green}4.4%{color}|
|state|38.43|40.16|{color:green}4.5%{color}|
|u*d|19.92|21.94|{color:green}10.1%{color}|
|united~0.7|24.02|30.30|{color:green}26.2%{color}|
|united~0.6|5.26|7.00|{color:green}33.2%{color}|
It's a good speedup for the two fuzzy queries and also u*d...
> Sep codec should store less in terms dict
> -----------------------------------------
>
> Key: LUCENE-2722
> URL: https://issues.apache.org/jira/browse/LUCENE-2722
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2722.patch
>
>
> I'm working on improving Lucene's performance with int block codecs
> (FOR/PFOR), but in early perf testing I found that these codecs cause
> a big perf hit to those MTQs that need to scan many terms but don't
> end up accepting many of those terms (eg fuzzy, wildcard, regexp).
> This is because sep codec stores much more in the terms dict, since
> each file is separate, ie seek points for each of doc, frq, pos, pyl,
> skp files.
> So I'd like to shift these seek points to instead be stored in the doc
> file, except for the doc seek point itself. Since a given query will
> always need to seek to the doc file, this does not add an extra seek.
> But it saves tons of vInt decodes for the next/seke intensive MTQs...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Resolved: (LUCENE-2722) Sep codec should store less in terms
dict
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless resolved LUCENE-2722.
----------------------------------------
Resolution: Fixed
> Sep codec should store less in terms dict
> -----------------------------------------
>
> Key: LUCENE-2722
> URL: https://issues.apache.org/jira/browse/LUCENE-2722
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2722.patch
>
>
> I'm working on improving Lucene's performance with int block codecs
> (FOR/PFOR), but in early perf testing I found that these codecs cause
> a big perf hit to those MTQs that need to scan many terms but don't
> end up accepting many of those terms (eg fuzzy, wildcard, regexp).
> This is because sep codec stores much more in the terms dict, since
> each file is separate, ie seek points for each of doc, frq, pos, pyl,
> skp files.
> So I'd like to shift these seek points to instead be stored in the doc
> file, except for the doc seek point itself. Since a given query will
> always need to seek to the doc file, this does not add an extra seek.
> But it saves tons of vInt decodes for the next/seke intensive MTQs...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] Updated: (LUCENE-2722) Sep codec should store less in terms
dict
Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-2722:
---------------------------------------
Attachment: LUCENE-2722.patch
Patch.
> Sep codec should store less in terms dict
> -----------------------------------------
>
> Key: LUCENE-2722
> URL: https://issues.apache.org/jira/browse/LUCENE-2722
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2722.patch
>
>
> I'm working on improving Lucene's performance with int block codecs
> (FOR/PFOR), but in early perf testing I found that these codecs cause
> a big perf hit to those MTQs that need to scan many terms but don't
> end up accepting many of those terms (eg fuzzy, wildcard, regexp).
> This is because sep codec stores much more in the terms dict, since
> each file is separate, ie seek points for each of doc, frq, pos, pyl,
> skp files.
> So I'd like to shift these seek points to instead be stored in the doc
> file, except for the doc seek point itself. Since a given query will
> always need to seek to the doc file, this does not add an extra seek.
> But it saves tons of vInt decodes for the next/seke intensive MTQs...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org