You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Sergey Vladimirov (JIRA)" <ji...@apache.org> on 2010/04/02 21:34:27 UTC

[jira] Created: (LUCENE-2362) Add support for slow filters with batch processing

Add support for slow filters with batch processing
--------------------------------------------------

                 Key: LUCENE-2362
                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
    Affects Versions: 3.0.1
            Reporter: Sergey Vladimirov


Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).

if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.

Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.

The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853111#action_12853111 ] 

Michael McCandless commented on LUCENE-2362:
--------------------------------------------

I think in general Lucene should do a better job managing whether the filter is cheap or expensive, random access or not (LUCENE-1536), and tune the matching/scoring appropriately.

But one issue with this patch: how is scoring done?  It looks like in first pass you gather bit set, then you filter it w/ batch filter, then you iterate again in 2nd pass to collect the docs.  But that 2nd pass won't in general have enough info to do scoring?

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Vladimirov updated LUCENE-2362:
--------------------------------------

    Attachment:     (was: IndexSearcherImpl.java)

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853114#action_12853114 ] 

Sergey Vladimirov commented on LUCENE-2362:
-------------------------------------------

Michael,

I'm sorry, i don't understand the question/problem.

Scoring is done exactly the same way as it done in IndexSearcher with standard filter, the only difference - it's done after filtering, not in the same time.

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Vladimirov updated LUCENE-2362:
--------------------------------------

    Attachment:     (was: ScorerProxy.java)

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Vladimirov updated LUCENE-2362:
--------------------------------------

    Attachment: BatchFilter.java
                IndexSearcherImpl.java

Example of batch slow filter interface and IndexSearch implementation.

May be it is possible to split Filter to several interfaces and allow user to select concrete way to implement it. Like:

- Filter (interface)
  -- Fast Filter (current one)
  -- Slow Filter (new one, like the one in attachment)

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853122#action_12853122 ] 

Sergey Vladimirov commented on LUCENE-2362:
-------------------------------------------

Michael,

I got it, thanks for notice! Right now i'll prepare another version. Too bad DocIdIterator doesn't have reset() method - it would help a lot.

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853119#action_12853119 ] 

Michael McCandless commented on LUCENE-2362:
--------------------------------------------

But that's a big problem -- most scorers can't score "after the fact".  They need to access things they have loaded for the one document being scored.

EG try running a TermQuery and compare the scores you get for docs with Lucene's normal search vs with your patch.

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Vladimirov updated LUCENE-2362:
--------------------------------------

    Attachment:     (was: IndexSearcherImpl.java)

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java, ScorerProxy.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Vladimirov updated LUCENE-2362:
--------------------------------------

    Attachment: IndexSearcherImpl.java
                ScorerProxy.java

Update the example

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java, ScorerProxy.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2362) Add support for slow filters with batch processing

Posted by "Sergey Vladimirov (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Vladimirov updated LUCENE-2362:
--------------------------------------

    Attachment: IndexSearcherImpl.java

We can't reset Scorer, but we can obtain a new one. Update the example.

> Add support for slow filters with batch processing
> --------------------------------------------------
>
>                 Key: LUCENE-2362
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2362
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 3.0.1
>            Reporter: Sergey Vladimirov
>         Attachments: BatchFilter.java, IndexSearcherImpl.java
>
>
> Internal implementation of IndexSearch assumes that Filter and scorer has almost equal perfomance. But in our environment we have Filter implementation that is very expensive (in compare to scorer).
> if we have, let's say, 2k of termdocs selected by scorer (each ~250 docs) and 2k selected by filter, then 250k docs will be fastly checked (and filtered out) by scorer, and 250k docs will be slowly checked by our filter.
> Using straigthforward implementation makes search out of 60 seconds per query boundary, because each next() or advance() requires N queries to database PER CHECKED DOC. Using read ahead technique allows us to optimze it to 35 seconds per query. Still too slow.
> The solution to problem is firstly select all documents by scorer and filter them in batch by our filter. Example of implementation (with BitSet) in attachement. Currently it takes only ~300 millseconds per query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org