You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Artem Vasiliev (JIRA)" <ji...@apache.org> on 2007/01/09 23:27:27 UTC

[jira] Created: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

[PATCH] Performance improvement for some cases of sorted search
---------------------------------------------------------------

                 Key: LUCENE-769
                 URL: https://issues.apache.org/jira/browse/LUCENE-769
             Project: Lucene - Java
          Issue Type: Improvement
    Affects Versions: 2.0.0
            Reporter: Artem Vasiliev


It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.

The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re[2]: [jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by Artem <ab...@gmail.com>.

Hello Karl, Yonik and all,

Wednesday, February 21, 2007, 11:13:05 PM, you wrote:

>> I think some of the attachments that were deleted were uploaded by
>> robert engels
>> not just Artem Vasiliev (the person who requested deletion).

I'm also the creator of this ticket, and basically I created it to contribute my
patch that's currently in it.
I beleive Robert Engels' patch is another story and definitely not a part of the
patch for the ticket.

kw> Well I feel silly now.

Thank you for clearing a mess.

-- 
Best regards,
 Artem                            mailto:abublic@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by karl wettin <ka...@gmail.com>.

21 feb 2007 kl. 20.55 skrev Yonik Seeley:

> I think some of the attachments that were deleted were uploaded by  
> robert engels
> not just Artem Vasiliev (the person who requested deletion).

Well I feel silly now.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by Yonik Seeley <yo...@apache.org>.

On 2/21/07, Karl Wettin (JIRA) <ji...@apache.org> wrote:
[ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
>
> Karl Wettin updated LUCENE-769:
> -------------------------------
>
>     Attachment:     (was: IndexReaderUtils.java)

I think some of the attachments that were deleted were uploaded by robert engels
not just Artem Vasiliev (the person who requested deletion).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464015 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

Ok guys, I think I'm finished on this. Feel free to include it in Lucene or not. 

I'm quite happy already using it in my app (Sharehound), it does solve the problem for me.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463678 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

Also tried this with DOCS_NUM == 1,000,000. Right after index creation it was 
    [junit] 1687ms elapsed for DocCaching sort
    [junit] 1922ms elapsed for FieldCache'd sort

But next 2 time I ran it (without index set up) the timings were near this:
    [junit] 94ms elapsed for DocCaching sort
    [junit] 1797ms elapsed for FieldCache'd sort

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Description: 
The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.

It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.

The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

  was:
It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.

The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, selfContained.patch, selfContained.patch, selfContained.patch, selfContained.patch, StoredFieldSorting.patch
>
>
> The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Attachment: selfContained.patch

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, selfContained.patch, selfContained.patch, selfContained.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-769:
-------------------------------

    Attachment:     (was: selfContained.patch)

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: selfContained.patch
>
>
> The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "robert engels (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464056 ] 

robert engels commented on LUCENE-769:
--------------------------------------

The IndexReaderUtils I posted is not compilable - there are a few more classes needed. These are unnecessary to understand the technique.

It was written this way to minimize the dependencies with Lucene, and not have to apply patches for my local codebase.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-769:
-------------------------------

    Attachment:     (was: IndexReaderUtils.java)

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: selfContained.patch
>
>
> The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464038 ] 

Hoss Man commented on LUCENE-769:
---------------------------------

Artem: while i agree with Yonik/Chuck's comments about your performance tests probably not being realistic in the general case, what i really like about your patch is that it makes no attempt to change the default behavior of sorting in a way that would hurt users by default -- users would only get this behavior if they choose to use it, and while the "typical" case may not bnefit from it, i'm sure there are plenty of situations where people *know* their index is big, and *know* that they are doing a search that should have a small number of results.  adding something like this doesn't proclude future work on making sorting using FieldCache's less prohibitive (ie: an IndexReader.reopen approach)

what does concern me about this patch is that without better javadocs explaining exactly what it does and when it's usefull, it could easily be missued by people who stumble upon it.

I also don't understand why in your updated version of the patch, you aren't making an attempt to use the FieldSelector version of IndexReader.document(), since it should *allways* be faster in this use case, and would result in your memory cache talking up less space.

I also don't understand your "IndexReader members access rights problems raised" ... a subclass of IndexReader should be able to live freely in any package -- including as a private static class inside of another class.  perhaps you ran into problems because you are attempting to subclass methods you don't really need to worry about subclassing? ... yet another reason to subclass FilterIndexReader and save yourself some headaches.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, QueryFilter.java, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463969 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

Hi guys!

Thanks for value comments. What a feedback! :)

I'd like to stress the point of my fix - to avoid costly FieldCache population with field values from the whole index.

Your point that it will be slower for cases when filtered sets be nearly as large as the whole index is valid. But is it a practical point? Lucene shines on big indexes and queries resulting with full index are not very useful I guess.

I think it's good idea to hide the caching reader class and utilize FieldSelector mechanism to make the fix more effective. However do you think this improvement worth doing? You are strong opposition and I'm not feeling up to an endless fight :) I'm serious, let me know what you think. This fix will have its limitations by no means but I think the above OutOfMemory scenario with current sorting mechanism alone makes this fix legitimate.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by Chris Hostetter <ho...@fucit.org>.

: Chuck Williams commented on LUCENE-769:
: ---------------------------------------
:
: Robert,
:
: Could you attach your current implementation of reopen() as well?  The
: attachment did not come through in your java-dev message today, or the
: one from 12/11.  I'd like to look at an incremental implementation of
: reopen() for FieldCache.

while i'm also interested, and the issue is related, it's still seperate
from the notion of sorting on Stored fields .. can we move this to a
seperate jira issue?


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Chuck Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464055 ] 

Chuck Williams commented on LUCENE-769:
---------------------------------------

Robert,

Could you attach your current implementation of reopen() as well?  The attachment did not come through in your java-dev message today, or the one from 12/11.  I'd like to look at an incremental implementation of reopen() for FieldCache.

Thanks


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, QueryFilter.java, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-769:
-------------------------------

    Attachment:     (was: StoredFieldSorting.patch)

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: selfContained.patch
>
>
> The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Attachment: DocCachingSorting.patch

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464319 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

Refactored the fix according to Hoss's recomendations. Now only StoredFieldSortFactory class is left public; FieldSelector is always used to fetch the only field from documents. Guys please remove all the attachments except #7 - things get messy..

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, selfContained.patch, selfContained.patch, selfContained.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463682 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

fork="true" maxmemory="105m" attributes need to be added to <junit> task  for the test to be runnable   with DOCS_NUM == 1,000,000

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "robert engels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

robert engels updated LUCENE-769:
---------------------------------

    Attachment: QueryFilter.java

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, QueryFilter.java, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Attachment: DocCachingSorting.patch

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Attachment: selfContained.patch

Removed several inner classes and documents cache from StoredFieldSortFactory. Now the whole class is pretty clean and simple. Checked the timings in the test - they remain pretty much the same.  

Checked it in Sharehound with more large index (~1mln documents), it sorts resultset with 4000 docs in ~0,2s now taking 70M RAM, that's fine to me. 

With standart new Sort(field, false) it takes (for the first search on a field) about 30-40s and quite a lot of memory (after several sorted searches different fields it took about 500M).

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, selfContained.patch, selfContained.patch, selfContained.patch, selfContained.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-769:
-------------------------------

    Attachment:     (was: QueryFilter.java)

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: selfContained.patch
>
>
> The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463996 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

Renamed classes as Hoss proposed. Tried to hide DocFieldCachingIndexReader, no luck - IndexReader members access rights problems raised. FieldSelector is now verified to be the same and is used by StoredFieldComparatorSource for DocFieldCachingIndexReader creation. 
Timings didn't change much - they probably would if documents in index were larger.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Attachment: selfContained.patch

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, selfContained.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Chuck Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463729 ] 

Chuck Williams commented on LUCENE-769:
---------------------------------------

The test case uses only tiny documents, and the reported timings for multiple searches with FieldCache make it appear that the version of lucene used contains the bug that caused FieldCaches to be frequently recomputed unnecessarily.

I suggest trying the test with much larger documents, of realistic size, and using current Lucene source.  I'm sure the patch will make things much slower with the current implementation.  As Hoss suggests, performance would be improved considerably by using a FieldSelector to obtain just the sort field, but even so will be slow unless the sort field is arranged to be early on the documents, ideally the first field, and a LOAD_AND_BREAK FieldSelector is used.

Another important performance variable will be the number of documents retrieved in the test query.  If the number of documents satisfying the query is a sizable percentage of the total collection size, I'm pretty sure the patch will be much slower than using FieldCache.


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463761 ] 

Yonik Seeley commented on LUCENE-769:
-------------------------------------

Those performance numbers don't make sense to me.
Why would DocCaching sort be so much faster than FieldCache sort the second time on the same IndexReader?
Using a cached FieldCache entry for sorting involves an array lookup... how do you improve on that?
Or do you open a new reader for each test?

Also, you specify the size of the index, but not the size of the number of documents to be sorted (that match the query).  DocCacheSorting should use much more memory than the FieldCache (and be slower) if the number of documents to be sorted is large, right?


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Attachment: StoredFieldSorting.patch

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by robert engels <re...@ix.netcom.com>.

I would assume the "incremental" field cache would be very similar to  
my "incremental" query filter. I have found this to be the #1  
performance improvement I've been able to make with Lucene -  
especially for highly dynamic indexes.

I have attached again to this email the code:

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Chuck Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464012 ] 

Chuck Williams commented on LUCENE-769:
---------------------------------------

I have this same issue with a constantly changing large index where users needs a current view.  The frist search after each frequent IndexReader reopen is slow due primarily to the requirement to rebuild the FieldCache for sort fields.

I don't believe this patch, or any continuation along these lines, will help my issue.  Documents are lage and queries frequently return large results sets, say 20% of the entire multi-million document index or more.  Hundreds of thousands of document() retrievals, even with a fast LOAD_AND_BREAK FieldSelector finding sort fields at the beginning of each Document, is not going to beat FieldCache's single traversal of the postings for the sort fieds.

Another approach I've looked at is Robert Engel's IndexReader.reopen().  I think this direction is more promising.  Artem, you might want to look at this.  At least the version I've seen is not integrated with FieldCache, but it seems this would be feasible.  Segments to the left of the first changed segment maintain their doc-ids, so an improved FieldCache could iterate just the postings in the first changed segment and those to the right.  Unless somebody else does this first, it's on my list to improve IndexReader.reopen() with this optimization and to make other enhancements my app needs (e.g., support for ParallelReader -- the current implementation fails in this case).

A specific comment on the new patch:  the introduction of FieldSelectors is too restrictive.  The same doc-id may be retrieved using multiple FieldSelectors in different calls to IndexReader.document().  Any implementation of the cache needs to support this.


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463727 ] 

Hoss Man commented on LUCENE-769:
---------------------------------

Artem: I've only skimmed your patch breifly, but i have a few comments:

1) since Lucene sorting has historicly been based on *indexed* fields, and this new patch results in fields being sorted on the *stored* values, you should definitely point this out in the javadocs of your DocCachingSortFactory and DocCachingFieldComparatorSource classes .. in big bold letters .. i would go even so far as to name the classes StoredFieldComparatorSrouce and StoredFieldSortFactory instead of the names you currently use.

2) your use of WeakHashMap with keys that are *never* refrenced elsewhere to ensure the cache is purged on every GC is not something i've ever seen before...

+public class WeakDocumentsCache {
+    private Map cache = Collections.synchronizedMap(new WeakHashMap());
+
+    public Document getById(int docId) {
+        return (Document) cache.get(new Integer(docId));
+    }
+
+    public void put(int docId, Document document) {
+        cache.put(new Integer(docId), document);
+    }

...have you tested out the performacne of this approach with various GC implimentations?

skimming the javadocs for WeakReference/WeakHashMap it seems like perhaps SoftRefrence would be better suited for your purposes.

3) making your new DocCachingIndexReader and WeakDocumentsCache clases part of the Lucene public API seems to be a little outside of the scope of this change ... perhapsthey should be left as a private static inner class es inside of the individual classes they are used by (DocCachingFieldComparatorSource and DocCachingIndexReader respectively) .... even if these classes are left public, DocCachingIndexReader should probably subclass FilterIndexReader to reduce the amount of duplication.

4) your check for FieldSelector usage in DocCachingIndexReader doesn't check if the FieldSelector used is the same every time -- which means you can't trust your cache ... fixing this could be complicated, and serves as another reason why it would be easier if DocCachingIndexReader was made a private inner class of DocCachingFieldComparatorSource where you know exactly how it's going to be used.

5) speaking of FieldSelector, your use case seems like a perfect example of when a FieldSelector would make sense to only read the field(s) that are needed for sorting.


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "robert engels (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

robert engels updated LUCENE-769:
---------------------------------

    Attachment: IndexReaderUtils.java

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464330 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

Btw I've integrated the modified fix into sharhound, 4000 documented sorted search improved from 0,4s to 0,1s. That's great, thanks guys for you time and consideration!

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch, IndexReaderUtils.java, QueryFilter.java, selfContained.patch, selfContained.patch, selfContained.patch, StoredFieldSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463787 ] 

Yonik Seeley commented on LUCENE-769:
-------------------------------------

Sorry for some of the redundant comments... Chucks comment wasn't visible to me for some strange reason when I left mine.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch, DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463652 ] 

Artem Vasiliev commented on LUCENE-769:
---------------------------------------

I've tried the test with DOCS_NUM == 10,000,000. DocCaching sort took about 1s with standart amount of memory (-Xmx80m) while FieldCache'd trapped with OutOfMemoryError even with 1G (-Xmx1000m). Resulting index size was 640M, its creation took at least 7hrs.

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated LUCENE-769:
-------------------------------

    Attachment:     (was: DocCachingSorting.patch)

> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: selfContained.patch
>
>
> The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Posted by "Artem Vasiliev (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Vasiliev updated LUCENE-769:
----------------------------------

    Description: 
It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.

The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

  was:
The attachment №5 from 31 Jan is the last and the whole patch for this issue. Guys with permissions, please remove all the others.

It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.

The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.


> [PATCH] Performance improvement for some cases of sorted search
> ---------------------------------------------------------------
>
>                 Key: LUCENE-769
>                 URL: https://issues.apache.org/jira/browse/LUCENE-769
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Artem Vasiliev
>         Attachments: selfContained.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption and improves performance for sorted searches with frequent index updates and relatively big indexes (>1mln docs) scenario. This solution supports only single-field sorting currently (which seem to be quite popular use case). Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given field's values from the whole index - current FieldCache approach) are cached in a WeakHashMap so the cached items are candidates for GC.  Their fields values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org