You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jason Gerlowski (Jira)" <ji...@apache.org> on 2019/11/20 14:25:00 UTC

[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

    [ https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978467#comment-16978467 ] 

Jason Gerlowski commented on SOLR-13013:
----------------------------------------

Chiming in late to the discussion so far.

bq. I expect that it needs at least a full re-implementation of the MapWriter-parts
I actually thought the FieldWriter stuff looked pretty well done.  It cuts a decent bit of duplication out of those implementation classes.


bq. Should there be a setting for max memory usage and if violated adjust the window size or fallback to old logic?
We should definitely expose some knobs here so users can tweak performance/memory-usage for their use case.  But I think adding smart-picking or auto-failover etc. is something that should be deferred to a second pass.  I'm all for it, it just seems like something that'll be easier to get right once we've had this optimization out there for a bit and start getting feedback on where this does/doesn't work well.

> Change export to extract DocValues in docID order
> -------------------------------------------------
>
>                 Key: SOLR-13013
>                 URL: https://issues.apache.org/jira/browse/SOLR-13013
>             Project: Solr
>          Issue Type: Improvement
>          Components: Export Writer
>    Affects Versions: 7.5, 8.0
>            Reporter: Toke Eskildsen
>            Priority: Major
>             Fix For: master (9.0), 8.2
>
>         Attachments: SOLR-13013.patch, SOLR-13013_proof_of_concept.patch, SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for paging through the result set in a given sort order. Each time a window has been calculated, the values for the export fields are retrieved from the underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support random access. The current export implementation bypasses this by creating a new DocValues-iterator for each individual value to retrieve. This slows down export as the iterator has to seek to the given docID from start for each value. The slowdown scales with shard size (see LUCENE-8374 for details). An alternative is to extract the DocValues in docID-order, with re-use of DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold the whole sliding window scaled result set in memory. This might well be a showstopper as there is no real limit to how large this partial result set can be. Maybe such an optimization could be requested explicitly if the user knows that there is enough memory?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org