You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Joel Bernstein (Jira)" <ji...@apache.org> on 2022/11/07 14:23:00 UTC

[jira] [Comment Edited] (SOLR-16524) Index time hash partitioning

    [ https://issues.apache.org/jira/browse/SOLR-16524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629848#comment-17629848 ] 

Joel Bernstein edited comment on SOLR-16524 at 11/7/22 2:22 PM:
----------------------------------------------------------------

The main reason for separate file is for performance, both writing and reading. The idea of creating separate optimized on-disk files on-commit is new. It allows us to break out of constraints of the Lucene index and build on disk structures that meet specific use cases. In this particular case it allows us to in-line the bytes of a field rather than having to fetch bytes using a docValues ordinal. The inlining of bytes in my testing is significantly faster to read and would allow the HashQParserPlugin to build filters much faster then it currently does. The tradeoff is writing the files on-commit. 




was (Author: joel.bernstein):
The main reason for separate file is for performance, both writing and reading. The idea of creating separate optimized on-disk files on-commit is new. It allows us to break out of constraints of the Lucene index and build on disk structures that meet specific use cases. In this particular case it allows us to in-line the bytes of a field rather than having to fetch bytes using a docValues ordinal. The inlining of bytes in my testing is significantly faster to read and allow the HashQParserPlugin to build filters much faster then it currently does.

> Index time hash partitioning
> ----------------------------
>
>                 Key: SOLR-16524
>                 URL: https://issues.apache.org/jira/browse/SOLR-16524
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>            Priority: Major
>
> Both Streaming Expressions and Spark-Solr currently rely on query time hash partitioning using the HashQParserPlugin. The query time hash partitioning, although extremely flexible, is very slow when it builds its initial filters. 
> This ticket will add an indexing time hash partitioner that Streaming Expressions and Spark-solr will both be able to use.
> When this ticket is complete I'll also update the ParallelStream and Spark-Solr to be able to use the index time partitioning rather than the HashQParserPlugin.
> This is a stepping stone towards much more performant parallel distributed joins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org