You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeff Wartes <jw...@whitepages.com> on 2016/08/11 17:39:50 UTC

Effects of insert order on query performance

This isn’t really a question, although some validation would be nice. It’s more of a warning.

Tldr is that the insert order of documents in my collection appears to have had a huge effect on my query speed.


I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued field (“permissions”) that for 90% of docs contains one particular value, (“A”) and for 10% of docs contains another distinct value. (“B”) It’s intended to represent something like permissions, so more values are possible in the future, but not present currently. In fact, the addition of docs with value B to this index was very recent, previously all docs had value “A”. All queries, in addition to various other Boolean-query type restrictions, have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions v=A,B}

Last week, I tried to re-index the whole collection from scratch, using source data. Query performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous query throughput, and even that was at latencies that were orders of magnitude higher than what I had in production.

I hooked up some CPU profiling to a server that had shards from both the old and new version of the collection, and eventually it looked like the significant difference in processing the two collections was coming from ConstantWeight.scorer()
Specifically, this line
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
was far more expensive in my re-indexed collection. From there, the call chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here:
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169

I don’t pretend to understand all that code, but the difference in my re-index appears to have something to do either with that cache, or the aggregate docIdSets that need weights generated is simply much bigger in my re-index.


But the queries didn’t change, and the data is basically the same, what else could have changed?

The documents with the “B” distinct value were added recently to the high-performance collection, but the A’s and the B’s were all mixed up in the source data dump I used to re-index. On a hunch, I manually ordered the docs such that the A’s were all first and re-indexed again, and performance is great!

Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index are contained in the largest segments. I’m guessing there’s an optimization somewhere that says something like “This segment only has A’s”. By indexing all the A’s first, those biggest segments only contain A’s, and only the smallest, newest segments are unable to make use of that optimization.

Here’s the scary part: Although my re-index is now performing well, if this theory is right, some random insert (or a deliberate optimize) at some random point in the future could cascade a segment merge such that the largest segment(s) now contain both A’s and B’s, and performance suddenly goes over a cliff. I have no way to prevent this possibility except to stop doing inserts.

My current thinking is that I need to pull the terms-query part out of the query and do a filter query for it instead. Probably as a post-filter, since I’ve had bad luck with very large filter queries and the filter cache. I’d tested this originally (when I only had A’s), but found the performance was a bit worse than just leaving it in the query. I’ll take a bit worse and predictability over a bit better and a time bomb though, if those are my choices.


If anyone has any comments refuting or supporting this theory, I’d certainly like to hear it. This is the first time I’ve encountered anything about insert order mattering from a performance perspective, and it becomes a general-form question around how to handle low-cardinality fields.


Re: Effects of insert order on query performance

Posted by Jeff Wartes <jw...@whitepages.com>.
Thanks Emir. I’m unfortunately already using a routing key that needs to be at the top level, since I’m collapsing on that field. 

Adding a sub-key won’t help much if my theory is correct, as even a single shard (distrib=false) showed serious performance degradation, and query latency is the max(shard latency). I’d need a routing scheme that assured that a given shard has *only* A’s, or *only* B’s.

Even if I could use “permissions” as the top-level routing key though, this is a very low cardinality field, so I’d expect to end up with very large differences between the sizes of the shards in that case. That’s fine from a SolrCloud query perspective of course, but it makes for more difficult resource provisioning.


On 8/12/16, 1:39 AM, "Emir Arnautovic" <em...@sematext.com> wrote:

    Hi Jeff,
    
    I will not comment on your theory (will let that to guys more familiar 
    with Lucene code) but will point to one alternative solution: routing. 
    You can use routing to split documents with different permission to 
    different shards and use composite hash routing to split "A" (and maybe 
    "B" as well) documents to multiple shards. That will make sure all doc 
    with the same permission are on the same shard and on query time only 
    those will be queried (less shards to query) and there is no need to 
    include term query or filter query at all.
    
    Here is blog explaining benefits of composite hash routing: 
    https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/
    
    Regards,
    Emir
    
    -- 
    Monitoring * Alerting * Anomaly Detection * Centralized Log Management
    Solr & Elasticsearch Support * http://sematext.com/
    
    On 11.08.2016 19:39, Jeff Wartes wrote:
    > This isn’t really a question, although some validation would be nice. It’s more of a warning.
    >
    > Tldr is that the insert order of documents in my collection appears to have had a huge effect on my query speed.
    >
    >
    > I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued field (“permissions”) that for 90% of docs contains one particular value, (“A”) and for 10% of docs contains another distinct value. (“B”) It’s intended to represent something like permissions, so more values are possible in the future, but not present currently. In fact, the addition of docs with value B to this index was very recent, previously all docs had value “A”. All queries, in addition to various other Boolean-query type restrictions, have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions v=A,B}
    >
    > Last week, I tried to re-index the whole collection from scratch, using source data. Query performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous query throughput, and even that was at latencies that were orders of magnitude higher than what I had in production.
    >
    > I hooked up some CPU profiling to a server that had shards from both the old and new version of the collection, and eventually it looked like the significant difference in processing the two collections was coming from ConstantWeight.scorer()
    > Specifically, this line
    > https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
    > was far more expensive in my re-indexed collection. From there, the call chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here:
    > https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169
    >
    > I don’t pretend to understand all that code, but the difference in my re-index appears to have something to do either with that cache, or the aggregate docIdSets that need weights generated is simply much bigger in my re-index.
    >
    >
    > But the queries didn’t change, and the data is basically the same, what else could have changed?
    >
    > The documents with the “B” distinct value were added recently to the high-performance collection, but the A’s and the B’s were all mixed up in the source data dump I used to re-index. On a hunch, I manually ordered the docs such that the A’s were all first and re-indexed again, and performance is great!
    >
    > Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index are contained in the largest segments. I’m guessing there’s an optimization somewhere that says something like “This segment only has A’s”. By indexing all the A’s first, those biggest segments only contain A’s, and only the smallest, newest segments are unable to make use of that optimization.
    >
    > Here’s the scary part: Although my re-index is now performing well, if this theory is right, some random insert (or a deliberate optimize) at some random point in the future could cascade a segment merge such that the largest segment(s) now contain both A’s and B’s, and performance suddenly goes over a cliff. I have no way to prevent this possibility except to stop doing inserts.
    >
    > My current thinking is that I need to pull the terms-query part out of the query and do a filter query for it instead. Probably as a post-filter, since I’ve had bad luck with very large filter queries and the filter cache. I’d tested this originally (when I only had A’s), but found the performance was a bit worse than just leaving it in the query. I’ll take a bit worse and predictability over a bit better and a time bomb though, if those are my choices.
    >
    >
    > If anyone has any comments refuting or supporting this theory, I’d certainly like to hear it. This is the first time I’ve encountered anything about insert order mattering from a performance perspective, and it becomes a general-form question around how to handle low-cardinality fields.
    >
    


Re: Effects of insert order on query performance

Posted by Emir Arnautovic <em...@sematext.com>.
Hi Jeff,

I will not comment on your theory (will let that to guys more familiar 
with Lucene code) but will point to one alternative solution: routing. 
You can use routing to split documents with different permission to 
different shards and use composite hash routing to split "A" (and maybe 
"B" as well) documents to multiple shards. That will make sure all doc 
with the same permission are on the same shard and on query time only 
those will be queried (less shards to query) and there is no need to 
include term query or filter query at all.

Here is blog explaining benefits of composite hash routing: 
https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/

Regards,
Emir

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

On 11.08.2016 19:39, Jeff Wartes wrote:
> This isn\u2019t really a question, although some validation would be nice. It\u2019s more of a warning.
>
> Tldr is that the insert order of documents in my collection appears to have had a huge effect on my query speed.
>
>
> I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is a multi-valued field (\u201cpermissions\u201d) that for 90% of docs contains one particular value, (\u201cA\u201d) and for 10% of docs contains another distinct value. (\u201cB\u201d) It\u2019s intended to represent something like permissions, so more values are possible in the future, but not present currently. In fact, the addition of docs with value B to this index was very recent, previously all docs had value \u201cA\u201d. All queries, in addition to various other Boolean-query type restrictions, have a terms query on this field, like {!terms f=permissions v=A} or {!terms f=permissions v=A,B}
>
> Last week, I tried to re-index the whole collection from scratch, using source data. Query performance on the resulting re-index proved to be abysmal, I could get barely 10% of my previous query throughput, and even that was at latencies that were orders of magnitude higher than what I had in production.
>
> I hooked up some CPU profiling to a server that had shards from both the old and new version of the collection, and eventually it looked like the significant difference in processing the two collections was coming from ConstantWeight.scorer()
> Specifically, this line
> https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
> was far more expensive in my re-indexed collection. From there, the call chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra work happening here:
> https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169
>
> I don\u2019t pretend to understand all that code, but the difference in my re-index appears to have something to do either with that cache, or the aggregate docIdSets that need weights generated is simply much bigger in my re-index.
>
>
> But the queries didn\u2019t change, and the data is basically the same, what else could have changed?
>
> The documents with the \u201cB\u201d distinct value were added recently to the high-performance collection, but the A\u2019s and the B\u2019s were all mixed up in the source data dump I used to re-index. On a hunch, I manually ordered the docs such that the A\u2019s were all first and re-indexed again, and performance is great!
>
> Here\u2019s my theory: Using TieredMergePolicy, the vast quantity of the documents in an index are contained in the largest segments. I\u2019m guessing there\u2019s an optimization somewhere that says something like \u201cThis segment only has A\u2019s\u201d. By indexing all the A\u2019s first, those biggest segments only contain A\u2019s, and only the smallest, newest segments are unable to make use of that optimization.
>
> Here\u2019s the scary part: Although my re-index is now performing well, if this theory is right, some random insert (or a deliberate optimize) at some random point in the future could cascade a segment merge such that the largest segment(s) now contain both A\u2019s and B\u2019s, and performance suddenly goes over a cliff. I have no way to prevent this possibility except to stop doing inserts.
>
> My current thinking is that I need to pull the terms-query part out of the query and do a filter query for it instead. Probably as a post-filter, since I\u2019ve had bad luck with very large filter queries and the filter cache. I\u2019d tested this originally (when I only had A\u2019s), but found the performance was a bit worse than just leaving it in the query. I\u2019ll take a bit worse and predictability over a bit better and a time bomb though, if those are my choices.
>
>
> If anyone has any comments refuting or supporting this theory, I\u2019d certainly like to hear it. This is the first time I\u2019ve encountered anything about insert order mattering from a performance perspective, and it becomes a general-form question around how to handle low-cardinality fields.
>