You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Mario Kevo <ma...@est.tech> on 2022/06/07 09:55:56 UTC

Odg: Question about INDEX_THRESHOLD_SIZE

Hi all,

I dig it more in this problem and saw that if we have some range query like this:


query --query="<trace>SELECT  e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] LIKE 'abc%'"

It will turn it into two comparisons. First, it will check all that are < 'abd' and store them into intermediate results and then the second comparison will find all entries that have this attribute >= 'abc'. And their intersection can be null, when doing the first comparison(LT) it will store all entries that have this attribute lower than 'abd' which can also be something like '1234', 'aab',... as all of them are lower by ASCII table. The problem is when adding entries to results it has a limit checked after every iteration. https://github.com/apache/geode/blob/develop/geode-core/src/main/java/org/apache/geode/cache/query/internal/index/CompactRangeIndex.java#L859
This limit is set in https://github.com/apache/geode/blob/develop/geode-core/src/main/java/org/apache/geode/cache/query/internal/index/CompactRangeIndex.java#L485
In that case, it stores only the first 100 entries that it found and there is the possibility that many of them are not starting with 'abc'.
This limit is different from the one added in the query to limit printing results.

If we just go with checking >= 'abc' this will include also entries whose attribute looks like 'bcd', ...

If we change indexThresholdSize while starting servers this limit can be changed and can get correct results if the indexThresholdSize is high as the region has entries (The user in many cases couldn't know how many entries it will have in the region).

I tried to change it by default to Integer.MAX_VALUE, but then have some test failing, so think it is not the best solution.
The test that reproduces the problem is available on #7754<https://github.com/apache/geode/pull/7754>.

Does anyone have some idea what is the best solution for this issue?

Thanks and BR,
Mario

________________________________
Šalje: Jason Huynh <jh...@vmware.com>
Poslano: 15. ožujka 2022. 21:11
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

Additional thought:
It would be nice to set/check CAN_APPLY_LIMIT_AT_INDEX on a per node basis or at least a finer grained setting.


  *   AND junctions probably it probably shouldn’t be applied at an index level
  *   OR junctions could as it’s a union
Combinations of the two or ancestor nodes should be smarter but that requires additional changes to make elegant…

From: Jason Huynh <jh...@vmware.com>
Date: Tuesday, March 15, 2022 at 1:03 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about INDEX_THRESHOLD_SIZE
Hi Mario,



Digging a little bit more, I’m assuming the CompiledLike probably should reassess whether limits should be applied at the index level.  I imagine the “someth%” is currently being turned into two compiled comparisons joined by an AND?
The first comparison being >= “someth” and the second being < MAX_CHAR. If so, we are applying limit at both executions and applying and intersecting them(?). The first comparison returning 1 value and the second only a subset of the entire index, so when intersected will possibly not have the correct value.
I think there are two changes that we might be able to make:
1.) recalculate whether limit should be applied at the index level for like queries that get reorganized into a junction

This will slow down your current query because indexes will no longer be able to be applied.  However :

2.) for a like query where the “%” is trailing.. do we need to do an AND junction?  Can it just be a >= “someth” ?





In the PR https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7010&amp;data=04%7C01%7Cjhuynh%40vmware.com%7Cfc3f9301ac074a3e694808da06beeee7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637829714375501933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FZRbcTiJzvRXrM9TXN3hTcsS%2BXr51ZAXfiuRu4FNpxo%3D&amp;reserved=0 :
A.) We now allow expanding the limit to the INDEX_THRESHOLD_SIZE - the concern being less efficiency if the limit can be applied at the index and only 1 value matches.

B.). Did we miss a limit != -1 check CompactRangeIndex (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2F41eb49989f25607acfcbf9ac5afe3d4c0721bb35%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Fcache%2Fquery%2Finternal%2Findex%2FCompactRangeIndex.java%23L487&amp;data=04%7C01%7Cjhuynh%40vmware.com%7Cfc3f9301ac074a3e694808da06beeee7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637829714375501933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=nr7WhCan6uysp%2Buz78q2QdqApBJ0zDPCFiU4fa%2FfVCE%3D&amp;reserved=0). It looks like all other times we added it - the concern being that we would now apply a limit even if it were -1 (no limit intended)

The PR and questions surrounding it, aren’t the cause of what you are seeing now.



I’ll jump into debugging this a bit when I get the chance – haven’t had a chance to sit down with a debugger just yet.



Regards,

-Jason




From: Mario Kevo <ma...@est.tech>
Date: Monday, March 14, 2022 at 8:17 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Question about INDEX_THRESHOLD_SIZE
Hi,

Regarding Anil's answer:
The test will pass as we have set INDEX_THRESHOLD_SIZE to 1000 (that is how many entries are in the region). If you deleted that line in startServer it will fail asit used default value of indexThresholdSize(100).

Regarding Jason's answers:
Only change what my PR introduced to get availability to change INDEX_THRESHOLD_SIZE by the user as before it is hardcoded to 100, and nothing can change it.
The intermediate result will have all entries on which index can be applied and if there are more entries that have that attribute on which index is created it will store them all in the results(if INDEX_THRESHOLD_SIZE is changed). By default, it has a limit of 100 entries so it can happen that an entry that matches the condition will not be in the results.
Regarding this applyCondition in CompactRangeIndex.addToResultsFromEntries, it is never applied as iterOps is always null, as it is set in AbstractGroupOrRangeJunction.auxFilterEvaluate:

filterResults = filter.filterEvaluate(context,
!isConditioningNeeded ? intermediateResults : null, completeExpansion,
null/*
* Asif * Asif :The iter operands passed are null, as a not null value can exists only
* if there exists a single Filter operand in original GroupJunction
*/, indpndntItr, _operator == LITERAL_and, isConditioningNeeded,
false /* do not evaluate projection */);
BR,
Mario
________________________________
Šalje: Jason Huynh <jh...@vmware.com>
Poslano: 11. ožujka 2022. 22:06
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

As an fyi, in the past we disabled applying limits at the index level for range indexes.

I’m surprised in this case that we would add all the entries to the intermediate results instead of applying the filter first and checking the condition before adding to the intermediate results..

I would have thought it would have to apply the condition as seen in CompactRangeIndex.addToResultsFromEntries :

          if (ok && runtimeItr != null && iterOps != null) {
              ok = QueryUtils.applyCondition(iterOps, context);
            }

I haven’t walked this query through the code, perhaps it’s hitting a different index type (I’d think a map index but that probably is backed by CompactRangeIndexes for each key…)



From: Jason Huynh <jh...@vmware.com>
Date: Friday, March 11, 2022 at 12:47 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about INDEX_THRESHOLD_SIZE
I think https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7010&amp;data=04%7C01%7Cjhuynh%40vmware.com%7Cfc3f9301ac074a3e694808da06beeee7%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637829714375501933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FZRbcTiJzvRXrM9TXN3hTcsS%2BXr51ZAXfiuRu4FNpxo%3D&amp;reserved=0 may have changed what that property represented.  I believe it was some arbitrary threshold to abort using index look ups (if the intermediate results were small, it was deemed faster to just iterate through and not do a lookup – at least from my interpretation of the code)
It looks like with the change, it now munges it with limit.. so now limit is applied to that value.. gfsh happens to always pass in a limit too, so there is possibly additional confusion

From the diff there is also one spot where a limit != -1 had not been added..  In CompactRangeIndex line 489:

      if (limit < indexThresholdSize) {
        limit = indexThresholdSize;
      }

This might be affecting the usage of limit at the index level?


From: Anilkumar Gingade <ag...@vmware.com>
Date: Friday, March 11, 2022 at 12:11 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about INDEX_THRESHOLD_SIZE
Mario,

There is similar test/example added by you in QueryWithRangeIndexDUnitTest. testQueryWithWildcardAndIndexOnAttributeFromHashMap()
When I run that test (on develop); I see the results as expected:
*************
Command result for <query --query="<trace> select e.key, e.value from /exampleRegion.entrySet e where e.value.positions['SUN'] like 'somethin%'">:
Result      : true
Limit       : 100
Rows        : 1
Query Trace : Query Executed in 85.1964 ms; indexesUsed(1):IdIndex(Results: 10000)
************
Are you running your test with any additional change like as you are saying :
>> I was working on allowing INDEX_THRESHOLD_SIZE System property to override CompiledValue.RESULT_LIMIT.

If so , you need to look at the change and see why its impacting that way.
If not, please let me know what change can be made in that test to reproduce the issue you are seeing; that will help to debug/analyze the issue.

-Anil.




On 3/11/22, 12:18 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi,

    It works without an index but it doesn't work with an index.
    When I revert changes, it takes INDEX_THRESHOLD_SIZE default value(100). And if the entry that matches the condition is not in that resultset it will not be printed.
    Without index:
    gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
    Result      : true
    Limit       : 100
    Rows        : 1
    Query Trace : Query Executed in 11.502283 ms; indexesUsed(0)

    key | value
    --- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    300 | {"ID":300,"indexKey":0,"pkid":"300","shortID":null,"position1":{"mktValue":1945.0,"secId":"ORCL","secIdIndexed":"ORCL","secType":null,"sharesOutstanding":1944000.0,"underlyer":null,"pid":1944,"portfolioId":300,..
    With index:
    gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
    Result      : true
    Limit       : 100
    Rows        : 0
    Query Trace : Query Executed in 8.784831 ms; indexesUsed(1):index1(Results: 100)
    BR,
    Mario
    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 10. ožujka 2022. 23:16
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Question about INDEX_THRESHOLD_SIZE

    Mario,

    There are few changes happened around this area as part of GEODE-9632 fix; can you please revert that change and see if the query works both with and without index.
    Looking at the code; it seems to restrict the number index look up that needs to be performed; certain latency/throughput sensitive queries that or not expecting exact result may use this (my guess) but by default it should not be resulting in unexpected results.

    -Anil.


    On 3/10/22, 6:50 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi geode-dev,

        Some time ago I was working on allowing INDEX_THRESHOLD_SIZE System property to override CompiledValue.RESULT_LIMIT.
        After this change, adding this attribute will take into a count if you set it.
        But I need some clarification of this INDEX_THRESHOLD_SIZE attribute. Why is this set by default to 100?
        The main problem with this attribute is that if you want to get the correct result, you need to know how many entries will be in the region while starting servers and set it on that value or higher. Sometimes it is too hard to know how many entries will be in the region, so maybe better will be to set it by default to some higher number, something like Integer.MAX_VALUE.

        Where this attribute is used?
        It is used to get index results while doing queries.

        What is the problem?
        If we have INDEX_THRESHOLD_SIZE set to 500, and we have 1k entries it can happen that while doing a query it will get only 500 entries and where clause cannot be fulfilled and we got no results.
        Let's see it by an example!

        We have only one entry that matches the condition from the query, INDEX_THRESHOLD_SIZE set to 500, and 1k entries in the region.
        If we run the query without an index we got the result.
        gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
        Result      : true
        Limit       : 100
        Rows        : 1
        Query Trace : Query Executed in 10.750238 ms; indexesUsed(0)

        key | value
        --- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        700 | {"ID":700,"indexKey":0,"pkid":"700","shortID":null,"position1":{"mktValue":1945.0,"secId":"ORCL","secIdIndexed":"ORCL","secType":null,"sharesOutstanding":1944000.0,"underlyer":null,"pid":1944,"portfolioId":700,..
        If we create an index and then run again this query there is no result.
        gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
        Result      : true
        Limit       : 100
        Rows        : 0
        Query Trace : Query Executed in 22.079016 ms; indexesUsed(1):index1(Results: 500)
        This happened because we have no luck getting that entry that matches the condition in the intermediate results for the index.
        So the questions are:
        What if more entries enter the region that will make the index return more entries than this threshold sets? Then we're again in jeopardy that the query condition will not match.
        Why is this attribute set by default to 100?
        Can we change it to the Integer.MAX_VALUE by default to be sure that we have the correct result? What are the consequences?

        BR,
        Mario

Re: Question about INDEX_THRESHOLD_SIZE

Posted by Nabarun Nag <nn...@vmware.com.INVALID>.

Hi Mario,

Regarding this issue, we are investigating how we can move forward with this issue. After discussing with the team, and with the 1.15 release schedule coming up this month, we have planned to move this solution to 1.16.0 release. We have planned to go forward with this plan.

  *   Revert GEODE-9632 from develop and backport the revert to 1.15.0 ( This needs to be done to follow protocol) to get 1.15.0 release task complete.
  *   Work with you and the team to find a complete solution/improvement and push the solution to develop again and release in 1.16.0
  *   Document the additional purpose of INDEX_THRESHOLD_SIZE in the Apache Geode docs.

Please do reach us if you have any questions.

Regards
Nabarun Nag

________________________________
From: Mario Kevo <ma...@est.tech>
Sent: Tuesday, June 7, 2022 2:55 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Question about INDEX_THRESHOLD_SIZE

⚠ External Email

Hi all,

I dig it more in this problem and saw that if we have some range query like this:


query --query="<trace>SELECT  e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] LIKE 'abc%'"

It will turn it into two comparisons. First, it will check all that are < 'abd' and store them into intermediate results and then the second comparison will find all entries that have this attribute >= 'abc'. And their intersection can be null, when doing the first comparison(LT) it will store all entries that have this attribute lower than 'abd' which can also be something like '1234', 'aab',... as all of them are lower by ASCII table. The problem is when adding entries to results it has a limit checked after every iteration. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2Fdevelop%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Fcache%2Fquery%2Finternal%2Findex%2FCompactRangeIndex.java%23L859&amp;data=05%7C01%7Cnnag%40vmware.com%7C521ece5f5f324b0c361d08da486bff8e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637901925947631598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=mfvHgPewuNmWFurIcyHr7zC0xrGwFjbvFt2LRYs2lGE%3D&amp;reserved=0
This limit is set in https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2Fdevelop%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Fcache%2Fquery%2Finternal%2Findex%2FCompactRangeIndex.java%23L485&amp;data=05%7C01%7Cnnag%40vmware.com%7C521ece5f5f324b0c361d08da486bff8e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637901925947631598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=OubRAWJ2S%2Fi4Dubh2wATXvdyhdIEoFOPKgHINk7yq6c%3D&amp;reserved=0
In that case, it stores only the first 100 entries that it found and there is the possibility that many of them are not starting with 'abc'.
This limit is different from the one added in the query to limit printing results.

If we just go with checking >= 'abc' this will include also entries whose attribute looks like 'bcd', ...

If we change indexThresholdSize while starting servers this limit can be changed and can get correct results if the indexThresholdSize is high as the region has entries (The user in many cases couldn't know how many entries it will have in the region).

I tried to change it by default to Integer.MAX_VALUE, but then have some test failing, so think it is not the best solution.
The test that reproduces the problem is available on #7754<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7754&amp;data=05%7C01%7Cnnag%40vmware.com%7C521ece5f5f324b0c361d08da486bff8e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637901925947631598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=ZDZkPK6PGZw%2F0uXvjLc9M%2BClBK75wSfyDUbH%2FReblus%3D&amp;reserved=0>.

Does anyone have some idea what is the best solution for this issue?

Thanks and BR,
Mario

________________________________
Šalje: Jason Huynh <jh...@vmware.com>
Poslano: 15. ožujka 2022. 21:11
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

Additional thought:
It would be nice to set/check CAN_APPLY_LIMIT_AT_INDEX on a per node basis or at least a finer grained setting.


  *   AND junctions probably it probably shouldn’t be applied at an index level
  *   OR junctions could as it’s a union
Combinations of the two or ancestor nodes should be smarter but that requires additional changes to make elegant…

From: Jason Huynh <jh...@vmware.com>
Date: Tuesday, March 15, 2022 at 1:03 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about INDEX_THRESHOLD_SIZE
Hi Mario,



Digging a little bit more, I’m assuming the CompiledLike probably should reassess whether limits should be applied at the index level.  I imagine the “someth%” is currently being turned into two compiled comparisons joined by an AND?
The first comparison being >= “someth” and the second being < MAX_CHAR. If so, we are applying limit at both executions and applying and intersecting them(?). The first comparison returning 1 value and the second only a subset of the entire index, so when intersected will possibly not have the correct value.
I think there are two changes that we might be able to make:
1.) recalculate whether limit should be applied at the index level for like queries that get reorganized into a junction

This will slow down your current query because indexes will no longer be able to be applied.  However :

2.) for a like query where the “%” is trailing.. do we need to do an AND junction?  Can it just be a >= “someth” ?





In the PR https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7010&amp;data=05%7C01%7Cnnag%40vmware.com%7C521ece5f5f324b0c361d08da486bff8e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637901925947631598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iTfLgoqyZCgO2oRDlsDREE0txRXMIWSG29sZj2O7rJg%3D&amp;reserved=0 :
A.) We now allow expanding the limit to the INDEX_THRESHOLD_SIZE - the concern being less efficiency if the limit can be applied at the index and only 1 value matches.

B.). Did we miss a limit != -1 check CompactRangeIndex (https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fblob%2F41eb49989f25607acfcbf9ac5afe3d4c0721bb35%2Fgeode-core%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fgeode%2Fcache%2Fquery%2Finternal%2Findex%2FCompactRangeIndex.java%23L487&amp;data=05%7C01%7Cnnag%40vmware.com%7C521ece5f5f324b0c361d08da486bff8e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637901925947631598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=51qX5ICPL5LwCpvs%2F%2F0uodP9n7ZV5Y2zOFabmiEOtsM%3D&amp;reserved=0). It looks like all other times we added it - the concern being that we would now apply a limit even if it were -1 (no limit intended)

The PR and questions surrounding it, aren’t the cause of what you are seeing now.



I’ll jump into debugging this a bit when I get the chance – haven’t had a chance to sit down with a debugger just yet.



Regards,

-Jason




From: Mario Kevo <ma...@est.tech>
Date: Monday, March 14, 2022 at 8:17 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Odg: Question about INDEX_THRESHOLD_SIZE
Hi,

Regarding Anil's answer:
The test will pass as we have set INDEX_THRESHOLD_SIZE to 1000 (that is how many entries are in the region). If you deleted that line in startServer it will fail asit used default value of indexThresholdSize(100).

Regarding Jason's answers:
Only change what my PR introduced to get availability to change INDEX_THRESHOLD_SIZE by the user as before it is hardcoded to 100, and nothing can change it.
The intermediate result will have all entries on which index can be applied and if there are more entries that have that attribute on which index is created it will store them all in the results(if INDEX_THRESHOLD_SIZE is changed). By default, it has a limit of 100 entries so it can happen that an entry that matches the condition will not be in the results.
Regarding this applyCondition in CompactRangeIndex.addToResultsFromEntries, it is never applied as iterOps is always null, as it is set in AbstractGroupOrRangeJunction.auxFilterEvaluate:

filterResults = filter.filterEvaluate(context,
!isConditioningNeeded ? intermediateResults : null, completeExpansion,
null/*
* Asif * Asif :The iter operands passed are null, as a not null value can exists only
* if there exists a single Filter operand in original GroupJunction
*/, indpndntItr, _operator == LITERAL_and, isConditioningNeeded,
false /* do not evaluate projection */);
BR,
Mario
________________________________
Šalje: Jason Huynh <jh...@vmware.com>
Poslano: 11. ožujka 2022. 22:06
Prima: dev@geode.apache.org <de...@geode.apache.org>
Predmet: Re: Question about INDEX_THRESHOLD_SIZE

As an fyi, in the past we disabled applying limits at the index level for range indexes.

I’m surprised in this case that we would add all the entries to the intermediate results instead of applying the filter first and checking the condition before adding to the intermediate results..

I would have thought it would have to apply the condition as seen in CompactRangeIndex.addToResultsFromEntries :

          if (ok && runtimeItr != null && iterOps != null) {
              ok = QueryUtils.applyCondition(iterOps, context);
            }

I haven’t walked this query through the code, perhaps it’s hitting a different index type (I’d think a map index but that probably is backed by CompactRangeIndexes for each key…)



From: Jason Huynh <jh...@vmware.com>
Date: Friday, March 11, 2022 at 12:47 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about INDEX_THRESHOLD_SIZE
I think https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7010&amp;data=05%7C01%7Cnnag%40vmware.com%7C521ece5f5f324b0c361d08da486bff8e%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637901925947631598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=iTfLgoqyZCgO2oRDlsDREE0txRXMIWSG29sZj2O7rJg%3D&amp;reserved=0 may have changed what that property represented.  I believe it was some arbitrary threshold to abort using index look ups (if the intermediate results were small, it was deemed faster to just iterate through and not do a lookup – at least from my interpretation of the code)
It looks like with the change, it now munges it with limit.. so now limit is applied to that value.. gfsh happens to always pass in a limit too, so there is possibly additional confusion

From the diff there is also one spot where a limit != -1 had not been added..  In CompactRangeIndex line 489:

      if (limit < indexThresholdSize) {
        limit = indexThresholdSize;
      }

This might be affecting the usage of limit at the index level?


From: Anilkumar Gingade <ag...@vmware.com>
Date: Friday, March 11, 2022 at 12:11 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about INDEX_THRESHOLD_SIZE
Mario,

There is similar test/example added by you in QueryWithRangeIndexDUnitTest. testQueryWithWildcardAndIndexOnAttributeFromHashMap()
When I run that test (on develop); I see the results as expected:
*************
Command result for <query --query="<trace> select e.key, e.value from /exampleRegion.entrySet e where e.value.positions['SUN'] like 'somethin%'">:
Result      : true
Limit       : 100
Rows        : 1
Query Trace : Query Executed in 85.1964 ms; indexesUsed(1):IdIndex(Results: 10000)
************
Are you running your test with any additional change like as you are saying :
>> I was working on allowing INDEX_THRESHOLD_SIZE System property to override CompiledValue.RESULT_LIMIT.

If so , you need to look at the change and see why its impacting that way.
If not, please let me know what change can be made in that test to reproduce the issue you are seeing; that will help to debug/analyze the issue.

-Anil.




On 3/11/22, 12:18 AM, "Mario Kevo" <ma...@est.tech> wrote:

    Hi,

    It works without an index but it doesn't work with an index.
    When I revert changes, it takes INDEX_THRESHOLD_SIZE default value(100). And if the entry that matches the condition is not in that resultset it will not be printed.
    Without index:
    gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
    Result      : true
    Limit       : 100
    Rows        : 1
    Query Trace : Query Executed in 11.502283 ms; indexesUsed(0)

    key | value
    --- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    300 | {"ID":300,"indexKey":0,"pkid":"300","shortID":null,"position1":{"mktValue":1945.0,"secId":"ORCL","secIdIndexed":"ORCL","secType":null,"sharesOutstanding":1944000.0,"underlyer":null,"pid":1944,"portfolioId":300,..
    With index:
    gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
    Result      : true
    Limit       : 100
    Rows        : 0
    Query Trace : Query Executed in 8.784831 ms; indexesUsed(1):index1(Results: 100)
    BR,
    Mario
    ________________________________
    Šalje: Anilkumar Gingade <ag...@vmware.com>
    Poslano: 10. ožujka 2022. 23:16
    Prima: dev@geode.apache.org <de...@geode.apache.org>
    Predmet: Re: Question about INDEX_THRESHOLD_SIZE

    Mario,

    There are few changes happened around this area as part of GEODE-9632 fix; can you please revert that change and see if the query works both with and without index.
    Looking at the code; it seems to restrict the number index look up that needs to be performed; certain latency/throughput sensitive queries that or not expecting exact result may use this (my guess) but by default it should not be resulting in unexpected results.

    -Anil.


    On 3/10/22, 6:50 AM, "Mario Kevo" <ma...@est.tech> wrote:

        Hi geode-dev,

        Some time ago I was working on allowing INDEX_THRESHOLD_SIZE System property to override CompiledValue.RESULT_LIMIT.
        After this change, adding this attribute will take into a count if you set it.
        But I need some clarification of this INDEX_THRESHOLD_SIZE attribute. Why is this set by default to 100?
        The main problem with this attribute is that if you want to get the correct result, you need to know how many entries will be in the region while starting servers and set it on that value or higher. Sometimes it is too hard to know how many entries will be in the region, so maybe better will be to set it by default to some higher number, something like Integer.MAX_VALUE.

        Where this attribute is used?
        It is used to get index results while doing queries.

        What is the problem?
        If we have INDEX_THRESHOLD_SIZE set to 500, and we have 1k entries it can happen that while doing a query it will get only 500 entries and where clause cannot be fulfilled and we got no results.
        Let's see it by an example!

        We have only one entry that matches the condition from the query, INDEX_THRESHOLD_SIZE set to 500, and 1k entries in the region.
        If we run the query without an index we got the result.
        gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
        Result      : true
        Limit       : 100
        Rows        : 1
        Query Trace : Query Executed in 10.750238 ms; indexesUsed(0)

        key | value
        --- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        700 | {"ID":700,"indexKey":0,"pkid":"700","shortID":null,"position1":{"mktValue":1945.0,"secId":"ORCL","secIdIndexed":"ORCL","secType":null,"sharesOutstanding":1944000.0,"underlyer":null,"pid":1944,"portfolioId":700,..
        If we create an index and then run again this query there is no result.
        gfsh>query --query="<trace>SELECT e.key, e.value from /example-region.entrySet e where e.value.positions['SUN'] like 'someth%'"
        Result      : true
        Limit       : 100
        Rows        : 0
        Query Trace : Query Executed in 22.079016 ms; indexesUsed(1):index1(Results: 500)
        This happened because we have no luck getting that entry that matches the condition in the intermediate results for the index.
        So the questions are:
        What if more entries enter the region that will make the index return more entries than this threshold sets? Then we're again in jeopardy that the query condition will not match.
        Why is this attribute set by default to 100?
        Can we change it to the Integer.MAX_VALUE by default to be sure that we have the correct result? What are the consequences?

        BR,
        Mario


________________________________

⚠ External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.