You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@geode.apache.org by Alberto Gomez <al...@est.tech> on 2021/02/11 16:28:05 UTC

Question about Map indexes

Hi,

We have observed that creating an index on a Map field causes the creation of an index entry for every entry created in the region containing the Map, no matter if the Map field contained the key used in the index.
Nevertheless, we would expect that only entries whose Map field contain the key used in the index would have the corresponding index entry. With this behavior, the memory consumed by the index could be much higher than needed depending on the percentage of entries whose Map field contained the key in the index.

---------------------------------------------------
Example:
We have a region with entries whose key type is a String and the value type is an object with a field called "field1" of Map type.

We expect to run queries on the region like the following:

SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

We create a Map index to speed up the above queries:

gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" --region="/example-region1 r"

We do the following puts:
- Put entry with key="key1" and with value=<Object whose field "field1" is a Map that contains the key 'mapkey1'>
- Put entry with key="key2" and with value=<Object whose field "field1" is a Map that does not contain the key 'mapkey1'>

The observation is that Geode creates two index entries for each entry. For the first entry, the internal indexKey is "key1" and for the second one, the internal indexKey is null.

These are the stats shown by gfsh after doing the above puts:

gfsh>list indexes --with-stats=yes
Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
---------------------------------------------------

Is there any reason why Geode would create an index entry for the second entry given that the Map field does not contain the key in the Map index?

I have created a draft pull request changing the behavior of Geode to not create the index entry when the Map field does not contain the key used in the index. Only two Unit test cases had to be adjusted. Please see: https://github.com/apache/geode/pull/6028

With this change and the same scenario as the one in the example, only one index entry is created. The stats shown by gfsh after the change are the following:

gfsh>list indexes --with-stats=yes
Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 0    | 0
server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 1    | 1


Could someone tell if the current behavior is not correct or if I am missing something and with the change I am proposing something else will stop working?

Thanks in advance,

/Alberto G.

Re: Question about Map indexes

Posted by Alberto Gomez <al...@est.tech>.

Hi again,

After investigating a bit more how Map indexes work, I have seen that they are also used to support queries that use "!=" as Jason pointed out.

I would have expected that queries using != or NOT would not make use of indexes as it is the common practice in databases ([1]) and also as the Geode documentation seems to suggest ([2]):

"Indexes are not used in expressions that contain NOT, so in a WHERE clause of a query, qty >= 10 could have an index on qty applied for efficiency. However, NOT(qty < 10) could not have the same index applied."

Could somebody please confirm or deny if what the documentation states above is true or false and also if the conclusion can also be extended to the use of the != operator?

I also think that the documentation about indexes could be improved at least in two areas:

  *   Information about range indexes. While there is a section for the deprecated Hash Indexes, there is no specific section for Range Indexes.
  *   Information about Map indexes. The information about these indexes lacks a bit of detail. For example, how does the index work when the entry does not contain the Map field for which there is an index? How does it behave when the Map field does not have the key in the index? How does it behave when the key is null or when the value is null?

Does anyone have plans to extend the information about indexes in Geode?

Thanks,

Alberto G.

[1] https://stackoverflow.com/questions/1759476/database-index-not-used-if-the-where-criteria-is
[2] https://geode.apache.org/docs/guide/19/developing/query_index/indexing_guidelines.html


________________________________
From: Alberto Gomez <al...@est.tech>
Sent: Saturday, February 13, 2021 5:40 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about Map indexes

Jason, thanks for the help.

I added a new commit to the pull request that solves the issue without (apparently) breaking anything.

The problem was that when adding an index entry we need to distinguish between the case where the Map does not contain the key from the case where the Map contains the key but the value for the key is null. If we use Map.get() we get in both cases null but we should only add the index entry in the latter case (when the map contains the key but the value corresponding to it is null).

I am not particularly proud of the solution because I use of an arbitrary exception to be able to distinguish both cases. Anyway, could you please check if we are in the right direction?

Thanks,

Alberto


________________________________
From: Jason Huynh <jh...@vmware.com>
Sent: Thursday, February 11, 2021 10:57 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about Map indexes

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought that comes to mind is when someone does a != query.  The index still has to supply the correct answer to the query (all entries with null or undefined values possibly)

I'll try to think of other cases where it might matter.  There may be other ways to execute the query but it would probably take a bit of reworking.. (ill check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez" <al...@est.tech> wrote:

    Hi,

    We have observed that creating an index on a Map field causes the creation of an index entry for every entry created in the region containing the Map, no matter if the Map field contained the key used in the index.
    Nevertheless, we would expect that only entries whose Map field contain the key used in the index would have the corresponding index entry. With this behavior, the memory consumed by the index could be much higher than needed depending on the percentage of entries whose Map field contained the key in the index.

    ---------------------------------------------------
    Example:
    We have a region with entries whose key type is a String and the value type is an object with a field called "field1" of Map type.

    We expect to run queries on the region like the following:

    SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

    We create a Map index to speed up the above queries:

    gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" --region="/example-region1 r"

    We do the following puts:
    - Put entry with key="key1" and with value=<Object whose field "field1" is a Map that contains the key 'mapkey1'>
    - Put entry with key="key2" and with value=<Object whose field "field1" is a Map that does not contain the key 'mapkey1'>

    The observation is that Geode creates two index entries for each entry. For the first entry, the internal indexKey is "key1" and for the second one, the internal indexKey is null.

    These are the stats shown by gfsh after doing the above puts:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
    server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
    ---------------------------------------------------

    Is there any reason why Geode would create an index entry for the second entry given that the Map field does not contain the key in the Map index?

    I have created a draft pull request changing the behavior of Geode to not create the index entry when the Map field does not contain the key used in the index. Only two Unit test cases had to be adjusted. Please see: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&amp;data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&amp;reserved=0

    With this change and the same scenario as the one in the example, only one index entry is created. The stats shown by gfsh after the change are the following:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 0    | 0
    server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 1    | 1


    Could someone tell if the current behavior is not correct or if I am missing something and with the change I am proposing something else will stop working?

    Thanks in advance,

    /Alberto G.

Re: Question about Map indexes

Posted by Alberto Gomez <al...@est.tech>.

Jason, thanks for the help.

I added a new commit to the pull request that solves the issue without (apparently) breaking anything.

The problem was that when adding an index entry we need to distinguish between the case where the Map does not contain the key from the case where the Map contains the key but the value for the key is null. If we use Map.get() we get in both cases null but we should only add the index entry in the latter case (when the map contains the key but the value corresponding to it is null).

I am not particularly proud of the solution because I use of an arbitrary exception to be able to distinguish both cases. Anyway, could you please check if we are in the right direction?

Thanks,

Alberto


________________________________
From: Jason Huynh <jh...@vmware.com>
Sent: Thursday, February 11, 2021 10:57 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: Question about Map indexes

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought that comes to mind is when someone does a != query.  The index still has to supply the correct answer to the query (all entries with null or undefined values possibly)

I'll try to think of other cases where it might matter.  There may be other ways to execute the query but it would probably take a bit of reworking.. (ill check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez" <al...@est.tech> wrote:

    Hi,

    We have observed that creating an index on a Map field causes the creation of an index entry for every entry created in the region containing the Map, no matter if the Map field contained the key used in the index.
    Nevertheless, we would expect that only entries whose Map field contain the key used in the index would have the corresponding index entry. With this behavior, the memory consumed by the index could be much higher than needed depending on the percentage of entries whose Map field contained the key in the index.

    ---------------------------------------------------
    Example:
    We have a region with entries whose key type is a String and the value type is an object with a field called "field1" of Map type.

    We expect to run queries on the region like the following:

    SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

    We create a Map index to speed up the above queries:

    gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" --region="/example-region1 r"

    We do the following puts:
    - Put entry with key="key1" and with value=<Object whose field "field1" is a Map that contains the key 'mapkey1'>
    - Put entry with key="key2" and with value=<Object whose field "field1" is a Map that does not contain the key 'mapkey1'>

    The observation is that Geode creates two index entries for each entry. For the first entry, the internal indexKey is "key1" and for the second one, the internal indexKey is null.

    These are the stats shown by gfsh after doing the above puts:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
    server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
    ---------------------------------------------------

    Is there any reason why Geode would create an index entry for the second entry given that the Map field does not contain the key in the Map index?

    I have created a draft pull request changing the behavior of Geode to not create the index entry when the Map field does not contain the key used in the index. Only two Unit test cases had to be adjusted. Please see: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&amp;data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&amp;reserved=0

    With this change and the same scenario as the one in the example, only one index entry is created. The stats shown by gfsh after the change are the following:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 0    | 0
    server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 1    | 1


    Could someone tell if the current behavior is not correct or if I am missing something and with the change I am proposing something else will stop working?

    Thanks in advance,

    /Alberto G.

Re: Question about Map indexes

Posted by Jason Huynh <jh...@vmware.com>.

Hi Alberto,

I haven't checked the PR yet, just read through the email.  The first thought that comes to mind is when someone does a != query.  The index still has to supply the correct answer to the query (all entries with null or undefined values possibly)

I'll try to think of other cases where it might matter.  There may be other ways to execute the query but it would probably take a bit of reworking.. (ill check your pr to see if this is already addressed.   Sorry if it is!)

-Jason

On 2/11/21, 8:28 AM, "Alberto Gomez" <al...@est.tech> wrote:

    Hi,

    We have observed that creating an index on a Map field causes the creation of an index entry for every entry created in the region containing the Map, no matter if the Map field contained the key used in the index.
    Nevertheless, we would expect that only entries whose Map field contain the key used in the index would have the corresponding index entry. With this behavior, the memory consumed by the index could be much higher than needed depending on the percentage of entries whose Map field contained the key in the index.

    ---------------------------------------------------
    Example:
    We have a region with entries whose key type is a String and the value type is an object with a field called "field1" of Map type.

    We expect to run queries on the region like the following:

    SELECT * from /example-region1 p WHERE p.field1['mapkey1']=$1"

    We create a Map index to speed up the above queries:

    gfsh> create index --name=myIndex --expression="r.field1['mapkey1']" --region="/example-region1 r"

    We do the following puts:
    - Put entry with key="key1" and with value=<Object whose field "field1" is a Map that contains the key 'mapkey1'>
    - Put entry with key="key2" and with value=<Object whose field "field1" is a Map that does not contain the key 'mapkey1'>

    The observation is that Geode creates two index entries for each entry. For the first entry, the internal indexKey is "key1" and for the second one, the internal indexKey is null.

    These are the stats shown by gfsh after doing the above puts:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1109606)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
    server2     | 192.168.0.26(server2:1109695)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 1    | 1       | 0           | 1    | 1
    ---------------------------------------------------

    Is there any reason why Geode would create an index entry for the second entry given that the Map field does not contain the key in the Map index?

    I have created a draft pull request changing the behavior of Geode to not create the index entry when the Map field does not contain the key used in the index. Only two Unit test cases had to be adjusted. Please see: https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6028&amp;data=04%7C01%7Cjhuynh%40vmware.com%7C0957cc0ef91b4b23116408d8ceaa0a8d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637486577011301177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2WDUj6NPEnfX3AXH72MTZYx%2FbXMPJQlVZeKq7KsJDTw%3D&amp;reserved=0

    With this change and the same scenario as the one in the example, only one index entry is created. The stats shown by gfsh after the change are the following:

    gfsh>list indexes --with-stats=yes
    Member Name |                Member ID                |   Region Path    |   Name   | Type  | Indexed Expression  |    From Clause     | Valid Index | Uses | Updates | Update Time | Keys | Values
    ----------- | --------------------------------------- | ---------------- | -------- | ----- | --------------------------------- | ------------------ | ----------- | ---- | ------- | ----------- | ---- | ------
    server1     | 192.168.0.26(server1:1102192)<v1>:41000 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 0    | 0
    server2     | 192.168.0.26(server2:1102279)<v2>:41001 | /example-region1 | mapIndex | RANGE | r.field1['mapkey1'] | /example-region1 r | true        | 2    | 1       | 0           | 1    | 1


    Could someone tell if the current behavior is not correct or if I am missing something and with the change I am proposing something else will stop working?

    Thanks in advance,

    /Alberto G.