You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by Mageswaran Muthukumar <ma...@thoughtworks.com> on 2018/04/28 06:16:05 UTC

GEODE joining two big tables and getting aggregation - Suggestions

Hi,

I am new to Geode and trying to implement for an use case to get an
aggregation of the data from two regions.

One region has the data received from the source system one and updated on
daily basis and has around 2.7 million records and based on market. I am
planning to setup this region as partitioned based on market data in two
servers.

One region has the data received from the source system two and has around
1 million records and it is also partitioned based on market data.

The requirement is to get aggregate joining both the regions with a key
common to both and provide a where clause

Region1 Attributes

productId
marketId
price
quantity

Region2 Attributes

productId
marketId
eventId

select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
where r1.productId = r2.productId and r2.eventId = '123'

Currently trying to run the above query as the server function just
ingesting 50K records for both the region though it is taking huge time and
can see spike in the CPU usage. Already using PDX serialization for this
and tried creating index for productId in both the regions

Any suggestion to improve the performance of this query. Also please advice
if we need to use joins for this or we need to fetch region2 records based
on eventId and do a lookup for region1 do the aggregation in server
function.

Thanks!
Mags

Re: GEODE joining two big tables and getting aggregation - Suggestions

Posted by Jason Huynh <jh...@pivotal.io>.

Great news!  I am still wondering how the query works when the alias r
isn't defined.  Is there a field named r in the object?  I'm probably
misreading something but that part of the query still confuses me.

On Mon, Apr 30, 2018 at 5:17 PM Mageswaran Muthukumar <
mageswam@thoughtworks.com> wrote:

> Hi Wes / Jason,
>
> Thanks for getting back. This was sorted once we changed the index in the
> productId column from Key to Range Index and also we just queried both the
> tables joining them and did the computation of price * quantity and other
> columns in java after fetching it in server function. Now getting the query
> back fo 100K records in less than a second.
>
> Regards,
> Mags
>
> On Mon, Apr 30, 2018 at 9:02 PM, Jason Huynh <jh...@pivotal.io> wrote:
>
>> Just to be sure, are the two regions are colocated by productId?
>>
>> Looking at the provided query, what is the r alias?  is it supposed to
>> be r1 or r2?
>> select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
>> where r1.productId = r2.productId and r2.eventId = '123'
>>
>> The index on productId for both regions is really important and probably
>> needs to be there for fastest execution.
>>
>> If you continue to get this problem after fixing the alias, you can break
>> up the query as you described, by querying against r2.eventId and running a
>> separate query against r1 if that works out better.  You should be able to
>> do that in a function and target only local data by running the
>> query.execute(context) method and passing in the function context.
>>
>> I would be interested in knowing if fixing the alias fixes the problem or
>> not.
>> Using r.price.multiply you are calling a method on the object.  I think
>> this causes a deserialization of the object.  I would look there
>> afterwards, if the change to the alias and index do not solve the issue.
>>
>>
>> On Sat, Apr 28, 2018 at 5:33 AM Real Wes <Th...@outlook.com> wrote:
>>
>>> Based on your description the query should be fast but obviously
>>> something is wrong and from the facts that you conveyed, there is not
>>> enough info to determine the problem.  It could be GC’s, etc., who knows
>>> what. However, here is the general way I approach this scenario:
>>>
>>> Put <trace> and <hint …> before the “select” in your query to guide it
>>> to use the index. It will tell you whether it used the index or not and
>>> exactly how long it took.
>>>
>>> Make productId + marketId the key of Region1, separated by a “|”.
>>> Colocate Region2 with Region1 based on productId + “|” + marketingId.
>>> Consider a key of productId + “|” + marketingId + “|” + counter.
>>> Index productId.
>>> That will work.
>>>
>>> If you don’t want to colocate, consider making your smaller region2 a
>>> replicate region with Region1 partitioned.  Index on productId.
>>>
>>> Wes Williams
>>>
>>> On Apr 28, 2018, at 2:16 AM, Mageswaran Muthukumar <
>>> mageswam@thoughtworks.com> wrote:
>>>
>>> Hi,
>>>
>>> I am new to Geode and trying to implement for an use case to get an
>>> aggregation of the data from two regions.
>>>
>>> One region has the data received from the source system one and updated
>>> on daily basis and has around 2.7 million records and based on market. I am
>>> planning to setup this region as partitioned based on market data in two
>>> servers.
>>>
>>> One region has the data received from the source system two and has
>>> around 1 million records and it is also partitioned based on market
>>> data.
>>>
>>> The requirement is to get aggregate joining both the regions with a key
>>> common to both and provide a where clause
>>>
>>> Region1 Attributes
>>>
>>> productId
>>> marketId
>>> price
>>> quantity
>>>
>>> Region2 Attributes
>>>
>>> productId
>>> marketId
>>> eventId
>>>
>>> select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
>>> where r1.productId = r2.productId and r2.eventId = '123'
>>>
>>> Currently trying to run the above query as the server function just
>>> ingesting 50K records for both the region though it is taking huge time and
>>> can see spike in the CPU usage. Already using PDX serialization for this
>>> and tried creating index for productId in both the regions
>>>
>>> Any suggestion to improve the performance of this query. Also
>>> please advice if we need to use joins for this or we need to fetch region2
>>> records based on eventId and do a lookup for region1 do the aggregation in
>>> server function.
>>>
>>> Thanks!
>>> Mags
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: GEODE joining two big tables and getting aggregation - Suggestions

Posted by Mageswaran Muthukumar <ma...@thoughtworks.com>.

Hi Wes / Jason,

Thanks for getting back. This was sorted once we changed the index in the
productId column from Key to Range Index and also we just queried both the
tables joining them and did the computation of price * quantity and other
columns in java after fetching it in server function. Now getting the query
back fo 100K records in less than a second.

Regards,
Mags

On Mon, Apr 30, 2018 at 9:02 PM, Jason Huynh <jh...@pivotal.io> wrote:

> Just to be sure, are the two regions are colocated by productId?
>
> Looking at the provided query, what is the r alias?  is it supposed to be
> r1 or r2?
> select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
> where r1.productId = r2.productId and r2.eventId = '123'
>
> The index on productId for both regions is really important and probably
> needs to be there for fastest execution.
>
> If you continue to get this problem after fixing the alias, you can break
> up the query as you described, by querying against r2.eventId and running a
> separate query against r1 if that works out better.  You should be able to
> do that in a function and target only local data by running the
> query.execute(context) method and passing in the function context.
>
> I would be interested in knowing if fixing the alias fixes the problem or
> not.
> Using r.price.multiply you are calling a method on the object.  I think
> this causes a deserialization of the object.  I would look there
> afterwards, if the change to the alias and index do not solve the issue.
>
>
> On Sat, Apr 28, 2018 at 5:33 AM Real Wes <Th...@outlook.com> wrote:
>
>> Based on your description the query should be fast but obviously
>> something is wrong and from the facts that you conveyed, there is not
>> enough info to determine the problem.  It could be GC’s, etc., who knows
>> what. However, here is the general way I approach this scenario:
>>
>> Put <trace> and <hint …> before the “select” in your query to guide it to
>> use the index. It will tell you whether it used the index or not and
>> exactly how long it took.
>>
>> Make productId + marketId the key of Region1, separated by a “|”.
>> Colocate Region2 with Region1 based on productId + “|” + marketingId.
>> Consider a key of productId + “|” + marketingId + “|” + counter.
>> Index productId.
>> That will work.
>>
>> If you don’t want to colocate, consider making your smaller region2 a
>> replicate region with Region1 partitioned.  Index on productId.
>>
>> Wes Williams
>>
>> On Apr 28, 2018, at 2:16 AM, Mageswaran Muthukumar <
>> mageswam@thoughtworks.com> wrote:
>>
>> Hi,
>>
>> I am new to Geode and trying to implement for an use case to get an
>> aggregation of the data from two regions.
>>
>> One region has the data received from the source system one and updated
>> on daily basis and has around 2.7 million records and based on market. I am
>> planning to setup this region as partitioned based on market data in two
>> servers.
>>
>> One region has the data received from the source system two and has
>> around 1 million records and it is also partitioned based on market data.
>>
>> The requirement is to get aggregate joining both the regions with a key
>> common to both and provide a where clause
>>
>> Region1 Attributes
>>
>> productId
>> marketId
>> price
>> quantity
>>
>> Region2 Attributes
>>
>> productId
>> marketId
>> eventId
>>
>> select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
>> where r1.productId = r2.productId and r2.eventId = '123'
>>
>> Currently trying to run the above query as the server function just
>> ingesting 50K records for both the region though it is taking huge time and
>> can see spike in the CPU usage. Already using PDX serialization for this
>> and tried creating index for productId in both the regions
>>
>> Any suggestion to improve the performance of this query. Also
>> please advice if we need to use joins for this or we need to fetch region2
>> records based on eventId and do a lookup for region1 do the aggregation in
>> server function.
>>
>> Thanks!
>> Mags
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: GEODE joining two big tables and getting aggregation - Suggestions

Posted by Jason Huynh <jh...@pivotal.io>.

Just to be sure, are the two regions are colocated by productId?

Looking at the provided query, what is the r alias?  is it supposed to be
r1 or r2?
select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
where r1.productId = r2.productId and r2.eventId = '123'

The index on productId for both regions is really important and probably
needs to be there for fastest execution.

If you continue to get this problem after fixing the alias, you can break
up the query as you described, by querying against r2.eventId and running a
separate query against r1 if that works out better.  You should be able to
do that in a function and target only local data by running the
query.execute(context) method and passing in the function context.

I would be interested in knowing if fixing the alias fixes the problem or
not.
Using r.price.multiply you are calling a method on the object.  I think
this causes a deserialization of the object.  I would look there
afterwards, if the change to the alias and index do not solve the issue.

On Sat, Apr 28, 2018 at 5:33 AM Real Wes <Th...@outlook.com> wrote:

> Based on your description the query should be fast but obviously something
> is wrong and from the facts that you conveyed, there is not enough info to
> determine the problem.  It could be GC’s, etc., who knows what. However,
> here is the general way I approach this scenario:
>
> Put <trace> and <hint …> before the “select” in your query to guide it to
> use the index. It will tell you whether it used the index or not and
> exactly how long it took.
>
> Make productId + marketId the key of Region1, separated by a “|”.
> Colocate Region2 with Region1 based on productId + “|” + marketingId.
> Consider a key of productId + “|” + marketingId + “|” + counter.
> Index productId.
> That will work.
>
> If you don’t want to colocate, consider making your smaller region2 a
> replicate region with Region1 partitioned.  Index on productId.
>
> Wes Williams
>
> On Apr 28, 2018, at 2:16 AM, Mageswaran Muthukumar <
> mageswam@thoughtworks.com> wrote:
>
> Hi,
>
> I am new to Geode and trying to implement for an use case to get an
> aggregation of the data from two regions.
>
> One region has the data received from the source system one and updated on
> daily basis and has around 2.7 million records and based on market. I am
> planning to setup this region as partitioned based on market data in two
> servers.
>
> One region has the data received from the source system two and has around
> 1 million records and it is also partitioned based on market data.
>
> The requirement is to get aggregate joining both the regions with a key
> common to both and provide a where clause
>
> Region1 Attributes
>
> productId
> marketId
> price
> quantity
>
> Region2 Attributes
>
> productId
> marketId
> eventId
>
> select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
> where r1.productId = r2.productId and r2.eventId = '123'
>
> Currently trying to run the above query as the server function just
> ingesting 50K records for both the region though it is taking huge time and
> can see spike in the CPU usage. Already using PDX serialization for this
> and tried creating index for productId in both the regions
>
> Any suggestion to improve the performance of this query. Also
> please advice if we need to use joins for this or we need to fetch region2
> records based on eventId and do a lookup for region1 do the aggregation in
> server function.
>
> Thanks!
> Mags
>
>
>
>
>
>
>
>
>
>

Re: GEODE joining two big tables and getting aggregation - Suggestions

Posted by Real Wes <Th...@outlook.com>.

Based on your description the query should be fast but obviously something is wrong and from the facts that you conveyed, there is not enough info to determine the problem. It could be GC’s, etc., who knows what. However, here is the general way I approach this scenario:

Put <trace> and <hint …> before the “select” in your query to guide it to use the index. It will tell you whether it used the index or not and exactly how long it took.

Make productId + marketId the key of Region1, separated by a “|”.
Colocate Region2 with Region1 based on productId + “|” + marketingId. Consider a key of productId + “|” + marketingId + “|” + counter.
Index productId.
That will work.

If you don’t want to colocate, consider making your smaller region2 a replicate region with Region1 partitioned. Index on productId.

Wes Williams

On Apr 28, 2018, at 2:16 AM, Mageswaran Muthukumar <ma...@thoughtworks.com>> wrote:

Hi,

I am new to Geode and trying to implement for an use case to get an aggregation of the data from two regions.

One region has the data received from the source system one and updated on daily basis and has around 2.7 million records and based on market. I am planning to setup this region as partitioned based on market data in two servers.

One region has the data received from the source system two and has around 1 million records and it is also partitioned based on market data.

The requirement is to get aggregate joining both the regions with a key common to both and provide a where clause

Region1 Attributes

productId
marketId
price
quantity

Region2 Attributes

productId
marketId
eventId

select sum(r.price.multiply(r.quantity)) from Region1 r1, Region2 r2
where r1.productId = r2.productId and r2.eventId = '123'

Currently trying to run the above query as the server function just ingesting 50K records for both the region though it is taking huge time and can see spike in the CPU usage. Already using PDX serialization for this and tried creating index for productId in both the regions

Any suggestion to improve the performance of this query. Also please advice if we need to use joins for this or we need to fetch region2 records based on eventId and do a lookup for region1 do the aggregation in server function.

Thanks!
Mags