You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by eugene miretsky <eu...@gmail.com> on 2018/08/22 00:18:48 UTC

Slow SQL query uses only a single CPU

Hi,

We have a cache called GAL3EC1, it has

   1. A composite pKey consisting of customer_id and date
   2. An Index on the date column
   3. 300 sparse columns

We are running a single EC2 4x8xlarge node.

The following query takes 8min to finish
Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
'2018-05-12' GROUP BY customer_id having
SUM(ec1_bknt_total_product_views_app) > 2 AND
MAX(ec1_hnk_total_product_clicks_app) < 1)

I have a few questions:

   1. 'top' command shows %100 cpu utilization (i.e only one of the 32 CPUs
   is used). How can I get the query to use all 32 CPUs? I have tried setting
   Query Parallelism to 32, but it didn't help,
   2. Adding the index on date column seems to have slowed down the query.
   The 8min time from above was without the index, with the index the query
   doesn't finish (I gave up after 30min). A similar query on a smaller date
   range showed a 10x slow down with the index. Why?
   3. Our loads from Spark are very slow as well, and also seem to not use
   the system resource properly, can that be related?
   4. What are some good tools and techniques to troubleshoot these
   problems in Ignite?


All the relevant info is attached (configs, cache stats, node stats, etc.).

Cheers,
Eugene

Re: Slow SQL query uses only a single CPU

Posted by aealexsandrov <ae...@gmail.com>.
Hi,

I think you can try to investigate the articles from the next wiki:

https://cwiki.apache.org/confluence/display/IGNITE/Design+Documents

Next blog contains the interesting information (possible some will be out of
date):

http://gridgain.blogspot.com

It contains a lot of information about how Ignite works under the hood. 

According to indexes and h2. Ignite used the h2 for query parsing, execution
planning, and indexing but looks like there is no detailed documentation
about it. Only official information:

https://apacheignite-sql.readme.io/docs/how-ignite-sql-works

BR,
Andrei



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Slow SQL query uses only a single CPU

Posted by eugene miretsky <eu...@gmail.com>.
Thanks,

I tried the composite key and provided the index tip in the query but the
query plan and execution time stayed the same. So really I am back to where
we started.

Right now I suspect that I am not setting the affinity key properly - I
intend the key to be customer_id, but maybe Ignite sets it to the _key
(composite of customer_id, dt). Is there a way to check it?

Also, is there documentation of how exactly data is stored, and how SQL
queries are performed? Where is the data stored (off-heap?)? When is the
data loaded to heap?  How do the ignite pages/segments get to H2?
Are Ignite and H2 indexes the same thing?

Cheers,
Eugene

On Wed, Aug 22, 2018 at 10:36 AM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> 1. /* PUBLIC.AFFINITY_KEY */ means index on affinity column is used. Full
> index will be scanned against date condition.
> As I wrote you can create composite index to speedup index scan.
> 2. "group sorted" means index is used for grouping. Looks like H2 have
> optimization for this and grouping can applied on fly.
> Unsorted grouping would means that we have to fetch full dataset and only
> then grouping.
>
> On Wed, Aug 22, 2018 at 5:21 PM eugene miretsky <eu...@gmail.com>
> wrote:
>
>> Just as a reference, bellow are 2 execution plans with and without the
>> index on a very similar table.
>>
>> Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted
>> */.
>> 1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are
>> setting customer_id as an affinity key. Is there a way to verify that?
>> 2) Is it possible that the removal of /* group sorted */ indicates that
>> the result of group_by must be sorted? (hence taking a long time)
>>
>> *Query*
>> Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt >
>> '2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count)
>> > 2 AND MAX(ru_total_web_sessions_count) < 1)
>>
>> *Without an index*
>>
>> SELECT
>>
>> __Z0.CUSTOMER_ID AS __C0_0,
>>
>> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>>
>> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>>
>> FROM PUBLIC.GAL2RU __Z0
>>
>> /* PUBLIC.AFFINITY_KEY */
>>
>> WHERE __Z0.DT > '2018-06-12'
>>
>> GROUP BY __Z0.CUSTOMER_ID
>>
>> /* group sorted */
>>
>>
>> SELECT
>>
>> COUNT(*)
>>
>> FROM (
>>
>> SELECT
>>
>> __C0_0 AS CUSTOMER_ID
>>
>> FROM PUBLIC.__T0
>>
>> GROUP BY __C0_0
>>
>> HAVING (SUM(__C0_1) > 2)
>>
>> AND (MAX(__C0_2) < 1)
>>
>>
>> *With an index*
>>
>> SELECT
>>
>> __Z0.CUSTOMER_ID AS __C0_0,
>>
>> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>>
>> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>>
>> FROM PUBLIC.GAL2RU __Z0
>>
>> /* PUBLIC.DT_IDX2: DT > '2018-06-12' */
>>
>> WHERE __Z0.DT > '2018-06-12'
>>
>> GROUP BY __Z0.CUSTOMER_ID
>>
>>
>> SELECT
>>
>> COUNT(*)
>>
>> FROM (
>>
>> SELECT
>>
>> __C0_0 AS CUSTOMER_ID
>>
>> FROM PUBLIC.__T0
>>
>> GROUP BY __C0_0
>>
>> HAVING (SUM(__C0_1) > 2)
>>
>> AND (MAX(__C0_2) < 1)
>>
>> ) _0__Z1
>>
>>
>> On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Andrey,
>>>
>>> We are using the Ignite notebook, any idea if there is a way to provide
>>> these flags and hints directly from SQL?
>>>
>>> From your description, it seems like the query is executed in the
>>> following order
>>> 1) Group by customer_id
>>> 2) For each group, perform the filtering on date using the index and
>>> aggregates
>>>
>>> My impressions was that the order is
>>> 1)  On each node, filter rows by date (using the index)
>>> 2)  On each node, group by the remaining rows by customer id, and then
>>> perform the aggrate
>>>
>>> That's why we created the index on the dt field, as opposed to
>>> customer_id field.
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
>>> andrey.mashenkov@gmail.com> wrote:
>>>
>>>> Eugene,
>>>>
>>>> 1. Note that queryParallelism splits indices and Ignite work similar
>>>> way as if index data resides on several nodes. These index part can be
>>>> looked up in parallel threads.
>>>> 2. It is not a simple query as you data distributed among partitions
>>>> and is not collocated and aggregate function are used.
>>>> HAVING clause here is a reason, Ignite can apply it on reduce phase
>>>> only as HAVING requires aggregate value from all index parts.
>>>> 3. If you data already collocated on customer_id then you can hit
>>>> Ignite with set SqlFieldsQuery.setCollocated(true). This should force
>>>> Ignite to optimize grouping and push down aggregates to map phase.
>>>> 4. In query plan you attached you can see H2 uses DT_IDX
>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>> It is not effective. With this index H2 have to process all data to
>>>> calculate aggregate for group. Index on affinity field may be more
>>>> effective as data can be processed group by group.
>>>> once all group data is process then result can be passed to reducer.
>>>> Hope, H2 is smart enough to do such streaming.
>>>>
>>>> Also, you can try to use composite index on (customer_id, date)
>>>> columns. Most likely. hint will needed [2].
>>>>
>>>> See also about collocated flag [1] and Hits [2]
>>>>
>>>> [1] https://ignite.apache.org/releases/latest/javadoc/org/
>>>> apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
>>>> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance-
>>>> and-debugging#index-hints
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Thanks Andrey,
>>>>>
>>>>> Right now we are testing with only one big node, so the reduce step
>>>>> should not take any time.
>>>>>
>>>>> 1) We already set parallelism to 32, and I can still see only 1 core
>>>>> working. Anything else could be preventing multiple cores from working on
>>>>> the job?
>>>>> 2) Why would the reduce phase need to look at all the data? It seems
>>>>> like a fairly simple query
>>>>> 3) We are already collocating data  by customer_id (though as I
>>>>> mentioned, right now there is only one node)
>>>>> 4) We already using collocation and tried using an index, and other
>>>>> advice? Is there a way to check what Ignite is actually doing? How are
>>>>> indexs used (by Ignite or H2)?
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>>>>> andrey.mashenkov@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> 1. Possible there are too much data should be looked for the query.
>>>>>> With single node and parallelism=1 query will always run in single thread.
>>>>>>  You can try to add more nodes or increase query parallelism to
>>>>>> utilize more CPU cores.
>>>>>>
>>>>>> 2. Index on date field may be not effective as reduce phase should
>>>>>> look all the data for further grouping.
>>>>>> Try add index on customer_id or use collocation in customer_id
>>>>>> (usually more preferable way).
>>>>>>
>>>>>> Also it is possible the bottleneck is the reduce phase.
>>>>>> Is it possible to collocate data by group by column  (customer_id)?
>>>>>> This collocation will allow you use collocated flag [1] and Ignite will use
>>>>>> more optimal plan.
>>>>>>
>>>>>> 4. The main techniques is trying to reduce amount to data to be
>>>>>> looked up on every phase with using data collocation and indices
>>>>>> Ignite provide 2 plans for distributed queries: map and reduce. You
>>>>>> can analyse and check these queries separately to understand how much data
>>>>>> are processed on map phase and on reduce.
>>>>>> Map query process node local data (until distributed joins on), while
>>>>>> reduce fetch data from remote node that may costs. .
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>>>>
>>>>>>> SELECT
>>>>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>>>>> WHERE __Z0.DT > '2018-05-12'
>>>>>>> GROUP BY __Z0.CUSTOMER_ID
>>>>>>> SELECT
>>>>>>> COUNT(*)
>>>>>>> FROM (
>>>>>>> SELECT
>>>>>>> __C0_0 AS CUSTOMER_ID
>>>>>>> FROM PUBLIC.__T0
>>>>>>> GROUP BY __C0_0
>>>>>>> HAVING (SUM(__C0_1) > 2)
>>>>>>> AND (MAX(__C0_2) < 1)
>>>>>>> ) _0__Z1
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We have a cache called GAL3EC1, it has
>>>>>>>>
>>>>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>>>>    2. An Index on the date column
>>>>>>>>    3. 300 sparse columns
>>>>>>>>
>>>>>>>> We are running a single EC2 4x8xlarge node.
>>>>>>>>
>>>>>>>> The following query takes 8min to finish
>>>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>>>>> '2018-05-12' GROUP BY customer_id having SUM(ec1_bknt_total_product_views_app)
>>>>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>>>>
>>>>>>>> I have a few questions:
>>>>>>>>
>>>>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of
>>>>>>>>    the 32 CPUs is used). How can I get the query to use all 32 CPUs? I have
>>>>>>>>    tried setting Query Parallelism to 32, but it didn't help,
>>>>>>>>    2. Adding the index on date column seems to have slowed down
>>>>>>>>    the query. The 8min time from above was without the index, with the index
>>>>>>>>    the query doesn't finish (I gave up after 30min). A similar query on a
>>>>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>>>>    not use the system resource properly, can that be related?
>>>>>>>>    4. What are some good tools and techniques to troubleshoot
>>>>>>>>    these problems in Ignite?
>>>>>>>>
>>>>>>>>
>>>>>>>> All the relevant info is attached (configs, cache stats, node
>>>>>>>> stats, etc.).
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Eugene
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Andrey V. Mashenkov
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Andrey V. Mashenkov
>>>>
>>>
>>>
>>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Slow SQL query uses only a single CPU

Posted by Andrey Mashenkov <an...@gmail.com>.
1. /* PUBLIC.AFFINITY_KEY */ means index on affinity column is used. Full
index will be scanned against date condition.
As I wrote you can create composite index to speedup index scan.
2. "group sorted" means index is used for grouping. Looks like H2 have
optimization for this and grouping can applied on fly.
Unsorted grouping would means that we have to fetch full dataset and only
then grouping.

On Wed, Aug 22, 2018 at 5:21 PM eugene miretsky <eu...@gmail.com>
wrote:

> Just as a reference, bellow are 2 execution plans with and without the
> index on a very similar table.
>
> Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted */.
> 1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are
> setting customer_id as an affinity key. Is there a way to verify that?
> 2) Is it possible that the removal of /* group sorted */ indicates that
> the result of group_by must be sorted? (hence taking a long time)
>
> *Query*
> Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt >
> '2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count) >
> 2 AND MAX(ru_total_web_sessions_count) < 1)
>
> *Without an index*
>
> SELECT
>
> __Z0.CUSTOMER_ID AS __C0_0,
>
> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>
> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>
> FROM PUBLIC.GAL2RU __Z0
>
> /* PUBLIC.AFFINITY_KEY */
>
> WHERE __Z0.DT > '2018-06-12'
>
> GROUP BY __Z0.CUSTOMER_ID
>
> /* group sorted */
>
>
> SELECT
>
> COUNT(*)
>
> FROM (
>
> SELECT
>
> __C0_0 AS CUSTOMER_ID
>
> FROM PUBLIC.__T0
>
> GROUP BY __C0_0
>
> HAVING (SUM(__C0_1) > 2)
>
> AND (MAX(__C0_2) < 1)
>
>
> *With an index*
>
> SELECT
>
> __Z0.CUSTOMER_ID AS __C0_0,
>
> SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,
>
> MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2
>
> FROM PUBLIC.GAL2RU __Z0
>
> /* PUBLIC.DT_IDX2: DT > '2018-06-12' */
>
> WHERE __Z0.DT > '2018-06-12'
>
> GROUP BY __Z0.CUSTOMER_ID
>
>
> SELECT
>
> COUNT(*)
>
> FROM (
>
> SELECT
>
> __C0_0 AS CUSTOMER_ID
>
> FROM PUBLIC.__T0
>
> GROUP BY __C0_0
>
> HAVING (SUM(__C0_1) > 2)
>
> AND (MAX(__C0_2) < 1)
>
> ) _0__Z1
>
>
> On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Thanks Andrey,
>>
>> We are using the Ignite notebook, any idea if there is a way to provide
>> these flags and hints directly from SQL?
>>
>> From your description, it seems like the query is executed in the
>> following order
>> 1) Group by customer_id
>> 2) For each group, perform the filtering on date using the index and
>> aggregates
>>
>> My impressions was that the order is
>> 1)  On each node, filter rows by date (using the index)
>> 2)  On each node, group by the remaining rows by customer id, and then
>> perform the aggrate
>>
>> That's why we created the index on the dt field, as opposed to
>> customer_id field.
>>
>> Cheers,
>> Eugene
>>
>>
>> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
>> andrey.mashenkov@gmail.com> wrote:
>>
>>> Eugene,
>>>
>>> 1. Note that queryParallelism splits indices and Ignite work similar way
>>> as if index data resides on several nodes. These index part can be looked
>>> up in parallel threads.
>>> 2. It is not a simple query as you data distributed among partitions and
>>> is not collocated and aggregate function are used.
>>> HAVING clause here is a reason, Ignite can apply it on reduce phase only
>>> as HAVING requires aggregate value from all index parts.
>>> 3. If you data already collocated on customer_id then you can hit Ignite
>>> with set SqlFieldsQuery.setCollocated(true). This should force Ignite to
>>> optimize grouping and push down aggregates to map phase.
>>> 4. In query plan you attached you can see H2 uses DT_IDX
>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>> It is not effective. With this index H2 have to process all data to
>>> calculate aggregate for group. Index on affinity field may be more
>>> effective as data can be processed group by group.
>>> once all group data is process then result can be passed to reducer.
>>> Hope, H2 is smart enough to do such streaming.
>>>
>>> Also, you can try to use composite index on (customer_id, date) columns.
>>> Most likely. hint will needed [2].
>>>
>>> See also about collocated flag [1] and Hits [2]
>>>
>>> [1]
>>> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
>>> [2]
>>> https://apacheignite.readme.io/v2.0/docs/sql-performance-and-debugging#index-hints
>>>
>>>
>>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Thanks Andrey,
>>>>
>>>> Right now we are testing with only one big node, so the reduce step
>>>> should not take any time.
>>>>
>>>> 1) We already set parallelism to 32, and I can still see only 1 core
>>>> working. Anything else could be preventing multiple cores from working on
>>>> the job?
>>>> 2) Why would the reduce phase need to look at all the data? It seems
>>>> like a fairly simple query
>>>> 3) We are already collocating data  by customer_id (though as I
>>>> mentioned, right now there is only one node)
>>>> 4) We already using collocation and tried using an index, and other
>>>> advice? Is there a way to check what Ignite is actually doing? How are
>>>> indexs used (by Ignite or H2)?
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>>>> andrey.mashenkov@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> 1. Possible there are too much data should be looked for the query.
>>>>> With single node and parallelism=1 query will always run in single thread.
>>>>>  You can try to add more nodes or increase query parallelism to
>>>>> utilize more CPU cores.
>>>>>
>>>>> 2. Index on date field may be not effective as reduce phase should
>>>>> look all the data for further grouping.
>>>>> Try add index on customer_id or use collocation in customer_id
>>>>> (usually more preferable way).
>>>>>
>>>>> Also it is possible the bottleneck is the reduce phase.
>>>>> Is it possible to collocate data by group by column  (customer_id)?
>>>>> This collocation will allow you use collocated flag [1] and Ignite will use
>>>>> more optimal plan.
>>>>>
>>>>> 4. The main techniques is trying to reduce amount to data to be looked
>>>>> up on every phase with using data collocation and indices
>>>>> Ignite provide 2 plans for distributed queries: map and reduce. You
>>>>> can analyse and check these queries separately to understand how much data
>>>>> are processed on map phase and on reduce.
>>>>> Map query process node local data (until distributed joins on), while
>>>>> reduce fetch data from remote node that may costs. .
>>>>>
>>>>>
>>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>>>
>>>>>> SELECT
>>>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>>>> WHERE __Z0.DT > '2018-05-12'
>>>>>> GROUP BY __Z0.CUSTOMER_ID
>>>>>> SELECT
>>>>>> COUNT(*)
>>>>>> FROM (
>>>>>> SELECT
>>>>>> __C0_0 AS CUSTOMER_ID
>>>>>> FROM PUBLIC.__T0
>>>>>> GROUP BY __C0_0
>>>>>> HAVING (SUM(__C0_1) > 2)
>>>>>> AND (MAX(__C0_2) < 1)
>>>>>> ) _0__Z1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have a cache called GAL3EC1, it has
>>>>>>>
>>>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>>>    2. An Index on the date column
>>>>>>>    3. 300 sparse columns
>>>>>>>
>>>>>>> We are running a single EC2 4x8xlarge node.
>>>>>>>
>>>>>>> The following query takes 8min to finish
>>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>>>> '2018-05-12' GROUP BY customer_id having
>>>>>>> SUM(ec1_bknt_total_product_views_app) > 2 AND
>>>>>>> MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>>>
>>>>>>> I have a few questions:
>>>>>>>
>>>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the
>>>>>>>    32 CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>>>>>>>    setting Query Parallelism to 32, but it didn't help,
>>>>>>>    2. Adding the index on date column seems to have slowed down the
>>>>>>>    query. The 8min time from above was without the index, with the index the
>>>>>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>>>    not use the system resource properly, can that be related?
>>>>>>>    4. What are some good tools and techniques to troubleshoot these
>>>>>>>    problems in Ignite?
>>>>>>>
>>>>>>>
>>>>>>> All the relevant info is attached (configs, cache stats, node stats,
>>>>>>> etc.).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrey V. Mashenkov
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best regards,
>>> Andrey V. Mashenkov
>>>
>>
>>
>

-- 
Best regards,
Andrey V. Mashenkov

Re: Slow SQL query uses only a single CPU

Posted by eugene miretsky <eu...@gmail.com>.
Just as a reference, bellow are 2 execution plans with and without the
index on a very similar table.

Adding the index remove /* PUBLIC.AFFINITY_KEY */ and /* group sorted */.
1) Does PUBLIC.AFFINITY_KEY mean that DT is the affinity key. We are
setting customer_id as an affinity key. Is there a way to verify that?
2) Is it possible that the removal of /* group sorted */ indicates that the
result of group_by must be sorted? (hence taking a long time)

*Query*
Select COUNT (*) FROM (SELECT customer_id FROM GAL2RU where dt >
'2018-06-12' GROUP BY customer_id having SUM(ru_total_app_sessions_count) >
2 AND MAX(ru_total_web_sessions_count) < 1)

*Without an index*

SELECT

__Z0.CUSTOMER_ID AS __C0_0,

SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,

MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2

FROM PUBLIC.GAL2RU __Z0

/* PUBLIC.AFFINITY_KEY */

WHERE __Z0.DT > '2018-06-12'

GROUP BY __Z0.CUSTOMER_ID

/* group sorted */


SELECT

COUNT(*)

FROM (

SELECT

__C0_0 AS CUSTOMER_ID

FROM PUBLIC.__T0

GROUP BY __C0_0

HAVING (SUM(__C0_1) > 2)

AND (MAX(__C0_2) < 1)


*With an index*

SELECT

__Z0.CUSTOMER_ID AS __C0_0,

SUM(__Z0.RU_TOTAL_APP_SESSIONS_COUNT) AS __C0_1,

MAX(__Z0.RU_TOTAL_WEB_SESSIONS_COUNT) AS __C0_2

FROM PUBLIC.GAL2RU __Z0

/* PUBLIC.DT_IDX2: DT > '2018-06-12' */

WHERE __Z0.DT > '2018-06-12'

GROUP BY __Z0.CUSTOMER_ID


SELECT

COUNT(*)

FROM (

SELECT

__C0_0 AS CUSTOMER_ID

FROM PUBLIC.__T0

GROUP BY __C0_0

HAVING (SUM(__C0_1) > 2)

AND (MAX(__C0_2) < 1)

) _0__Z1


On Wed, Aug 22, 2018 at 9:43 AM, eugene miretsky <eu...@gmail.com>
wrote:

> Thanks Andrey,
>
> We are using the Ignite notebook, any idea if there is a way to provide
> these flags and hints directly from SQL?
>
> From your description, it seems like the query is executed in the
> following order
> 1) Group by customer_id
> 2) For each group, perform the filtering on date using the index and
> aggregates
>
> My impressions was that the order is
> 1)  On each node, filter rows by date (using the index)
> 2)  On each node, group by the remaining rows by customer id, and then
> perform the aggrate
>
> That's why we created the index on the dt field, as opposed to customer_id
> field.
>
> Cheers,
> Eugene
>
>
> On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
> andrey.mashenkov@gmail.com> wrote:
>
>> Eugene,
>>
>> 1. Note that queryParallelism splits indices and Ignite work similar way
>> as if index data resides on several nodes. These index part can be looked
>> up in parallel threads.
>> 2. It is not a simple query as you data distributed among partitions and
>> is not collocated and aggregate function are used.
>> HAVING clause here is a reason, Ignite can apply it on reduce phase only
>> as HAVING requires aggregate value from all index parts.
>> 3. If you data already collocated on customer_id then you can hit Ignite
>> with set SqlFieldsQuery.setCollocated(true). This should force Ignite to
>> optimize grouping and push down aggregates to map phase.
>> 4. In query plan you attached you can see H2 uses DT_IDX
>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>> It is not effective. With this index H2 have to process all data to
>> calculate aggregate for group. Index on affinity field may be more
>> effective as data can be processed group by group.
>> once all group data is process then result can be passed to reducer.
>> Hope, H2 is smart enough to do such streaming.
>>
>> Also, you can try to use composite index on (customer_id, date) columns.
>> Most likely. hint will needed [2].
>>
>> See also about collocated flag [1] and Hits [2]
>>
>> [1] https://ignite.apache.org/releases/latest/javadoc/org/apache
>> /ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
>> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance-and
>> -debugging#index-hints
>>
>>
>> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Thanks Andrey,
>>>
>>> Right now we are testing with only one big node, so the reduce step
>>> should not take any time.
>>>
>>> 1) We already set parallelism to 32, and I can still see only 1 core
>>> working. Anything else could be preventing multiple cores from working on
>>> the job?
>>> 2) Why would the reduce phase need to look at all the data? It seems
>>> like a fairly simple query
>>> 3) We are already collocating data  by customer_id (though as I
>>> mentioned, right now there is only one node)
>>> 4) We already using collocation and tried using an index, and other
>>> advice? Is there a way to check what Ignite is actually doing? How are
>>> indexs used (by Ignite or H2)?
>>>
>>> Cheers,
>>> Eugene
>>>
>>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>>> andrey.mashenkov@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> 1. Possible there are too much data should be looked for the query.
>>>> With single node and parallelism=1 query will always run in single thread.
>>>>  You can try to add more nodes or increase query parallelism to utilize
>>>> more CPU cores.
>>>>
>>>> 2. Index on date field may be not effective as reduce phase should look
>>>> all the data for further grouping.
>>>> Try add index on customer_id or use collocation in customer_id (usually
>>>> more preferable way).
>>>>
>>>> Also it is possible the bottleneck is the reduce phase.
>>>> Is it possible to collocate data by group by column  (customer_id)?
>>>> This collocation will allow you use collocated flag [1] and Ignite will use
>>>> more optimal plan.
>>>>
>>>> 4. The main techniques is trying to reduce amount to data to be looked
>>>> up on every phase with using data collocation and indices
>>>> Ignite provide 2 plans for distributed queries: map and reduce. You can
>>>> analyse and check these queries separately to understand how much data are
>>>> processed on map phase and on reduce.
>>>> Map query process node local data (until distributed joins on), while
>>>> reduce fetch data from remote node that may costs. .
>>>>
>>>>
>>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>>
>>>>> SELECT
>>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>>> WHERE __Z0.DT > '2018-05-12'
>>>>> GROUP BY __Z0.CUSTOMER_ID
>>>>> SELECT
>>>>> COUNT(*)
>>>>> FROM (
>>>>> SELECT
>>>>> __C0_0 AS CUSTOMER_ID
>>>>> FROM PUBLIC.__T0
>>>>> GROUP BY __C0_0
>>>>> HAVING (SUM(__C0_1) > 2)
>>>>> AND (MAX(__C0_2) < 1)
>>>>> ) _0__Z1
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>>> eugene.miretsky@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We have a cache called GAL3EC1, it has
>>>>>>
>>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>>    2. An Index on the date column
>>>>>>    3. 300 sparse columns
>>>>>>
>>>>>> We are running a single EC2 4x8xlarge node.
>>>>>>
>>>>>> The following query takes 8min to finish
>>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>>> '2018-05-12' GROUP BY customer_id having SUM(ec1_bknt_total_product_views_app)
>>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>>
>>>>>> I have a few questions:
>>>>>>
>>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the
>>>>>>    32 CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>>>>>>    setting Query Parallelism to 32, but it didn't help,
>>>>>>    2. Adding the index on date column seems to have slowed down the
>>>>>>    query. The 8min time from above was without the index, with the index the
>>>>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>>    not use the system resource properly, can that be related?
>>>>>>    4. What are some good tools and techniques to troubleshoot these
>>>>>>    problems in Ignite?
>>>>>>
>>>>>>
>>>>>> All the relevant info is attached (configs, cache stats, node stats,
>>>>>> etc.).
>>>>>>
>>>>>> Cheers,
>>>>>> Eugene
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Andrey V. Mashenkov
>>>>
>>>
>>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>
>

Re: Slow SQL query uses only a single CPU

Posted by eugene miretsky <eu...@gmail.com>.
Thanks Andrey,

We are using the Ignite notebook, any idea if there is a way to provide
these flags and hints directly from SQL?

From your description, it seems like the query is executed in the following
order
1) Group by customer_id
2) For each group, perform the filtering on date using the index and
aggregates

My impressions was that the order is
1)  On each node, filter rows by date (using the index)
2)  On each node, group by the remaining rows by customer id, and then
perform the aggrate

That's why we created the index on the dt field, as opposed to customer_id
field.

Cheers,
Eugene


On Wed, Aug 22, 2018 at 8:44 AM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> Eugene,
>
> 1. Note that queryParallelism splits indices and Ignite work similar way
> as if index data resides on several nodes. These index part can be looked
> up in parallel threads.
> 2. It is not a simple query as you data distributed among partitions and
> is not collocated and aggregate function are used.
> HAVING clause here is a reason, Ignite can apply it on reduce phase only
> as HAVING requires aggregate value from all index parts.
> 3. If you data already collocated on customer_id then you can hit Ignite
> with set SqlFieldsQuery.setCollocated(true). This should force Ignite to
> optimize grouping and push down aggregates to map phase.
> 4. In query plan you attached you can see H2 uses DT_IDX
> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
> It is not effective. With this index H2 have to process all data to
> calculate aggregate for group. Index on affinity field may be more
> effective as data can be processed group by group.
> once all group data is process then result can be passed to reducer. Hope,
> H2 is smart enough to do such streaming.
>
> Also, you can try to use composite index on (customer_id, date) columns.
> Most likely. hint will needed [2].
>
> See also about collocated flag [1] and Hits [2]
>
> [1] https://ignite.apache.org/releases/latest/javadoc/org/
> apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
> [2] https://apacheignite.readme.io/v2.0/docs/sql-performance-
> and-debugging#index-hints
>
>
> On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <eu...@gmail.com>
> wrote:
>
>> Thanks Andrey,
>>
>> Right now we are testing with only one big node, so the reduce step
>> should not take any time.
>>
>> 1) We already set parallelism to 32, and I can still see only 1 core
>> working. Anything else could be preventing multiple cores from working on
>> the job?
>> 2) Why would the reduce phase need to look at all the data? It seems like
>> a fairly simple query
>> 3) We are already collocating data  by customer_id (though as I
>> mentioned, right now there is only one node)
>> 4) We already using collocation and tried using an index, and other
>> advice? Is there a way to check what Ignite is actually doing? How are
>> indexs used (by Ignite or H2)?
>>
>> Cheers,
>> Eugene
>>
>> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
>> andrey.mashenkov@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> 1. Possible there are too much data should be looked for the query. With
>>> single node and parallelism=1 query will always run in single thread.
>>>  You can try to add more nodes or increase query parallelism to utilize
>>> more CPU cores.
>>>
>>> 2. Index on date field may be not effective as reduce phase should look
>>> all the data for further grouping.
>>> Try add index on customer_id or use collocation in customer_id (usually
>>> more preferable way).
>>>
>>> Also it is possible the bottleneck is the reduce phase.
>>> Is it possible to collocate data by group by column  (customer_id)? This
>>> collocation will allow you use collocated flag [1] and Ignite will use more
>>> optimal plan.
>>>
>>> 4. The main techniques is trying to reduce amount to data to be looked
>>> up on every phase with using data collocation and indices
>>> Ignite provide 2 plans for distributed queries: map and reduce. You can
>>> analyse and check these queries separately to understand how much data are
>>> processed on map phase and on reduce.
>>> Map query process node local data (until distributed joins on), while
>>> reduce fetch data from remote node that may costs. .
>>>
>>>
>>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Here is the result of EXPLAIN for the afermantioned query:
>>>>
>>>> SELECT
>>>> __Z0.CUSTOMER_ID AS __C0_0,
>>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>>> FROM PUBLIC.GAL3EC1 __Z0
>>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>>> WHERE __Z0.DT > '2018-05-12'
>>>> GROUP BY __Z0.CUSTOMER_ID
>>>> SELECT
>>>> COUNT(*)
>>>> FROM (
>>>> SELECT
>>>> __C0_0 AS CUSTOMER_ID
>>>> FROM PUBLIC.__T0
>>>> GROUP BY __C0_0
>>>> HAVING (SUM(__C0_1) > 2)
>>>> AND (MAX(__C0_2) < 1)
>>>> ) _0__Z1
>>>>
>>>>
>>>>
>>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>>> eugene.miretsky@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have a cache called GAL3EC1, it has
>>>>>
>>>>>    1. A composite pKey consisting of customer_id and date
>>>>>    2. An Index on the date column
>>>>>    3. 300 sparse columns
>>>>>
>>>>> We are running a single EC2 4x8xlarge node.
>>>>>
>>>>> The following query takes 8min to finish
>>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>>> '2018-05-12' GROUP BY customer_id having SUM(ec1_bknt_total_product_views_app)
>>>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>>
>>>>> I have a few questions:
>>>>>
>>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the
>>>>>    32 CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>>>>>    setting Query Parallelism to 32, but it didn't help,
>>>>>    2. Adding the index on date column seems to have slowed down the
>>>>>    query. The 8min time from above was without the index, with the index the
>>>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>>    3. Our loads from Spark are very slow as well, and also seem to
>>>>>    not use the system resource properly, can that be related?
>>>>>    4. What are some good tools and techniques to troubleshoot these
>>>>>    problems in Ignite?
>>>>>
>>>>>
>>>>> All the relevant info is attached (configs, cache stats, node stats,
>>>>> etc.).
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Best regards,
>>> Andrey V. Mashenkov
>>>
>>
>>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Slow SQL query uses only a single CPU

Posted by Andrey Mashenkov <an...@gmail.com>.
Eugene,

1. Note that queryParallelism splits indices and Ignite work similar way as
if index data resides on several nodes. These index part can be looked up
in parallel threads.
2. It is not a simple query as you data distributed among partitions and is
not collocated and aggregate function are used.
HAVING clause here is a reason, Ignite can apply it on reduce phase only as
HAVING requires aggregate value from all index parts.
3. If you data already collocated on customer_id then you can hit Ignite
with set SqlFieldsQuery.setCollocated(true). This should force Ignite to
optimize grouping and push down aggregates to map phase.
4. In query plan you attached you can see H2 uses DT_IDX
/* PUBLIC.DT_IDX: DT > '2018-05-12' */
It is not effective. With this index H2 have to process all data to
calculate aggregate for group. Index on affinity field may be more
effective as data can be processed group by group.
once all group data is process then result can be passed to reducer. Hope,
H2 is smart enough to do such streaming.

Also, you can try to use composite index on (customer_id, date) columns.
Most likely. hint will needed [2].

See also about collocated flag [1] and Hits [2]

[1]
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/query/SqlFieldsQuery.html#setCollocated-boolean-
[2]
https://apacheignite.readme.io/v2.0/docs/sql-performance-and-debugging#index-hints


On Wed, Aug 22, 2018 at 3:10 PM eugene miretsky <eu...@gmail.com>
wrote:

> Thanks Andrey,
>
> Right now we are testing with only one big node, so the reduce step should
> not take any time.
>
> 1) We already set parallelism to 32, and I can still see only 1 core
> working. Anything else could be preventing multiple cores from working on
> the job?
> 2) Why would the reduce phase need to look at all the data? It seems like
> a fairly simple query
> 3) We are already collocating data  by customer_id (though as I mentioned,
> right now there is only one node)
> 4) We already using collocation and tried using an index, and other
> advice? Is there a way to check what Ignite is actually doing? How are
> indexs used (by Ignite or H2)?
>
> Cheers,
> Eugene
>
> On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
> andrey.mashenkov@gmail.com> wrote:
>
>> Hi,
>>
>> 1. Possible there are too much data should be looked for the query. With
>> single node and parallelism=1 query will always run in single thread.
>>  You can try to add more nodes or increase query parallelism to utilize
>> more CPU cores.
>>
>> 2. Index on date field may be not effective as reduce phase should look
>> all the data for further grouping.
>> Try add index on customer_id or use collocation in customer_id (usually
>> more preferable way).
>>
>> Also it is possible the bottleneck is the reduce phase.
>> Is it possible to collocate data by group by column  (customer_id)? This
>> collocation will allow you use collocated flag [1] and Ignite will use more
>> optimal plan.
>>
>> 4. The main techniques is trying to reduce amount to data to be looked up
>> on every phase with using data collocation and indices
>> Ignite provide 2 plans for distributed queries: map and reduce. You can
>> analyse and check these queries separately to understand how much data are
>> processed on map phase and on reduce.
>> Map query process node local data (until distributed joins on), while
>> reduce fetch data from remote node that may costs. .
>>
>>
>> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Here is the result of EXPLAIN for the afermantioned query:
>>>
>>> SELECT
>>> __Z0.CUSTOMER_ID AS __C0_0,
>>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>>> FROM PUBLIC.GAL3EC1 __Z0
>>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>>> WHERE __Z0.DT > '2018-05-12'
>>> GROUP BY __Z0.CUSTOMER_ID
>>> SELECT
>>> COUNT(*)
>>> FROM (
>>> SELECT
>>> __C0_0 AS CUSTOMER_ID
>>> FROM PUBLIC.__T0
>>> GROUP BY __C0_0
>>> HAVING (SUM(__C0_1) > 2)
>>> AND (MAX(__C0_2) < 1)
>>> ) _0__Z1
>>>
>>>
>>>
>>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>>> eugene.miretsky@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have a cache called GAL3EC1, it has
>>>>
>>>>    1. A composite pKey consisting of customer_id and date
>>>>    2. An Index on the date column
>>>>    3. 300 sparse columns
>>>>
>>>> We are running a single EC2 4x8xlarge node.
>>>>
>>>> The following query takes 8min to finish
>>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>>> '2018-05-12' GROUP BY customer_id having
>>>> SUM(ec1_bknt_total_product_views_app) > 2 AND
>>>> MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>>
>>>> I have a few questions:
>>>>
>>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the 32
>>>>    CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>>>>    setting Query Parallelism to 32, but it didn't help,
>>>>    2. Adding the index on date column seems to have slowed down the
>>>>    query. The 8min time from above was without the index, with the index the
>>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>>    smaller date range showed a 10x slow down with the index. Why?
>>>>    3. Our loads from Spark are very slow as well, and also seem to not
>>>>    use the system resource properly, can that be related?
>>>>    4. What are some good tools and techniques to troubleshoot these
>>>>    problems in Ignite?
>>>>
>>>>
>>>> All the relevant info is attached (configs, cache stats, node stats,
>>>> etc.).
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>>
>
>

-- 
Best regards,
Andrey V. Mashenkov

Re: Slow SQL query uses only a single CPU

Posted by eugene miretsky <eu...@gmail.com>.
Thanks Andrey,

Right now we are testing with only one big node, so the reduce step should
not take any time.

1) We already set parallelism to 32, and I can still see only 1 core
working. Anything else could be preventing multiple cores from working on
the job?
2) Why would the reduce phase need to look at all the data? It seems like a
fairly simple query
3) We are already collocating data  by customer_id (though as I mentioned,
right now there is only one node)
4) We already using collocation and tried using an index, and other advice?
Is there a way to check what Ignite is actually doing? How are indexs used
(by Ignite or H2)?

Cheers,
Eugene

On Wed, Aug 22, 2018 at 3:54 AM, Andrey Mashenkov <
andrey.mashenkov@gmail.com> wrote:

> Hi,
>
> 1. Possible there are too much data should be looked for the query. With
> single node and parallelism=1 query will always run in single thread.
>  You can try to add more nodes or increase query parallelism to utilize
> more CPU cores.
>
> 2. Index on date field may be not effective as reduce phase should look
> all the data for further grouping.
> Try add index on customer_id or use collocation in customer_id (usually
> more preferable way).
>
> Also it is possible the bottleneck is the reduce phase.
> Is it possible to collocate data by group by column  (customer_id)? This
> collocation will allow you use collocated flag [1] and Ignite will use more
> optimal plan.
>
> 4. The main techniques is trying to reduce amount to data to be looked up
> on every phase with using data collocation and indices
> Ignite provide 2 plans for distributed queries: map and reduce. You can
> analyse and check these queries separately to understand how much data are
> processed on map phase and on reduce.
> Map query process node local data (until distributed joins on), while
> reduce fetch data from remote node that may costs. .
>
>
> On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <eu...@gmail.com>
> wrote:
>
>> Here is the result of EXPLAIN for the afermantioned query:
>>
>> SELECT
>> __Z0.CUSTOMER_ID AS __C0_0,
>> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
>> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
>> FROM PUBLIC.GAL3EC1 __Z0
>> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
>> WHERE __Z0.DT > '2018-05-12'
>> GROUP BY __Z0.CUSTOMER_ID
>> SELECT
>> COUNT(*)
>> FROM (
>> SELECT
>> __C0_0 AS CUSTOMER_ID
>> FROM PUBLIC.__T0
>> GROUP BY __C0_0
>> HAVING (SUM(__C0_1) > 2)
>> AND (MAX(__C0_2) < 1)
>> ) _0__Z1
>>
>>
>>
>> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
>> eugene.miretsky@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have a cache called GAL3EC1, it has
>>>
>>>    1. A composite pKey consisting of customer_id and date
>>>    2. An Index on the date column
>>>    3. 300 sparse columns
>>>
>>> We are running a single EC2 4x8xlarge node.
>>>
>>> The following query takes 8min to finish
>>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>>> '2018-05-12' GROUP BY customer_id having SUM(ec1_bknt_total_product_views_app)
>>> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>>>
>>> I have a few questions:
>>>
>>>    1. 'top' command shows %100 cpu utilization (i.e only one of the 32
>>>    CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>>>    setting Query Parallelism to 32, but it didn't help,
>>>    2. Adding the index on date column seems to have slowed down the
>>>    query. The 8min time from above was without the index, with the index the
>>>    query doesn't finish (I gave up after 30min). A similar query on a
>>>    smaller date range showed a 10x slow down with the index. Why?
>>>    3. Our loads from Spark are very slow as well, and also seem to not
>>>    use the system resource properly, can that be related?
>>>    4. What are some good tools and techniques to troubleshoot these
>>>    problems in Ignite?
>>>
>>>
>>> All the relevant info is attached (configs, cache stats, node stats,
>>> etc.).
>>>
>>> Cheers,
>>> Eugene
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Slow SQL query uses only a single CPU

Posted by Andrey Mashenkov <an...@gmail.com>.
Hi,

1. Possible there are too much data should be looked for the query. With
single node and parallelism=1 query will always run in single thread.
 You can try to add more nodes or increase query parallelism to utilize
more CPU cores.

2. Index on date field may be not effective as reduce phase should look all
the data for further grouping.
Try add index on customer_id or use collocation in customer_id (usually
more preferable way).

Also it is possible the bottleneck is the reduce phase.
Is it possible to collocate data by group by column  (customer_id)? This
collocation will allow you use collocated flag [1] and Ignite will use more
optimal plan.

4. The main techniques is trying to reduce amount to data to be looked up
on every phase with using data collocation and indices
Ignite provide 2 plans for distributed queries: map and reduce. You can
analyse and check these queries separately to understand how much data are
processed on map phase and on reduce.
Map query process node local data (until distributed joins on), while
reduce fetch data from remote node that may costs. .


On Wed, Aug 22, 2018 at 6:07 AM eugene miretsky <eu...@gmail.com>
wrote:

> Here is the result of EXPLAIN for the afermantioned query:
>
> SELECT
> __Z0.CUSTOMER_ID AS __C0_0,
> SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
> MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
> FROM PUBLIC.GAL3EC1 __Z0
> /* PUBLIC.DT_IDX: DT > '2018-05-12' */
> WHERE __Z0.DT > '2018-05-12'
> GROUP BY __Z0.CUSTOMER_ID
> SELECT
> COUNT(*)
> FROM (
> SELECT
> __C0_0 AS CUSTOMER_ID
> FROM PUBLIC.__T0
> GROUP BY __C0_0
> HAVING (SUM(__C0_1) > 2)
> AND (MAX(__C0_2) < 1)
> ) _0__Z1
>
>
>
> On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <
> eugene.miretsky@gmail.com> wrote:
>
>> Hi,
>>
>> We have a cache called GAL3EC1, it has
>>
>>    1. A composite pKey consisting of customer_id and date
>>    2. An Index on the date column
>>    3. 300 sparse columns
>>
>> We are running a single EC2 4x8xlarge node.
>>
>> The following query takes 8min to finish
>> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
>> '2018-05-12' GROUP BY customer_id having
>> SUM(ec1_bknt_total_product_views_app) > 2 AND
>> MAX(ec1_hnk_total_product_clicks_app) < 1)
>>
>> I have a few questions:
>>
>>    1. 'top' command shows %100 cpu utilization (i.e only one of the 32
>>    CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>>    setting Query Parallelism to 32, but it didn't help,
>>    2. Adding the index on date column seems to have slowed down the
>>    query. The 8min time from above was without the index, with the index the
>>    query doesn't finish (I gave up after 30min). A similar query on a
>>    smaller date range showed a 10x slow down with the index. Why?
>>    3. Our loads from Spark are very slow as well, and also seem to not
>>    use the system resource properly, can that be related?
>>    4. What are some good tools and techniques to troubleshoot these
>>    problems in Ignite?
>>
>>
>> All the relevant info is attached (configs, cache stats, node stats,
>> etc.).
>>
>> Cheers,
>> Eugene
>>
>>
>>
>>
>>
>>
>>
>

-- 
Best regards,
Andrey V. Mashenkov

Re: Slow SQL query uses only a single CPU

Posted by eugene miretsky <eu...@gmail.com>.
Here is the result of EXPLAIN for the afermantioned query:

SELECT
__Z0.CUSTOMER_ID AS __C0_0,
SUM(__Z0.EC1_BKNT_TOTAL_PRODUCT_VIEWS_APP) AS __C0_1,
MAX(__Z0.EC1_HNK_TOTAL_PRODUCT_CLICKS_APP) AS __C0_2
FROM PUBLIC.GAL3EC1 __Z0
/* PUBLIC.DT_IDX: DT > '2018-05-12' */
WHERE __Z0.DT > '2018-05-12'
GROUP BY __Z0.CUSTOMER_ID
SELECT
COUNT(*)
FROM (
SELECT
__C0_0 AS CUSTOMER_ID
FROM PUBLIC.__T0
GROUP BY __C0_0
HAVING (SUM(__C0_1) > 2)
AND (MAX(__C0_2) < 1)
) _0__Z1



On Tue, Aug 21, 2018 at 8:18 PM, eugene miretsky <eu...@gmail.com>
wrote:

> Hi,
>
> We have a cache called GAL3EC1, it has
>
>    1. A composite pKey consisting of customer_id and date
>    2. An Index on the date column
>    3. 300 sparse columns
>
> We are running a single EC2 4x8xlarge node.
>
> The following query takes 8min to finish
> Select COUNT (*) FROM (SELECT customer_id FROM GAl3ec1 where dt >
> '2018-05-12' GROUP BY customer_id having SUM(ec1_bknt_total_product_views_app)
> > 2 AND MAX(ec1_hnk_total_product_clicks_app) < 1)
>
> I have a few questions:
>
>    1. 'top' command shows %100 cpu utilization (i.e only one of the 32
>    CPUs is used). How can I get the query to use all 32 CPUs? I have tried
>    setting Query Parallelism to 32, but it didn't help,
>    2. Adding the index on date column seems to have slowed down the
>    query. The 8min time from above was without the index, with the index the
>    query doesn't finish (I gave up after 30min). A similar query on a
>    smaller date range showed a 10x slow down with the index. Why?
>    3. Our loads from Spark are very slow as well, and also seem to not
>    use the system resource properly, can that be related?
>    4. What are some good tools and techniques to troubleshoot these
>    problems in Ignite?
>
>
> All the relevant info is attached (configs, cache stats, node stats,
> etc.).
>
> Cheers,
> Eugene
>
>
>
>
>
>
>