You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Faraz Mateen <fm...@an10.io> on 2019/02/12 14:52:23 UTC

Inconsistent read performance with Spark

Hi all,

I am using spark to pull data from my single node testing kudu setup and
publish it to kafka. However, my query time is not consistent.

I am querying a table with around *1.1 million *packets. Initially my query
was taking* 537 seconds to read 51042 records* from kudu and write them to
kafka. This rate was quite low than what I had expected. I had around 45
tables with little data in them that was not needed anymore. I deleted all
those tables, restarted spark session and attempted the same query. Now the
query completed in* 5.3 seconds*.

I increased the number of rows to be fetched and tried the same query. Rows
count was *118741* but it took *1861 seconds *to complete. During the
query, resource utilization of my servers was very low. When I attempted
the same query again after a couple of hours, it took only* 16 secs*.

After this I kept increasing the number of rows to be fetched and the time
kept increasing in linear fashion.

What I want to ask is:

   - How can I debug why the time for these queries is varying so much? I
   am not able to get anything out of Kudu logs.
   - I am running kudu with default configurations. Are there any tweaks I
   should perform to boast the performance of my setup?
   - Does having a lot of tables cause performance issues?
   - Will having more masters and tservers improve my query time?

*Environment Details:*

   - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16 GB
   RAM.
   - Table that I am querying is hash partitioned on the basis of an ID
   with 3 buckets. It is also range partitioned on the basis of datetime with
   a new partition for each month.
   - Kafka version 1.1.
   - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.

-- 
Faraz Mateen

Re: Inconsistent read performance with Spark

Posted by Hao Hao <ha...@cloudera.com>.

Hi Faraz,

Yes, the order can help with both write and scan performance in your case.
When the inserts are random (as you said the order of IDs is random), there
will be many rowsets that overlap in primary key bounds, which
maintenance manager needs to allocate resource to compact. And you will
eventually reach a point where bloom filters and primary key indexes cannot
be resident in cache, and the write performance will degrade. It also
affects your scan, since during key lookup, you would have to look through
more rowsets than if you had sequential (or close to) inserts.

Therefore, we generally recommended putting timestamp first in the primary
key, if data arrives mostly in timestamp order.

FYI, you can check the rowset layout for each tablet in the tablet server
web UI.

Best,
Hao

On Thu, Feb 14, 2019 at 4:09 PM Faraz Mateen <fm...@an10.io> wrote:

> Hao,
>
> The order of my primary key is (ID, datetime). My query had 'WHERE' clause
> for both these keys. How does the order exactly affect scan performance?
>
> I think restarting the tablet server removed all previous records on scan
> dashboard. I can't find any query that took too long to complete.
>
> On Thu, Feb 14, 2019 at 4:31 AM Hao Hao <ha...@cloudera.com> wrote:
>
>> Hi Faraz,
>>
>> What is the order of your primary key? Is it (datetime, ID) or (ID,
>> datatime)?
>>
>> On the contrary, I suspect your scan performance got better for the same
>> query because compaction happened in between, and thus there were less
>> blocks to scan. Also would you mind sharing the screen shot of the tablet
>> server web UI page when your scans took place (to do a comparison between
>> the 'good' and 'bad' scans) ?
>>
>> Best,
>> Hao
>>
>> On Wed, Feb 13, 2019 at 9:37 AM Faraz Mateen <fm...@an10.io> wrote:
>>
>>> By "not noticing any compaction" I meant I did not see any visible
>>> change in disk space. However, logs show that there were some compaction
>>> related operations happening during this whole time period. These
>>> statements appeared multiple times in tserver logs:
>>>
>>> W0211 13:44:10.991221 15822 tablet.cc:1679] T
>>> 00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
>>> schedule compaction. Clean time has not been advanced past its initial
>>> value.
>>> ...
>>> ...
>>> I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
>>> 7b44fc5229fe43e190d4d6c1e8022988: Scheduling
>>> MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
>>> score=0.106957
>>> I0211 14:36:33.884233 13179 diskrowset.cc:560] T
>>> 30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
>>> RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
>>> 13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
>>> 75 77 78 79 80 81 109 128 137)
>>>
>>>
>>> Does compaction affect scan performance? And if it does, what can I do
>>> to limit this degradation?
>>>
>>>
>>> On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <fm...@an10.io> wrote:
>>>
>>>> Thanks a lot for the help, Hao.
>>>>
>>>> Response Inline:
>>>>
>>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>>>> better understanding of the ongoing/past queries. The flag
>>>>> 'scan_history_count' is used to configure the size of the buffer. From
>>>>> there, you can get information such as the applied predicates and column
>>>>> stats for the selected columns.
>>>>
>>>>
>>>> Thanks. I did not know about this.
>>>>
>>>> Did you notice any compactions in Kudu between you issued the two
>>>>> queries? What is your ingest pattern, are you inserting data in random
>>>>> primary key order?
>>>>
>>>>
>>>> The table has hash partitioning on a ID column that can have 15
>>>> different values and range partition on datetime which is split monthly.
>>>> Both ID and datetime are my primary keys. The data we ingest is in
>>>> increasing order of time (usually) but the order of IDs is random.
>>>>
>>>> However, ingestion into kudu was stopped while I was performing these
>>>> queries. I did not notice any compaction either.
>>>>
>>>> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <ha...@cloudera.com> wrote:
>>>>
>>>>> Hi Faraz,
>>>>>
>>>>> Answered inline below.
>>>>>
>>>>> Best,
>>>>> Hao
>>>>>
>>>>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fm...@an10.io> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am using spark to pull data from my single node testing kudu setup
>>>>>> and publish it to kafka. However, my query time is not consistent.
>>>>>>
>>>>>> I am querying a table with around *1.1 million *packets. Initially
>>>>>> my query was taking* 537 seconds to read 51042 records* from kudu
>>>>>> and write them to kafka. This rate was quite low than what I had expected.
>>>>>> I had around 45 tables with little data in them that was not needed
>>>>>> anymore. I deleted all those tables, restarted spark session and attempted
>>>>>> the same query. Now the query completed in* 5.3 seconds*.
>>>>>>
>>>>>> I increased the number of rows to be fetched and tried the same
>>>>>> query. Rows count was *118741* but it took *1861 seconds *to
>>>>>> complete. During the query, resource utilization of my servers was very
>>>>>> low. When I attempted the same query again after a couple of hours,
>>>>>> it took only* 16 secs*.
>>>>>>
>>>>>> After this I kept increasing the number of rows to be fetched and the
>>>>>> time kept increasing in linear fashion.
>>>>>>
>>>>>> What I want to ask is:
>>>>>>
>>>>>>    - How can I debug why the time for these queries is varying so
>>>>>>    much? I am not able to get anything out of Kudu logs.
>>>>>>
>>>>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>>>> better understanding of the ongoing/past queries. The flag
>>>>> 'scan_history_count' is used to configure the size of the buffer. From
>>>>> there, you can get information such as the applied predicates and column
>>>>> stats for the selected columns.
>>>>>
>>>>>
>>>>>>
>>>>>>    - I am running kudu with default configurations. Are there any
>>>>>>    tweaks I should perform to boast the performance of my setup?
>>>>>>
>>>>>> Did you notice any compactions in Kudu between you issued the two
>>>>> queries? What is your ingest pattern, are you inserting data in random
>>>>> primary key order?
>>>>>
>>>>>>
>>>>>>    - Does having a lot of tables cause performance issues?
>>>>>>
>>>>>> If there is no hitting of resource limitation due to writes/scans to
>>>>> the other tables, they shouldn't affect the performance of your queries.
>>>>> Just FYI, this is the scale guide
>>>>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>>>>> various system resources.
>>>>>
>>>>>>
>>>>>>    - Will having more masters and tservers improve my query time?
>>>>>>
>>>>>> Master is not likely to be the bottleneck, as client communicate
>>>>> directly to tserver for query once he/she knows which tserver to talk to.
>>>>> But separating master and tserver to be on the same node might help. This
>>>>> is the scale limitation
>>>>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>>>>> roughly estimation of number of tservers required for a given quantity of
>>>>> data.
>>>>>
>>>>> *Environment Details:*
>>>>>>
>>>>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and
>>>>>>    16 GB RAM.
>>>>>>    - Table that I am querying is hash partitioned on the basis of an
>>>>>>    ID with 3 buckets. It is also range partitioned on the basis of datetime
>>>>>>    with a new partition for each month.
>>>>>>    - Kafka version 1.1.
>>>>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB
>>>>>>    RAM.
>>>>>>
>>>>>> --
>>>>>> Faraz Mateen
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Faraz Mateen
>>>>
>>>
>>>
>>> --
>>> Faraz Mateen
>>>
>>
>
> --
> Faraz Mateen
>

Re: Inconsistent read performance with Spark

Posted by Faraz Mateen <fm...@an10.io>.

Hao,

The order of my primary key is (ID, datetime). My query had 'WHERE' clause
for both these keys. How does the order exactly affect scan performance?

I think restarting the tablet server removed all previous records on scan
dashboard. I can't find any query that took too long to complete.

On Thu, Feb 14, 2019 at 4:31 AM Hao Hao <ha...@cloudera.com> wrote:

> Hi Faraz,
>
> What is the order of your primary key? Is it (datetime, ID) or (ID,
> datatime)?
>
> On the contrary, I suspect your scan performance got better for the same
> query because compaction happened in between, and thus there were less
> blocks to scan. Also would you mind sharing the screen shot of the tablet
> server web UI page when your scans took place (to do a comparison between
> the 'good' and 'bad' scans) ?
>
> Best,
> Hao
>
> On Wed, Feb 13, 2019 at 9:37 AM Faraz Mateen <fm...@an10.io> wrote:
>
>> By "not noticing any compaction" I meant I did not see any visible change
>> in disk space. However, logs show that there were some compaction related
>> operations happening during this whole time period. These statements
>> appeared multiple times in tserver logs:
>>
>> W0211 13:44:10.991221 15822 tablet.cc:1679] T
>> 00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
>> schedule compaction. Clean time has not been advanced past its initial
>> value.
>> ...
>> ...
>> I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
>> 7b44fc5229fe43e190d4d6c1e8022988: Scheduling
>> MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
>> score=0.106957
>> I0211 14:36:33.884233 13179 diskrowset.cc:560] T
>> 30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
>> RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
>> 13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
>> 75 77 78 79 80 81 109 128 137)
>>
>>
>> Does compaction affect scan performance? And if it does, what can I do to
>> limit this degradation?
>>
>>
>> On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <fm...@an10.io> wrote:
>>
>>> Thanks a lot for the help, Hao.
>>>
>>> Response Inline:
>>>
>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>>> better understanding of the ongoing/past queries. The flag
>>>> 'scan_history_count' is used to configure the size of the buffer. From
>>>> there, you can get information such as the applied predicates and column
>>>> stats for the selected columns.
>>>
>>>
>>> Thanks. I did not know about this.
>>>
>>> Did you notice any compactions in Kudu between you issued the two
>>>> queries? What is your ingest pattern, are you inserting data in random
>>>> primary key order?
>>>
>>>
>>> The table has hash partitioning on a ID column that can have 15
>>> different values and range partition on datetime which is split monthly.
>>> Both ID and datetime are my primary keys. The data we ingest is in
>>> increasing order of time (usually) but the order of IDs is random.
>>>
>>> However, ingestion into kudu was stopped while I was performing these
>>> queries. I did not notice any compaction either.
>>>
>>> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <ha...@cloudera.com> wrote:
>>>
>>>> Hi Faraz,
>>>>
>>>> Answered inline below.
>>>>
>>>> Best,
>>>> Hao
>>>>
>>>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fm...@an10.io> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am using spark to pull data from my single node testing kudu setup
>>>>> and publish it to kafka. However, my query time is not consistent.
>>>>>
>>>>> I am querying a table with around *1.1 million *packets. Initially my
>>>>> query was taking* 537 seconds to read 51042 records* from kudu and
>>>>> write them to kafka. This rate was quite low than what I had expected. I
>>>>> had around 45 tables with little data in them that was not needed anymore.
>>>>> I deleted all those tables, restarted spark session and attempted the same
>>>>> query. Now the query completed in* 5.3 seconds*.
>>>>>
>>>>> I increased the number of rows to be fetched and tried the same query.
>>>>> Rows count was *118741* but it took *1861 seconds *to complete.
>>>>> During the query, resource utilization of my servers was very low. When
>>>>> I attempted the same query again after a couple of hours, it took only*
>>>>> 16 secs*.
>>>>>
>>>>> After this I kept increasing the number of rows to be fetched and the
>>>>> time kept increasing in linear fashion.
>>>>>
>>>>> What I want to ask is:
>>>>>
>>>>>    - How can I debug why the time for these queries is varying so
>>>>>    much? I am not able to get anything out of Kudu logs.
>>>>>
>>>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>>> better understanding of the ongoing/past queries. The flag
>>>> 'scan_history_count' is used to configure the size of the buffer. From
>>>> there, you can get information such as the applied predicates and column
>>>> stats for the selected columns.
>>>>
>>>>
>>>>>
>>>>>    - I am running kudu with default configurations. Are there any
>>>>>    tweaks I should perform to boast the performance of my setup?
>>>>>
>>>>> Did you notice any compactions in Kudu between you issued the two
>>>> queries? What is your ingest pattern, are you inserting data in random
>>>> primary key order?
>>>>
>>>>>
>>>>>    - Does having a lot of tables cause performance issues?
>>>>>
>>>>> If there is no hitting of resource limitation due to writes/scans to
>>>> the other tables, they shouldn't affect the performance of your queries.
>>>> Just FYI, this is the scale guide
>>>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>>>> various system resources.
>>>>
>>>>>
>>>>>    - Will having more masters and tservers improve my query time?
>>>>>
>>>>> Master is not likely to be the bottleneck, as client communicate
>>>> directly to tserver for query once he/she knows which tserver to talk to.
>>>> But separating master and tserver to be on the same node might help. This
>>>> is the scale limitation
>>>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>>>> roughly estimation of number of tservers required for a given quantity of
>>>> data.
>>>>
>>>> *Environment Details:*
>>>>>
>>>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and
>>>>>    16 GB RAM.
>>>>>    - Table that I am querying is hash partitioned on the basis of an
>>>>>    ID with 3 buckets. It is also range partitioned on the basis of datetime
>>>>>    with a new partition for each month.
>>>>>    - Kafka version 1.1.
>>>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>>>>
>>>>> --
>>>>> Faraz Mateen
>>>>>
>>>>
>>>
>>> --
>>> Faraz Mateen
>>>
>>
>>
>> --
>> Faraz Mateen
>>
>

-- 
Faraz Mateen

Re: Inconsistent read performance with Spark

Posted by Hao Hao <ha...@cloudera.com>.

Hi Faraz,

What is the order of your primary key? Is it (datetime, ID) or (ID,
datatime)?

On the contrary, I suspect your scan performance got better for the same
query because compaction happened in between, and thus there were less
blocks to scan. Also would you mind sharing the screen shot of the tablet
server web UI page when your scans took place (to do a comparison between
the 'good' and 'bad' scans) ?

Best,
Hao

On Wed, Feb 13, 2019 at 9:37 AM Faraz Mateen <fm...@an10.io> wrote:

> By "not noticing any compaction" I meant I did not see any visible change
> in disk space. However, logs show that there were some compaction related
> operations happening during this whole time period. These statements
> appeared multiple times in tserver logs:
>
> W0211 13:44:10.991221 15822 tablet.cc:1679] T
> 00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
> schedule compaction. Clean time has not been advanced past its initial
> value.
> ...
> ...
> I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
> 7b44fc5229fe43e190d4d6c1e8022988: Scheduling
> MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
> score=0.106957
> I0211 14:36:33.884233 13179 diskrowset.cc:560] T
> 30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
> RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
> 13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
> 75 77 78 79 80 81 109 128 137)
>
>
> Does compaction affect scan performance? And if it does, what can I do to
> limit this degradation?
>
>
> On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <fm...@an10.io> wrote:
>
>> Thanks a lot for the help, Hao.
>>
>> Response Inline:
>>
>> You can use tablet server web UI scans dashboard (/scans) to get a better
>>> understanding of the ongoing/past queries. The flag 'scan_history_count' is
>>> used to configure the size of the buffer. From there, you can get
>>> information such as the applied predicates and column stats for the
>>> selected columns.
>>
>>
>> Thanks. I did not know about this.
>>
>> Did you notice any compactions in Kudu between you issued the two
>>> queries? What is your ingest pattern, are you inserting data in random
>>> primary key order?
>>
>>
>> The table has hash partitioning on a ID column that can have 15 different
>> values and range partition on datetime which is split monthly. Both ID and
>> datetime are my primary keys. The data we ingest is in increasing order of
>> time (usually) but the order of IDs is random.
>>
>> However, ingestion into kudu was stopped while I was performing these
>> queries. I did not notice any compaction either.
>>
>> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <ha...@cloudera.com> wrote:
>>
>>> Hi Faraz,
>>>
>>> Answered inline below.
>>>
>>> Best,
>>> Hao
>>>
>>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fm...@an10.io> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am using spark to pull data from my single node testing kudu setup
>>>> and publish it to kafka. However, my query time is not consistent.
>>>>
>>>> I am querying a table with around *1.1 million *packets. Initially my
>>>> query was taking* 537 seconds to read 51042 records* from kudu and
>>>> write them to kafka. This rate was quite low than what I had expected. I
>>>> had around 45 tables with little data in them that was not needed anymore.
>>>> I deleted all those tables, restarted spark session and attempted the same
>>>> query. Now the query completed in* 5.3 seconds*.
>>>>
>>>> I increased the number of rows to be fetched and tried the same query.
>>>> Rows count was *118741* but it took *1861 seconds *to complete. During
>>>> the query, resource utilization of my servers was very low. When I
>>>> attempted the same query again after a couple of hours, it took only*
>>>> 16 secs*.
>>>>
>>>> After this I kept increasing the number of rows to be fetched and the
>>>> time kept increasing in linear fashion.
>>>>
>>>> What I want to ask is:
>>>>
>>>>    - How can I debug why the time for these queries is varying so
>>>>    much? I am not able to get anything out of Kudu logs.
>>>>
>>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>> better understanding of the ongoing/past queries. The flag
>>> 'scan_history_count' is used to configure the size of the buffer. From
>>> there, you can get information such as the applied predicates and column
>>> stats for the selected columns.
>>>
>>>
>>>>
>>>>    - I am running kudu with default configurations. Are there any
>>>>    tweaks I should perform to boast the performance of my setup?
>>>>
>>>> Did you notice any compactions in Kudu between you issued the two
>>> queries? What is your ingest pattern, are you inserting data in random
>>> primary key order?
>>>
>>>>
>>>>    - Does having a lot of tables cause performance issues?
>>>>
>>>> If there is no hitting of resource limitation due to writes/scans to
>>> the other tables, they shouldn't affect the performance of your queries.
>>> Just FYI, this is the scale guide
>>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>>> various system resources.
>>>
>>>>
>>>>    - Will having more masters and tservers improve my query time?
>>>>
>>>> Master is not likely to be the bottleneck, as client communicate
>>> directly to tserver for query once he/she knows which tserver to talk to.
>>> But separating master and tserver to be on the same node might help. This
>>> is the scale limitation
>>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>>> roughly estimation of number of tservers required for a given quantity of
>>> data.
>>>
>>> *Environment Details:*
>>>>
>>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and
>>>>    16 GB RAM.
>>>>    - Table that I am querying is hash partitioned on the basis of an
>>>>    ID with 3 buckets. It is also range partitioned on the basis of datetime
>>>>    with a new partition for each month.
>>>>    - Kafka version 1.1.
>>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>>>
>>>> --
>>>> Faraz Mateen
>>>>
>>>
>>
>> --
>> Faraz Mateen
>>
>
>
> --
> Faraz Mateen
>

Re: Inconsistent read performance with Spark

Posted by Faraz Mateen <fm...@an10.io>.

By "not noticing any compaction" I meant I did not see any visible change
in disk space. However, logs show that there were some compaction related
operations happening during this whole time period. These statements
appeared multiple times in tserver logs:

W0211 13:44:10.991221 15822 tablet.cc:1679] T
00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
schedule compaction. Clean time has not been advanced past its initial
value.
...
...
I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
7b44fc5229fe43e190d4d6c1e8022988: Scheduling
MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
score=0.106957
I0211 14:36:33.884233 13179 diskrowset.cc:560] T
30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
75 77 78 79 80 81 109 128 137)


Does compaction affect scan performance? And if it does, what can I do to
limit this degradation?


On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <fm...@an10.io> wrote:

> Thanks a lot for the help, Hao.
>
> Response Inline:
>
> You can use tablet server web UI scans dashboard (/scans) to get a better
>> understanding of the ongoing/past queries. The flag 'scan_history_count' is
>> used to configure the size of the buffer. From there, you can get
>> information such as the applied predicates and column stats for the
>> selected columns.
>
>
> Thanks. I did not know about this.
>
> Did you notice any compactions in Kudu between you issued the two queries?
>> What is your ingest pattern, are you inserting data in random primary key
>> order?
>
>
> The table has hash partitioning on a ID column that can have 15 different
> values and range partition on datetime which is split monthly. Both ID and
> datetime are my primary keys. The data we ingest is in increasing order of
> time (usually) but the order of IDs is random.
>
> However, ingestion into kudu was stopped while I was performing these
> queries. I did not notice any compaction either.
>
> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <ha...@cloudera.com> wrote:
>
>> Hi Faraz,
>>
>> Answered inline below.
>>
>> Best,
>> Hao
>>
>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fm...@an10.io> wrote:
>>
>>> Hi all,
>>>
>>> I am using spark to pull data from my single node testing kudu setup and
>>> publish it to kafka. However, my query time is not consistent.
>>>
>>> I am querying a table with around *1.1 million *packets. Initially my
>>> query was taking* 537 seconds to read 51042 records* from kudu and
>>> write them to kafka. This rate was quite low than what I had expected. I
>>> had around 45 tables with little data in them that was not needed anymore.
>>> I deleted all those tables, restarted spark session and attempted the same
>>> query. Now the query completed in* 5.3 seconds*.
>>>
>>> I increased the number of rows to be fetched and tried the same query.
>>> Rows count was *118741* but it took *1861 seconds *to complete. During
>>> the query, resource utilization of my servers was very low. When I
>>> attempted the same query again after a couple of hours, it took only*
>>> 16 secs*.
>>>
>>> After this I kept increasing the number of rows to be fetched and the
>>> time kept increasing in linear fashion.
>>>
>>> What I want to ask is:
>>>
>>>    - How can I debug why the time for these queries is varying so much?
>>>    I am not able to get anything out of Kudu logs.
>>>
>>> You can use tablet server web UI scans dashboard (/scans) to get a
>> better understanding of the ongoing/past queries. The flag
>> 'scan_history_count' is used to configure the size of the buffer. From
>> there, you can get information such as the applied predicates and column
>> stats for the selected columns.
>>
>>
>>>
>>>    - I am running kudu with default configurations. Are there any
>>>    tweaks I should perform to boast the performance of my setup?
>>>
>>> Did you notice any compactions in Kudu between you issued the two
>> queries? What is your ingest pattern, are you inserting data in random
>> primary key order?
>>
>>>
>>>    - Does having a lot of tables cause performance issues?
>>>
>>> If there is no hitting of resource limitation due to writes/scans to the
>> other tables, they shouldn't affect the performance of your queries. Just
>> FYI, this is the scale guide
>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>> various system resources.
>>
>>>
>>>    - Will having more masters and tservers improve my query time?
>>>
>>> Master is not likely to be the bottleneck, as client communicate
>> directly to tserver for query once he/she knows which tserver to talk to.
>> But separating master and tserver to be on the same node might help. This
>> is the scale limitation
>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>> roughly estimation of number of tservers required for a given quantity of
>> data.
>>
>> *Environment Details:*
>>>
>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16
>>>    GB RAM.
>>>    - Table that I am querying is hash partitioned on the basis of an ID
>>>    with 3 buckets. It is also range partitioned on the basis of datetime with
>>>    a new partition for each month.
>>>    - Kafka version 1.1.
>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>>
>>> --
>>> Faraz Mateen
>>>
>>
>
> --
> Faraz Mateen
>


-- 
Faraz Mateen

Re: Inconsistent read performance with Spark

Posted by Faraz Mateen <fm...@an10.io>.

Thanks a lot for the help, Hao.

Response Inline:

You can use tablet server web UI scans dashboard (/scans) to get a better
> understanding of the ongoing/past queries. The flag 'scan_history_count' is
> used to configure the size of the buffer. From there, you can get
> information such as the applied predicates and column stats for the
> selected columns.


Thanks. I did not know about this.

Did you notice any compactions in Kudu between you issued the two queries?
> What is your ingest pattern, are you inserting data in random primary key
> order?


The table has hash partitioning on a ID column that can have 15 different
values and range partition on datetime which is split monthly. Both ID and
datetime are my primary keys. The data we ingest is in increasing order of
time (usually) but the order of IDs is random.

However, ingestion into kudu was stopped while I was performing these
queries. I did not notice any compaction either.

On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <ha...@cloudera.com> wrote:

> Hi Faraz,
>
> Answered inline below.
>
> Best,
> Hao
>
> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fm...@an10.io> wrote:
>
>> Hi all,
>>
>> I am using spark to pull data from my single node testing kudu setup and
>> publish it to kafka. However, my query time is not consistent.
>>
>> I am querying a table with around *1.1 million *packets. Initially my
>> query was taking* 537 seconds to read 51042 records* from kudu and write
>> them to kafka. This rate was quite low than what I had expected. I had
>> around 45 tables with little data in them that was not needed anymore. I
>> deleted all those tables, restarted spark session and attempted the same
>> query. Now the query completed in* 5.3 seconds*.
>>
>> I increased the number of rows to be fetched and tried the same query.
>> Rows count was *118741* but it took *1861 seconds *to complete. During
>> the query, resource utilization of my servers was very low. When I
>> attempted the same query again after a couple of hours, it took only* 16
>> secs*.
>>
>> After this I kept increasing the number of rows to be fetched and the
>> time kept increasing in linear fashion.
>>
>> What I want to ask is:
>>
>>    - How can I debug why the time for these queries is varying so much?
>>    I am not able to get anything out of Kudu logs.
>>
>> You can use tablet server web UI scans dashboard (/scans) to get a better
> understanding of the ongoing/past queries. The flag 'scan_history_count' is
> used to configure the size of the buffer. From there, you can get
> information such as the applied predicates and column stats for the
> selected columns.
>
>
>>
>>    - I am running kudu with default configurations. Are there any tweaks
>>    I should perform to boast the performance of my setup?
>>
>> Did you notice any compactions in Kudu between you issued the two
> queries? What is your ingest pattern, are you inserting data in random
> primary key order?
>
>>
>>    - Does having a lot of tables cause performance issues?
>>
>> If there is no hitting of resource limitation due to writes/scans to the
> other tables, they shouldn't affect the performance of your queries. Just
> FYI, this is the scale guide
> <https://kudu.apache.org/docs/scaling_guide.html> with respect to various
> system resources.
>
>>
>>    - Will having more masters and tservers improve my query time?
>>
>> Master is not likely to be the bottleneck, as client communicate directly
> to tserver for query once he/she knows which tserver to talk to. But
> separating master and tserver to be on the same node might help. This is
> the scale limitation
> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for roughly
> estimation of number of tservers required for a given quantity of data.
>
> *Environment Details:*
>>
>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16
>>    GB RAM.
>>    - Table that I am querying is hash partitioned on the basis of an ID
>>    with 3 buckets. It is also range partitioned on the basis of datetime with
>>    a new partition for each month.
>>    - Kafka version 1.1.
>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>
>> --
>> Faraz Mateen
>>
>

-- 
Faraz Mateen

Re: Inconsistent read performance with Spark

Posted by Hao Hao <ha...@cloudera.com>.

Hi Faraz,

Answered inline below.

Best,
Hao

On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fm...@an10.io> wrote:

> Hi all,
>
> I am using spark to pull data from my single node testing kudu setup and
> publish it to kafka. However, my query time is not consistent.
>
> I am querying a table with around *1.1 million *packets. Initially my
> query was taking* 537 seconds to read 51042 records* from kudu and write
> them to kafka. This rate was quite low than what I had expected. I had
> around 45 tables with little data in them that was not needed anymore. I
> deleted all those tables, restarted spark session and attempted the same
> query. Now the query completed in* 5.3 seconds*.
>
> I increased the number of rows to be fetched and tried the same query.
> Rows count was *118741* but it took *1861 seconds *to complete. During
> the query, resource utilization of my servers was very low. When I
> attempted the same query again after a couple of hours, it took only* 16
> secs*.
>
> After this I kept increasing the number of rows to be fetched and the time
> kept increasing in linear fashion.
>
> What I want to ask is:
>
>    - How can I debug why the time for these queries is varying so much? I
>    am not able to get anything out of Kudu logs.
>
> You can use tablet server web UI scans dashboard (/scans) to get a better
understanding of the ongoing/past queries. The flag 'scan_history_count' is
used to configure the size of the buffer. From there, you can get
information such as the applied predicates and column stats for the
selected columns.


>
>    - I am running kudu with default configurations. Are there any tweaks
>    I should perform to boast the performance of my setup?
>
> Did you notice any compactions in Kudu between you issued the two queries?
What is your ingest pattern, are you inserting data in random primary key
order?

>
>    - Does having a lot of tables cause performance issues?
>
> If there is no hitting of resource limitation due to writes/scans to the
other tables, they shouldn't affect the performance of your queries. Just
FYI, this is the scale guide
<https://kudu.apache.org/docs/scaling_guide.html> with respect to various
system resources.

>
>    - Will having more masters and tservers improve my query time?
>
> Master is not likely to be the bottleneck, as client communicate directly
to tserver for query once he/she knows which tserver to talk to. But
separating master and tserver to be on the same node might help. This is
the scale limitation
<https://kudu.apache.org/docs/known_issues.html#_scale> guide
for roughly estimation of number of tservers required for a given quantity
of data.

*Environment Details:*
>
>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16
>    GB RAM.
>    - Table that I am querying is hash partitioned on the basis of an ID
>    with 3 buckets. It is also range partitioned on the basis of datetime with
>    a new partition for each month.
>    - Kafka version 1.1.
>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>
> --
> Faraz Mateen
>