You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Quanlong Huang <hu...@126.com> on 2018/06/15 14:52:53 UTC

Why RowSet size is much smaller than flush_threshold_mb

Hi all,


I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, I find most of the compactions are compacting small files (~40MB for each). For example:


I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: stage 1 complete, picked 4 rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: ~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: ~34720852 bytes)
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: ~29914833 bytes)
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: ~29007249 bytes)
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction Phase 2: carrying over any updates which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction successful on 82987 rows (123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628suser 1.460ssys 0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}


The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed file size be ~1GB?


I think increasing the initial RowSet size can reduce compactions and then reduce the impact of other ongoing operations. It may also improve the flush performance. Is that right? If so, how can I increase the RowSet size?


I'd be grateful if someone can make me clear about these!


Thanks,
Quanlong

Re:Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by Quanlong Huang <hu...@126.com>.

No, I failed to tune other flags... That's why I started this thread...


I understand it's a trade-off whether to expose the design docs. Not exposing them will make the document clearer. The downside is users may bother you guys more when they encounter problems since there're no answers they can find themselves. However, it's not a problem since you guys are quite helpful :)


Thanks,
Quanlong


At 2018-08-02 10:18:00，"Todd Lipcon" <to...@cloudera.com> wrote：

On Wed, Aug 1, 2018 at 4:52 PM, Quanlong Huang <hu...@126.com> wrote:

In my experience, when I found the performance is below my expectation, I'd like to tune flags listed in https://kudu.apache.org/docs/configuration_reference.html , which needs a clear understanding of kudu internals. Maybe we can add the link there?




Any particular flags that you found you had to tune? I almost never advise tuning anything other than the number of maintenance threads. If you have some good guidance on how tuning those flags can improve performance, maybe we can consider changing the defaults or giving some more prescriptive advice?


I'm a little nervous that saying "here are all the internals, and here are 100 config flags to study" will scare users more than help them :)


-Todd
 

At 2018-08-02 01:06:40，"Todd Lipcon" <to...@cloudera.com> wrote：

On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang <hu...@126.com> wrote:

Hi Todd and William,


I'm really appreciated for your help and sorry for my late reply. I was going to reply with some follow-up questions but was assigned to focus some other works... Now I'm back to this work.


The design docs are really helpful. Now I understand the flush and compaction. I think we can add a link to these design docs in the kudu documentation page, so users who want to dig deeper can know more about kudu internal.


Personally, since starting the project, I have had the philosophy that the user-facing documentation should remain simple and not discuss internals too much. I found in some other open source projects that there isn't a clear difference between user documentation and developer documentation, and users can easily get confused by all of the internal details. Or, users may start to believe that Kudu is very complex and they need to understand knapsack problem approximation algorithms in order to operate it. So, normally we try to avoid exposing too much of the details.


That said, I think it is a good idea to add a small note in the documentation somewhere that links to the design docs, maybe with some sentence explaining that understanding internals is not necessary to operate Kudu, but that expert users may find the internal design useful as a reference? I would be curious to hear what other users think about how best to make this trade-off.


-Todd
 
At 2018-06-15 23:41:17, "Todd Lipcon" <to...@cloudera.com> wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of compactions, because each of these 32MB rowsets is non-overlapping. In other words, if your MRS contains rows A-Z, the output RowSets will include [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will never need to be compacted with each other. The net result, here, is that compaction becomes more fine-grained and only needs to operate on sub-ranges of the tablet where there is a lot of overlap.


You can read more about this in docs/design-docs/compaction-policy.md, in particular the section "Limiting RowSet Sizes"


Hope that helps
-Todd


On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wd...@gmail.com> wrote:

The op seen in the logs is a rowset compaction, which takes existing diskrowsets and rewrites them. It's not a flush, which writes data in memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset compaction is done to reduce the amount of overlap of rowsets in primary key space, i.e. reduce the number of rowsets that might need to be checked to enforce the primary key constraint or find a row. Having lots of rowset compaction indicates that rows are being written in a somewhat random order w.r.t the primary key order. Kudu will perform much better as writes scale when rows are inserted roughly in increasing order per tablet.


Also, because you are using the log block manager (the default and only one suitable for production deployments), there isn't a 1-1 relationship between cfiles or diskrowsets and files on the filesystem. Many cfiles and diskrowsets will be put together in a container file.


Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)


The metrics from the compact row sets op indicates the time is spent in fdatasync and in reading (likely reading the original rowsets). The overall compaction time is kinda long but not crazy long. What's the performance you are seeing and what is the performance you would like to see?


-Will


On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com> wrote:

Hi all,


I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, I find most of the compactions are compacting small files (~40MB for each). For example:


I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: stage 1 complete, picked 4 rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: ~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: ~34720852 bytes)
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: ~29914833 bytes)
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: ~29007249 bytes)
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction Phase 2: carrying over any updates which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction successful on 82987 rows (123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628suser 1.460ssys 0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}


The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed file size be ~1GB?


I think increasing the initial RowSet size can reduce compactions and then reduce the impact of other ongoing operations. It may also improve the flush performance. Is that right? If so, how can I increase the RowSet size?


I'd be grateful if someone can make me clear about these!


Thanks,
Quanlong







--

Todd Lipcon
Software Engineer, Cloudera





--

Todd Lipcon
Software Engineer, Cloudera





--

Todd Lipcon
Software Engineer, Cloudera

Re: Re: Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Aug 1, 2018 at 4:52 PM, Quanlong Huang <hu...@126.com>
wrote:

> In my experience, when I found the performance is below my expectation,
> I'd like to tune flags listed in https://kudu.apache.org/
> docs/configuration_reference.html , which needs a clear understanding of
> kudu internals. Maybe we can add the link there?
>
>
Any particular flags that you found you had to tune? I almost never advise
tuning anything other than the number of maintenance threads. If you have
some good guidance on how tuning those flags can improve performance, maybe
we can consider changing the defaults or giving some more prescriptive
advice?

I'm a little nervous that saying "here are all the internals, and here are
100 config flags to study" will scare users more than help them :)

-Todd


>
> At 2018-08-02 01:06:40，"Todd Lipcon" <to...@cloudera.com> wrote：
>
> On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang <hu...@126.com>
> wrote:
>
>> Hi Todd and William,
>>
>> I'm really appreciated for your help and sorry for my late reply. I was
>> going to reply with some follow-up questions but was assigned to focus some
>> other works... Now I'm back to this work.
>>
>> The design docs are really helpful. Now I understand the flush and
>> compaction. I think we can add a link to these design docs in the kudu
>> documentation page, so users who want to dig deeper can know more about
>> kudu internal.
>>
>
> Personally, since starting the project, I have had the philosophy that the
> user-facing documentation should remain simple and not discuss internals
> too much. I found in some other open source projects that there isn't a
> clear difference between user documentation and developer documentation,
> and users can easily get confused by all of the internal details. Or, users
> may start to believe that Kudu is very complex and they need to understand
> knapsack problem approximation algorithms in order to operate it. So,
> normally we try to avoid exposing too much of the details.
>
> That said, I think it is a good idea to add a small note in the
> documentation somewhere that links to the design docs, maybe with some
> sentence explaining that understanding internals is not necessary to
> operate Kudu, but that expert users may find the internal design useful as
> a reference? I would be curious to hear what other users think about how
> best to make this trade-off.
>
> -Todd
>
>
>> At 2018-06-15 23:41:17, "Todd Lipcon" <to...@cloudera.com> wrote:
>>
>> Also, keep in mind that when the MRS flushes, it flushes into a bunch of
>> separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
>> default). This is set by --budgeted_compaction_target_rowset_size
>>
>> However, increasing this size isn't likely to decrease the number of
>> compactions, because each of these 32MB rowsets is non-overlapping. In
>> other words, if your MRS contains rows A-Z, the output RowSets will include
>> [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will
>> never need to be compacted with each other. The net result, here, is that
>> compaction becomes more fine-grained and only needs to operate on
>> sub-ranges of the tablet where there is a lot of overlap.
>>
>> You can read more about this in docs/design-docs/compaction-policy.md,
>> in particular the section "Limiting RowSet Sizes"
>>
>> Hope that helps
>> -Todd
>>
>> On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wd...@gmail.com>
>> wrote:
>>
>>> The op seen in the logs is a rowset compaction, which takes existing
>>> diskrowsets and rewrites them. It's not a flush, which writes data in
>>> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
>>> compaction is done to reduce the amount of overlap of rowsets in primary
>>> key space, i.e. reduce the number of rowsets that might need to be checked
>>> to enforce the primary key constraint or find a row. Having lots of rowset
>>> compaction indicates that rows are being written in a somewhat random order
>>> w.r.t the primary key order. Kudu will perform much better as writes scale
>>> when rows are inserted roughly in increasing order per tablet.
>>>
>>> Also, because you are using the log block manager (the default and only
>>> one suitable for production deployments), there isn't a 1-1 relationship
>>> between cfiles or diskrowsets and files on the filesystem. Many cfiles and
>>> diskrowsets will be put together in a container file.
>>>
>>> Config parameters that might be relevant here:
>>> --maintenance_manager_num_threads
>>> --fs_data_dirs (how many)
>>> --fs_wal_dir (is it shared on a device with the data dir?)
>>>
>>> The metrics from the compact row sets op indicates the time is spent in
>>> fdatasync and in reading (likely reading the original rowsets). The overall
>>> compaction time is kinda long but not crazy long. What's the performance
>>> you are seeing and what is the performance you would like to see?
>>>
>>> -Will
>>>
>>> On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
>>>> server, I find most of the compactions are compacting small files (~40MB
>>>> for each). For example:
>>>>
>>>> I0615 07:22:42.637351 30614 tablet.cc:1661] T
>>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>>> Compaction: stage 1 complete, picked 4 rowsets to compact
>>>> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
>>>> compact:
>>>> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current
>>>> size on disk: ~40666600 bytes)
>>>> I0615 07:22:42.637401 30614 compaction.cc:906] RowSet(1563)(current
>>>> size on disk: ~34720852 bytes)
>>>> I0615 07:22:42.637408 30614 compaction.cc:906] RowSet(1645)(current
>>>> size on disk: ~29914833 bytes)
>>>> I0615 07:22:42.637415 30614 compaction.cc:906] RowSet(1870)(current
>>>> size on disk: ~29007249 bytes)
>>>> I0615 07:22:42.637428 30614 tablet.cc:1447] T
>>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>>> Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot:
>>>> MvccSnapshot[committed={T|T < 6263071556616208384 or (T in
>>>> {6263071556616208384})}]
>>>> I0615 07:22:42.641582 30614 multi_column_writer.cc:103] Opened CFile
>>>> writers for 124 column(s)
>>>> I0615 07:22:43.875396 30614 multi_column_writer.cc:103] Opened CFile
>>>> writers for 124 column(s)
>>>> I0615 07:22:44.418421 30614 multi_column_writer.cc:103] Opened CFile
>>>> writers for 124 column(s)
>>>> I0615 07:22:45.114389 30614 multi_column_writer.cc:103] Opened CFile
>>>> writers for 124 column(s)
>>>> I0615 07:22:54.762563 30614 tablet.cc:1532] T
>>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>>> Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
>>>> I0615 07:22:54.773572 30614 tablet.cc:1587] T
>>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>>> Compaction Phase 2: carrying over any updates which arrived during Phase 1
>>>> I0615 07:22:54.773599 30614 tablet.cc:1589] T
>>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>>> Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T
>>>> in {6263071556616208384})}]
>>>> I0615 07:22:55.189757 30614 tablet.cc:1631] T
>>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>>> Compaction successful on 82987 rows (123387929 bytes)
>>>> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent
>>>> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real
>>>> 12.628s user 1.460s sys 0.410s
>>>> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P
>>>> 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba)
>>>> metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfi
>>>> le_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
>>>> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data
>>>> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us"
>>>> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms
>>>> <https://maps.google.com/?q=1-10_ms+:+32&entry=gmail&source=g>":32,"
>>>> lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_t
>>>> ime_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"
>>>> spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":76
>>>> 8,"thread_start_us":677,"threads_started":14,"wal-append.
>>>> queue_time_us":300}
>>>>
>>>> The flush_threshold_mb is set in the default value (1024). Wouldn't the
>>>> flushed file size be ~1GB?
>>>>
>>>> I think increasing the initial RowSet size can reduce compactions and
>>>> then reduce the impact of other ongoing operations. It may also improve the
>>>> flush performance. Is that right? If so, how can I increase the RowSet size?
>>>>
>>>> I'd be grateful if someone can make me clear about these!
>>>>
>>>> Thanks,
>>>> Quanlong
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re:Re: Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by Quanlong Huang <hu...@126.com>.

In my experience, when I found the performance is below my expectation, I'd like to tune flags listed in https://kudu.apache.org/docs/configuration_reference.html , which needs a clear understanding of kudu internals. Maybe we can add the link there?


At 2018-08-02 01:06:40，"Todd Lipcon" <to...@cloudera.com> wrote：

On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang <hu...@126.com> wrote:

Hi Todd and William,


I'm really appreciated for your help and sorry for my late reply. I was going to reply with some follow-up questions but was assigned to focus some other works... Now I'm back to this work.


The design docs are really helpful. Now I understand the flush and compaction. I think we can add a link to these design docs in the kudu documentation page, so users who want to dig deeper can know more about kudu internal.


Personally, since starting the project, I have had the philosophy that the user-facing documentation should remain simple and not discuss internals too much. I found in some other open source projects that there isn't a clear difference between user documentation and developer documentation, and users can easily get confused by all of the internal details. Or, users may start to believe that Kudu is very complex and they need to understand knapsack problem approximation algorithms in order to operate it. So, normally we try to avoid exposing too much of the details.


That said, I think it is a good idea to add a small note in the documentation somewhere that links to the design docs, maybe with some sentence explaining that understanding internals is not necessary to operate Kudu, but that expert users may find the internal design useful as a reference? I would be curious to hear what other users think about how best to make this trade-off.


-Todd
 
At 2018-06-15 23:41:17, "Todd Lipcon" <to...@cloudera.com> wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of compactions, because each of these 32MB rowsets is non-overlapping. In other words, if your MRS contains rows A-Z, the output RowSets will include [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will never need to be compacted with each other. The net result, here, is that compaction becomes more fine-grained and only needs to operate on sub-ranges of the tablet where there is a lot of overlap.


You can read more about this in docs/design-docs/compaction-policy.md, in particular the section "Limiting RowSet Sizes"


Hope that helps
-Todd


On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wd...@gmail.com> wrote:

The op seen in the logs is a rowset compaction, which takes existing diskrowsets and rewrites them. It's not a flush, which writes data in memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset compaction is done to reduce the amount of overlap of rowsets in primary key space, i.e. reduce the number of rowsets that might need to be checked to enforce the primary key constraint or find a row. Having lots of rowset compaction indicates that rows are being written in a somewhat random order w.r.t the primary key order. Kudu will perform much better as writes scale when rows are inserted roughly in increasing order per tablet.


Also, because you are using the log block manager (the default and only one suitable for production deployments), there isn't a 1-1 relationship between cfiles or diskrowsets and files on the filesystem. Many cfiles and diskrowsets will be put together in a container file.


Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)


The metrics from the compact row sets op indicates the time is spent in fdatasync and in reading (likely reading the original rowsets). The overall compaction time is kinda long but not crazy long. What's the performance you are seeing and what is the performance you would like to see?


-Will


On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com> wrote:

Hi all,


I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, I find most of the compactions are compacting small files (~40MB for each). For example:


I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: stage 1 complete, picked 4 rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: ~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: ~34720852 bytes)
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: ~29914833 bytes)
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: ~29007249 bytes)
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction Phase 2: carrying over any updates which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction successful on 82987 rows (123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628suser 1.460ssys 0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}


The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed file size be ~1GB?


I think increasing the initial RowSet size can reduce compactions and then reduce the impact of other ongoing operations. It may also improve the flush performance. Is that right? If so, how can I increase the RowSet size?


I'd be grateful if someone can make me clear about these!


Thanks,
Quanlong







--

Todd Lipcon
Software Engineer, Cloudera





--

Todd Lipcon
Software Engineer, Cloudera

Re: Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Aug 1, 2018 at 6:28 AM, Quanlong Huang <hu...@126.com>
wrote:

> Hi Todd and William,
>
> I'm really appreciated for your help and sorry for my late reply. I was
> going to reply with some follow-up questions but was assigned to focus some
> other works... Now I'm back to this work.
>
> The design docs are really helpful. Now I understand the flush and
> compaction. I think we can add a link to these design docs in the kudu
> documentation page, so users who want to dig deeper can know more about
> kudu internal.
>

Personally, since starting the project, I have had the philosophy that the
user-facing documentation should remain simple and not discuss internals
too much. I found in some other open source projects that there isn't a
clear difference between user documentation and developer documentation,
and users can easily get confused by all of the internal details. Or, users
may start to believe that Kudu is very complex and they need to understand
knapsack problem approximation algorithms in order to operate it. So,
normally we try to avoid exposing too much of the details.

That said, I think it is a good idea to add a small note in the
documentation somewhere that links to the design docs, maybe with some
sentence explaining that understanding internals is not necessary to
operate Kudu, but that expert users may find the internal design useful as
a reference? I would be curious to hear what other users think about how
best to make this trade-off.

-Todd


> At 2018-06-15 23:41:17, "Todd Lipcon" <to...@cloudera.com> wrote:
>
> Also, keep in mind that when the MRS flushes, it flushes into a bunch of
> separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
> default). This is set by --budgeted_compaction_target_rowset_size
>
> However, increasing this size isn't likely to decrease the number of
> compactions, because each of these 32MB rowsets is non-overlapping. In
> other words, if your MRS contains rows A-Z, the output RowSets will include
> [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will
> never need to be compacted with each other. The net result, here, is that
> compaction becomes more fine-grained and only needs to operate on
> sub-ranges of the tablet where there is a lot of overlap.
>
> You can read more about this in docs/design-docs/compaction-policy.md, in
> particular the section "Limiting RowSet Sizes"
>
> Hope that helps
> -Todd
>
> On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wd...@gmail.com>
> wrote:
>
>> The op seen in the logs is a rowset compaction, which takes existing
>> diskrowsets and rewrites them. It's not a flush, which writes data in
>> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
>> compaction is done to reduce the amount of overlap of rowsets in primary
>> key space, i.e. reduce the number of rowsets that might need to be checked
>> to enforce the primary key constraint or find a row. Having lots of rowset
>> compaction indicates that rows are being written in a somewhat random order
>> w.r.t the primary key order. Kudu will perform much better as writes scale
>> when rows are inserted roughly in increasing order per tablet.
>>
>> Also, because you are using the log block manager (the default and only
>> one suitable for production deployments), there isn't a 1-1 relationship
>> between cfiles or diskrowsets and files on the filesystem. Many cfiles and
>> diskrowsets will be put together in a container file.
>>
>> Config parameters that might be relevant here:
>> --maintenance_manager_num_threads
>> --fs_data_dirs (how many)
>> --fs_wal_dir (is it shared on a device with the data dir?)
>>
>> The metrics from the compact row sets op indicates the time is spent in
>> fdatasync and in reading (likely reading the original rowsets). The overall
>> compaction time is kinda long but not crazy long. What's the performance
>> you are seeing and what is the performance you would like to see?
>>
>> -Will
>>
>> On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
>>> server, I find most of the compactions are compacting small files (~40MB
>>> for each). For example:
>>>
>>> I0615 07:22:42.637351 30614 tablet.cc:1661] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Compaction: stage 1 complete, picked 4 rowsets to compact
>>> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
>>> compact:
>>> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current size
>>> on disk: ~40666600 bytes)
>>> I0615 07:22:42.637401 30614 compaction.cc:906] RowSet(1563)(current
>>> size on disk: ~34720852 bytes)
>>> I0615 07:22:42.637408 30614 compaction.cc:906] RowSet(1645)(current
>>> size on disk: ~29914833 bytes)
>>> I0615 07:22:42.637415 30614 compaction.cc:906] RowSet(1870)(current
>>> size on disk: ~29007249 bytes)
>>> I0615 07:22:42.637428 30614 tablet.cc:1447] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot:
>>> MvccSnapshot[committed={T|T < 6263071556616208384 or (T in
>>> {6263071556616208384})}]
>>> I0615 07:22:42.641582 30614 multi_column_writer.cc:103] Opened CFile
>>> writers for 124 column(s)
>>> I0615 07:22:43.875396 30614 multi_column_writer.cc:103] Opened CFile
>>> writers for 124 column(s)
>>> I0615 07:22:44.418421 30614 multi_column_writer.cc:103] Opened CFile
>>> writers for 124 column(s)
>>> I0615 07:22:45.114389 30614 multi_column_writer.cc:103] Opened CFile
>>> writers for 124 column(s)
>>> I0615 07:22:54.762563 30614 tablet.cc:1532] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
>>> I0615 07:22:54.773572 30614 tablet.cc:1587] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Compaction Phase 2: carrying over any updates which arrived during Phase 1
>>> I0615 07:22:54.773599 30614 tablet.cc:1589] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T
>>> in {6263071556616208384})}]
>>> I0615 07:22:55.189757 30614 tablet.cc:1631] T
>>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>>> Compaction successful on 82987 rows (123387929 bytes)
>>> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent
>>> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628s user
>>> 1.460s sys 0.410s
>>> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P
>>> 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba)
>>> metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfi
>>> le_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
>>> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data
>>> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us"
>>> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms
>>> <https://maps.google.com/?q=1-10_ms+:+32&entry=gmail&source=g>":32,"
>>> lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_t
>>> ime_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"
>>> spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":
>>> 768,"thread_start_us":677,"threads_started":14,"wal-appen
>>> d.queue_time_us":300}
>>>
>>> The flush_threshold_mb is set in the default value (1024). Wouldn't the
>>> flushed file size be ~1GB?
>>>
>>> I think increasing the initial RowSet size can reduce compactions and
>>> then reduce the impact of other ongoing operations. It may also improve the
>>> flush performance. Is that right? If so, how can I increase the RowSet size?
>>>
>>> I'd be grateful if someone can make me clear about these!
>>>
>>> Thanks,
>>> Quanlong
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re:Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by Quanlong Huang <hu...@126.com>.

Hi Todd and William,


I'm really appreciated for your help and sorry for my late reply. I was going to reply with some follow-up questions but was assigned to focus some other works... Now I'm back to this work.


The design docs are really helpful. Now I understand the flush and compaction. I think we can add a link to these design docs in the kudu documentation page, so users who want to dig deeper can know more about kudu internal.


Thanks,
Quanlong

At 2018-06-15 23:41:17, "Todd Lipcon" <to...@cloudera.com> wrote:

Also, keep in mind that when the MRS flushes, it flushes into a bunch of separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by default). This is set by --budgeted_compaction_target_rowset_size


However, increasing this size isn't likely to decrease the number of compactions, because each of these 32MB rowsets is non-overlapping. In other words, if your MRS contains rows A-Z, the output RowSets will include [A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will never need to be compacted with each other. The net result, here, is that compaction becomes more fine-grained and only needs to operate on sub-ranges of the tablet where there is a lot of overlap.


You can read more about this in docs/design-docs/compaction-policy.md, in particular the section "Limiting RowSet Sizes"


Hope that helps
-Todd


On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wd...@gmail.com> wrote:

The op seen in the logs is a rowset compaction, which takes existing diskrowsets and rewrites them. It's not a flush, which writes data in memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset compaction is done to reduce the amount of overlap of rowsets in primary key space, i.e. reduce the number of rowsets that might need to be checked to enforce the primary key constraint or find a row. Having lots of rowset compaction indicates that rows are being written in a somewhat random order w.r.t the primary key order. Kudu will perform much better as writes scale when rows are inserted roughly in increasing order per tablet.


Also, because you are using the log block manager (the default and only one suitable for production deployments), there isn't a 1-1 relationship between cfiles or diskrowsets and files on the filesystem. Many cfiles and diskrowsets will be put together in a container file.


Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)


The metrics from the compact row sets op indicates the time is spent in fdatasync and in reading (likely reading the original rowsets). The overall compaction time is kinda long but not crazy long. What's the performance you are seeing and what is the performance you would like to see?


-Will


On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com> wrote:

Hi all,


I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet server, I find most of the compactions are compacting small files (~40MB for each). For example:


I0615 07:22:42.63735130614tablet.cc:1661] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: stage 1 complete, picked 4 rowsets to compact
I0615 07:22:42.63738530614compaction.cc:903] Selected 4 rowsets to compact:
I0615 07:22:42.63739330614compaction.cc:906] RowSet(343)(current size on disk: ~40666600 bytes)
I0615 07:22:42.63740130614compaction.cc:906] RowSet(1563)(current size on disk: ~34720852 bytes)
I0615 07:22:42.63740830614compaction.cc:906] RowSet(1645)(current size on disk: ~29914833 bytes)
I0615 07:22:42.63741530614compaction.cc:906] RowSet(1870)(current size on disk: ~29007249 bytes)
I0615 07:22:42.63742830614tablet.cc:1447] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:42.64158230614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:43.87539630614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:44.41842130614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:45.11438930614multi_column_writer.cc:103] Opened CFile writers for 124 column(s)
I0615 07:22:54.76256330614tablet.cc:1532] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
I0615 07:22:54.77357230614tablet.cc:1587] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction Phase 2: carrying over any updates which arrived during Phase 1
I0615 07:22:54.77359930614tablet.cc:1589] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T in {6263071556616208384})}]
I0615 07:22:55.18975730614tablet.cc:1631] T 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7: Compaction successful on 82987 rows (123387929 bytes)
I0615 07:22:55.19142630614maintenance_manager.cc:491] Time spent running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628suser 1.460ssys 0.410s
I0615 07:22:55.19148430614maintenance_manager.cc:497] P 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us":9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_contention_cycles":768,"thread_start_us":677,"threads_started":14,"wal-append.queue_time_us":300}


The flush_threshold_mb is set in the default value (1024). Wouldn't the flushed file size be ~1GB?


I think increasing the initial RowSet size can reduce compactions and then reduce the impact of other ongoing operations. It may also improve the flush performance. Is that right? If so, how can I increase the RowSet size?


I'd be grateful if someone can make me clear about these!


Thanks,
Quanlong







--

Todd Lipcon
Software Engineer, Cloudera

Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by Todd Lipcon <to...@cloudera.com>.

Also, keep in mind that when the MRS flushes, it flushes into a bunch of
separate RowSets, not 1:1. It "rolls" to a new RowSet every N MB (N=32 by
default). This is set by --budgeted_compaction_target_rowset_size

However, increasing this size isn't likely to decrease the number of
compactions, because each of these 32MB rowsets is non-overlapping. In
other words, if your MRS contains rows A-Z, the output RowSets will include
[A-C], [D-G], [H-P], [Q-Z]. Since these ranges do not overlap, they will
never need to be compacted with each other. The net result, here, is that
compaction becomes more fine-grained and only needs to operate on
sub-ranges of the tablet where there is a lot of overlap.

You can read more about this in docs/design-docs/compaction-policy.md, in
particular the section "Limiting RowSet Sizes"

Hope that helps
-Todd

On Fri, Jun 15, 2018 at 8:26 AM, William Berkeley <wd...@gmail.com>
wrote:

> The op seen in the logs is a rowset compaction, which takes existing
> diskrowsets and rewrites them. It's not a flush, which writes data in
> memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
> compaction is done to reduce the amount of overlap of rowsets in primary
> key space, i.e. reduce the number of rowsets that might need to be checked
> to enforce the primary key constraint or find a row. Having lots of rowset
> compaction indicates that rows are being written in a somewhat random order
> w.r.t the primary key order. Kudu will perform much better as writes scale
> when rows are inserted roughly in increasing order per tablet.
>
> Also, because you are using the log block manager (the default and only
> one suitable for production deployments), there isn't a 1-1 relationship
> between cfiles or diskrowsets and files on the filesystem. Many cfiles and
> diskrowsets will be put together in a container file.
>
> Config parameters that might be relevant here:
> --maintenance_manager_num_threads
> --fs_data_dirs (how many)
> --fs_wal_dir (is it shared on a device with the data dir?)
>
> The metrics from the compact row sets op indicates the time is spent in
> fdatasync and in reading (likely reading the original rowsets). The overall
> compaction time is kinda long but not crazy long. What's the performance
> you are seeing and what is the performance you would like to see?
>
> -Will
>
> On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com>
> wrote:
>
>> Hi all,
>>
>> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
>> server, I find most of the compactions are compacting small files (~40MB
>> for each). For example:
>>
>> I0615 07:22:42.637351 30614 tablet.cc:1661] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction: stage 1 complete, picked 4 rowsets to compact
>> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
>> compact:
>> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current size
>> on disk: ~40666600 bytes)
>> I0615 07:22:42.637401 30614 compaction.cc:906] RowSet(1563)(current size
>> on disk: ~34720852 bytes)
>> I0615 07:22:42.637408 30614 compaction.cc:906] RowSet(1645)(current size
>> on disk: ~29914833 bytes)
>> I0615 07:22:42.637415 30614 compaction.cc:906] RowSet(1870)(current size
>> on disk: ~29007249 bytes)
>> I0615 07:22:42.637428 30614 tablet.cc:1447] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot:
>> MvccSnapshot[committed={T|T < 6263071556616208384 or (T in
>> {6263071556616208384})}]
>> I0615 07:22:42.641582 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:43.875396 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:44.418421 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:45.114389 30614 multi_column_writer.cc:103] Opened CFile
>> writers for 124 column(s)
>> I0615 07:22:54.762563 30614 tablet.cc:1532] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
>> I0615 07:22:54.773572 30614 tablet.cc:1587] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction Phase 2: carrying over any updates which arrived during Phase 1
>> I0615 07:22:54.773599 30614 tablet.cc:1589] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T
>> in {6263071556616208384})}]
>> I0615 07:22:55.189757 30614 tablet.cc:1631] T
>> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
>> Compaction successful on 82987 rows (123387929 bytes)
>> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent
>> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628s user
>> 1.460s sys 0.410s
>> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P
>> 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba)
>> metrics: {"cfile_cache_hit":812,"cfile_cache_hit_bytes":16840376,"cfi
>> le_cache_miss":2730,"cfile_cache_miss_bytes":251298442,"cfile_init":496,"data
>> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data
>> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us"
>> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms
>> <https://maps.google.com/?q=1-10_ms+:+32&entry=gmail&source=g>":32,"
>> lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_write_
>> time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_wait_us":
>> 25,"spinlock_wait_cycles":155264,"tcmalloc_contention_
>> cycles":768,"thread_start_us":677,"threads_started":14,"wal-
>> append.queue_time_us":300}
>>
>> The flush_threshold_mb is set in the default value (1024). Wouldn't the
>> flushed file size be ~1GB?
>>
>> I think increasing the initial RowSet size can reduce compactions and
>> then reduce the impact of other ongoing operations. It may also improve the
>> flush performance. Is that right? If so, how can I increase the RowSet size?
>>
>> I'd be grateful if someone can make me clear about these!
>>
>> Thanks,
>> Quanlong
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Why RowSet size is much smaller than flush_threshold_mb

Posted by William Berkeley <wd...@gmail.com>.

The op seen in the logs is a rowset compaction, which takes existing
diskrowsets and rewrites them. It's not a flush, which writes data in
memory to disk, so I don't think the flush_threshold_mb is relevant. Rowset
compaction is done to reduce the amount of overlap of rowsets in primary
key space, i.e. reduce the number of rowsets that might need to be checked
to enforce the primary key constraint or find a row. Having lots of rowset
compaction indicates that rows are being written in a somewhat random order
w.r.t the primary key order. Kudu will perform much better as writes scale
when rows are inserted roughly in increasing order per tablet.

Also, because you are using the log block manager (the default and only one
suitable for production deployments), there isn't a 1-1 relationship
between cfiles or diskrowsets and files on the filesystem. Many cfiles and
diskrowsets will be put together in a container file.

Config parameters that might be relevant here:
--maintenance_manager_num_threads
--fs_data_dirs (how many)
--fs_wal_dir (is it shared on a device with the data dir?)

The metrics from the compact row sets op indicates the time is spent in
fdatasync and in reading (likely reading the original rowsets). The overall
compaction time is kinda long but not crazy long. What's the performance
you are seeing and what is the performance you would like to see?

-Will

On Fri, Jun 15, 2018 at 7:52 AM, Quanlong Huang <hu...@126.com>
wrote:

> Hi all,
>
> I'm running kudu 1.6.0-cdh5.14.2. When looking into the logs of tablet
> server, I find most of the compactions are compacting small files (~40MB
> for each). For example:
>
> I0615 07:22:42.637351 30614 tablet.cc:1661] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction: stage 1 complete, picked 4 rowsets to compact
> I0615 07:22:42.637385 30614 compaction.cc:903] Selected 4 rowsets to
> compact:
> I0615 07:22:42.637393 30614 compaction.cc:906] RowSet(343)(current size
> on disk: ~40666600 bytes)
> I0615 07:22:42.637401 30614 compaction.cc:906] RowSet(1563)(current size
> on disk: ~34720852 bytes)
> I0615 07:22:42.637408 30614 compaction.cc:906] RowSet(1645)(current size
> on disk: ~29914833 bytes)
> I0615 07:22:42.637415 30614 compaction.cc:906] RowSet(1870)(current size
> on disk: ~29007249 bytes)
> I0615 07:22:42.637428 30614 tablet.cc:1447] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction: entering phase 1 (flushing snapshot). Phase 1 snapshot:
> MvccSnapshot[committed={T|T < 6263071556616208384 or (T in
> {6263071556616208384})}]
> I0615 07:22:42.641582 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:43.875396 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:44.418421 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:45.114389 30614 multi_column_writer.cc:103] Opened CFile
> writers for 124 column(s)
> I0615 07:22:54.762563 30614 tablet.cc:1532] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction: entering phase 2 (starting to duplicate updates in new rowsets)
> I0615 07:22:54.773572 30614 tablet.cc:1587] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction Phase 2: carrying over any updates which arrived during Phase 1
> I0615 07:22:54.773599 30614 tablet.cc:1589] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Phase 2 snapshot: MvccSnapshot[committed={T|T < 6263071556616208384 or (T
> in {6263071556616208384})}]
> I0615 07:22:55.189757 30614 tablet.cc:1631] T
> 6bdefb8c27764a0597dcf98ee1b450ba P 70f3e54fe0f3490cbf0371a6830a33a7:
> Compaction successful on 82987 rows (123387929 bytes)
> I0615 07:22:55.191426 30614 maintenance_manager.cc:491] Time spent
> running CompactRowSetsOp(6bdefb8c27764a0597dcf98ee1b450ba): real 12.628s user
> 1.460s sys 0.410s
> I0615 07:22:55.191484 30614 maintenance_manager.cc:497] P
> 70f3e54fe0f3490cbf0371a6830a33a7: CompactRowSetsOp(
> 6bdefb8c27764a0597dcf98ee1b450ba) metrics: {"cfile_cache_hit":812,"cfile_
> cache_hit_bytes":16840376,"cfile_cache_miss":2730,"cfile_
> cache_miss_bytes":251298442,"cfile_init":496,"data
> dirs.queue_time_us":6646,"data dirs.run_cpu_time_us":2188,"data
> dirs.run_wall_time_us":101717,"fdatasync":315,"fdatasync_us"
> :9617174,"lbm_read_time_us":1288971,"lbm_reads_1-10_ms":
> 32,"lbm_reads_10-100_ms":41,"lbm_reads_lt_1ms":4641,"lbm_
> write_time_us":122520,"lbm_writes_lt_1ms":2799,"mutex_
> wait_us":25,"spinlock_wait_cycles":155264,"tcmalloc_
> contention_cycles":768,"thread_start_us":677,"threads_
> started":14,"wal-append.queue_time_us":300}
>
> The flush_threshold_mb is set in the default value (1024). Wouldn't the
> flushed file size be ~1GB?
>
> I think increasing the initial RowSet size can reduce compactions and then
> reduce the impact of other ongoing operations. It may also improve the
> flush performance. Is that right? If so, how can I increase the RowSet size?
>
> I'd be grateful if someone can make me clear about these!
>
> Thanks,
> Quanlong
>