You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2020/03/07 14:52:50 UTC

Re: Partitioning Rules of Thumb

hey guys,

I asked the same question on Slack on and got no responses. I just went
through the docs and design doc and FAQ and still did not find an answer.

Can someone comment?

Maybe I was not asking a clear question. If my cluster is large enough in
my example above, should I go with 3, 9 or 18 tablets? or should I pick
tablets to be closer to 1Gb?

And a follow-up question, if I have tons of smaller tables under 5 million
rows, should I just use 1 partition or still break them on smaller tablets
for concurrency?

We cannot pick the partitioning strategy for each table as we need to
stream 100s of tables and we use PK from RBDMS and need to come with an
automated way to pick number of partitions/tablets. So far I was using 1Gb
rule but rethinking this now for another project.

On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> forgot to post results of my quick test:
>
>  Kudu 1.5
>
> Table takes 18Gb of disk space after 3x replication
>
> Tablets Tablet Size Query run time, sec
> 3 2Gb 65
> 9 700Mb 27
> 18 350Mb 17
> [image: image.png]
>
>
> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi guys,
>>
>> just want to clarify recommendations from the doc. It says:
>>
>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>
>> Partitioning Rules of Thumb
>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>
>>    -
>>
>>    For large tables, such as fact tables, aim for as many tablets as you
>>    have cores in the cluster.
>>    -
>>
>>    For small tables, such as dimension tables, ensure that each tablet
>>    is at least 1 GB in size.
>>
>> In general, be mindful the number of tablets limits the parallelism of
>> reads, in the current implementation. Increasing the number of tablets
>> significantly beyond the number of cores is likely to have diminishing
>> returns.
>>
>>
>> I've read this a few times but I am not sure I understand it correctly.
>> Let me use concrete example.
>>
>> If a table ends up taking 18Gb after replication (so with 3x replication
>> it is ~9Gb per tablet if I do not partition), should I aim for 1Gb tablets
>> (6 tablets before replication) or should I aim for 500Mb tablets if my
>> cluster capacity allows so (12 tablets before replication)? confused why
>> they say "at least" not "at most" - does it mean I should design it so a
>> tablet takes 2Gb or 3Gb in this example?
>>
>> Assume that I have tons of CPU cores on a cluster...
>> Based on my quick test, it seems that queries are faster if I have more
>> tablets/partitions...In this example, 18 tablets gave me the best timing
>> but tablet size was around 300-400Mb. But the doc says "at least 1Gb".
>>
>> Really confused what the doc is saying, please help
>>
>> Boris
>>
>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

My company actually uses CDP, but for the type of virtualized deployments
my team does a simple Apache build of Impala and Kudu works best. We have
an automated build script that we run every now and then to build on Amazon
Linux. We do however have to tweak it every time we build new versions.

On Sat, Apr 25, 2020, 9:10 PM Cliff Resnick <cr...@gmail.com> wrote:

> That's pretty much it. Every now and then we get notice an instance is
> scheduled for retirement, or maybe just goes flakey on its own, and we swap
> it out. 3x replication has been plenty for us, and though we regularly back
> up to S3 we've never had to rebuild due to failure. In my limited
> experience disk failure is not a much of a thing with NVMe, and one of the
> nice things about renting versus owning is we don't worry about things like
> write amplification. We also co-locate a couple of masters, but their tiny
> data goes to EBS and survives reboots. But then our cluster must be much
> smaller than yours because rebalancing takes about a half an hour at most.
>
> On Sat, Apr 25, 2020, 3:20 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> Cliff, I think i got the idea - you basically use a bunch of temp servers
>> that can be evicted any time and each has attached temporary nvme drives -
>> very clever! I am surprised it works well with Kudu as our experience with
>> our on-prem cluster that Kudu is extremely sensitive to disk failures and
>> random reboots. at one point we had a situation when 2 out of 3 replicas
>> were down for a while and it took us quite a bit of time to recover and we
>> had to use kudu cli to evict some tablets manually.
>>
>> Also manual rebalancing is in order often and it takes a lot of time
>> (many hours).
>>
>> On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> Cliff, i would be extremely interested to see a blog post to compare
>>> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>>>
>>> would love to get some details how you set up Kudu/Impala cluster on AWS
>>> as well as my company might be heading the same direction. this does not
>>> mean much to me as we are not using cloud but I hope you can elaborate on
>>> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
>>> a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>> use "for free" instance-store NVMe. We also have associated
>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>> compute oriented queries. "
>>>
>>> I cannot really find anything on the web that would compare Impala/Kudu
>>> to Snowflake and Redshift. Everything I see is about Snowflake, Redshift
>>> and BigQuery.
>>>
>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>
>>>> This is a good conversation but I don't think the comparison with
>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>> closed source, it seems pretty clear the architectures are quite different.
>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>> make Kudu valuable for some use cases.
>>>>
>>>> It also seems to me that their intended workloads are quite different.
>>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>> not designed for heavy concurrency, but complex query plans for a small
>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>> because you have a large amount of data to churn through, not because you
>>>> have a large number of users. Or, at least that's how we used it.
>>>>
>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>> use "for free" instance-store NVMe. We also have associated
>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>> compute oriented queries.
>>>>
>>>> The net result blows away what we had with RedShift at less than 1/3
>>>> the cost, with performance improvements mostly from better concurrency
>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>> also use streaming ingestion which, aside from being impossible with
>>>> RedShift, removes the added cost of staging.
>>>>
>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>> we use Kudu, and even if we could, the cost would would probably put us out
>>>> of business!
>>>>
>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>> there are no exact recommendations.
>>>>>
>>>>> I did look at scaling recommendations but that math is extremely
>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>> into that calculation. We have no control really what users are doing, how
>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>
>>>>> I do like the Snowflake approach than the engine takes care of
>>>>> defaults and can estimate the number of micro-partitions and even
>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>> the design is very similar. I really do not like to pick random number of
>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>> one to make these decisions. Does it make sense?
>>>>>
>>>>>
>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>>
>>>>>> Hey Boris,
>>>>>>
>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>
>>>>>>
>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>> total number of partitions cluster-wide.
>>>>>>
>>>>>> There are some trade offs with the replica count per tablet server
>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>
>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>> be worked into Impala.
>>>>>>
>>>>>> For now, I would try to take into account the total sum of resources
>>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>> system can handle (the scaling guide
>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>> guide how you partition your tables.
>>>>>>
>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>
>>>>>>
>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>
>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>>> on a different tablet server running on isolated hardware. For a
>>>>>> significantly larger table into which you expect highly concurrent
>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>> consider range-partitioning on timestamp.
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>> No Default Partitioning
>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hey guys,
>>>>>>>>
>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>> answer.
>>>>>>>>
>>>>>>>> Can someone comment?
>>>>>>>>
>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>> tablets for concurrency?
>>>>>>>>
>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>
>>>>>>>>>  Kudu 1.5
>>>>>>>>>
>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>
>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>> 3 2Gb 65
>>>>>>>>> 9 700Mb 27
>>>>>>>>> 18 350Mb 17
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>
>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>
>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>
>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>> diminishing returns.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>
>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>
>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>> 1Gb".
>>>>>>>>>>
>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>
>>>>>>>>>> Boris
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Andrew Wong
>>>>>>
>>>>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

That's pretty much it. Every now and then we get notice an instance is
scheduled for retirement, or maybe just goes flakey on its own, and we swap
it out. 3x replication has been plenty for us, and though we regularly back
up to S3 we've never had to rebuild due to failure. In my limited
experience disk failure is not a much of a thing with NVMe, and one of the
nice things about renting versus owning is we don't worry about things like
write amplification. We also co-locate a couple of masters, but their tiny
data goes to EBS and survives reboots. But then our cluster must be much
smaller than yours because rebalancing takes about a half an hour at most.

On Sat, Apr 25, 2020, 3:20 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> Cliff, I think i got the idea - you basically use a bunch of temp servers
> that can be evicted any time and each has attached temporary nvme drives -
> very clever! I am surprised it works well with Kudu as our experience with
> our on-prem cluster that Kudu is extremely sensitive to disk failures and
> random reboots. at one point we had a situation when 2 out of 3 replicas
> were down for a while and it took us quite a bit of time to recover and we
> had to use kudu cli to evict some tablets manually.
>
> Also manual rebalancing is in order often and it takes a lot of time (many
> hours).
>
> On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Cliff, i would be extremely interested to see a blog post to compare
>> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>>
>> would love to get some details how you set up Kudu/Impala cluster on AWS
>> as well as my company might be heading the same direction. this does not
>> mean much to me as we are not using cloud but I hope you can elaborate on
>> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
>> a load balancer. We're in AWS, and thanks to Kudu's replication we can
>> use "for free" instance-store NVMe. We also have associated
>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>> compute oriented queries. "
>>
>> I cannot really find anything on the web that would compare Impala/Kudu
>> to Snowflake and Redshift. Everything I see is about Snowflake, Redshift
>> and BigQuery.
>>
>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> This is a good conversation but I don't think the comparison with
>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>> closed source, it seems pretty clear the architectures are quite different.
>>> Snowflake has no primary key index, no UPSERT capability, features that
>>> make Kudu valuable for some use cases.
>>>
>>> It also seems to me that their intended workloads are quite different.
>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>> not designed for heavy concurrency, but complex query plans for a small
>>> group of users. If you select an x-large Snowflake cluster it's probably
>>> because you have a large amount of data to churn through, not because you
>>> have a large number of users. Or, at least that's how we used it.
>>>
>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>> concurrent queries. I agree that getting very fussy about partitioning can
>>> be a pain, but for the large fact tables we generally use a simple strategy
>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>> use "for free" instance-store NVMe. We also have associated
>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>> compute oriented queries.
>>>
>>> The net result blows away what we had with RedShift at less than 1/3 the
>>> cost, with performance improvements mostly from better concurrency
>>> handling. This is despite the fact that RedShift has built-in cache. We
>>> also use streaming ingestion which, aside from being impossible with
>>> RedShift, removes the added cost of staging.
>>>
>>> Getting back to Snowflake, there's no way we could use it the same way
>>> we use Kudu, and even if we could, the cost would would probably put us out
>>> of business!
>>>
>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Andrew for taking your time responding to me. It seems that
>>>> there are no exact recommendations.
>>>>
>>>> I did look at scaling recommendations but that math is extremely
>>>> complicated and I do not think anyone will know all the answers to plug
>>>> into that calculation. We have no control really what users are doing, how
>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>> IMHO to expect that knowledge of user query patterns.
>>>>
>>>> I do like the Snowflake approach than the engine takes care of defaults
>>>> and can estimate the number of micro-partitions and even repartition tables
>>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>>> similar. I really do not like to pick random number of buckets. Also we
>>>> manager 100s of tables, I cannot look at them each one by one to make these
>>>> decisions. Does it make sense?
>>>>
>>>>
>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>
>>>>> Hey Boris,
>>>>>
>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>>> tablets to be closer to 1Gb?
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>> tablets for concurrency?
>>>>>
>>>>>
>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>> concurrency here, and that therefore having more and having smaller
>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>> total number of partitions cluster-wide.
>>>>>
>>>>> There are some trade offs with the replica count per tablet server
>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>> operations in the pool of background operations to be run, etc.
>>>>>
>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>> have been patches for other integrations that have tackled this to decouple
>>>>> partitioning from scan concurrency (KUDU-2437
>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>> its way into Impala yet). I filed KUDU-3071
>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>> be worked into Impala.
>>>>>
>>>>> For now, I would try to take into account the total sum of resources
>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>> system can handle (the scaling guide
>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>> guide how you partition your tables.
>>>>>
>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>
>>>>>
>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>
>>>>> It is recommended that new tables which are expected to have heavy
>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>> on a different tablet server running on isolated hardware. For a
>>>>> significantly larger table into which you expect highly concurrent
>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>> consider range-partitioning on timestamp.
>>>>>
>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> just saw this in the docs but it is still confusing statement
>>>>>> No Default Partitioning
>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hey guys,
>>>>>>>
>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>> answer.
>>>>>>>
>>>>>>> Can someone comment?
>>>>>>>
>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>>
>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>> rule but rethinking this now for another project.
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> forgot to post results of my quick test:
>>>>>>>>
>>>>>>>>  Kudu 1.5
>>>>>>>>
>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>
>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>> 3 2Gb 65
>>>>>>>> 9 700Mb 27
>>>>>>>> 18 350Mb 17
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>
>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>
>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>>    as you have cores in the cluster.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>
>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>> diminishing returns.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>
>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>
>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>> 1Gb".
>>>>>>>>>
>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>
>>>>>>>>> Boris
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Andrew Wong
>>>>>
>>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

Cliff, I think i got the idea - you basically use a bunch of temp servers
that can be evicted any time and each has attached temporary nvme drives -
very clever! I am surprised it works well with Kudu as our experience with
our on-prem cluster that Kudu is extremely sensitive to disk failures and
random reboots. at one point we had a situation when 2 out of 3 replicas
were down for a while and it took us quite a bit of time to recover and we
had to use kudu cli to evict some tablets manually.

Also manual rebalancing is in order often and it takes a lot of time (many
hours).

On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> Cliff, i would be extremely interested to see a blog post to compare
> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>
> would love to get some details how you set up Kudu/Impala cluster on AWS
> as well as my company might be heading the same direction. this does not
> mean much to me as we are not using cloud but I hope you can elaborate on
> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
> a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries. "
>
> I cannot really find anything on the web that would compare Impala/Kudu to
> Snowflake and Redshift. Everything I see is about Snowflake, Redshift and
> BigQuery.
>
> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>
>> This is a good conversation but I don't think the comparison with
>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>> last job, about 5 years ago, I pretty much single-handedly scale tested
>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>> closed source, it seems pretty clear the architectures are quite different.
>> Snowflake has no primary key index, no UPSERT capability, features that
>> make Kudu valuable for some use cases.
>>
>> It also seems to me that their intended workloads are quite different.
>> Snowflake is great for intensive analytics on demand, and can handle deeply
>> nested data very well, where Kudu can't handle that at all. Snowflake is
>> not designed for heavy concurrency, but complex query plans for a small
>> group of users. If you select an x-large Snowflake cluster it's probably
>> because you have a large amount of data to churn through, not because you
>> have a large number of users. Or, at least that's how we used it.
>>
>> At my current workplace we use Kudu/Impala to handle about 30-60
>> concurrent queries. I agree that getting very fussy about partitioning can
>> be a pain, but for the large fact tables we generally use a simple strategy
>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>> use "for free" instance-store NVMe. We also have associated
>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>> compute oriented queries.
>>
>> The net result blows away what we had with RedShift at less than 1/3 the
>> cost, with performance improvements mostly from better concurrency
>> handling. This is despite the fact that RedShift has built-in cache. We
>> also use streaming ingestion which, aside from being impossible with
>> RedShift, removes the added cost of staging.
>>
>> Getting back to Snowflake, there's no way we could use it the same way we
>> use Kudu, and even if we could, the cost would would probably put us out of
>> business!
>>
>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Andrew for taking your time responding to me. It seems that there
>>> are no exact recommendations.
>>>
>>> I did look at scaling recommendations but that math is extremely
>>> complicated and I do not think anyone will know all the answers to plug
>>> into that calculation. We have no control really what users are doing, how
>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>> IMHO to expect that knowledge of user query patterns.
>>>
>>> I do like the Snowflake approach than the engine takes care of defaults
>>> and can estimate the number of micro-partitions and even repartition tables
>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>> similar. I really do not like to pick random number of buckets. Also we
>>> manager 100s of tables, I cannot look at them each one by one to make these
>>> decisions. Does it make sense?
>>>
>>>
>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>
>>>> Hey Boris,
>>>>
>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>
>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>
>>>>
>>>> Per your numbers, this confirms that the partitions are the units of
>>>> concurrency here, and that therefore having more and having smaller
>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>> smaller partitions across all tables may not scale when thinking about the
>>>> total number of partitions cluster-wide.
>>>>
>>>> There are some trade offs with the replica count per tablet server here
>>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>>> an SSD, I would recommend doing so), each tablet introduces some
>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>> operations in the pool of background operations to be run, etc.
>>>>
>>>> Your point about scan concurrency is certainly a valid one -- there
>>>> have been patches for other integrations that have tackled this to decouple
>>>> partitioning from scan concurrency (KUDU-2437
>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>> its way into Impala yet). I filed KUDU-3071
>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>> be worked into Impala.
>>>>
>>>> For now, I would try to take into account the total sum of resources
>>>> you have available to Kudu (including number of tablet servers, amount of
>>>> storage per node, number of disks per tablet server, type of disk for the
>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>> system can handle (the scaling guide
>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>> guide how you partition your tables.
>>>>
>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>
>>>>
>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>> your workload. As you show and as pointed out in documentation, larger and
>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>> though we've seen multiple GBs works relatively well for many use cases
>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>
>>>> It is recommended that new tables which are expected to have heavy read
>>>>> and write workloads have at least as many tablets as tablet servers.
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>> The recommendation here is to have at least 20 logical partitions per
>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>> on a different tablet server running on isolated hardware. For a
>>>> significantly larger table into which you expect highly concurrent
>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>> having more partitions, and if your data is naturally time-oriented,
>>>> consider range-partitioning on timestamp.
>>>>
>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> just saw this in the docs but it is still confusing statement
>>>>> No Default Partitioning
>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> hey guys,
>>>>>>
>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>> answer.
>>>>>>
>>>>>> Can someone comment?
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>> tablets for concurrency?
>>>>>>
>>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>> rule but rethinking this now for another project.
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> forgot to post results of my quick test:
>>>>>>>
>>>>>>>  Kudu 1.5
>>>>>>>
>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>
>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>> 3 2Gb 65
>>>>>>> 9 700Mb 27
>>>>>>> 18 350Mb 17
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>
>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>
>>>>>>>> Partitioning Rules of Thumb
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>    as you have cores in the cluster.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>
>>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>>> returns.
>>>>>>>>
>>>>>>>>
>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>
>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>
>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>> 1Gb".
>>>>>>>>
>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>
>>>>>>>> Boris
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Andrew Wong
>>>>
>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

Cliff, i would be extremely interested to see a blog post to compare
Snowflake, Redshift and Impala/Kudu since you tried all of them.

would love to get some details how you set up Kudu/Impala cluster on AWS as
well as my company might be heading the same direction. this does not mean
much to me as we are not using cloud but I hope you can elaborate on your
setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by a
load balancer. We're in AWS, and thanks to Kudu's replication we can use
"for free" instance-store NVMe. We also have associated compute-oriented
stateless Impala Spot Fleet clusters for HLL and other compute oriented
queries. "

I cannot really find anything on the web that would compare Impala/Kudu to
Snowflake and Redshift. Everything I see is about Snowflake, Redshift and
BigQuery.

On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:

> This is a good conversation but I don't think the comparison with
> Snowflake is a fair one, at least from an older version of Snowflake (In my
> last job, about 5 years ago, I pretty much single-handedly scale tested
> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
> closed source, it seems pretty clear the architectures are quite different.
> Snowflake has no primary key index, no UPSERT capability, features that
> make Kudu valuable for some use cases.
>
> It also seems to me that their intended workloads are quite different.
> Snowflake is great for intensive analytics on demand, and can handle deeply
> nested data very well, where Kudu can't handle that at all. Snowflake is
> not designed for heavy concurrency, but complex query plans for a small
> group of users. If you select an x-large Snowflake cluster it's probably
> because you have a large amount of data to churn through, not because you
> have a large number of users. Or, at least that's how we used it.
>
> At my current workplace we use Kudu/Impala to handle about 30-60
> concurrent queries. I agree that getting very fussy about partitioning can
> be a pain, but for the large fact tables we generally use a simple strategy
> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries.
>
> The net result blows away what we had with RedShift at less than 1/3 the
> cost, with performance improvements mostly from better concurrency
> handling. This is despite the fact that RedShift has built-in cache. We
> also use streaming ingestion which, aside from being impossible with
> RedShift, removes the added cost of staging.
>
> Getting back to Snowflake, there's no way we could use it the same way we
> use Kudu, and even if we could, the cost would would probably put us out of
> business!
>
> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> thanks Andrew for taking your time responding to me. It seems that there
>> are no exact recommendations.
>>
>> I did look at scaling recommendations but that math is extremely
>> complicated and I do not think anyone will know all the answers to plug
>> into that calculation. We have no control really what users are doing, how
>> many queries they run, how many are hot vs cold etc. It is not realistic
>> IMHO to expect that knowledge of user query patterns.
>>
>> I do like the Snowflake approach than the engine takes care of defaults
>> and can estimate the number of micro-partitions and even repartition tables
>> as they grow. I feel Kudu has the same capabilities as the design is very
>> similar. I really do not like to pick random number of buckets. Also we
>> manager 100s of tables, I cannot look at them each one by one to make these
>> decisions. Does it make sense?
>>
>>
>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>
>>> Hey Boris,
>>>
>>> Sorry you didn't have much luck on Slack. I know partitioning in general
>>> can be tricky; thanks for the question. Left some thoughts below:
>>>
>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>> tablets to be closer to 1Gb?
>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>> million rows, should I just use 1 partition or still break them on smaller
>>>> tablets for concurrency?
>>>
>>>
>>> Per your numbers, this confirms that the partitions are the units of
>>> concurrency here, and that therefore having more and having smaller
>>> partitions yields a concurrency bump. That said, extending a scheme of
>>> smaller partitions across all tables may not scale when thinking about the
>>> total number of partitions cluster-wide.
>>>
>>> There are some trade offs with the replica count per tablet server here
>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>> an SSD, I would recommend doing so), each tablet introduces some
>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>> operations in the pool of background operations to be run, etc.
>>>
>>> Your point about scan concurrency is certainly a valid one -- there have
>>> been patches for other integrations that have tackled this to decouple
>>> partitioning from scan concurrency (KUDU-2437
>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
>>> Kudu's Spark integration will split range scans into smaller-scoped scan
>>> tokens to be run concurrently, though this optimization hasn't made its way
>>> into Impala yet). I filed KUDU-3071
>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think
>>> is left on the Kudu-side to get this up and running, so that it can be
>>> worked into Impala.
>>>
>>> For now, I would try to take into account the total sum of resources you
>>> have available to Kudu (including number of tablet servers, amount of
>>> storage per node, number of disks per tablet server, type of disk for the
>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>> system can handle (the scaling guide
>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here),
>>> and hopefully that, along with your own SLAs per table, can help guide how
>>> you partition your tables.
>>>
>>> confused why they say "at least" not "at most" - does it mean I should
>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>
>>>
>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
>>> tens of GB per tablet replica, though exact perf obviously depends on your
>>> workload. As you show and as pointed out in documentation, larger and fewer
>>> tablets can limit the amount of concurrency for writes and reads, though
>>> we've seen multiple GBs works relatively well for many use cases while
>>> weighing the above mentioned tradeoffs with replica count.
>>>
>>> It is recommended that new tables which are expected to have heavy read
>>>> and write workloads have at least as many tablets as tablet servers.
>>>
>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>
>>>
>>> The recommendation here is to have at least 20 logical partitions per
>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>> scan would be broken up into 20 tablet scans, and each of those might land
>>> on a different tablet server running on isolated hardware. For a
>>> significantly larger table into which you expect highly concurrent
>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>> having more partitions, and if your data is naturally time-oriented,
>>> consider range-partitioning on timestamp.
>>>
>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> just saw this in the docs but it is still confusing statement
>>>> No Default Partitioning
>>>> Kudu does not provide a default partitioning strategy when creating
>>>> tables. It is recommended that new tables which are expected to have heavy
>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> hey guys,
>>>>>
>>>>> I asked the same question on Slack on and got no responses. I just
>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>> answer.
>>>>>
>>>>> Can someone comment?
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>>
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>>
>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>> rule but rethinking this now for another project.
>>>>>
>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> forgot to post results of my quick test:
>>>>>>
>>>>>>  Kudu 1.5
>>>>>>
>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>
>>>>>> Tablets Tablet Size Query run time, sec
>>>>>> 3 2Gb 65
>>>>>> 9 700Mb 27
>>>>>> 18 350Mb 17
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>
>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>
>>>>>>> Partitioning Rules of Thumb
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>    as you have cores in the cluster.
>>>>>>>    -
>>>>>>>
>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>
>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>> returns.
>>>>>>>
>>>>>>>
>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>> correctly. Let me use concrete example.
>>>>>>>
>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>> 1Gb".
>>>>>>>
>>>>>>> Really confused what the doc is saying, please help
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>

RE: [EXTERNAL] Re: Partitioning Rules of Thumb

Posted by Jy...@contractor.amat.com.

Hi All,
How can I get the Hive Integration with Kudu Tables? I'm new to this, let me know what to install to achieve this.

Best Regards,
Jyothi Swaroop Bh.
HADOOP |BIDM |GIS |Applied Materials
EXTN: #9575-8876 | PH: +91-9051222662

From: Boris Tyukin <bo...@boristyukin.com>
Sent: Friday, March 13, 2020 8:15 PM
To: user@kudu.apache.org
Subject: [EXTERNAL] Re: Partitioning Rules of Thumb

CAUTION: EXTERNAL EMAIL. Verify before you click links or open attachments. Questions? Contact GIS.

very nice, Grant! this sounds really promising!

On Fri, Mar 13, 2020 at 9:52 AM Grant Henke <gh...@cloudera.com>> wrote:
Actually, I have a much longer wish list on the Impala side than Kudu - it feels that Kudu itself is very close to Snowflake but lacks these self-management features. While Impala is certainly one of the fastest SQL engines I've seen, we struggle with mutli-tenancy and complex queries, that Hive would run like a champ.

There is an experimental Hive integration you could try. You may need to compile it for your version of Hive depending on what version of Hive you are using. Feel free to reach out on Slack if you have issues. The details on the integration here: https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration<https://urldefense.com/v3/__https:/cwiki.apache.org/confluence/display/Hive/Kudu*Integration__;Kw!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYym92KIOY$>

On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin <bo...@boristyukin.com>> wrote:
thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu, and I do not think snowflake deserves that much credit and noise. But they do have great ideas that take their engine to the next level. I do not know architecture-wise how it works but here is the best doc I found regarding micro partitions:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html<https://urldefense.com/v3/__https:/docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYy_s86tNE$>

It is also appealing to me that they support nested data types with ease (and they are still very efficient to query/filter by), they scale workload easily (Kudu requires manual rebalancing and it is VERY slow as I've learned last weekend by doing it on our dev cluster). Also, they support very large blobs and text fields that Kudu does not.

Actually, I have a much longer wish list on the Impala side than Kudu - it feels that Kudu itself is very close to Snowflake but lacks these self-management features. While Impala is certainly one of the fastest SQL engines I've seen, we struggle with mutli-tenancy and complex queries, that Hive would run like a champ.

As for Kudu's compaction process - it was largely an issue with 1.5 and I believed was addressed in 1.9. We had a terrible time for a few weeks when frequent updates/delete almost froze all our queries to crawl. A lot of smaller tables had a huge number of tiny rowsets causing massive scans and freezing the entire Kudu cluster. We had a good discussion on slack and Todd Lipcon suggested a good workaround using flush_threshold_secs till we move to 1.9 and it worked fine. Nowhere in the documentation, it was suggested to set this flag and actually it was one of these "use it at your own risk" flags.



On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>> wrote:
This has been an excellent discussion to follow, with very useful feedback. Thank you for that.

Boris, if I can try to summarize your position, it's that manual partitioning doesn't scale when dealing with hundreds of (small) tables and when you don't control the PK of each table. The Kudu schema design guide would advise you to hash partition such tables, and, following Andrew's recommendation, have at least one partition per tserver. Except that given enough tables, you'll eventually overwhelm any cluster based on the sheer number of partitions per tserver. And if you reduce the number of partitions per table, you open yourself up to potential hotspotting if the per-table load isn't even. Is that correct?

I can think of several improvements we could make here:
1. Support splitting/merging of range partitions, first manually and later automatically based on load. This is tracked in KUDU-437 and KUDU-441. Both are complicated, and since we've gotten a lot of mileage out of static range/hash partitioning, there hasn't been much traction on those JIRAs.
2. Improve server efficiency when hosting large numbers of replicas, so that you can blanket your cluster with maximally hash-partitioned tables. We did some work here a couple years ago (see KUDU-1967) but there's clearly much more that we can do. Of note, we previously discussed implementing Multi-Raft (https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels*3Acomment-tabpanel*comment-15967031__;JSM!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyQWo2sAc$>).
3. Add support for consistent hashing. I'm not super familiar with the details but YugabyteDB (based on Kudu) has seen success here. Their recent blog post on sharding is a good (and related) read: https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/<https://urldefense.com/v3/__https:/blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYy3CfFDeA$>

Since you're clearly more familiar with Snowflake than I, can you comment more on how their partitioning system works? Sounds like it's quite automated; is that ever problematic when it decides to repartition during a workload? The YugabyteDB blog post talks about heuristics "reacting" to hotspotting by load balancing, and concludes that hotspots can move around too quickly for a load balancer to keep up.

Separately, you mentioned having to manage Kudu's compaction process. Could you go into more detail here?

On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>> wrote:
thanks Cliff, this is really good info. I am tempted to do the benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of traction lately, and based on a few conversations in slack, looks like Kudu team is not tracking competition that much and I think there are awesome features in Snowflake that would be great to have in Kudu.

I think they do support upserts now and back in-time queries - while they do not upsert/delete records, they just create new micro-partitions and it is a metadata operation - that's who they can see data in a given time. The idea of virtual warehouse is also very appealing especially in healthcare as we often need to share data with other departments or partners.

do not get me wrong, I decided to use Kudu for the same reasons and I really did not have a choice on CDH other than HBase but I get questioned lately why we should not just give up CDH and go all in on Cloud with Cloud native tech like Redshift and Snowflake. It is getting harder to explain why :)

My biggest gripe with Kudu besides known limitations is the management of partitions and compaction process. For our largest tables, we just max out the number of tablets as Kudu allows per cluster. Smaller tables though is a challenge and more like an art while I feel there should be an automated process with good defaults like in other data storage systems. No one expects to assign partitions manually in Snowflake, BigQuery or Redshift. No one is expected to tweak compaction parameters and deal with fast degraded performance over time on tables with a high number of deletes/updates. We are happy with performance but it is not all about performance.

We also struggle on Impala side as it feels it is limiting what we can do with Kudu and it feels like Impala team is always behind. pyspark is another problem - Kudu pyspark client is lagging behind java client and very difficult to install/deploy. Some api is not even available in pyspark.

And honestly, I do not like that Impala is the only viable option with Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a tank but we cannot do that anymore with Kudu tables and struggle with disk spilling issues, Impala choosing bad explain plans and forcing us to use straight_join hint many times etc.

Sorry if this post came a bit on a negative side, I really like Kudu and the dev team behind it rocks. But I do not like their industry is going and why Kudu is not getting the traction it deserves.

On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com>> wrote:
This is a good conversation but I don't think the comparison with Snowflake is a fair one, at least from an older version of Snowflake (In my last job, about 5 years ago, I pretty much single-handedly scale tested Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is closed source, it seems pretty clear the architectures are quite different. Snowflake has no primary key index, no UPSERT capability, features that make Kudu valuable for some use cases.

It also seems to me that their intended workloads are quite different. Snowflake is great for intensive analytics on demand, and can handle deeply nested data very well, where Kudu can't handle that at all. Snowflake is not designed for heavy concurrency, but complex query plans for a small group of users. If you select an x-large Snowflake cluster it's probably because you have a large amount of data to churn through, not because you have a large number of users. Or, at least that's how we used it.

At my current workplace we use Kudu/Impala to handle about 30-60 concurrent queries. I agree that getting very fussy about partitioning can be a pain, but for the large fact tables we generally use a simple strategy of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted by a load balancer. We're in AWS, and thanks to Kudu's replication we can use "for free" instance-store NVMe. We also have associated compute-oriented stateless Impala Spot Fleet clusters for HLL and other compute oriented queries.

The net result blows away what we had with RedShift at less than 1/3 the cost, with performance improvements mostly from better concurrency handling. This is despite the fact that RedShift has built-in cache. We also use streaming ingestion which, aside from being impossible with RedShift, removes the added cost of staging.

Getting back to Snowflake, there's no way we could use it the same way we use Kudu, and even if we could, the cost would would probably put us out of business!

On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>> wrote:
thanks Andrew for taking your time responding to me. It seems that there are no exact recommendations.

I did look at scaling recommendations but that math is extremely complicated and I do not think anyone will know all the answers to plug into that calculation. We have no control really what users are doing, how many queries they run, how many are hot vs cold etc. It is not realistic IMHO to expect that knowledge of user query patterns.

I do like the Snowflake approach than the engine takes care of defaults and can estimate the number of micro-partitions and even repartition tables as they grow. I feel Kudu has the same capabilities as the design is very similar. I really do not like to pick random number of buckets. Also we manager 100s of tables, I cannot look at them each one by one to make these decisions. Does it make sense?


On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>> wrote:
Hey Boris,

Sorry you didn't have much luck on Slack. I know partitioning in general can be tricky; thanks for the question. Left some thoughts below:

Maybe I was not asking a clear question. If my cluster is large enough in my example above, should I go with 3, 9 or 18 tablets? or should I pick tablets to be closer to 1Gb?
And a follow-up question, if I have tons of smaller tables under 5 million rows, should I just use 1 partition or still break them on smaller tablets for concurrency?

Per your numbers, this confirms that the partitions are the units of concurrency here, and that therefore having more and having smaller partitions yields a concurrency bump. That said, extending a scheme of smaller partitions across all tables may not scale when thinking about the total number of partitions cluster-wide.

There are some trade offs with the replica count per tablet server here -- generally, each tablet replica has a resource cost on tablet servers: WALs and tablet-related metadata use a shared disk (if you can put this on an SSD, I would recommend doing so), each tablet introduces some Raft-related RPC traffic, each tablet replica introduces some maintenance operations in the pool of background operations to be run, etc.

Your point about scan concurrency is certainly a valid one -- there have been patches for other integrations that have tackled this to decouple partitioning from scan concurrency (KUDU-2437<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-2437__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyv2_-DpY$> and KUDU-2670<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-2670__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYytW0tFHc$> are an example, where Kudu's Spark integration will split range scans into smaller-scoped scan tokens to be run concurrently, though this optimization hasn't made its way into Impala yet). I filed KUDU-3071<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-3071__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyiAjniDw$> to track what I think is left on the Kudu-side to get this up and running, so that it can be worked into Impala.

For now, I would try to take into account the total sum of resources you have available to Kudu (including number of tablet servers, amount of storage per node, number of disks per tablet server, type of disk for the WAL/metadata disks), to settle on roughly how many tablet replicas your system can handle (the scaling guide<https://urldefense.com/v3/__https:/kudu.apache.org/docs/scaling_guide.html__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyrIbQ8MQ$> may be helpful here), and hopefully that, along with your own SLAs per table, can help guide how you partition your tables.

confused why they say "at least" not "at most" - does it mean I should design it so a tablet takes 2Gb or 3Gb in this example?

Aiming for 1GB seems a bit low; Kudu should be able to handle in the low tens of GB per tablet replica, though exact perf obviously depends on your workload. As you show and as pointed out in documentation, larger and fewer tablets can limit the amount of concurrency for writes and reads, though we've seen multiple GBs works relatively well for many use cases while weighing the above mentioned tradeoffs with replica count.

It is recommended that new tables which are expected to have heavy read and write workloads have at least as many tablets as tablet servers.
if I have 20 tablet servers and I have two tables - one with 1MM rows and another one with 100MM rows, do I pick 20 / 3 partitions for both (divide by 3 because of replication)?

The recommendation here is to have at least 20 logical partitions per table. That way, a scan were to touch a table's entire keyspace, the table scan would be broken up into 20 tablet scans, and each of those might land on a different tablet server running on isolated hardware. For a significantly larger table into which you expect highly concurrent workloads, the recommendation serves as a lower bound -- I'd recommend having more partitions, and if your data is naturally time-oriented, consider range-partitioning on timestamp.

On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>> wrote:
just saw this in the docs but it is still confusing statement
No Default Partitioning
Kudu does not provide a default partitioning strategy when creating tables. It is recommended that new tables which are expected to have heavy read and write workloads have at least as many tablets as tablet servers.

if I have 20 tablet servers and I have two tables - one with 1MM rows and another one with 100MM rows, do I pick 20 / 3 partitions for both (divide by 3 because of replication)?




On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>> wrote:
hey guys,

I asked the same question on Slack on and got no responses. I just went through the docs and design doc and FAQ and still did not find an answer.

Can someone comment?

Maybe I was not asking a clear question. If my cluster is large enough in my example above, should I go with 3, 9 or 18 tablets? or should I pick tablets to be closer to 1Gb?

And a follow-up question, if I have tons of smaller tables under 5 million rows, should I just use 1 partition or still break them on smaller tablets for concurrency?

We cannot pick the partitioning strategy for each table as we need to stream 100s of tables and we use PK from RBDMS and need to come with an automated way to pick number of partitions/tablets. So far I was using 1Gb rule but rethinking this now for another project.

On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>> wrote:
forgot to post results of my quick test:

 Kudu 1.5

Table takes 18Gb of disk space after 3x replication

Tablets
Tablet Size
Query run time, sec
3
2Gb
65
9
700Mb
27
18
350Mb
17

[image.png]


On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>> wrote:
Hi guys,

just want to clarify recommendations from the doc. It says:
https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb<https://urldefense.com/v3/__https:/kudu.apache.org/docs/kudu_impala_integration.html*partitioning_rules_of_thumb__;Iw!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyjLK3py0$>

Partitioning Rules of Thumb<https://urldefense.com/v3/__https:/kudu.apache.org/docs/kudu_impala_integration.html*partitioning_rules_of_thumb__;Iw!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyjLK3py0$>

  *   For large tables, such as fact tables, aim for as many tablets as you have cores in the cluster.
  *   For small tables, such as dimension tables, ensure that each tablet is at least 1 GB in size.

In general, be mindful the number of tablets limits the parallelism of reads, in the current implementation. Increasing the number of tablets significantly beyond the number of cores is likely to have diminishing returns.



I've read this a few times but I am not sure I understand it correctly. Let me use concrete example.

If a table ends up taking 18Gb after replication (so with 3x replication it is ~9Gb per tablet if I do not partition), should I aim for 1Gb tablets (6 tablets before replication) or should I aim for 500Mb tablets if my cluster capacity allows so (12 tablets before replication)? confused why they say "at least" not "at most" - does it mean I should design it so a tablet takes 2Gb or 3Gb in this example?

Assume that I have tons of CPU cores on a cluster...
Based on my quick test, it seems that queries are faster if I have more tablets/partitions...In this example, 18 tablets gave me the best timing but tablet size was around 300-400Mb. But the doc says "at least 1Gb".

Really confused what the doc is saying, please help

Boris



--
Andrew Wong


--
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com<ma...@cloudera.com> | twitter.com/gchenke<https://urldefense.com/v3/__http:/twitter.com/gchenke__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyGRJzLIM$> | linkedin.com/in/granthenke<https://urldefense.com/v3/__http:/linkedin.com/in/granthenke__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyk4V5SJY$>

The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not the intended recipient, please notify me, delete this email and do not use or distribute this email.

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

very nice, Grant! this sounds really promising!

On Fri, Mar 13, 2020 at 9:52 AM Grant Henke <gh...@cloudera.com> wrote:

> Actually, I have a much longer wish list on the Impala side than Kudu - it
>> feels that Kudu itself is very close to Snowflake but lacks these
>> self-management features. While Impala is certainly one of the fastest SQL
>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>> Hive would run like a champ.
>>
>
> There is an experimental Hive integration you could try. You may need to
> compile it for your version of Hive depending on what version of Hive you
> are using. Feel free to reach out on Slack if you have issues. The details
> on the integration here:
> https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration
>
> On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu,
>> and I do not think snowflake deserves that much credit and noise. But they
>> do have great ideas that take their engine to the next level. I do not know
>> architecture-wise how it works but here is the best doc I found regarding
>> micro partitions:
>>
>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>
>>
>> It is also appealing to me that they support nested data types with ease
>> (and they are still very efficient to query/filter by), they scale workload
>> easily (Kudu requires manual rebalancing and it is VERY slow as I've
>> learned last weekend by doing it on our dev cluster). Also, they support
>> very large blobs and text fields that Kudu does not.
>>
>> Actually, I have a much longer wish list on the Impala side than Kudu -
>> it feels that Kudu itself is very close to Snowflake but lacks these
>> self-management features. While Impala is certainly one of the fastest SQL
>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>> Hive would run like a champ.
>>
>> As for Kudu's compaction process - it was largely an issue with 1.5 and I
>> believed was addressed in 1.9. We had a terrible time for a few weeks when
>> frequent updates/delete almost froze all our queries to crawl. A lot of
>> smaller tables had a huge number of tiny rowsets causing massive scans and
>> freezing the entire Kudu cluster. We had a good discussion on slack and
>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>> suggested to set this flag and actually it was one of these "use it at your
>> own risk" flags.
>>
>>
>>
>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>> wrote:
>>
>>> This has been an excellent discussion to follow, with very useful
>>> feedback. Thank you for that.
>>>
>>> Boris, if I can try to summarize your position, it's that manual
>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>> when you don't control the PK of each table. The Kudu schema design guide
>>> would advise you to hash partition such tables, and, following Andrew's
>>> recommendation, have at least one partition per tserver. Except that given
>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>> number of partitions per tserver. And if you reduce the number of
>>> partitions per table, you open yourself up to potential hotspotting if the
>>> per-table load isn't even. Is that correct?
>>>
>>> I can think of several improvements we could make here:
>>> 1. Support splitting/merging of range partitions, first manually and
>>> later automatically based on load. This is tracked in KUDU-437 and
>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>> of static range/hash partitioning, there hasn't been much traction on those
>>> JIRAs.
>>> 2. Improve server efficiency when hosting large numbers of replicas, so
>>> that you can blanket your cluster with maximally hash-partitioned tables.
>>> We did some work here a couple years ago (see KUDU-1967) but there's
>>> clearly much more that we can do. Of note, we previously discussed
>>> implementing Multi-Raft (
>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>> ).
>>> 3. Add support for consistent hashing. I'm not super familiar with the
>>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>>> blog post on sharding is a good (and related) read:
>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>
>>> Since you're clearly more familiar with Snowflake than I, can you
>>> comment more on how their partitioning system works? Sounds like it's quite
>>> automated; is that ever problematic when it decides to repartition during a
>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>> hotspotting by load balancing, and concludes that hotspots can move around
>>> too quickly for a load balancer to keep up.
>>>
>>> Separately, you mentioned having to manage Kudu's compaction process.
>>> Could you go into more detail here?
>>>
>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>> team is not tracking competition that much and I think there are awesome
>>>> features in Snowflake that would be great to have in Kudu.
>>>>
>>>> I think they do support upserts now and back in-time queries - while
>>>> they do not upsert/delete records, they just create new micro-partitions
>>>> and it is a metadata operation - that's who they can see data in a given
>>>> time. The idea of virtual warehouse is also very appealing especially in
>>>> healthcare as we often need to share data with other departments or
>>>> partners.
>>>>
>>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>>> really did not have a choice on CDH other than HBase but I get questioned
>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>> why :)
>>>>
>>>> My biggest gripe with Kudu besides known limitations is the management
>>>> of partitions and compaction process. For our largest tables, we just max
>>>> out the number of tablets as Kudu allows per cluster. Smaller tables though
>>>> is a challenge and more like an art while I feel there should be an
>>>> automated process with good defaults like in other data storage systems. No
>>>> one expects to assign partitions manually in Snowflake, BigQuery or
>>>> Redshift. No one is expected to tweak compaction parameters and deal with
>>>> fast degraded performance over time on tables with a high number of
>>>> deletes/updates. We are happy with performance but it is not all about
>>>> performance.
>>>>
>>>> We also struggle on Impala side as it feels it is limiting what we can
>>>> do with Kudu and it feels like Impala team is always behind. pyspark is
>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>
>>>> And honestly, I do not like that Impala is the only viable option with
>>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>>>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>>>> spilling issues, Impala choosing bad explain plans and forcing us to use
>>>> straight_join hint many times etc.
>>>>
>>>> Sorry if this post came a bit on a negative side, I really like Kudu
>>>> and the dev team behind it rocks. But I do not like their industry is going
>>>> and why Kudu is not getting the traction it deserves.
>>>>
>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>>
>>>>> This is a good conversation but I don't think the comparison with
>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>> make Kudu valuable for some use cases.
>>>>>
>>>>> It also seems to me that their intended workloads are quite different.
>>>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>>> not designed for heavy concurrency, but complex query plans for a small
>>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>>> because you have a large amount of data to churn through, not because you
>>>>> have a large number of users. Or, at least that's how we used it.
>>>>>
>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>> use "for free" instance-store NVMe. We also have associated
>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>> compute oriented queries.
>>>>>
>>>>> The net result blows away what we had with RedShift at less than 1/3
>>>>> the cost, with performance improvements mostly from better concurrency
>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>> also use streaming ingestion which, aside from being impossible with
>>>>> RedShift, removes the added cost of staging.
>>>>>
>>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>>> we use Kudu, and even if we could, the cost would would probably put us out
>>>>> of business!
>>>>>
>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>>> there are no exact recommendations.
>>>>>>
>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>
>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>> one to make these decisions. Does it make sense?
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Boris,
>>>>>>>
>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>
>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>> tablets for concurrency?
>>>>>>>
>>>>>>>
>>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>>> total number of partitions cluster-wide.
>>>>>>>
>>>>>>> There are some trade offs with the replica count per tablet server
>>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>
>>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>>> be worked into Impala.
>>>>>>>
>>>>>>> For now, I would try to take into account the total sum of resources
>>>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>>> system can handle (the scaling guide
>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>>> guide how you partition your tables.
>>>>>>>
>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>>
>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>
>>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>> The recommendation here is to have at least 20 logical partitions
>>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the
>>>>>>> table scan would be broken up into 20 tablet scans, and each of those might
>>>>>>> land on a different tablet server running on isolated hardware. For a
>>>>>>> significantly larger table into which you expect highly concurrent
>>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>>> consider range-partitioning on timestamp.
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>> No Default Partitioning
>>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>>
>>>>>>>>
>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hey guys,
>>>>>>>>>
>>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>>> answer.
>>>>>>>>>
>>>>>>>>> Can someone comment?
>>>>>>>>>
>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>
>>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>>> tablets for concurrency?
>>>>>>>>>
>>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>
>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>
>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>
>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>> 3 2Gb 65
>>>>>>>>>> 9 700Mb 27
>>>>>>>>>> 18 350Mb 17
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>
>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>
>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>>
>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>
>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>
>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>>> 1Gb".
>>>>>>>>>>>
>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>
>>>>>>>>>>> Boris
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Andrew Wong
>>>>>>>
>>>>>>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: Partitioning Rules of Thumb

Posted by Grant Henke <gh...@cloudera.com>.

>
> Actually, I have a much longer wish list on the Impala side than Kudu - it
> feels that Kudu itself is very close to Snowflake but lacks these
> self-management features. While Impala is certainly one of the fastest SQL
> engines I've seen, we struggle with mutli-tenancy and complex queries, that
> Hive would run like a champ.
>

There is an experimental Hive integration you could try. You may need to
compile it for your version of Hive depending on what version of Hive you
are using. Feel free to reach out on Slack if you have issues. The details
on the integration here:
https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration

On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu,
> and I do not think snowflake deserves that much credit and noise. But they
> do have great ideas that take their engine to the next level. I do not know
> architecture-wise how it works but here is the best doc I found regarding
> micro partitions:
>
> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>
>
> It is also appealing to me that they support nested data types with ease
> (and they are still very efficient to query/filter by), they scale workload
> easily (Kudu requires manual rebalancing and it is VERY slow as I've
> learned last weekend by doing it on our dev cluster). Also, they support
> very large blobs and text fields that Kudu does not.
>
> Actually, I have a much longer wish list on the Impala side than Kudu - it
> feels that Kudu itself is very close to Snowflake but lacks these
> self-management features. While Impala is certainly one of the fastest SQL
> engines I've seen, we struggle with mutli-tenancy and complex queries, that
> Hive would run like a champ.
>
> As for Kudu's compaction process - it was largely an issue with 1.5 and I
> believed was addressed in 1.9. We had a terrible time for a few weeks when
> frequent updates/delete almost froze all our queries to crawl. A lot of
> smaller tables had a huge number of tiny rowsets causing massive scans and
> freezing the entire Kudu cluster. We had a good discussion on slack and
> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
> move to 1.9 and it worked fine. Nowhere in the documentation, it was
> suggested to set this flag and actually it was one of these "use it at your
> own risk" flags.
>
>
>
> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
> wrote:
>
>> This has been an excellent discussion to follow, with very useful
>> feedback. Thank you for that.
>>
>> Boris, if I can try to summarize your position, it's that manual
>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>> when you don't control the PK of each table. The Kudu schema design guide
>> would advise you to hash partition such tables, and, following Andrew's
>> recommendation, have at least one partition per tserver. Except that given
>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>> number of partitions per tserver. And if you reduce the number of
>> partitions per table, you open yourself up to potential hotspotting if the
>> per-table load isn't even. Is that correct?
>>
>> I can think of several improvements we could make here:
>> 1. Support splitting/merging of range partitions, first manually and
>> later automatically based on load. This is tracked in KUDU-437 and
>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>> of static range/hash partitioning, there hasn't been much traction on those
>> JIRAs.
>> 2. Improve server efficiency when hosting large numbers of replicas, so
>> that you can blanket your cluster with maximally hash-partitioned tables.
>> We did some work here a couple years ago (see KUDU-1967) but there's
>> clearly much more that we can do. Of note, we previously discussed
>> implementing Multi-Raft (
>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>> ).
>> 3. Add support for consistent hashing. I'm not super familiar with the
>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>> blog post on sharding is a good (and related) read:
>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>
>> Since you're clearly more familiar with Snowflake than I, can you comment
>> more on how their partitioning system works? Sounds like it's quite
>> automated; is that ever problematic when it decides to repartition during a
>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>> hotspotting by load balancing, and concludes that hotspots can move around
>> too quickly for a load balancer to keep up.
>>
>> Separately, you mentioned having to manage Kudu's compaction process.
>> Could you go into more detail here?
>>
>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Cliff, this is really good info. I am tempted to do the
>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>> team is not tracking competition that much and I think there are awesome
>>> features in Snowflake that would be great to have in Kudu.
>>>
>>> I think they do support upserts now and back in-time queries - while
>>> they do not upsert/delete records, they just create new micro-partitions
>>> and it is a metadata operation - that's who they can see data in a given
>>> time. The idea of virtual warehouse is also very appealing especially in
>>> healthcare as we often need to share data with other departments or
>>> partners.
>>>
>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>> really did not have a choice on CDH other than HBase but I get questioned
>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>> why :)
>>>
>>> My biggest gripe with Kudu besides known limitations is the management
>>> of partitions and compaction process. For our largest tables, we just max
>>> out the number of tablets as Kudu allows per cluster. Smaller tables though
>>> is a challenge and more like an art while I feel there should be an
>>> automated process with good defaults like in other data storage systems. No
>>> one expects to assign partitions manually in Snowflake, BigQuery or
>>> Redshift. No one is expected to tweak compaction parameters and deal with
>>> fast degraded performance over time on tables with a high number of
>>> deletes/updates. We are happy with performance but it is not all about
>>> performance.
>>>
>>> We also struggle on Impala side as it feels it is limiting what we can
>>> do with Kudu and it feels like Impala team is always behind. pyspark is
>>> another problem - Kudu pyspark client is lagging behind java client and
>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>
>>> And honestly, I do not like that Impala is the only viable option with
>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>>> spilling issues, Impala choosing bad explain plans and forcing us to use
>>> straight_join hint many times etc.
>>>
>>> Sorry if this post came a bit on a negative side, I really like Kudu and
>>> the dev team behind it rocks. But I do not like their industry is going and
>>> why Kudu is not getting the traction it deserves.
>>>
>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>
>>>> This is a good conversation but I don't think the comparison with
>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>> closed source, it seems pretty clear the architectures are quite different.
>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>> make Kudu valuable for some use cases.
>>>>
>>>> It also seems to me that their intended workloads are quite different.
>>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>> not designed for heavy concurrency, but complex query plans for a small
>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>> because you have a large amount of data to churn through, not because you
>>>> have a large number of users. Or, at least that's how we used it.
>>>>
>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>> use "for free" instance-store NVMe. We also have associated
>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>> compute oriented queries.
>>>>
>>>> The net result blows away what we had with RedShift at less than 1/3
>>>> the cost, with performance improvements mostly from better concurrency
>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>> also use streaming ingestion which, aside from being impossible with
>>>> RedShift, removes the added cost of staging.
>>>>
>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>> we use Kudu, and even if we could, the cost would would probably put us out
>>>> of business!
>>>>
>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>> there are no exact recommendations.
>>>>>
>>>>> I did look at scaling recommendations but that math is extremely
>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>> into that calculation. We have no control really what users are doing, how
>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>
>>>>> I do like the Snowflake approach than the engine takes care of
>>>>> defaults and can estimate the number of micro-partitions and even
>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>> the design is very similar. I really do not like to pick random number of
>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>> one to make these decisions. Does it make sense?
>>>>>
>>>>>
>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>>
>>>>>> Hey Boris,
>>>>>>
>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>
>>>>>>
>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>> total number of partitions cluster-wide.
>>>>>>
>>>>>> There are some trade offs with the replica count per tablet server
>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>
>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>> be worked into Impala.
>>>>>>
>>>>>> For now, I would try to take into account the total sum of resources
>>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>> system can handle (the scaling guide
>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>> guide how you partition your tables.
>>>>>>
>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>
>>>>>>
>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>
>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>>> on a different tablet server running on isolated hardware. For a
>>>>>> significantly larger table into which you expect highly concurrent
>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>> consider range-partitioning on timestamp.
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>> No Default Partitioning
>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hey guys,
>>>>>>>>
>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>> answer.
>>>>>>>>
>>>>>>>> Can someone comment?
>>>>>>>>
>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>> tablets for concurrency?
>>>>>>>>
>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>
>>>>>>>>>  Kudu 1.5
>>>>>>>>>
>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>
>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>> 3 2Gb 65
>>>>>>>>> 9 700Mb 27
>>>>>>>>> 18 350Mb 17
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>
>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>
>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>
>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>> diminishing returns.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>
>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>
>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>> 1Gb".
>>>>>>>>>>
>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>
>>>>>>>>>> Boris
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Andrew Wong
>>>>>>
>>>>>

-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

I look forward to this enhancement, Tim! thanks for sharing

On Mon, Mar 16, 2020 at 7:54 PM Tim Armstrong <ta...@cloudera.com>
wrote:

> > I think your hardware situation hurts not only for the number of tablets
> but also for Kudu + Impala. Impala afaik will only use one core per host
> per query, so is a poor fit for large complex queries on vertical hardware.
>
> This is basically true as of current releases of Impala but I'm working on
> addressing this. It's been possible to set mt_dop per-query on queries
> without joins or table sinks for a long time now (the limitations make it
> of limited use). Rough join and table sink support is behind a hidden flag
> in 3.3 and 3.4.  I've been working on making it performant with joins (add
> doing all the requisite testing), which should land in master very soon and
> if things go remotely to plan, be a fully supported option in the next
> release after 3.4. We saw huge speedups on a lot of queries (like 10x or
> more). Some queries didn't benefit much, if they were limited by the scan
> perf (including if the runtime filters pushed into the scans were filtering
> most data before joins).
>
> On Mon, Mar 16, 2020 at 2:58 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> appreciate your thoughts, Cliff
>>
>> On Mon, Mar 16, 2020 at 11:18 AM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> Boris,
>>>
>>> I think the crux of the problem is that "real-time analytics" over
>>> deeply nested storage does not really exist. I'll qualify that statement
>>> with "real-time" meaning streaming per-record ingestion, and "analytics" as
>>> columnar query access. The closest thing I know of today is Google BigQuery
>>> or Snowflake, but those are actually micro-batch ingestion done well, not
>>> per-record like Kudu. The only real-time analytics solution I know of that
>>> has a modicum of nesting is Druid, with a single level of "GroupBy"
>>> dimensions that get flattened into virtual rows as part of its non-SQL
>>> rollup API. Columnar storage and real-time are probably aways going to be a
>>> difficult pairing and in fact Kudu's "real-time" storage format is
>>> row-based, and Druid requires a whole lambda architecture batch compaction.
>>> As we all know, nothing is for free, whether the trade-off for nested data
>>> be column-shredding into something like Kudu as Adar suggested, or
>>> micro-batching to Parquet or some combination of both. I won't even go into
>>> how nested analytics are handled on the SQL/query side because it gets
>>> weird there, too.
>>>
>>> I think your hardware situation hurts not only for the number of tablets
>>> but also for Kudu + Impala. Impala afaik will only use one core per host
>>> per query, so is a poor fit for large complex queries on vertical hardware.
>>>
>>> In conclusion, I don't know how far you've gone into your investigations
>>> but I can say that if your needs are to support "power users" at a premium
>>> then something like Snowflake is great for your problem space. But if
>>> analytics are also more centrally integrated in pipelines then parquet is
>>> hard to beat for the price and flexibility, as is Kudu for dashboards or
>>> other intelligence that leverages upsert/key semantics. Ultimately, like
>>> many of us, you may find that you'll need all of the above.
>>>
>>> Cliff
>>>
>>>
>>>
>>> On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Adar for your comments and thinking this through! I really think
>>>> Kudu has tons of potential and very happy we've made a decision to use it
>>>> for our real-time pipeline.
>>>>
>>>> you asked about use cases for support of nested and large binary/text
>>>> objects. I work for a large healthcare organization and there is a lot of
>>>> going with newer HL7 FHIR standard. FHIR documents are highly nested json
>>>> objects that might get as big as a few Gbs in size. We could not use Kudu
>>>> for storing FHIR bundles/documents due to the size and inability to store
>>>> FHIR objects natively. We ended up using Hive/Spark for that, but that
>>>> solution is not real-time. And it is not only about storing, but also about
>>>> fast search in those FHIR documents. I know we could use Hbase for that but
>>>> Hbase is really bad when it comes to analytics type of workloads.
>>>>
>>>> As for large text/binaries, the good and common example is
>>>> narratives notes (physician notes, progress notes, etc.) They can be in
>>>> form of RTF or PDF documents or images.
>>>>
>>>> 300 column limit was not a big deal so far but we have high-density
>>>> nodes with 2 44-core cpus and 12 12-Tb drives and we cannot use the full
>>>> power of them due to limitations of Kudu in terms of number of tablets per
>>>> node. This is more concerning than 300 column limit.
>>>>
>>>>
>>>>
>>>> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <ad...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
>>>>> 1. Both are created transparently and automatically as data is
>>>>> inserted.
>>>>> 2. Both may overlap with one another based on the sort column
>>>>> (Snowflake lets you choose the sort column; in Kudu this is always the PK).
>>>>> 3. Both are pruned at scan time if the scan's predicates allow it.
>>>>> 4. In Kudu, a tablet with a large clustering depth (known as "average
>>>>> rowset height") will be a likely compaction candidate. It's not clear to me
>>>>> if Snowflake has a process to automatically compact their micro-partitions.
>>>>>
>>>>> I don't understand how this solves the partitioning problem though:
>>>>> the doc doesn't explain how micro-partitions are mapped to nodes, or
>>>>> if/when micro-partitions are load balanced across the cluster. I'd be
>>>>> curious to learn more about this.
>>>>>
>>>>> Below you'll find some more info on the other issues you raised:
>>>>> - Nested data type support is tracked in KUDU-1261. We've had an
>>>>> on-again/off-again relationship with it: it's a significant amount of work
>>>>> to implement, so we're hesitant to commit unless there's also significant
>>>>> need. Many users have been successful in "shredding" their nested types
>>>>> across many Kudu columns, which is one of the reasons we're actively
>>>>> working on increasing the number of supported columns above 300.
>>>>> - We're also working on auto-rebalancing right now; see KUDU-2780 for
>>>>> more details.
>>>>> - Large cell support is tracked in KUDU-2874, but there doesn't seem
>>>>> to have been as much interest there.
>>>>> - The compaction issue you're describing sounds like KUDU-1400, which
>>>>> was indeed fixed in 1.9. This was our bad; for quite some time we were
>>>>> focused on optimizing for heavy workloads and didn't pay attention to Kudu
>>>>> performance in "trickling" scenarios, especially not over a long period of
>>>>> time. That's also why we didn't think to advertise --flush_threshold_secs,
>>>>> or even to change its default value (which is still 2 minutes: far too
>>>>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
>>>>> widespread of a problem it was.
>>>>>
>>>>>
>>>>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in
>>>>>> Kudu, and I do not think snowflake deserves that much credit and noise. But
>>>>>> they do have great ideas that take their engine to the next level. I do not
>>>>>> know architecture-wise how it works but here is the best doc I found
>>>>>> regarding micro partitions:
>>>>>>
>>>>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>>>>>
>>>>>>
>>>>>> It is also appealing to me that they support nested data types with
>>>>>> ease (and they are still very efficient to query/filter by), they scale
>>>>>> workload easily (Kudu requires manual rebalancing and it is VERY slow as
>>>>>> I've learned last weekend by doing it on our dev cluster). Also, they
>>>>>> support very large blobs and text fields that Kudu does not.
>>>>>>
>>>>>> Actually, I have a much longer wish list on the Impala side than Kudu
>>>>>> - it feels that Kudu itself is very close to Snowflake but lacks these
>>>>>> self-management features. While Impala is certainly one of the fastest SQL
>>>>>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>>>>>> Hive would run like a champ.
>>>>>>
>>>>>> As for Kudu's compaction process - it was largely an issue with 1.5
>>>>>> and I believed was addressed in 1.9. We had a terrible time for a few weeks
>>>>>> when frequent updates/delete almost froze all our queries to crawl. A lot
>>>>>> of smaller tables had a huge number of tiny rowsets causing massive scans
>>>>>> and freezing the entire Kudu cluster. We had a good discussion on slack and
>>>>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>>>>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>>>>>> suggested to set this flag and actually it was one of these "use it at your
>>>>>> own risk" flags.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This has been an excellent discussion to follow, with very useful
>>>>>>> feedback. Thank you for that.
>>>>>>>
>>>>>>> Boris, if I can try to summarize your position, it's that manual
>>>>>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>>>>>> when you don't control the PK of each table. The Kudu schema design guide
>>>>>>> would advise you to hash partition such tables, and, following Andrew's
>>>>>>> recommendation, have at least one partition per tserver. Except that given
>>>>>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>>>>>> number of partitions per tserver. And if you reduce the number of
>>>>>>> partitions per table, you open yourself up to potential hotspotting if the
>>>>>>> per-table load isn't even. Is that correct?
>>>>>>>
>>>>>>> I can think of several improvements we could make here:
>>>>>>> 1. Support splitting/merging of range partitions, first manually and
>>>>>>> later automatically based on load. This is tracked in KUDU-437 and
>>>>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>>>>>> of static range/hash partitioning, there hasn't been much traction on those
>>>>>>> JIRAs.
>>>>>>> 2. Improve server efficiency when hosting large numbers of replicas,
>>>>>>> so that you can blanket your cluster with maximally hash-partitioned
>>>>>>> tables. We did some work here a couple years ago (see KUDU-1967) but
>>>>>>> there's clearly much more that we can do. Of note, we previously discussed
>>>>>>> implementing Multi-Raft (
>>>>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>>>>>> ).
>>>>>>> 3. Add support for consistent hashing. I'm not super familiar with
>>>>>>> the details but YugabyteDB (based on Kudu) has seen success here. Their
>>>>>>> recent blog post on sharding is a good (and related) read:
>>>>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>>>>>
>>>>>>> Since you're clearly more familiar with Snowflake than I, can you
>>>>>>> comment more on how their partitioning system works? Sounds like it's quite
>>>>>>> automated; is that ever problematic when it decides to repartition during a
>>>>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>>>>>> hotspotting by load balancing, and concludes that hotspots can move around
>>>>>>> too quickly for a load balancer to keep up.
>>>>>>>
>>>>>>> Separately, you mentioned having to manage Kudu's compaction
>>>>>>> process. Could you go into more detail here?
>>>>>>>
>>>>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>>>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>>>>>> team is not tracking competition that much and I think there are awesome
>>>>>>>> features in Snowflake that would be great to have in Kudu.
>>>>>>>>
>>>>>>>> I think they do support upserts now and back in-time queries -
>>>>>>>> while they do not upsert/delete records, they just create new
>>>>>>>> micro-partitions and it is a metadata operation - that's who they can see
>>>>>>>> data in a given time. The idea of virtual warehouse is also very appealing
>>>>>>>> especially in healthcare as we often need to share data with other
>>>>>>>> departments or partners.
>>>>>>>>
>>>>>>>> do not get me wrong, I decided to use Kudu for the same reasons and
>>>>>>>> I really did not have a choice on CDH other than HBase but I get questioned
>>>>>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>>>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>>>>>> why :)
>>>>>>>>
>>>>>>>> My biggest gripe with Kudu besides known limitations is the
>>>>>>>> management of partitions and compaction process. For our largest tables, we
>>>>>>>> just max out the number of tablets as Kudu allows per cluster. Smaller
>>>>>>>> tables though is a challenge and more like an art while I feel there should
>>>>>>>> be an automated process with good defaults like in other data storage
>>>>>>>> systems. No one expects to assign partitions manually in Snowflake,
>>>>>>>> BigQuery or Redshift. No one is expected to tweak compaction parameters and
>>>>>>>> deal with fast degraded performance over time on tables with a high number
>>>>>>>> of deletes/updates. We are happy with performance but it is not all about
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> We also struggle on Impala side as it feels it is limiting what we
>>>>>>>> can do with Kudu and it feels like Impala team is always behind. pyspark is
>>>>>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>>>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>>>>>
>>>>>>>> And honestly, I do not like that Impala is the only viable option
>>>>>>>> with Kudu. We are forced a lot of time to do heavy ETL in Hive which is
>>>>>>>> like a tank but we cannot do that anymore with Kudu tables and struggle
>>>>>>>> with disk spilling issues, Impala choosing bad explain plans and forcing us
>>>>>>>> to use straight_join hint many times etc.
>>>>>>>>
>>>>>>>> Sorry if this post came a bit on a negative side, I really like
>>>>>>>> Kudu and the dev team behind it rocks. But I do not like their industry is
>>>>>>>> going and why Kudu is not getting the traction it deserves.
>>>>>>>>
>>>>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is a good conversation but I don't think the comparison with
>>>>>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>>>>>> make Kudu valuable for some use cases.
>>>>>>>>>
>>>>>>>>> It also seems to me that their intended workloads are quite
>>>>>>>>> different. Snowflake is great for intensive analytics on demand, and can
>>>>>>>>> handle deeply nested data very well, where Kudu can't handle that at all.
>>>>>>>>> Snowflake is not designed for heavy concurrency, but complex query plans
>>>>>>>>> for a small group of users. If you select an x-large Snowflake cluster it's
>>>>>>>>> probably because you have a large amount of data to churn through, not
>>>>>>>>> because you have a large number of users. Or, at least that's how we used
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>>>>>> use "for free" instance-store NVMe. We also have associated
>>>>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>>>>>> compute oriented queries.
>>>>>>>>>
>>>>>>>>> The net result blows away what we had with RedShift at less than
>>>>>>>>> 1/3 the cost, with performance improvements mostly from better concurrency
>>>>>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>>>>>> also use streaming ingestion which, aside from being impossible with
>>>>>>>>> RedShift, removes the added cost of staging.
>>>>>>>>>
>>>>>>>>> Getting back to Snowflake, there's no way we could use it the same
>>>>>>>>> way we use Kudu, and even if we could, the cost would would probably put us
>>>>>>>>> out of business!
>>>>>>>>>
>>>>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> thanks Andrew for taking your time responding to me. It seems
>>>>>>>>>> that there are no exact recommendations.
>>>>>>>>>>
>>>>>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>>>>>
>>>>>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>>>>>> one to make these decisions. Does it make sense?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Boris,
>>>>>>>>>>>
>>>>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>>>>>
>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Per your numbers, this confirms that the partitions are the
>>>>>>>>>>> units of concurrency here, and that therefore having more and having
>>>>>>>>>>> smaller partitions yields a concurrency bump. That said, extending a scheme
>>>>>>>>>>> of smaller partitions across all tables may not scale when thinking about
>>>>>>>>>>> the total number of partitions cluster-wide.
>>>>>>>>>>>
>>>>>>>>>>> There are some trade offs with the replica count per tablet
>>>>>>>>>>> server here -- generally, each tablet replica has a resource cost on tablet
>>>>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>>>>>
>>>>>>>>>>> Your point about scan concurrency is certainly a valid one --
>>>>>>>>>>> there have been patches for other integrations that have tackled this to
>>>>>>>>>>> decouple partitioning from scan concurrency (KUDU-2437
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an
>>>>>>>>>>> example, where Kudu's Spark integration will split range scans into
>>>>>>>>>>> smaller-scoped scan tokens to be run concurrently, though this optimization
>>>>>>>>>>> hasn't made its way into Impala yet). I filed KUDU-3071
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what
>>>>>>>>>>> I think is left on the Kudu-side to get this up and running, so that it can
>>>>>>>>>>> be worked into Impala.
>>>>>>>>>>>
>>>>>>>>>>> For now, I would try to take into account the total sum of
>>>>>>>>>>> resources you have available to Kudu (including number of tablet servers,
>>>>>>>>>>> amount of storage per node, number of disks per tablet server, type of disk
>>>>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet replicas
>>>>>>>>>>> your system can handle (the scaling guide
>>>>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be
>>>>>>>>>>> helpful here), and hopefully that, along with your own SLAs per table, can
>>>>>>>>>>> help guide how you partition your tables.
>>>>>>>>>>>
>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in
>>>>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously depends
>>>>>>>>>>> on your workload. As you show and as pointed out in documentation, larger
>>>>>>>>>>> and fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>>>>>
>>>>>>>>>>> It is recommended that new tables which are expected to have
>>>>>>>>>>>> heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>>>> servers.
>>>>>>>>>>>
>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The recommendation here is to have at least 20 logical
>>>>>>>>>>> partitions per table. That way, a scan were to touch a table's entire
>>>>>>>>>>> keyspace, the table scan would be broken up into 20 tablet scans, and each
>>>>>>>>>>> of those might land on a different tablet server running on isolated
>>>>>>>>>>> hardware. For a significantly larger table into which you expect highly
>>>>>>>>>>> concurrent workloads, the recommendation serves as a lower bound -- I'd
>>>>>>>>>>> recommend having more partitions, and if your data is naturally
>>>>>>>>>>> time-oriented, consider range-partitioning on timestamp.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <
>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>>>>>> No Default Partitioning
>>>>>>>>>>>> Kudu does not provide a default partitioning strategy when
>>>>>>>>>>>> creating tables. It is recommended that new tables which are expected to
>>>>>>>>>>>> have heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>>>> servers.
>>>>>>>>>>>>
>>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with
>>>>>>>>>>>> 1MM rows and another one with 100MM rows, do I pick 20 / 3 partitions for
>>>>>>>>>>>> both (divide by 3 because of replication)?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <
>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> hey guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I asked the same question on Slack on and got no responses. I
>>>>>>>>>>>>> just went through the docs and design doc and FAQ and still did not find an
>>>>>>>>>>>>> answer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can someone comment?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is
>>>>>>>>>>>>> large enough in my example above, should I go with 3, 9 or 18 tablets? or
>>>>>>>>>>>>> should I pick tablets to be closer to 1Gb?
>>>>>>>>>>>>>
>>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>>>
>>>>>>>>>>>>> We cannot pick the partitioning strategy for each table as we
>>>>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need to come
>>>>>>>>>>>>> with an automated way to pick number of partitions/tablets. So far I was
>>>>>>>>>>>>> using 1Gb rule but rethinking this now for another project.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>>>>>> 3 2Gb 65
>>>>>>>>>>>>>> 9 700Mb 27
>>>>>>>>>>>>>> 18 350Mb 17
>>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    For small tables, such as dimension tables, ensure that
>>>>>>>>>>>>>>>    each tablet is at least 1 GB in size.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if
>>>>>>>>>>>>>>> I have more tablets/partitions...In this example, 18 tablets gave me the
>>>>>>>>>>>>>>> best timing but tablet size was around 300-400Mb. But the doc says "at
>>>>>>>>>>>>>>> least 1Gb".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Boris
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Andrew Wong
>>>>>>>>>>>
>>>>>>>>>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

Hey Tim,

That optimization does seem particularly well suited for handling nested
data since they are joins, and I'm curious about other parallizable
execution paths. Exciting news, thanks.

Cliff

On Mon, Mar 16, 2020, 7:55 PM Tim Armstrong <ta...@cloudera.com> wrote:

> > I think your hardware situation hurts not only for the number of tablets
> but also for Kudu + Impala. Impala afaik will only use one core per host
> per query, so is a poor fit for large complex queries on vertical hardware.
>
> This is basically true as of current releases of Impala but I'm working on
> addressing this. It's been possible to set mt_dop per-query on queries
> without joins or table sinks for a long time now (the limitations make it
> of limited use). Rough join and table sink support is behind a hidden flag
> in 3.3 and 3.4.  I've been working on making it performant with joins (add
> doing all the requisite testing), which should land in master very soon and
> if things go remotely to plan, be a fully supported option in the next
> release after 3.4. We saw huge speedups on a lot of queries (like 10x or
> more). Some queries didn't benefit much, if they were limited by the scan
> perf (including if the runtime filters pushed into the scans were filtering
> most data before joins).
>
> On Mon, Mar 16, 2020 at 2:58 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> appreciate your thoughts, Cliff
>>
>> On Mon, Mar 16, 2020 at 11:18 AM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> Boris,
>>>
>>> I think the crux of the problem is that "real-time analytics" over
>>> deeply nested storage does not really exist. I'll qualify that statement
>>> with "real-time" meaning streaming per-record ingestion, and "analytics" as
>>> columnar query access. The closest thing I know of today is Google BigQuery
>>> or Snowflake, but those are actually micro-batch ingestion done well, not
>>> per-record like Kudu. The only real-time analytics solution I know of that
>>> has a modicum of nesting is Druid, with a single level of "GroupBy"
>>> dimensions that get flattened into virtual rows as part of its non-SQL
>>> rollup API. Columnar storage and real-time are probably aways going to be a
>>> difficult pairing and in fact Kudu's "real-time" storage format is
>>> row-based, and Druid requires a whole lambda architecture batch compaction.
>>> As we all know, nothing is for free, whether the trade-off for nested data
>>> be column-shredding into something like Kudu as Adar suggested, or
>>> micro-batching to Parquet or some combination of both. I won't even go into
>>> how nested analytics are handled on the SQL/query side because it gets
>>> weird there, too.
>>>
>>> I think your hardware situation hurts not only for the number of tablets
>>> but also for Kudu + Impala. Impala afaik will only use one core per host
>>> per query, so is a poor fit for large complex queries on vertical hardware.
>>>
>>> In conclusion, I don't know how far you've gone into your investigations
>>> but I can say that if your needs are to support "power users" at a premium
>>> then something like Snowflake is great for your problem space. But if
>>> analytics are also more centrally integrated in pipelines then parquet is
>>> hard to beat for the price and flexibility, as is Kudu for dashboards or
>>> other intelligence that leverages upsert/key semantics. Ultimately, like
>>> many of us, you may find that you'll need all of the above.
>>>
>>> Cliff
>>>
>>>
>>>
>>> On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Adar for your comments and thinking this through! I really think
>>>> Kudu has tons of potential and very happy we've made a decision to use it
>>>> for our real-time pipeline.
>>>>
>>>> you asked about use cases for support of nested and large binary/text
>>>> objects. I work for a large healthcare organization and there is a lot of
>>>> going with newer HL7 FHIR standard. FHIR documents are highly nested json
>>>> objects that might get as big as a few Gbs in size. We could not use Kudu
>>>> for storing FHIR bundles/documents due to the size and inability to store
>>>> FHIR objects natively. We ended up using Hive/Spark for that, but that
>>>> solution is not real-time. And it is not only about storing, but also about
>>>> fast search in those FHIR documents. I know we could use Hbase for that but
>>>> Hbase is really bad when it comes to analytics type of workloads.
>>>>
>>>> As for large text/binaries, the good and common example is
>>>> narratives notes (physician notes, progress notes, etc.) They can be in
>>>> form of RTF or PDF documents or images.
>>>>
>>>> 300 column limit was not a big deal so far but we have high-density
>>>> nodes with 2 44-core cpus and 12 12-Tb drives and we cannot use the full
>>>> power of them due to limitations of Kudu in terms of number of tablets per
>>>> node. This is more concerning than 300 column limit.
>>>>
>>>>
>>>>
>>>> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <ad...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
>>>>> 1. Both are created transparently and automatically as data is
>>>>> inserted.
>>>>> 2. Both may overlap with one another based on the sort column
>>>>> (Snowflake lets you choose the sort column; in Kudu this is always the PK).
>>>>> 3. Both are pruned at scan time if the scan's predicates allow it.
>>>>> 4. In Kudu, a tablet with a large clustering depth (known as "average
>>>>> rowset height") will be a likely compaction candidate. It's not clear to me
>>>>> if Snowflake has a process to automatically compact their micro-partitions.
>>>>>
>>>>> I don't understand how this solves the partitioning problem though:
>>>>> the doc doesn't explain how micro-partitions are mapped to nodes, or
>>>>> if/when micro-partitions are load balanced across the cluster. I'd be
>>>>> curious to learn more about this.
>>>>>
>>>>> Below you'll find some more info on the other issues you raised:
>>>>> - Nested data type support is tracked in KUDU-1261. We've had an
>>>>> on-again/off-again relationship with it: it's a significant amount of work
>>>>> to implement, so we're hesitant to commit unless there's also significant
>>>>> need. Many users have been successful in "shredding" their nested types
>>>>> across many Kudu columns, which is one of the reasons we're actively
>>>>> working on increasing the number of supported columns above 300.
>>>>> - We're also working on auto-rebalancing right now; see KUDU-2780 for
>>>>> more details.
>>>>> - Large cell support is tracked in KUDU-2874, but there doesn't seem
>>>>> to have been as much interest there.
>>>>> - The compaction issue you're describing sounds like KUDU-1400, which
>>>>> was indeed fixed in 1.9. This was our bad; for quite some time we were
>>>>> focused on optimizing for heavy workloads and didn't pay attention to Kudu
>>>>> performance in "trickling" scenarios, especially not over a long period of
>>>>> time. That's also why we didn't think to advertise --flush_threshold_secs,
>>>>> or even to change its default value (which is still 2 minutes: far too
>>>>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
>>>>> widespread of a problem it was.
>>>>>
>>>>>
>>>>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in
>>>>>> Kudu, and I do not think snowflake deserves that much credit and noise. But
>>>>>> they do have great ideas that take their engine to the next level. I do not
>>>>>> know architecture-wise how it works but here is the best doc I found
>>>>>> regarding micro partitions:
>>>>>>
>>>>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>>>>>
>>>>>>
>>>>>> It is also appealing to me that they support nested data types with
>>>>>> ease (and they are still very efficient to query/filter by), they scale
>>>>>> workload easily (Kudu requires manual rebalancing and it is VERY slow as
>>>>>> I've learned last weekend by doing it on our dev cluster). Also, they
>>>>>> support very large blobs and text fields that Kudu does not.
>>>>>>
>>>>>> Actually, I have a much longer wish list on the Impala side than Kudu
>>>>>> - it feels that Kudu itself is very close to Snowflake but lacks these
>>>>>> self-management features. While Impala is certainly one of the fastest SQL
>>>>>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>>>>>> Hive would run like a champ.
>>>>>>
>>>>>> As for Kudu's compaction process - it was largely an issue with 1.5
>>>>>> and I believed was addressed in 1.9. We had a terrible time for a few weeks
>>>>>> when frequent updates/delete almost froze all our queries to crawl. A lot
>>>>>> of smaller tables had a huge number of tiny rowsets causing massive scans
>>>>>> and freezing the entire Kudu cluster. We had a good discussion on slack and
>>>>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>>>>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>>>>>> suggested to set this flag and actually it was one of these "use it at your
>>>>>> own risk" flags.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This has been an excellent discussion to follow, with very useful
>>>>>>> feedback. Thank you for that.
>>>>>>>
>>>>>>> Boris, if I can try to summarize your position, it's that manual
>>>>>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>>>>>> when you don't control the PK of each table. The Kudu schema design guide
>>>>>>> would advise you to hash partition such tables, and, following Andrew's
>>>>>>> recommendation, have at least one partition per tserver. Except that given
>>>>>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>>>>>> number of partitions per tserver. And if you reduce the number of
>>>>>>> partitions per table, you open yourself up to potential hotspotting if the
>>>>>>> per-table load isn't even. Is that correct?
>>>>>>>
>>>>>>> I can think of several improvements we could make here:
>>>>>>> 1. Support splitting/merging of range partitions, first manually and
>>>>>>> later automatically based on load. This is tracked in KUDU-437 and
>>>>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>>>>>> of static range/hash partitioning, there hasn't been much traction on those
>>>>>>> JIRAs.
>>>>>>> 2. Improve server efficiency when hosting large numbers of replicas,
>>>>>>> so that you can blanket your cluster with maximally hash-partitioned
>>>>>>> tables. We did some work here a couple years ago (see KUDU-1967) but
>>>>>>> there's clearly much more that we can do. Of note, we previously discussed
>>>>>>> implementing Multi-Raft (
>>>>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>>>>>> ).
>>>>>>> 3. Add support for consistent hashing. I'm not super familiar with
>>>>>>> the details but YugabyteDB (based on Kudu) has seen success here. Their
>>>>>>> recent blog post on sharding is a good (and related) read:
>>>>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>>>>>
>>>>>>> Since you're clearly more familiar with Snowflake than I, can you
>>>>>>> comment more on how their partitioning system works? Sounds like it's quite
>>>>>>> automated; is that ever problematic when it decides to repartition during a
>>>>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>>>>>> hotspotting by load balancing, and concludes that hotspots can move around
>>>>>>> too quickly for a load balancer to keep up.
>>>>>>>
>>>>>>> Separately, you mentioned having to manage Kudu's compaction
>>>>>>> process. Could you go into more detail here?
>>>>>>>
>>>>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>>>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>>>>>> team is not tracking competition that much and I think there are awesome
>>>>>>>> features in Snowflake that would be great to have in Kudu.
>>>>>>>>
>>>>>>>> I think they do support upserts now and back in-time queries -
>>>>>>>> while they do not upsert/delete records, they just create new
>>>>>>>> micro-partitions and it is a metadata operation - that's who they can see
>>>>>>>> data in a given time. The idea of virtual warehouse is also very appealing
>>>>>>>> especially in healthcare as we often need to share data with other
>>>>>>>> departments or partners.
>>>>>>>>
>>>>>>>> do not get me wrong, I decided to use Kudu for the same reasons and
>>>>>>>> I really did not have a choice on CDH other than HBase but I get questioned
>>>>>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>>>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>>>>>> why :)
>>>>>>>>
>>>>>>>> My biggest gripe with Kudu besides known limitations is the
>>>>>>>> management of partitions and compaction process. For our largest tables, we
>>>>>>>> just max out the number of tablets as Kudu allows per cluster. Smaller
>>>>>>>> tables though is a challenge and more like an art while I feel there should
>>>>>>>> be an automated process with good defaults like in other data storage
>>>>>>>> systems. No one expects to assign partitions manually in Snowflake,
>>>>>>>> BigQuery or Redshift. No one is expected to tweak compaction parameters and
>>>>>>>> deal with fast degraded performance over time on tables with a high number
>>>>>>>> of deletes/updates. We are happy with performance but it is not all about
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> We also struggle on Impala side as it feels it is limiting what we
>>>>>>>> can do with Kudu and it feels like Impala team is always behind. pyspark is
>>>>>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>>>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>>>>>
>>>>>>>> And honestly, I do not like that Impala is the only viable option
>>>>>>>> with Kudu. We are forced a lot of time to do heavy ETL in Hive which is
>>>>>>>> like a tank but we cannot do that anymore with Kudu tables and struggle
>>>>>>>> with disk spilling issues, Impala choosing bad explain plans and forcing us
>>>>>>>> to use straight_join hint many times etc.
>>>>>>>>
>>>>>>>> Sorry if this post came a bit on a negative side, I really like
>>>>>>>> Kudu and the dev team behind it rocks. But I do not like their industry is
>>>>>>>> going and why Kudu is not getting the traction it deserves.
>>>>>>>>
>>>>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is a good conversation but I don't think the comparison with
>>>>>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>>>>>> make Kudu valuable for some use cases.
>>>>>>>>>
>>>>>>>>> It also seems to me that their intended workloads are quite
>>>>>>>>> different. Snowflake is great for intensive analytics on demand, and can
>>>>>>>>> handle deeply nested data very well, where Kudu can't handle that at all.
>>>>>>>>> Snowflake is not designed for heavy concurrency, but complex query plans
>>>>>>>>> for a small group of users. If you select an x-large Snowflake cluster it's
>>>>>>>>> probably because you have a large amount of data to churn through, not
>>>>>>>>> because you have a large number of users. Or, at least that's how we used
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>>>>>> use "for free" instance-store NVMe. We also have associated
>>>>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>>>>>> compute oriented queries.
>>>>>>>>>
>>>>>>>>> The net result blows away what we had with RedShift at less than
>>>>>>>>> 1/3 the cost, with performance improvements mostly from better concurrency
>>>>>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>>>>>> also use streaming ingestion which, aside from being impossible with
>>>>>>>>> RedShift, removes the added cost of staging.
>>>>>>>>>
>>>>>>>>> Getting back to Snowflake, there's no way we could use it the same
>>>>>>>>> way we use Kudu, and even if we could, the cost would would probably put us
>>>>>>>>> out of business!
>>>>>>>>>
>>>>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> thanks Andrew for taking your time responding to me. It seems
>>>>>>>>>> that there are no exact recommendations.
>>>>>>>>>>
>>>>>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>>>>>
>>>>>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>>>>>> one to make these decisions. Does it make sense?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Boris,
>>>>>>>>>>>
>>>>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>>>>>
>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Per your numbers, this confirms that the partitions are the
>>>>>>>>>>> units of concurrency here, and that therefore having more and having
>>>>>>>>>>> smaller partitions yields a concurrency bump. That said, extending a scheme
>>>>>>>>>>> of smaller partitions across all tables may not scale when thinking about
>>>>>>>>>>> the total number of partitions cluster-wide.
>>>>>>>>>>>
>>>>>>>>>>> There are some trade offs with the replica count per tablet
>>>>>>>>>>> server here -- generally, each tablet replica has a resource cost on tablet
>>>>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>>>>>
>>>>>>>>>>> Your point about scan concurrency is certainly a valid one --
>>>>>>>>>>> there have been patches for other integrations that have tackled this to
>>>>>>>>>>> decouple partitioning from scan concurrency (KUDU-2437
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an
>>>>>>>>>>> example, where Kudu's Spark integration will split range scans into
>>>>>>>>>>> smaller-scoped scan tokens to be run concurrently, though this optimization
>>>>>>>>>>> hasn't made its way into Impala yet). I filed KUDU-3071
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what
>>>>>>>>>>> I think is left on the Kudu-side to get this up and running, so that it can
>>>>>>>>>>> be worked into Impala.
>>>>>>>>>>>
>>>>>>>>>>> For now, I would try to take into account the total sum of
>>>>>>>>>>> resources you have available to Kudu (including number of tablet servers,
>>>>>>>>>>> amount of storage per node, number of disks per tablet server, type of disk
>>>>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet replicas
>>>>>>>>>>> your system can handle (the scaling guide
>>>>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be
>>>>>>>>>>> helpful here), and hopefully that, along with your own SLAs per table, can
>>>>>>>>>>> help guide how you partition your tables.
>>>>>>>>>>>
>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in
>>>>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously depends
>>>>>>>>>>> on your workload. As you show and as pointed out in documentation, larger
>>>>>>>>>>> and fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>>>>>
>>>>>>>>>>> It is recommended that new tables which are expected to have
>>>>>>>>>>>> heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>>>> servers.
>>>>>>>>>>>
>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The recommendation here is to have at least 20 logical
>>>>>>>>>>> partitions per table. That way, a scan were to touch a table's entire
>>>>>>>>>>> keyspace, the table scan would be broken up into 20 tablet scans, and each
>>>>>>>>>>> of those might land on a different tablet server running on isolated
>>>>>>>>>>> hardware. For a significantly larger table into which you expect highly
>>>>>>>>>>> concurrent workloads, the recommendation serves as a lower bound -- I'd
>>>>>>>>>>> recommend having more partitions, and if your data is naturally
>>>>>>>>>>> time-oriented, consider range-partitioning on timestamp.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <
>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>>>>>> No Default Partitioning
>>>>>>>>>>>> Kudu does not provide a default partitioning strategy when
>>>>>>>>>>>> creating tables. It is recommended that new tables which are expected to
>>>>>>>>>>>> have heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>>>> servers.
>>>>>>>>>>>>
>>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with
>>>>>>>>>>>> 1MM rows and another one with 100MM rows, do I pick 20 / 3 partitions for
>>>>>>>>>>>> both (divide by 3 because of replication)?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <
>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> hey guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I asked the same question on Slack on and got no responses. I
>>>>>>>>>>>>> just went through the docs and design doc and FAQ and still did not find an
>>>>>>>>>>>>> answer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can someone comment?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is
>>>>>>>>>>>>> large enough in my example above, should I go with 3, 9 or 18 tablets? or
>>>>>>>>>>>>> should I pick tablets to be closer to 1Gb?
>>>>>>>>>>>>>
>>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>>>
>>>>>>>>>>>>> We cannot pick the partitioning strategy for each table as we
>>>>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need to come
>>>>>>>>>>>>> with an automated way to pick number of partitions/tablets. So far I was
>>>>>>>>>>>>> using 1Gb rule but rethinking this now for another project.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>>>>>> 3 2Gb 65
>>>>>>>>>>>>>> 9 700Mb 27
>>>>>>>>>>>>>> 18 350Mb 17
>>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    For small tables, such as dimension tables, ensure that
>>>>>>>>>>>>>>>    each tablet is at least 1 GB in size.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if
>>>>>>>>>>>>>>> I have more tablets/partitions...In this example, 18 tablets gave me the
>>>>>>>>>>>>>>> best timing but tablet size was around 300-400Mb. But the doc says "at
>>>>>>>>>>>>>>> least 1Gb".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Boris
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Andrew Wong
>>>>>>>>>>>
>>>>>>>>>>

Re: Partitioning Rules of Thumb

Posted by Tim Armstrong <ta...@cloudera.com>.

> I think your hardware situation hurts not only for the number of tablets
but also for Kudu + Impala. Impala afaik will only use one core per host
per query, so is a poor fit for large complex queries on vertical hardware.

This is basically true as of current releases of Impala but I'm working on
addressing this. It's been possible to set mt_dop per-query on queries
without joins or table sinks for a long time now (the limitations make it
of limited use). Rough join and table sink support is behind a hidden flag
in 3.3 and 3.4.  I've been working on making it performant with joins (add
doing all the requisite testing), which should land in master very soon and
if things go remotely to plan, be a fully supported option in the next
release after 3.4. We saw huge speedups on a lot of queries (like 10x or
more). Some queries didn't benefit much, if they were limited by the scan
perf (including if the runtime filters pushed into the scans were filtering
most data before joins).

On Mon, Mar 16, 2020 at 2:58 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> appreciate your thoughts, Cliff
>
> On Mon, Mar 16, 2020 at 11:18 AM Cliff Resnick <cr...@gmail.com> wrote:
>
>> Boris,
>>
>> I think the crux of the problem is that "real-time analytics" over deeply
>> nested storage does not really exist. I'll qualify that statement with
>> "real-time" meaning streaming per-record ingestion, and "analytics" as
>> columnar query access. The closest thing I know of today is Google BigQuery
>> or Snowflake, but those are actually micro-batch ingestion done well, not
>> per-record like Kudu. The only real-time analytics solution I know of that
>> has a modicum of nesting is Druid, with a single level of "GroupBy"
>> dimensions that get flattened into virtual rows as part of its non-SQL
>> rollup API. Columnar storage and real-time are probably aways going to be a
>> difficult pairing and in fact Kudu's "real-time" storage format is
>> row-based, and Druid requires a whole lambda architecture batch compaction.
>> As we all know, nothing is for free, whether the trade-off for nested data
>> be column-shredding into something like Kudu as Adar suggested, or
>> micro-batching to Parquet or some combination of both. I won't even go into
>> how nested analytics are handled on the SQL/query side because it gets
>> weird there, too.
>>
>> I think your hardware situation hurts not only for the number of tablets
>> but also for Kudu + Impala. Impala afaik will only use one core per host
>> per query, so is a poor fit for large complex queries on vertical hardware.
>>
>> In conclusion, I don't know how far you've gone into your investigations
>> but I can say that if your needs are to support "power users" at a premium
>> then something like Snowflake is great for your problem space. But if
>> analytics are also more centrally integrated in pipelines then parquet is
>> hard to beat for the price and flexibility, as is Kudu for dashboards or
>> other intelligence that leverages upsert/key semantics. Ultimately, like
>> many of us, you may find that you'll need all of the above.
>>
>> Cliff
>>
>>
>>
>> On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Adar for your comments and thinking this through! I really think
>>> Kudu has tons of potential and very happy we've made a decision to use it
>>> for our real-time pipeline.
>>>
>>> you asked about use cases for support of nested and large binary/text
>>> objects. I work for a large healthcare organization and there is a lot of
>>> going with newer HL7 FHIR standard. FHIR documents are highly nested json
>>> objects that might get as big as a few Gbs in size. We could not use Kudu
>>> for storing FHIR bundles/documents due to the size and inability to store
>>> FHIR objects natively. We ended up using Hive/Spark for that, but that
>>> solution is not real-time. And it is not only about storing, but also about
>>> fast search in those FHIR documents. I know we could use Hbase for that but
>>> Hbase is really bad when it comes to analytics type of workloads.
>>>
>>> As for large text/binaries, the good and common example is
>>> narratives notes (physician notes, progress notes, etc.) They can be in
>>> form of RTF or PDF documents or images.
>>>
>>> 300 column limit was not a big deal so far but we have high-density
>>> nodes with 2 44-core cpus and 12 12-Tb drives and we cannot use the full
>>> power of them due to limitations of Kudu in terms of number of tablets per
>>> node. This is more concerning than 300 column limit.
>>>
>>>
>>>
>>> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <ad...@cloudera.com>
>>> wrote:
>>>
>>>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
>>>> 1. Both are created transparently and automatically as data is inserted.
>>>> 2. Both may overlap with one another based on the sort column
>>>> (Snowflake lets you choose the sort column; in Kudu this is always the PK).
>>>> 3. Both are pruned at scan time if the scan's predicates allow it.
>>>> 4. In Kudu, a tablet with a large clustering depth (known as "average
>>>> rowset height") will be a likely compaction candidate. It's not clear to me
>>>> if Snowflake has a process to automatically compact their micro-partitions.
>>>>
>>>> I don't understand how this solves the partitioning problem though: the
>>>> doc doesn't explain how micro-partitions are mapped to nodes, or if/when
>>>> micro-partitions are load balanced across the cluster. I'd be curious to
>>>> learn more about this.
>>>>
>>>> Below you'll find some more info on the other issues you raised:
>>>> - Nested data type support is tracked in KUDU-1261. We've had an
>>>> on-again/off-again relationship with it: it's a significant amount of work
>>>> to implement, so we're hesitant to commit unless there's also significant
>>>> need. Many users have been successful in "shredding" their nested types
>>>> across many Kudu columns, which is one of the reasons we're actively
>>>> working on increasing the number of supported columns above 300.
>>>> - We're also working on auto-rebalancing right now; see KUDU-2780 for
>>>> more details.
>>>> - Large cell support is tracked in KUDU-2874, but there doesn't seem to
>>>> have been as much interest there.
>>>> - The compaction issue you're describing sounds like KUDU-1400, which
>>>> was indeed fixed in 1.9. This was our bad; for quite some time we were
>>>> focused on optimizing for heavy workloads and didn't pay attention to Kudu
>>>> performance in "trickling" scenarios, especially not over a long period of
>>>> time. That's also why we didn't think to advertise --flush_threshold_secs,
>>>> or even to change its default value (which is still 2 minutes: far too
>>>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
>>>> widespread of a problem it was.
>>>>
>>>>
>>>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in
>>>>> Kudu, and I do not think snowflake deserves that much credit and noise. But
>>>>> they do have great ideas that take their engine to the next level. I do not
>>>>> know architecture-wise how it works but here is the best doc I found
>>>>> regarding micro partitions:
>>>>>
>>>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>>>>
>>>>>
>>>>> It is also appealing to me that they support nested data types with
>>>>> ease (and they are still very efficient to query/filter by), they scale
>>>>> workload easily (Kudu requires manual rebalancing and it is VERY slow as
>>>>> I've learned last weekend by doing it on our dev cluster). Also, they
>>>>> support very large blobs and text fields that Kudu does not.
>>>>>
>>>>> Actually, I have a much longer wish list on the Impala side than Kudu
>>>>> - it feels that Kudu itself is very close to Snowflake but lacks these
>>>>> self-management features. While Impala is certainly one of the fastest SQL
>>>>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>>>>> Hive would run like a champ.
>>>>>
>>>>> As for Kudu's compaction process - it was largely an issue with 1.5
>>>>> and I believed was addressed in 1.9. We had a terrible time for a few weeks
>>>>> when frequent updates/delete almost froze all our queries to crawl. A lot
>>>>> of smaller tables had a huge number of tiny rowsets causing massive scans
>>>>> and freezing the entire Kudu cluster. We had a good discussion on slack and
>>>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>>>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>>>>> suggested to set this flag and actually it was one of these "use it at your
>>>>> own risk" flags.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> This has been an excellent discussion to follow, with very useful
>>>>>> feedback. Thank you for that.
>>>>>>
>>>>>> Boris, if I can try to summarize your position, it's that manual
>>>>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>>>>> when you don't control the PK of each table. The Kudu schema design guide
>>>>>> would advise you to hash partition such tables, and, following Andrew's
>>>>>> recommendation, have at least one partition per tserver. Except that given
>>>>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>>>>> number of partitions per tserver. And if you reduce the number of
>>>>>> partitions per table, you open yourself up to potential hotspotting if the
>>>>>> per-table load isn't even. Is that correct?
>>>>>>
>>>>>> I can think of several improvements we could make here:
>>>>>> 1. Support splitting/merging of range partitions, first manually and
>>>>>> later automatically based on load. This is tracked in KUDU-437 and
>>>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>>>>> of static range/hash partitioning, there hasn't been much traction on those
>>>>>> JIRAs.
>>>>>> 2. Improve server efficiency when hosting large numbers of replicas,
>>>>>> so that you can blanket your cluster with maximally hash-partitioned
>>>>>> tables. We did some work here a couple years ago (see KUDU-1967) but
>>>>>> there's clearly much more that we can do. Of note, we previously discussed
>>>>>> implementing Multi-Raft (
>>>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>>>>> ).
>>>>>> 3. Add support for consistent hashing. I'm not super familiar with
>>>>>> the details but YugabyteDB (based on Kudu) has seen success here. Their
>>>>>> recent blog post on sharding is a good (and related) read:
>>>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>>>>
>>>>>> Since you're clearly more familiar with Snowflake than I, can you
>>>>>> comment more on how their partitioning system works? Sounds like it's quite
>>>>>> automated; is that ever problematic when it decides to repartition during a
>>>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>>>>> hotspotting by load balancing, and concludes that hotspots can move around
>>>>>> too quickly for a load balancer to keep up.
>>>>>>
>>>>>> Separately, you mentioned having to manage Kudu's compaction process.
>>>>>> Could you go into more detail here?
>>>>>>
>>>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>>>>> team is not tracking competition that much and I think there are awesome
>>>>>>> features in Snowflake that would be great to have in Kudu.
>>>>>>>
>>>>>>> I think they do support upserts now and back in-time queries - while
>>>>>>> they do not upsert/delete records, they just create new micro-partitions
>>>>>>> and it is a metadata operation - that's who they can see data in a given
>>>>>>> time. The idea of virtual warehouse is also very appealing especially in
>>>>>>> healthcare as we often need to share data with other departments or
>>>>>>> partners.
>>>>>>>
>>>>>>> do not get me wrong, I decided to use Kudu for the same reasons and
>>>>>>> I really did not have a choice on CDH other than HBase but I get questioned
>>>>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>>>>> why :)
>>>>>>>
>>>>>>> My biggest gripe with Kudu besides known limitations is the
>>>>>>> management of partitions and compaction process. For our largest tables, we
>>>>>>> just max out the number of tablets as Kudu allows per cluster. Smaller
>>>>>>> tables though is a challenge and more like an art while I feel there should
>>>>>>> be an automated process with good defaults like in other data storage
>>>>>>> systems. No one expects to assign partitions manually in Snowflake,
>>>>>>> BigQuery or Redshift. No one is expected to tweak compaction parameters and
>>>>>>> deal with fast degraded performance over time on tables with a high number
>>>>>>> of deletes/updates. We are happy with performance but it is not all about
>>>>>>> performance.
>>>>>>>
>>>>>>> We also struggle on Impala side as it feels it is limiting what we
>>>>>>> can do with Kudu and it feels like Impala team is always behind. pyspark is
>>>>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>>>>
>>>>>>> And honestly, I do not like that Impala is the only viable option
>>>>>>> with Kudu. We are forced a lot of time to do heavy ETL in Hive which is
>>>>>>> like a tank but we cannot do that anymore with Kudu tables and struggle
>>>>>>> with disk spilling issues, Impala choosing bad explain plans and forcing us
>>>>>>> to use straight_join hint many times etc.
>>>>>>>
>>>>>>> Sorry if this post came a bit on a negative side, I really like Kudu
>>>>>>> and the dev team behind it rocks. But I do not like their industry is going
>>>>>>> and why Kudu is not getting the traction it deserves.
>>>>>>>
>>>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> This is a good conversation but I don't think the comparison with
>>>>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>>>>> make Kudu valuable for some use cases.
>>>>>>>>
>>>>>>>> It also seems to me that their intended workloads are quite
>>>>>>>> different. Snowflake is great for intensive analytics on demand, and can
>>>>>>>> handle deeply nested data very well, where Kudu can't handle that at all.
>>>>>>>> Snowflake is not designed for heavy concurrency, but complex query plans
>>>>>>>> for a small group of users. If you select an x-large Snowflake cluster it's
>>>>>>>> probably because you have a large amount of data to churn through, not
>>>>>>>> because you have a large number of users. Or, at least that's how we used
>>>>>>>> it.
>>>>>>>>
>>>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>>>>> use "for free" instance-store NVMe. We also have associated
>>>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>>>>> compute oriented queries.
>>>>>>>>
>>>>>>>> The net result blows away what we had with RedShift at less than
>>>>>>>> 1/3 the cost, with performance improvements mostly from better concurrency
>>>>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>>>>> also use streaming ingestion which, aside from being impossible with
>>>>>>>> RedShift, removes the added cost of staging.
>>>>>>>>
>>>>>>>> Getting back to Snowflake, there's no way we could use it the same
>>>>>>>> way we use Kudu, and even if we could, the cost would would probably put us
>>>>>>>> out of business!
>>>>>>>>
>>>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>>>>>> there are no exact recommendations.
>>>>>>>>>
>>>>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>>>>
>>>>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>>>>> one to make these decisions. Does it make sense?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Boris,
>>>>>>>>>>
>>>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>>>>
>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables under
>>>>>>>>>>> 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Per your numbers, this confirms that the partitions are the units
>>>>>>>>>> of concurrency here, and that therefore having more and having smaller
>>>>>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>>>>>> total number of partitions cluster-wide.
>>>>>>>>>>
>>>>>>>>>> There are some trade offs with the replica count per tablet
>>>>>>>>>> server here -- generally, each tablet replica has a resource cost on tablet
>>>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>>>>
>>>>>>>>>> Your point about scan concurrency is certainly a valid one --
>>>>>>>>>> there have been patches for other integrations that have tackled this to
>>>>>>>>>> decouple partitioning from scan concurrency (KUDU-2437
>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an
>>>>>>>>>> example, where Kudu's Spark integration will split range scans into
>>>>>>>>>> smaller-scoped scan tokens to be run concurrently, though this optimization
>>>>>>>>>> hasn't made its way into Impala yet). I filed KUDU-3071
>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what
>>>>>>>>>> I think is left on the Kudu-side to get this up and running, so that it can
>>>>>>>>>> be worked into Impala.
>>>>>>>>>>
>>>>>>>>>> For now, I would try to take into account the total sum of
>>>>>>>>>> resources you have available to Kudu (including number of tablet servers,
>>>>>>>>>> amount of storage per node, number of disks per tablet server, type of disk
>>>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet replicas
>>>>>>>>>> your system can handle (the scaling guide
>>>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>>>>>> guide how you partition your tables.
>>>>>>>>>>
>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in
>>>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously depends
>>>>>>>>>> on your workload. As you show and as pointed out in documentation, larger
>>>>>>>>>> and fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>>>>
>>>>>>>>>> It is recommended that new tables which are expected to have
>>>>>>>>>>> heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>>> servers.
>>>>>>>>>>
>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The recommendation here is to have at least 20 logical partitions
>>>>>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the
>>>>>>>>>> table scan would be broken up into 20 tablet scans, and each of those might
>>>>>>>>>> land on a different tablet server running on isolated hardware. For a
>>>>>>>>>> significantly larger table into which you expect highly concurrent
>>>>>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>>>>>> consider range-partitioning on timestamp.
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <
>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>>>>> No Default Partitioning
>>>>>>>>>>> Kudu does not provide a default partitioning strategy when
>>>>>>>>>>> creating tables. It is recommended that new tables which are expected to
>>>>>>>>>>> have heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>>> servers.
>>>>>>>>>>>
>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <
>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> hey guys,
>>>>>>>>>>>>
>>>>>>>>>>>> I asked the same question on Slack on and got no responses. I
>>>>>>>>>>>> just went through the docs and design doc and FAQ and still did not find an
>>>>>>>>>>>> answer.
>>>>>>>>>>>>
>>>>>>>>>>>> Can someone comment?
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>>>
>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>>
>>>>>>>>>>>> We cannot pick the partitioning strategy for each table as we
>>>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need to come
>>>>>>>>>>>> with an automated way to pick number of partitions/tablets. So far I was
>>>>>>>>>>>> using 1Gb rule but rethinking this now for another project.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>>>>
>>>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>>>>> 3 2Gb 65
>>>>>>>>>>>>> 9 700Mb 27
>>>>>>>>>>>>> 18 350Mb 17
>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    For small tables, such as dimension tables, ensure that
>>>>>>>>>>>>>>    each tablet is at least 1 GB in size.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>>>>>> 1Gb".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Boris
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Andrew Wong
>>>>>>>>>>
>>>>>>>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

appreciate your thoughts, Cliff

On Mon, Mar 16, 2020 at 11:18 AM Cliff Resnick <cr...@gmail.com> wrote:

> Boris,
>
> I think the crux of the problem is that "real-time analytics" over deeply
> nested storage does not really exist. I'll qualify that statement with
> "real-time" meaning streaming per-record ingestion, and "analytics" as
> columnar query access. The closest thing I know of today is Google BigQuery
> or Snowflake, but those are actually micro-batch ingestion done well, not
> per-record like Kudu. The only real-time analytics solution I know of that
> has a modicum of nesting is Druid, with a single level of "GroupBy"
> dimensions that get flattened into virtual rows as part of its non-SQL
> rollup API. Columnar storage and real-time are probably aways going to be a
> difficult pairing and in fact Kudu's "real-time" storage format is
> row-based, and Druid requires a whole lambda architecture batch compaction.
> As we all know, nothing is for free, whether the trade-off for nested data
> be column-shredding into something like Kudu as Adar suggested, or
> micro-batching to Parquet or some combination of both. I won't even go into
> how nested analytics are handled on the SQL/query side because it gets
> weird there, too.
>
> I think your hardware situation hurts not only for the number of tablets
> but also for Kudu + Impala. Impala afaik will only use one core per host
> per query, so is a poor fit for large complex queries on vertical hardware.
>
> In conclusion, I don't know how far you've gone into your investigations
> but I can say that if your needs are to support "power users" at a premium
> then something like Snowflake is great for your problem space. But if
> analytics are also more centrally integrated in pipelines then parquet is
> hard to beat for the price and flexibility, as is Kudu for dashboards or
> other intelligence that leverages upsert/key semantics. Ultimately, like
> many of us, you may find that you'll need all of the above.
>
> Cliff
>
>
>
> On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks Adar for your comments and thinking this through! I really think
>> Kudu has tons of potential and very happy we've made a decision to use it
>> for our real-time pipeline.
>>
>> you asked about use cases for support of nested and large binary/text
>> objects. I work for a large healthcare organization and there is a lot of
>> going with newer HL7 FHIR standard. FHIR documents are highly nested json
>> objects that might get as big as a few Gbs in size. We could not use Kudu
>> for storing FHIR bundles/documents due to the size and inability to store
>> FHIR objects natively. We ended up using Hive/Spark for that, but that
>> solution is not real-time. And it is not only about storing, but also about
>> fast search in those FHIR documents. I know we could use Hbase for that but
>> Hbase is really bad when it comes to analytics type of workloads.
>>
>> As for large text/binaries, the good and common example is
>> narratives notes (physician notes, progress notes, etc.) They can be in
>> form of RTF or PDF documents or images.
>>
>> 300 column limit was not a big deal so far but we have high-density nodes
>> with 2 44-core cpus and 12 12-Tb drives and we cannot use the full power of
>> them due to limitations of Kudu in terms of number of tablets per node.
>> This is more concerning than 300 column limit.
>>
>>
>>
>> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <ad...@cloudera.com>
>> wrote:
>>
>>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
>>> 1. Both are created transparently and automatically as data is inserted.
>>> 2. Both may overlap with one another based on the sort column (Snowflake
>>> lets you choose the sort column; in Kudu this is always the PK).
>>> 3. Both are pruned at scan time if the scan's predicates allow it.
>>> 4. In Kudu, a tablet with a large clustering depth (known as "average
>>> rowset height") will be a likely compaction candidate. It's not clear to me
>>> if Snowflake has a process to automatically compact their micro-partitions.
>>>
>>> I don't understand how this solves the partitioning problem though: the
>>> doc doesn't explain how micro-partitions are mapped to nodes, or if/when
>>> micro-partitions are load balanced across the cluster. I'd be curious to
>>> learn more about this.
>>>
>>> Below you'll find some more info on the other issues you raised:
>>> - Nested data type support is tracked in KUDU-1261. We've had an
>>> on-again/off-again relationship with it: it's a significant amount of work
>>> to implement, so we're hesitant to commit unless there's also significant
>>> need. Many users have been successful in "shredding" their nested types
>>> across many Kudu columns, which is one of the reasons we're actively
>>> working on increasing the number of supported columns above 300.
>>> - We're also working on auto-rebalancing right now; see KUDU-2780 for
>>> more details.
>>> - Large cell support is tracked in KUDU-2874, but there doesn't seem to
>>> have been as much interest there.
>>> - The compaction issue you're describing sounds like KUDU-1400, which
>>> was indeed fixed in 1.9. This was our bad; for quite some time we were
>>> focused on optimizing for heavy workloads and didn't pay attention to Kudu
>>> performance in "trickling" scenarios, especially not over a long period of
>>> time. That's also why we didn't think to advertise --flush_threshold_secs,
>>> or even to change its default value (which is still 2 minutes: far too
>>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
>>> widespread of a problem it was.
>>>
>>>
>>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in
>>>> Kudu, and I do not think snowflake deserves that much credit and noise. But
>>>> they do have great ideas that take their engine to the next level. I do not
>>>> know architecture-wise how it works but here is the best doc I found
>>>> regarding micro partitions:
>>>>
>>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>>>
>>>>
>>>> It is also appealing to me that they support nested data types with
>>>> ease (and they are still very efficient to query/filter by), they scale
>>>> workload easily (Kudu requires manual rebalancing and it is VERY slow as
>>>> I've learned last weekend by doing it on our dev cluster). Also, they
>>>> support very large blobs and text fields that Kudu does not.
>>>>
>>>> Actually, I have a much longer wish list on the Impala side than Kudu -
>>>> it feels that Kudu itself is very close to Snowflake but lacks these
>>>> self-management features. While Impala is certainly one of the fastest SQL
>>>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>>>> Hive would run like a champ.
>>>>
>>>> As for Kudu's compaction process - it was largely an issue with 1.5 and
>>>> I believed was addressed in 1.9. We had a terrible time for a few weeks
>>>> when frequent updates/delete almost froze all our queries to crawl. A lot
>>>> of smaller tables had a huge number of tiny rowsets causing massive scans
>>>> and freezing the entire Kudu cluster. We had a good discussion on slack and
>>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>>>> suggested to set this flag and actually it was one of these "use it at your
>>>> own risk" flags.
>>>>
>>>>
>>>>
>>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>>>> wrote:
>>>>
>>>>> This has been an excellent discussion to follow, with very useful
>>>>> feedback. Thank you for that.
>>>>>
>>>>> Boris, if I can try to summarize your position, it's that manual
>>>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>>>> when you don't control the PK of each table. The Kudu schema design guide
>>>>> would advise you to hash partition such tables, and, following Andrew's
>>>>> recommendation, have at least one partition per tserver. Except that given
>>>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>>>> number of partitions per tserver. And if you reduce the number of
>>>>> partitions per table, you open yourself up to potential hotspotting if the
>>>>> per-table load isn't even. Is that correct?
>>>>>
>>>>> I can think of several improvements we could make here:
>>>>> 1. Support splitting/merging of range partitions, first manually and
>>>>> later automatically based on load. This is tracked in KUDU-437 and
>>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>>>> of static range/hash partitioning, there hasn't been much traction on those
>>>>> JIRAs.
>>>>> 2. Improve server efficiency when hosting large numbers of replicas,
>>>>> so that you can blanket your cluster with maximally hash-partitioned
>>>>> tables. We did some work here a couple years ago (see KUDU-1967) but
>>>>> there's clearly much more that we can do. Of note, we previously discussed
>>>>> implementing Multi-Raft (
>>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>>>> ).
>>>>> 3. Add support for consistent hashing. I'm not super familiar with the
>>>>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>>>>> blog post on sharding is a good (and related) read:
>>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>>>
>>>>> Since you're clearly more familiar with Snowflake than I, can you
>>>>> comment more on how their partitioning system works? Sounds like it's quite
>>>>> automated; is that ever problematic when it decides to repartition during a
>>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>>>> hotspotting by load balancing, and concludes that hotspots can move around
>>>>> too quickly for a load balancer to keep up.
>>>>>
>>>>> Separately, you mentioned having to manage Kudu's compaction process.
>>>>> Could you go into more detail here?
>>>>>
>>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>>>> team is not tracking competition that much and I think there are awesome
>>>>>> features in Snowflake that would be great to have in Kudu.
>>>>>>
>>>>>> I think they do support upserts now and back in-time queries - while
>>>>>> they do not upsert/delete records, they just create new micro-partitions
>>>>>> and it is a metadata operation - that's who they can see data in a given
>>>>>> time. The idea of virtual warehouse is also very appealing especially in
>>>>>> healthcare as we often need to share data with other departments or
>>>>>> partners.
>>>>>>
>>>>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>>>>> really did not have a choice on CDH other than HBase but I get questioned
>>>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>>>> why :)
>>>>>>
>>>>>> My biggest gripe with Kudu besides known limitations is the
>>>>>> management of partitions and compaction process. For our largest tables, we
>>>>>> just max out the number of tablets as Kudu allows per cluster. Smaller
>>>>>> tables though is a challenge and more like an art while I feel there should
>>>>>> be an automated process with good defaults like in other data storage
>>>>>> systems. No one expects to assign partitions manually in Snowflake,
>>>>>> BigQuery or Redshift. No one is expected to tweak compaction parameters and
>>>>>> deal with fast degraded performance over time on tables with a high number
>>>>>> of deletes/updates. We are happy with performance but it is not all about
>>>>>> performance.
>>>>>>
>>>>>> We also struggle on Impala side as it feels it is limiting what we
>>>>>> can do with Kudu and it feels like Impala team is always behind. pyspark is
>>>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>>>
>>>>>> And honestly, I do not like that Impala is the only viable option
>>>>>> with Kudu. We are forced a lot of time to do heavy ETL in Hive which is
>>>>>> like a tank but we cannot do that anymore with Kudu tables and struggle
>>>>>> with disk spilling issues, Impala choosing bad explain plans and forcing us
>>>>>> to use straight_join hint many times etc.
>>>>>>
>>>>>> Sorry if this post came a bit on a negative side, I really like Kudu
>>>>>> and the dev team behind it rocks. But I do not like their industry is going
>>>>>> and why Kudu is not getting the traction it deserves.
>>>>>>
>>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This is a good conversation but I don't think the comparison with
>>>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>>>> make Kudu valuable for some use cases.
>>>>>>>
>>>>>>> It also seems to me that their intended workloads are quite
>>>>>>> different. Snowflake is great for intensive analytics on demand, and can
>>>>>>> handle deeply nested data very well, where Kudu can't handle that at all.
>>>>>>> Snowflake is not designed for heavy concurrency, but complex query plans
>>>>>>> for a small group of users. If you select an x-large Snowflake cluster it's
>>>>>>> probably because you have a large amount of data to churn through, not
>>>>>>> because you have a large number of users. Or, at least that's how we used
>>>>>>> it.
>>>>>>>
>>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>>>> use "for free" instance-store NVMe. We also have associated
>>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>>>> compute oriented queries.
>>>>>>>
>>>>>>> The net result blows away what we had with RedShift at less than 1/3
>>>>>>> the cost, with performance improvements mostly from better concurrency
>>>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>>>> also use streaming ingestion which, aside from being impossible with
>>>>>>> RedShift, removes the added cost of staging.
>>>>>>>
>>>>>>> Getting back to Snowflake, there's no way we could use it the same
>>>>>>> way we use Kudu, and even if we could, the cost would would probably put us
>>>>>>> out of business!
>>>>>>>
>>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>>>>> there are no exact recommendations.
>>>>>>>>
>>>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>>>
>>>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>>>> one to make these decisions. Does it make sense?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Boris,
>>>>>>>>>
>>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>>>
>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>> And a follow-up question, if I have tons of smaller tables under
>>>>>>>>>> 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Per your numbers, this confirms that the partitions are the units
>>>>>>>>> of concurrency here, and that therefore having more and having smaller
>>>>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>>>>> total number of partitions cluster-wide.
>>>>>>>>>
>>>>>>>>> There are some trade offs with the replica count per tablet server
>>>>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>>>
>>>>>>>>> Your point about scan concurrency is certainly a valid one --
>>>>>>>>> there have been patches for other integrations that have tackled this to
>>>>>>>>> decouple partitioning from scan concurrency (KUDU-2437
>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>>>>> be worked into Impala.
>>>>>>>>>
>>>>>>>>> For now, I would try to take into account the total sum of
>>>>>>>>> resources you have available to Kudu (including number of tablet servers,
>>>>>>>>> amount of storage per node, number of disks per tablet server, type of disk
>>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet replicas
>>>>>>>>> your system can handle (the scaling guide
>>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>>>>> guide how you partition your tables.
>>>>>>>>>
>>>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in
>>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously depends
>>>>>>>>> on your workload. As you show and as pointed out in documentation, larger
>>>>>>>>> and fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>>>
>>>>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The recommendation here is to have at least 20 logical partitions
>>>>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the
>>>>>>>>> table scan would be broken up into 20 tablet scans, and each of those might
>>>>>>>>> land on a different tablet server running on isolated hardware. For a
>>>>>>>>> significantly larger table into which you expect highly concurrent
>>>>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>>>>> consider range-partitioning on timestamp.
>>>>>>>>>
>>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>>>> No Default Partitioning
>>>>>>>>>> Kudu does not provide a default partitioning strategy when
>>>>>>>>>> creating tables. It is recommended that new tables which are expected to
>>>>>>>>>> have heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>>> servers.
>>>>>>>>>>
>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <
>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> hey guys,
>>>>>>>>>>>
>>>>>>>>>>> I asked the same question on Slack on and got no responses. I
>>>>>>>>>>> just went through the docs and design doc and FAQ and still did not find an
>>>>>>>>>>> answer.
>>>>>>>>>>>
>>>>>>>>>>> Can someone comment?
>>>>>>>>>>>
>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>>
>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables under
>>>>>>>>>>> 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>
>>>>>>>>>>> We cannot pick the partitioning strategy for each table as we
>>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need to come
>>>>>>>>>>> with an automated way to pick number of partitions/tablets. So far I was
>>>>>>>>>>> using 1Gb rule but rethinking this now for another project.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>>>
>>>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>>>
>>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>>>
>>>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>>>> 3 2Gb 65
>>>>>>>>>>>> 9 700Mb 27
>>>>>>>>>>>> 18 350Mb 17
>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>>>
>>>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    For small tables, such as dimension tables, ensure that
>>>>>>>>>>>>>    each tablet is at least 1 GB in size.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>>>>> 1Gb".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>>>
>>>>>>>>>>>>> Boris
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Andrew Wong
>>>>>>>>>
>>>>>>>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

Boris,

I think the crux of the problem is that "real-time analytics" over deeply
nested storage does not really exist. I'll qualify that statement with
"real-time" meaning streaming per-record ingestion, and "analytics" as
columnar query access. The closest thing I know of today is Google BigQuery
or Snowflake, but those are actually micro-batch ingestion done well, not
per-record like Kudu. The only real-time analytics solution I know of that
has a modicum of nesting is Druid, with a single level of "GroupBy"
dimensions that get flattened into virtual rows as part of its non-SQL
rollup API. Columnar storage and real-time are probably aways going to be a
difficult pairing and in fact Kudu's "real-time" storage format is
row-based, and Druid requires a whole lambda architecture batch compaction.
As we all know, nothing is for free, whether the trade-off for nested data
be column-shredding into something like Kudu as Adar suggested, or
micro-batching to Parquet or some combination of both. I won't even go into
how nested analytics are handled on the SQL/query side because it gets
weird there, too.

I think your hardware situation hurts not only for the number of tablets
but also for Kudu + Impala. Impala afaik will only use one core per host
per query, so is a poor fit for large complex queries on vertical hardware.

In conclusion, I don't know how far you've gone into your investigations
but I can say that if your needs are to support "power users" at a premium
then something like Snowflake is great for your problem space. But if
analytics are also more centrally integrated in pipelines then parquet is
hard to beat for the price and flexibility, as is Kudu for dashboards or
other intelligence that leverages upsert/key semantics. Ultimately, like
many of us, you may find that you'll need all of the above.

Cliff



On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> thanks Adar for your comments and thinking this through! I really think
> Kudu has tons of potential and very happy we've made a decision to use it
> for our real-time pipeline.
>
> you asked about use cases for support of nested and large binary/text
> objects. I work for a large healthcare organization and there is a lot of
> going with newer HL7 FHIR standard. FHIR documents are highly nested json
> objects that might get as big as a few Gbs in size. We could not use Kudu
> for storing FHIR bundles/documents due to the size and inability to store
> FHIR objects natively. We ended up using Hive/Spark for that, but that
> solution is not real-time. And it is not only about storing, but also about
> fast search in those FHIR documents. I know we could use Hbase for that but
> Hbase is really bad when it comes to analytics type of workloads.
>
> As for large text/binaries, the good and common example is
> narratives notes (physician notes, progress notes, etc.) They can be in
> form of RTF or PDF documents or images.
>
> 300 column limit was not a big deal so far but we have high-density nodes
> with 2 44-core cpus and 12 12-Tb drives and we cannot use the full power of
> them due to limitations of Kudu in terms of number of tablets per node.
> This is more concerning than 300 column limit.
>
>
>
> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <ad...@cloudera.com>
> wrote:
>
>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
>> 1. Both are created transparently and automatically as data is inserted.
>> 2. Both may overlap with one another based on the sort column (Snowflake
>> lets you choose the sort column; in Kudu this is always the PK).
>> 3. Both are pruned at scan time if the scan's predicates allow it.
>> 4. In Kudu, a tablet with a large clustering depth (known as "average
>> rowset height") will be a likely compaction candidate. It's not clear to me
>> if Snowflake has a process to automatically compact their micro-partitions.
>>
>> I don't understand how this solves the partitioning problem though: the
>> doc doesn't explain how micro-partitions are mapped to nodes, or if/when
>> micro-partitions are load balanced across the cluster. I'd be curious to
>> learn more about this.
>>
>> Below you'll find some more info on the other issues you raised:
>> - Nested data type support is tracked in KUDU-1261. We've had an
>> on-again/off-again relationship with it: it's a significant amount of work
>> to implement, so we're hesitant to commit unless there's also significant
>> need. Many users have been successful in "shredding" their nested types
>> across many Kudu columns, which is one of the reasons we're actively
>> working on increasing the number of supported columns above 300.
>> - We're also working on auto-rebalancing right now; see KUDU-2780 for
>> more details.
>> - Large cell support is tracked in KUDU-2874, but there doesn't seem to
>> have been as much interest there.
>> - The compaction issue you're describing sounds like KUDU-1400, which was
>> indeed fixed in 1.9. This was our bad; for quite some time we were focused
>> on optimizing for heavy workloads and didn't pay attention to Kudu
>> performance in "trickling" scenarios, especially not over a long period of
>> time. That's also why we didn't think to advertise --flush_threshold_secs,
>> or even to change its default value (which is still 2 minutes: far too
>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
>> widespread of a problem it was.
>>
>>
>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in
>>> Kudu, and I do not think snowflake deserves that much credit and noise. But
>>> they do have great ideas that take their engine to the next level. I do not
>>> know architecture-wise how it works but here is the best doc I found
>>> regarding micro partitions:
>>>
>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>>
>>>
>>> It is also appealing to me that they support nested data types with ease
>>> (and they are still very efficient to query/filter by), they scale workload
>>> easily (Kudu requires manual rebalancing and it is VERY slow as I've
>>> learned last weekend by doing it on our dev cluster). Also, they support
>>> very large blobs and text fields that Kudu does not.
>>>
>>> Actually, I have a much longer wish list on the Impala side than Kudu -
>>> it feels that Kudu itself is very close to Snowflake but lacks these
>>> self-management features. While Impala is certainly one of the fastest SQL
>>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>>> Hive would run like a champ.
>>>
>>> As for Kudu's compaction process - it was largely an issue with 1.5 and
>>> I believed was addressed in 1.9. We had a terrible time for a few weeks
>>> when frequent updates/delete almost froze all our queries to crawl. A lot
>>> of smaller tables had a huge number of tiny rowsets causing massive scans
>>> and freezing the entire Kudu cluster. We had a good discussion on slack and
>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>>> suggested to set this flag and actually it was one of these "use it at your
>>> own risk" flags.
>>>
>>>
>>>
>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>>> wrote:
>>>
>>>> This has been an excellent discussion to follow, with very useful
>>>> feedback. Thank you for that.
>>>>
>>>> Boris, if I can try to summarize your position, it's that manual
>>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>>> when you don't control the PK of each table. The Kudu schema design guide
>>>> would advise you to hash partition such tables, and, following Andrew's
>>>> recommendation, have at least one partition per tserver. Except that given
>>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>>> number of partitions per tserver. And if you reduce the number of
>>>> partitions per table, you open yourself up to potential hotspotting if the
>>>> per-table load isn't even. Is that correct?
>>>>
>>>> I can think of several improvements we could make here:
>>>> 1. Support splitting/merging of range partitions, first manually and
>>>> later automatically based on load. This is tracked in KUDU-437 and
>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>>> of static range/hash partitioning, there hasn't been much traction on those
>>>> JIRAs.
>>>> 2. Improve server efficiency when hosting large numbers of replicas, so
>>>> that you can blanket your cluster with maximally hash-partitioned tables.
>>>> We did some work here a couple years ago (see KUDU-1967) but there's
>>>> clearly much more that we can do. Of note, we previously discussed
>>>> implementing Multi-Raft (
>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>>> ).
>>>> 3. Add support for consistent hashing. I'm not super familiar with the
>>>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>>>> blog post on sharding is a good (and related) read:
>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>>
>>>> Since you're clearly more familiar with Snowflake than I, can you
>>>> comment more on how their partitioning system works? Sounds like it's quite
>>>> automated; is that ever problematic when it decides to repartition during a
>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>>> hotspotting by load balancing, and concludes that hotspots can move around
>>>> too quickly for a load balancer to keep up.
>>>>
>>>> Separately, you mentioned having to manage Kudu's compaction process.
>>>> Could you go into more detail here?
>>>>
>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>>> team is not tracking competition that much and I think there are awesome
>>>>> features in Snowflake that would be great to have in Kudu.
>>>>>
>>>>> I think they do support upserts now and back in-time queries - while
>>>>> they do not upsert/delete records, they just create new micro-partitions
>>>>> and it is a metadata operation - that's who they can see data in a given
>>>>> time. The idea of virtual warehouse is also very appealing especially in
>>>>> healthcare as we often need to share data with other departments or
>>>>> partners.
>>>>>
>>>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>>>> really did not have a choice on CDH other than HBase but I get questioned
>>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>>> why :)
>>>>>
>>>>> My biggest gripe with Kudu besides known limitations is the management
>>>>> of partitions and compaction process. For our largest tables, we just max
>>>>> out the number of tablets as Kudu allows per cluster. Smaller tables though
>>>>> is a challenge and more like an art while I feel there should be an
>>>>> automated process with good defaults like in other data storage systems. No
>>>>> one expects to assign partitions manually in Snowflake, BigQuery or
>>>>> Redshift. No one is expected to tweak compaction parameters and deal with
>>>>> fast degraded performance over time on tables with a high number of
>>>>> deletes/updates. We are happy with performance but it is not all about
>>>>> performance.
>>>>>
>>>>> We also struggle on Impala side as it feels it is limiting what we can
>>>>> do with Kudu and it feels like Impala team is always behind. pyspark is
>>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>>
>>>>> And honestly, I do not like that Impala is the only viable option with
>>>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>>>>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>>>>> spilling issues, Impala choosing bad explain plans and forcing us to use
>>>>> straight_join hint many times etc.
>>>>>
>>>>> Sorry if this post came a bit on a negative side, I really like Kudu
>>>>> and the dev team behind it rocks. But I do not like their industry is going
>>>>> and why Kudu is not getting the traction it deserves.
>>>>>
>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> This is a good conversation but I don't think the comparison with
>>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>>> make Kudu valuable for some use cases.
>>>>>>
>>>>>> It also seems to me that their intended workloads are quite
>>>>>> different. Snowflake is great for intensive analytics on demand, and can
>>>>>> handle deeply nested data very well, where Kudu can't handle that at all.
>>>>>> Snowflake is not designed for heavy concurrency, but complex query plans
>>>>>> for a small group of users. If you select an x-large Snowflake cluster it's
>>>>>> probably because you have a large amount of data to churn through, not
>>>>>> because you have a large number of users. Or, at least that's how we used
>>>>>> it.
>>>>>>
>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>>> use "for free" instance-store NVMe. We also have associated
>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>>> compute oriented queries.
>>>>>>
>>>>>> The net result blows away what we had with RedShift at less than 1/3
>>>>>> the cost, with performance improvements mostly from better concurrency
>>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>>> also use streaming ingestion which, aside from being impossible with
>>>>>> RedShift, removes the added cost of staging.
>>>>>>
>>>>>> Getting back to Snowflake, there's no way we could use it the same
>>>>>> way we use Kudu, and even if we could, the cost would would probably put us
>>>>>> out of business!
>>>>>>
>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>>>> there are no exact recommendations.
>>>>>>>
>>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>>
>>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>>> one to make these decisions. Does it make sense?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Boris,
>>>>>>>>
>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>>
>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>>> tablets for concurrency?
>>>>>>>>
>>>>>>>>
>>>>>>>> Per your numbers, this confirms that the partitions are the units
>>>>>>>> of concurrency here, and that therefore having more and having smaller
>>>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>>>> total number of partitions cluster-wide.
>>>>>>>>
>>>>>>>> There are some trade offs with the replica count per tablet server
>>>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>>
>>>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>>>> be worked into Impala.
>>>>>>>>
>>>>>>>> For now, I would try to take into account the total sum of
>>>>>>>> resources you have available to Kudu (including number of tablet servers,
>>>>>>>> amount of storage per node, number of disks per tablet server, type of disk
>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet replicas
>>>>>>>> your system can handle (the scaling guide
>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>>>> guide how you partition your tables.
>>>>>>>>
>>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>
>>>>>>>>
>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in
>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously depends
>>>>>>>> on your workload. As you show and as pointed out in documentation, larger
>>>>>>>> and fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>>
>>>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>>>
>>>>>>>>
>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>
>>>>>>>>
>>>>>>>> The recommendation here is to have at least 20 logical partitions
>>>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the
>>>>>>>> table scan would be broken up into 20 tablet scans, and each of those might
>>>>>>>> land on a different tablet server running on isolated hardware. For a
>>>>>>>> significantly larger table into which you expect highly concurrent
>>>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>>>> consider range-partitioning on timestamp.
>>>>>>>>
>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>>> No Default Partitioning
>>>>>>>>> Kudu does not provide a default partitioning strategy when
>>>>>>>>> creating tables. It is recommended that new tables which are expected to
>>>>>>>>> have heavy read and write workloads have at least as many tablets as tablet
>>>>>>>>> servers.
>>>>>>>>>
>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> hey guys,
>>>>>>>>>>
>>>>>>>>>> I asked the same question on Slack on and got no responses. I
>>>>>>>>>> just went through the docs and design doc and FAQ and still did not find an
>>>>>>>>>> answer.
>>>>>>>>>>
>>>>>>>>>> Can someone comment?
>>>>>>>>>>
>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>
>>>>>>>>>> And a follow-up question, if I have tons of smaller tables under
>>>>>>>>>> 5 million rows, should I just use 1 partition or still break them on
>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>
>>>>>>>>>> We cannot pick the partitioning strategy for each table as we
>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need to come
>>>>>>>>>> with an automated way to pick number of partitions/tablets. So far I was
>>>>>>>>>> using 1Gb rule but rethinking this now for another project.
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>>
>>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>>
>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>>
>>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>>> 3 2Gb 65
>>>>>>>>>>> 9 700Mb 27
>>>>>>>>>>> 18 350Mb 17
>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>
>>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>>
>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>>
>>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>
>>>>>>>>>>>>    -
>>>>>>>>>>>>
>>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>>    -
>>>>>>>>>>>>
>>>>>>>>>>>>    For small tables, such as dimension tables, ensure that
>>>>>>>>>>>>    each tablet is at least 1 GB in size.
>>>>>>>>>>>>
>>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>>
>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>>
>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>>>> 1Gb".
>>>>>>>>>>>>
>>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>>
>>>>>>>>>>>> Boris
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Andrew Wong
>>>>>>>>
>>>>>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks Adar for your comments and thinking this through! I really think
Kudu has tons of potential and very happy we've made a decision to use it
for our real-time pipeline.

you asked about use cases for support of nested and large binary/text
objects. I work for a large healthcare organization and there is a lot of
going with newer HL7 FHIR standard. FHIR documents are highly nested json
objects that might get as big as a few Gbs in size. We could not use Kudu
for storing FHIR bundles/documents due to the size and inability to store
FHIR objects natively. We ended up using Hive/Spark for that, but that
solution is not real-time. And it is not only about storing, but also about
fast search in those FHIR documents. I know we could use Hbase for that but
Hbase is really bad when it comes to analytics type of workloads.

As for large text/binaries, the good and common example is narratives notes
(physician notes, progress notes, etc.) They can be in form of RTF or PDF
documents or images.

300 column limit was not a big deal so far but we have high-density nodes
with 2 44-core cpus and 12 12-Tb drives and we cannot use the full power of
them due to limitations of Kudu in terms of number of tablets per node.
This is more concerning than 300 column limit.



On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <ad...@cloudera.com> wrote:

> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
> 1. Both are created transparently and automatically as data is inserted.
> 2. Both may overlap with one another based on the sort column (Snowflake
> lets you choose the sort column; in Kudu this is always the PK).
> 3. Both are pruned at scan time if the scan's predicates allow it.
> 4. In Kudu, a tablet with a large clustering depth (known as "average
> rowset height") will be a likely compaction candidate. It's not clear to me
> if Snowflake has a process to automatically compact their micro-partitions.
>
> I don't understand how this solves the partitioning problem though: the
> doc doesn't explain how micro-partitions are mapped to nodes, or if/when
> micro-partitions are load balanced across the cluster. I'd be curious to
> learn more about this.
>
> Below you'll find some more info on the other issues you raised:
> - Nested data type support is tracked in KUDU-1261. We've had an
> on-again/off-again relationship with it: it's a significant amount of work
> to implement, so we're hesitant to commit unless there's also significant
> need. Many users have been successful in "shredding" their nested types
> across many Kudu columns, which is one of the reasons we're actively
> working on increasing the number of supported columns above 300.
> - We're also working on auto-rebalancing right now; see KUDU-2780 for more
> details.
> - Large cell support is tracked in KUDU-2874, but there doesn't seem to
> have been as much interest there.
> - The compaction issue you're describing sounds like KUDU-1400, which was
> indeed fixed in 1.9. This was our bad; for quite some time we were focused
> on optimizing for heavy workloads and didn't pay attention to Kudu
> performance in "trickling" scenarios, especially not over a long period of
> time. That's also why we didn't think to advertise --flush_threshold_secs,
> or even to change its default value (which is still 2 minutes: far too
> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
> widespread of a problem it was.
>
>
> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu,
>> and I do not think snowflake deserves that much credit and noise. But they
>> do have great ideas that take their engine to the next level. I do not know
>> architecture-wise how it works but here is the best doc I found regarding
>> micro partitions:
>>
>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>
>>
>> It is also appealing to me that they support nested data types with ease
>> (and they are still very efficient to query/filter by), they scale workload
>> easily (Kudu requires manual rebalancing and it is VERY slow as I've
>> learned last weekend by doing it on our dev cluster). Also, they support
>> very large blobs and text fields that Kudu does not.
>>
>> Actually, I have a much longer wish list on the Impala side than Kudu -
>> it feels that Kudu itself is very close to Snowflake but lacks these
>> self-management features. While Impala is certainly one of the fastest SQL
>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>> Hive would run like a champ.
>>
>> As for Kudu's compaction process - it was largely an issue with 1.5 and I
>> believed was addressed in 1.9. We had a terrible time for a few weeks when
>> frequent updates/delete almost froze all our queries to crawl. A lot of
>> smaller tables had a huge number of tiny rowsets causing massive scans and
>> freezing the entire Kudu cluster. We had a good discussion on slack and
>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>> suggested to set this flag and actually it was one of these "use it at your
>> own risk" flags.
>>
>>
>>
>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
>> wrote:
>>
>>> This has been an excellent discussion to follow, with very useful
>>> feedback. Thank you for that.
>>>
>>> Boris, if I can try to summarize your position, it's that manual
>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>> when you don't control the PK of each table. The Kudu schema design guide
>>> would advise you to hash partition such tables, and, following Andrew's
>>> recommendation, have at least one partition per tserver. Except that given
>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>> number of partitions per tserver. And if you reduce the number of
>>> partitions per table, you open yourself up to potential hotspotting if the
>>> per-table load isn't even. Is that correct?
>>>
>>> I can think of several improvements we could make here:
>>> 1. Support splitting/merging of range partitions, first manually and
>>> later automatically based on load. This is tracked in KUDU-437 and
>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>> of static range/hash partitioning, there hasn't been much traction on those
>>> JIRAs.
>>> 2. Improve server efficiency when hosting large numbers of replicas, so
>>> that you can blanket your cluster with maximally hash-partitioned tables.
>>> We did some work here a couple years ago (see KUDU-1967) but there's
>>> clearly much more that we can do. Of note, we previously discussed
>>> implementing Multi-Raft (
>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>> ).
>>> 3. Add support for consistent hashing. I'm not super familiar with the
>>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>>> blog post on sharding is a good (and related) read:
>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>
>>> Since you're clearly more familiar with Snowflake than I, can you
>>> comment more on how their partitioning system works? Sounds like it's quite
>>> automated; is that ever problematic when it decides to repartition during a
>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>> hotspotting by load balancing, and concludes that hotspots can move around
>>> too quickly for a load balancer to keep up.
>>>
>>> Separately, you mentioned having to manage Kudu's compaction process.
>>> Could you go into more detail here?
>>>
>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>> team is not tracking competition that much and I think there are awesome
>>>> features in Snowflake that would be great to have in Kudu.
>>>>
>>>> I think they do support upserts now and back in-time queries - while
>>>> they do not upsert/delete records, they just create new micro-partitions
>>>> and it is a metadata operation - that's who they can see data in a given
>>>> time. The idea of virtual warehouse is also very appealing especially in
>>>> healthcare as we often need to share data with other departments or
>>>> partners.
>>>>
>>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>>> really did not have a choice on CDH other than HBase but I get questioned
>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>> why :)
>>>>
>>>> My biggest gripe with Kudu besides known limitations is the management
>>>> of partitions and compaction process. For our largest tables, we just max
>>>> out the number of tablets as Kudu allows per cluster. Smaller tables though
>>>> is a challenge and more like an art while I feel there should be an
>>>> automated process with good defaults like in other data storage systems. No
>>>> one expects to assign partitions manually in Snowflake, BigQuery or
>>>> Redshift. No one is expected to tweak compaction parameters and deal with
>>>> fast degraded performance over time on tables with a high number of
>>>> deletes/updates. We are happy with performance but it is not all about
>>>> performance.
>>>>
>>>> We also struggle on Impala side as it feels it is limiting what we can
>>>> do with Kudu and it feels like Impala team is always behind. pyspark is
>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>>
>>>> And honestly, I do not like that Impala is the only viable option with
>>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>>>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>>>> spilling issues, Impala choosing bad explain plans and forcing us to use
>>>> straight_join hint many times etc.
>>>>
>>>> Sorry if this post came a bit on a negative side, I really like Kudu
>>>> and the dev team behind it rocks. But I do not like their industry is going
>>>> and why Kudu is not getting the traction it deserves.
>>>>
>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>>
>>>>> This is a good conversation but I don't think the comparison with
>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>> closed source, it seems pretty clear the architectures are quite different.
>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>> make Kudu valuable for some use cases.
>>>>>
>>>>> It also seems to me that their intended workloads are quite different.
>>>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>>> not designed for heavy concurrency, but complex query plans for a small
>>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>>> because you have a large amount of data to churn through, not because you
>>>>> have a large number of users. Or, at least that's how we used it.
>>>>>
>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>> use "for free" instance-store NVMe. We also have associated
>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>> compute oriented queries.
>>>>>
>>>>> The net result blows away what we had with RedShift at less than 1/3
>>>>> the cost, with performance improvements mostly from better concurrency
>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>> also use streaming ingestion which, aside from being impossible with
>>>>> RedShift, removes the added cost of staging.
>>>>>
>>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>>> we use Kudu, and even if we could, the cost would would probably put us out
>>>>> of business!
>>>>>
>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>>> there are no exact recommendations.
>>>>>>
>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>> into that calculation. We have no control really what users are doing, how
>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>
>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>>> one to make these decisions. Does it make sense?
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Boris,
>>>>>>>
>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>>
>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>> tablets for concurrency?
>>>>>>>
>>>>>>>
>>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>>> total number of partitions cluster-wide.
>>>>>>>
>>>>>>> There are some trade offs with the replica count per tablet server
>>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>
>>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>>> be worked into Impala.
>>>>>>>
>>>>>>> For now, I would try to take into account the total sum of resources
>>>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>>> system can handle (the scaling guide
>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>>> guide how you partition your tables.
>>>>>>>
>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>>
>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>
>>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>> The recommendation here is to have at least 20 logical partitions
>>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the
>>>>>>> table scan would be broken up into 20 tablet scans, and each of those might
>>>>>>> land on a different tablet server running on isolated hardware. For a
>>>>>>> significantly larger table into which you expect highly concurrent
>>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>>> consider range-partitioning on timestamp.
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>> No Default Partitioning
>>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>>
>>>>>>>>
>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hey guys,
>>>>>>>>>
>>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>>> answer.
>>>>>>>>>
>>>>>>>>> Can someone comment?
>>>>>>>>>
>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>
>>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>>> tablets for concurrency?
>>>>>>>>>
>>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>
>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>
>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>
>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>> 3 2Gb 65
>>>>>>>>>> 9 700Mb 27
>>>>>>>>>> 18 350Mb 17
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>
>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>
>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>>
>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>
>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>
>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>>> 1Gb".
>>>>>>>>>>>
>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>
>>>>>>>>>>> Boris
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Andrew Wong
>>>>>>>
>>>>>>

Re: Partitioning Rules of Thumb

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
1. Both are created transparently and automatically as data is inserted.
2. Both may overlap with one another based on the sort column (Snowflake
lets you choose the sort column; in Kudu this is always the PK).
3. Both are pruned at scan time if the scan's predicates allow it.
4. In Kudu, a tablet with a large clustering depth (known as "average
rowset height") will be a likely compaction candidate. It's not clear to me
if Snowflake has a process to automatically compact their micro-partitions.

I don't understand how this solves the partitioning problem though: the doc
doesn't explain how micro-partitions are mapped to nodes, or if/when
micro-partitions are load balanced across the cluster. I'd be curious to
learn more about this.

Below you'll find some more info on the other issues you raised:
- Nested data type support is tracked in KUDU-1261. We've had an
on-again/off-again relationship with it: it's a significant amount of work
to implement, so we're hesitant to commit unless there's also significant
need. Many users have been successful in "shredding" their nested types
across many Kudu columns, which is one of the reasons we're actively
working on increasing the number of supported columns above 300.
- We're also working on auto-rebalancing right now; see KUDU-2780 for more
details.
- Large cell support is tracked in KUDU-2874, but there doesn't seem to
have been as much interest there.
- The compaction issue you're describing sounds like KUDU-1400, which was
indeed fixed in 1.9. This was our bad; for quite some time we were focused
on optimizing for heavy workloads and didn't pay attention to Kudu
performance in "trickling" scenarios, especially not over a long period of
time. That's also why we didn't think to advertise --flush_threshold_secs,
or even to change its default value (which is still 2 minutes: far too
short if we hadn't fixed KUDU-1400); we just weren't very aware of how
widespread of a problem it was.


On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu,
> and I do not think snowflake deserves that much credit and noise. But they
> do have great ideas that take their engine to the next level. I do not know
> architecture-wise how it works but here is the best doc I found regarding
> micro partitions:
>
> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>
>
> It is also appealing to me that they support nested data types with ease
> (and they are still very efficient to query/filter by), they scale workload
> easily (Kudu requires manual rebalancing and it is VERY slow as I've
> learned last weekend by doing it on our dev cluster). Also, they support
> very large blobs and text fields that Kudu does not.
>
> Actually, I have a much longer wish list on the Impala side than Kudu - it
> feels that Kudu itself is very close to Snowflake but lacks these
> self-management features. While Impala is certainly one of the fastest SQL
> engines I've seen, we struggle with mutli-tenancy and complex queries, that
> Hive would run like a champ.
>
> As for Kudu's compaction process - it was largely an issue with 1.5 and I
> believed was addressed in 1.9. We had a terrible time for a few weeks when
> frequent updates/delete almost froze all our queries to crawl. A lot of
> smaller tables had a huge number of tiny rowsets causing massive scans and
> freezing the entire Kudu cluster. We had a good discussion on slack and
> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
> move to 1.9 and it worked fine. Nowhere in the documentation, it was
> suggested to set this flag and actually it was one of these "use it at your
> own risk" flags.
>
>
>
> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com>
> wrote:
>
>> This has been an excellent discussion to follow, with very useful
>> feedback. Thank you for that.
>>
>> Boris, if I can try to summarize your position, it's that manual
>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>> when you don't control the PK of each table. The Kudu schema design guide
>> would advise you to hash partition such tables, and, following Andrew's
>> recommendation, have at least one partition per tserver. Except that given
>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>> number of partitions per tserver. And if you reduce the number of
>> partitions per table, you open yourself up to potential hotspotting if the
>> per-table load isn't even. Is that correct?
>>
>> I can think of several improvements we could make here:
>> 1. Support splitting/merging of range partitions, first manually and
>> later automatically based on load. This is tracked in KUDU-437 and
>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>> of static range/hash partitioning, there hasn't been much traction on those
>> JIRAs.
>> 2. Improve server efficiency when hosting large numbers of replicas, so
>> that you can blanket your cluster with maximally hash-partitioned tables.
>> We did some work here a couple years ago (see KUDU-1967) but there's
>> clearly much more that we can do. Of note, we previously discussed
>> implementing Multi-Raft (
>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>> ).
>> 3. Add support for consistent hashing. I'm not super familiar with the
>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>> blog post on sharding is a good (and related) read:
>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>
>> Since you're clearly more familiar with Snowflake than I, can you comment
>> more on how their partitioning system works? Sounds like it's quite
>> automated; is that ever problematic when it decides to repartition during a
>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>> hotspotting by load balancing, and concludes that hotspots can move around
>> too quickly for a load balancer to keep up.
>>
>> Separately, you mentioned having to manage Kudu's compaction process.
>> Could you go into more detail here?
>>
>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Cliff, this is really good info. I am tempted to do the
>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>> team is not tracking competition that much and I think there are awesome
>>> features in Snowflake that would be great to have in Kudu.
>>>
>>> I think they do support upserts now and back in-time queries - while
>>> they do not upsert/delete records, they just create new micro-partitions
>>> and it is a metadata operation - that's who they can see data in a given
>>> time. The idea of virtual warehouse is also very appealing especially in
>>> healthcare as we often need to share data with other departments or
>>> partners.
>>>
>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>> really did not have a choice on CDH other than HBase but I get questioned
>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>> why :)
>>>
>>> My biggest gripe with Kudu besides known limitations is the management
>>> of partitions and compaction process. For our largest tables, we just max
>>> out the number of tablets as Kudu allows per cluster. Smaller tables though
>>> is a challenge and more like an art while I feel there should be an
>>> automated process with good defaults like in other data storage systems. No
>>> one expects to assign partitions manually in Snowflake, BigQuery or
>>> Redshift. No one is expected to tweak compaction parameters and deal with
>>> fast degraded performance over time on tables with a high number of
>>> deletes/updates. We are happy with performance but it is not all about
>>> performance.
>>>
>>> We also struggle on Impala side as it feels it is limiting what we can
>>> do with Kudu and it feels like Impala team is always behind. pyspark is
>>> another problem - Kudu pyspark client is lagging behind java client and
>>> very difficult to install/deploy. Some api is not even available in pyspark.
>>>
>>> And honestly, I do not like that Impala is the only viable option with
>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>>> spilling issues, Impala choosing bad explain plans and forcing us to use
>>> straight_join hint many times etc.
>>>
>>> Sorry if this post came a bit on a negative side, I really like Kudu and
>>> the dev team behind it rocks. But I do not like their industry is going and
>>> why Kudu is not getting the traction it deserves.
>>>
>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>
>>>> This is a good conversation but I don't think the comparison with
>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>> closed source, it seems pretty clear the architectures are quite different.
>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>> make Kudu valuable for some use cases.
>>>>
>>>> It also seems to me that their intended workloads are quite different.
>>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>> not designed for heavy concurrency, but complex query plans for a small
>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>> because you have a large amount of data to churn through, not because you
>>>> have a large number of users. Or, at least that's how we used it.
>>>>
>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>> use "for free" instance-store NVMe. We also have associated
>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>> compute oriented queries.
>>>>
>>>> The net result blows away what we had with RedShift at less than 1/3
>>>> the cost, with performance improvements mostly from better concurrency
>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>> also use streaming ingestion which, aside from being impossible with
>>>> RedShift, removes the added cost of staging.
>>>>
>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>> we use Kudu, and even if we could, the cost would would probably put us out
>>>> of business!
>>>>
>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>> there are no exact recommendations.
>>>>>
>>>>> I did look at scaling recommendations but that math is extremely
>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>> into that calculation. We have no control really what users are doing, how
>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>
>>>>> I do like the Snowflake approach than the engine takes care of
>>>>> defaults and can estimate the number of micro-partitions and even
>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>> the design is very similar. I really do not like to pick random number of
>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>> one to make these decisions. Does it make sense?
>>>>>
>>>>>
>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>>
>>>>>> Hey Boris,
>>>>>>
>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>
>>>>>>
>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>> total number of partitions cluster-wide.
>>>>>>
>>>>>> There are some trade offs with the replica count per tablet server
>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>
>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>> be worked into Impala.
>>>>>>
>>>>>> For now, I would try to take into account the total sum of resources
>>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>> system can handle (the scaling guide
>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>> guide how you partition your tables.
>>>>>>
>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>
>>>>>>
>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>
>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>>> on a different tablet server running on isolated hardware. For a
>>>>>> significantly larger table into which you expect highly concurrent
>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>> consider range-partitioning on timestamp.
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>> No Default Partitioning
>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hey guys,
>>>>>>>>
>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>> answer.
>>>>>>>>
>>>>>>>> Can someone comment?
>>>>>>>>
>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>> tablets for concurrency?
>>>>>>>>
>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>
>>>>>>>>>  Kudu 1.5
>>>>>>>>>
>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>
>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>> 3 2Gb 65
>>>>>>>>> 9 700Mb 27
>>>>>>>>> 18 350Mb 17
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>
>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>
>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>
>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>> diminishing returns.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>
>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>
>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>> 1Gb".
>>>>>>>>>>
>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>
>>>>>>>>>> Boris
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Andrew Wong
>>>>>>
>>>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu,
and I do not think snowflake deserves that much credit and noise. But they
do have great ideas that take their engine to the next level. I do not know
architecture-wise how it works but here is the best doc I found regarding
micro partitions:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html


It is also appealing to me that they support nested data types with ease
(and they are still very efficient to query/filter by), they scale workload
easily (Kudu requires manual rebalancing and it is VERY slow as I've
learned last weekend by doing it on our dev cluster). Also, they support
very large blobs and text fields that Kudu does not.

Actually, I have a much longer wish list on the Impala side than Kudu - it
feels that Kudu itself is very close to Snowflake but lacks these
self-management features. While Impala is certainly one of the fastest SQL
engines I've seen, we struggle with mutli-tenancy and complex queries, that
Hive would run like a champ.

As for Kudu's compaction process - it was largely an issue with 1.5 and I
believed was addressed in 1.9. We had a terrible time for a few weeks when
frequent updates/delete almost froze all our queries to crawl. A lot of
smaller tables had a huge number of tiny rowsets causing massive scans and
freezing the entire Kudu cluster. We had a good discussion on slack and
Todd Lipcon suggested a good workaround using flush_threshold_secs till we
move to 1.9 and it worked fine. Nowhere in the documentation, it was
suggested to set this flag and actually it was one of these "use it at your
own risk" flags.



On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <ad...@cloudera.com> wrote:

> This has been an excellent discussion to follow, with very useful
> feedback. Thank you for that.
>
> Boris, if I can try to summarize your position, it's that manual
> partitioning doesn't scale when dealing with hundreds of (small) tables and
> when you don't control the PK of each table. The Kudu schema design guide
> would advise you to hash partition such tables, and, following Andrew's
> recommendation, have at least one partition per tserver. Except that given
> enough tables, you'll eventually overwhelm any cluster based on the sheer
> number of partitions per tserver. And if you reduce the number of
> partitions per table, you open yourself up to potential hotspotting if the
> per-table load isn't even. Is that correct?
>
> I can think of several improvements we could make here:
> 1. Support splitting/merging of range partitions, first manually and later
> automatically based on load. This is tracked in KUDU-437 and KUDU-441. Both
> are complicated, and since we've gotten a lot of mileage out of static
> range/hash partitioning, there hasn't been much traction on those JIRAs.
> 2. Improve server efficiency when hosting large numbers of replicas, so
> that you can blanket your cluster with maximally hash-partitioned tables.
> We did some work here a couple years ago (see KUDU-1967) but there's
> clearly much more that we can do. Of note, we previously discussed
> implementing Multi-Raft (
> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
> ).
> 3. Add support for consistent hashing. I'm not super familiar with the
> details but YugabyteDB (based on Kudu) has seen success here. Their recent
> blog post on sharding is a good (and related) read:
> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>
> Since you're clearly more familiar with Snowflake than I, can you comment
> more on how their partitioning system works? Sounds like it's quite
> automated; is that ever problematic when it decides to repartition during a
> workload? The YugabyteDB blog post talks about heuristics "reacting" to
> hotspotting by load balancing, and concludes that hotspots can move around
> too quickly for a load balancer to keep up.
>
> Separately, you mentioned having to manage Kudu's compaction process.
> Could you go into more detail here?
>
> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks Cliff, this is really good info. I am tempted to do the benchmarks
>> myself but need to find a sponsor :) Snowflake gets A LOT of traction
>> lately, and based on a few conversations in slack, looks like Kudu team is
>> not tracking competition that much and I think there are awesome features
>> in Snowflake that would be great to have in Kudu.
>>
>> I think they do support upserts now and back in-time queries - while they
>> do not upsert/delete records, they just create new micro-partitions and it
>> is a metadata operation - that's who they can see data in a given time. The
>> idea of virtual warehouse is also very appealing especially in healthcare
>> as we often need to share data with other departments or partners.
>>
>> do not get me wrong, I decided to use Kudu for the same reasons and I
>> really did not have a choice on CDH other than HBase but I get questioned
>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>> native tech like Redshift and Snowflake. It is getting harder to explain
>> why :)
>>
>> My biggest gripe with Kudu besides known limitations is the management of
>> partitions and compaction process. For our largest tables, we just max out
>> the number of tablets as Kudu allows per cluster. Smaller tables though is
>> a challenge and more like an art while I feel there should be an automated
>> process with good defaults like in other data storage systems. No one
>> expects to assign partitions manually in Snowflake, BigQuery or Redshift.
>> No one is expected to tweak compaction parameters and deal with fast
>> degraded performance over time on tables with a high number of
>> deletes/updates. We are happy with performance but it is not all about
>> performance.
>>
>> We also struggle on Impala side as it feels it is limiting what we can do
>> with Kudu and it feels like Impala team is always behind. pyspark is
>> another problem - Kudu pyspark client is lagging behind java client and
>> very difficult to install/deploy. Some api is not even available in pyspark.
>>
>> And honestly, I do not like that Impala is the only viable option with
>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>> spilling issues, Impala choosing bad explain plans and forcing us to use
>> straight_join hint many times etc.
>>
>> Sorry if this post came a bit on a negative side, I really like Kudu and
>> the dev team behind it rocks. But I do not like their industry is going and
>> why Kudu is not getting the traction it deserves.
>>
>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> This is a good conversation but I don't think the comparison with
>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>> closed source, it seems pretty clear the architectures are quite different.
>>> Snowflake has no primary key index, no UPSERT capability, features that
>>> make Kudu valuable for some use cases.
>>>
>>> It also seems to me that their intended workloads are quite different.
>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>> not designed for heavy concurrency, but complex query plans for a small
>>> group of users. If you select an x-large Snowflake cluster it's probably
>>> because you have a large amount of data to churn through, not because you
>>> have a large number of users. Or, at least that's how we used it.
>>>
>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>> concurrent queries. I agree that getting very fussy about partitioning can
>>> be a pain, but for the large fact tables we generally use a simple strategy
>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>> use "for free" instance-store NVMe. We also have associated
>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>> compute oriented queries.
>>>
>>> The net result blows away what we had with RedShift at less than 1/3 the
>>> cost, with performance improvements mostly from better concurrency
>>> handling. This is despite the fact that RedShift has built-in cache. We
>>> also use streaming ingestion which, aside from being impossible with
>>> RedShift, removes the added cost of staging.
>>>
>>> Getting back to Snowflake, there's no way we could use it the same way
>>> we use Kudu, and even if we could, the cost would would probably put us out
>>> of business!
>>>
>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Andrew for taking your time responding to me. It seems that
>>>> there are no exact recommendations.
>>>>
>>>> I did look at scaling recommendations but that math is extremely
>>>> complicated and I do not think anyone will know all the answers to plug
>>>> into that calculation. We have no control really what users are doing, how
>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>> IMHO to expect that knowledge of user query patterns.
>>>>
>>>> I do like the Snowflake approach than the engine takes care of defaults
>>>> and can estimate the number of micro-partitions and even repartition tables
>>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>>> similar. I really do not like to pick random number of buckets. Also we
>>>> manager 100s of tables, I cannot look at them each one by one to make these
>>>> decisions. Does it make sense?
>>>>
>>>>
>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>
>>>>> Hey Boris,
>>>>>
>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>>> tablets to be closer to 1Gb?
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>> tablets for concurrency?
>>>>>
>>>>>
>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>> concurrency here, and that therefore having more and having smaller
>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>> total number of partitions cluster-wide.
>>>>>
>>>>> There are some trade offs with the replica count per tablet server
>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>> operations in the pool of background operations to be run, etc.
>>>>>
>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>> have been patches for other integrations that have tackled this to decouple
>>>>> partitioning from scan concurrency (KUDU-2437
>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>> its way into Impala yet). I filed KUDU-3071
>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>> be worked into Impala.
>>>>>
>>>>> For now, I would try to take into account the total sum of resources
>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>> system can handle (the scaling guide
>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>> guide how you partition your tables.
>>>>>
>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>
>>>>>
>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>
>>>>> It is recommended that new tables which are expected to have heavy
>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>> on a different tablet server running on isolated hardware. For a
>>>>> significantly larger table into which you expect highly concurrent
>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>> consider range-partitioning on timestamp.
>>>>>
>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> just saw this in the docs but it is still confusing statement
>>>>>> No Default Partitioning
>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hey guys,
>>>>>>>
>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>> answer.
>>>>>>>
>>>>>>> Can someone comment?
>>>>>>>
>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>>
>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>> rule but rethinking this now for another project.
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> forgot to post results of my quick test:
>>>>>>>>
>>>>>>>>  Kudu 1.5
>>>>>>>>
>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>
>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>> 3 2Gb 65
>>>>>>>> 9 700Mb 27
>>>>>>>> 18 350Mb 17
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>
>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>
>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>>    as you have cores in the cluster.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>
>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>> diminishing returns.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>
>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>
>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>> 1Gb".
>>>>>>>>>
>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>
>>>>>>>>> Boris
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Andrew Wong
>>>>>
>>>>

Re: Partitioning Rules of Thumb

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

This has been an excellent discussion to follow, with very useful feedback.
Thank you for that.

Boris, if I can try to summarize your position, it's that manual
partitioning doesn't scale when dealing with hundreds of (small) tables and
when you don't control the PK of each table. The Kudu schema design guide
would advise you to hash partition such tables, and, following Andrew's
recommendation, have at least one partition per tserver. Except that given
enough tables, you'll eventually overwhelm any cluster based on the sheer
number of partitions per tserver. And if you reduce the number of
partitions per table, you open yourself up to potential hotspotting if the
per-table load isn't even. Is that correct?

I can think of several improvements we could make here:
1. Support splitting/merging of range partitions, first manually and later
automatically based on load. This is tracked in KUDU-437 and KUDU-441. Both
are complicated, and since we've gotten a lot of mileage out of static
range/hash partitioning, there hasn't been much traction on those JIRAs.
2. Improve server efficiency when hosting large numbers of replicas, so
that you can blanket your cluster with maximally hash-partitioned tables.
We did some work here a couple years ago (see KUDU-1967) but there's
clearly much more that we can do. Of note, we previously discussed
implementing Multi-Raft (
https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
).
3. Add support for consistent hashing. I'm not super familiar with the
details but YugabyteDB (based on Kudu) has seen success here. Their recent
blog post on sharding is a good (and related) read:
https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/

Since you're clearly more familiar with Snowflake than I, can you comment
more on how their partitioning system works? Sounds like it's quite
automated; is that ever problematic when it decides to repartition during a
workload? The YugabyteDB blog post talks about heuristics "reacting" to
hotspotting by load balancing, and concludes that hotspots can move around
too quickly for a load balancer to keep up.

Separately, you mentioned having to manage Kudu's compaction process. Could
you go into more detail here?

On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> thanks Cliff, this is really good info. I am tempted to do the benchmarks
> myself but need to find a sponsor :) Snowflake gets A LOT of traction
> lately, and based on a few conversations in slack, looks like Kudu team is
> not tracking competition that much and I think there are awesome features
> in Snowflake that would be great to have in Kudu.
>
> I think they do support upserts now and back in-time queries - while they
> do not upsert/delete records, they just create new micro-partitions and it
> is a metadata operation - that's who they can see data in a given time. The
> idea of virtual warehouse is also very appealing especially in healthcare
> as we often need to share data with other departments or partners.
>
> do not get me wrong, I decided to use Kudu for the same reasons and I
> really did not have a choice on CDH other than HBase but I get questioned
> lately why we should not just give up CDH and go all in on Cloud with Cloud
> native tech like Redshift and Snowflake. It is getting harder to explain
> why :)
>
> My biggest gripe with Kudu besides known limitations is the management of
> partitions and compaction process. For our largest tables, we just max out
> the number of tablets as Kudu allows per cluster. Smaller tables though is
> a challenge and more like an art while I feel there should be an automated
> process with good defaults like in other data storage systems. No one
> expects to assign partitions manually in Snowflake, BigQuery or Redshift.
> No one is expected to tweak compaction parameters and deal with fast
> degraded performance over time on tables with a high number of
> deletes/updates. We are happy with performance but it is not all about
> performance.
>
> We also struggle on Impala side as it feels it is limiting what we can do
> with Kudu and it feels like Impala team is always behind. pyspark is
> another problem - Kudu pyspark client is lagging behind java client and
> very difficult to install/deploy. Some api is not even available in pyspark.
>
> And honestly, I do not like that Impala is the only viable option with
> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
> tank but we cannot do that anymore with Kudu tables and struggle with disk
> spilling issues, Impala choosing bad explain plans and forcing us to use
> straight_join hint many times etc.
>
> Sorry if this post came a bit on a negative side, I really like Kudu and
> the dev team behind it rocks. But I do not like their industry is going and
> why Kudu is not getting the traction it deserves.
>
> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>
>> This is a good conversation but I don't think the comparison with
>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>> last job, about 5 years ago, I pretty much single-handedly scale tested
>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>> closed source, it seems pretty clear the architectures are quite different.
>> Snowflake has no primary key index, no UPSERT capability, features that
>> make Kudu valuable for some use cases.
>>
>> It also seems to me that their intended workloads are quite different.
>> Snowflake is great for intensive analytics on demand, and can handle deeply
>> nested data very well, where Kudu can't handle that at all. Snowflake is
>> not designed for heavy concurrency, but complex query plans for a small
>> group of users. If you select an x-large Snowflake cluster it's probably
>> because you have a large amount of data to churn through, not because you
>> have a large number of users. Or, at least that's how we used it.
>>
>> At my current workplace we use Kudu/Impala to handle about 30-60
>> concurrent queries. I agree that getting very fussy about partitioning can
>> be a pain, but for the large fact tables we generally use a simple strategy
>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>> use "for free" instance-store NVMe. We also have associated
>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>> compute oriented queries.
>>
>> The net result blows away what we had with RedShift at less than 1/3 the
>> cost, with performance improvements mostly from better concurrency
>> handling. This is despite the fact that RedShift has built-in cache. We
>> also use streaming ingestion which, aside from being impossible with
>> RedShift, removes the added cost of staging.
>>
>> Getting back to Snowflake, there's no way we could use it the same way we
>> use Kudu, and even if we could, the cost would would probably put us out of
>> business!
>>
>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Andrew for taking your time responding to me. It seems that there
>>> are no exact recommendations.
>>>
>>> I did look at scaling recommendations but that math is extremely
>>> complicated and I do not think anyone will know all the answers to plug
>>> into that calculation. We have no control really what users are doing, how
>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>> IMHO to expect that knowledge of user query patterns.
>>>
>>> I do like the Snowflake approach than the engine takes care of defaults
>>> and can estimate the number of micro-partitions and even repartition tables
>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>> similar. I really do not like to pick random number of buckets. Also we
>>> manager 100s of tables, I cannot look at them each one by one to make these
>>> decisions. Does it make sense?
>>>
>>>
>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>
>>>> Hey Boris,
>>>>
>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>
>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>
>>>>
>>>> Per your numbers, this confirms that the partitions are the units of
>>>> concurrency here, and that therefore having more and having smaller
>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>> smaller partitions across all tables may not scale when thinking about the
>>>> total number of partitions cluster-wide.
>>>>
>>>> There are some trade offs with the replica count per tablet server here
>>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>>> an SSD, I would recommend doing so), each tablet introduces some
>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>> operations in the pool of background operations to be run, etc.
>>>>
>>>> Your point about scan concurrency is certainly a valid one -- there
>>>> have been patches for other integrations that have tackled this to decouple
>>>> partitioning from scan concurrency (KUDU-2437
>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>> its way into Impala yet). I filed KUDU-3071
>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>> be worked into Impala.
>>>>
>>>> For now, I would try to take into account the total sum of resources
>>>> you have available to Kudu (including number of tablet servers, amount of
>>>> storage per node, number of disks per tablet server, type of disk for the
>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>> system can handle (the scaling guide
>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>> guide how you partition your tables.
>>>>
>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>
>>>>
>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>> your workload. As you show and as pointed out in documentation, larger and
>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>> though we've seen multiple GBs works relatively well for many use cases
>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>
>>>> It is recommended that new tables which are expected to have heavy read
>>>>> and write workloads have at least as many tablets as tablet servers.
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>> The recommendation here is to have at least 20 logical partitions per
>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>> on a different tablet server running on isolated hardware. For a
>>>> significantly larger table into which you expect highly concurrent
>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>> having more partitions, and if your data is naturally time-oriented,
>>>> consider range-partitioning on timestamp.
>>>>
>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> just saw this in the docs but it is still confusing statement
>>>>> No Default Partitioning
>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> hey guys,
>>>>>>
>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>> answer.
>>>>>>
>>>>>> Can someone comment?
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>> tablets for concurrency?
>>>>>>
>>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>> rule but rethinking this now for another project.
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> forgot to post results of my quick test:
>>>>>>>
>>>>>>>  Kudu 1.5
>>>>>>>
>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>
>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>> 3 2Gb 65
>>>>>>> 9 700Mb 27
>>>>>>> 18 350Mb 17
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>
>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>
>>>>>>>> Partitioning Rules of Thumb
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>    as you have cores in the cluster.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>
>>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>>> returns.
>>>>>>>>
>>>>>>>>
>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>
>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>
>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>> 1Gb".
>>>>>>>>
>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>
>>>>>>>> Boris
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Andrew Wong
>>>>
>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks Cliff, this is really good info. I am tempted to do the benchmarks
myself but need to find a sponsor :) Snowflake gets A LOT of traction
lately, and based on a few conversations in slack, looks like Kudu team is
not tracking competition that much and I think there are awesome features
in Snowflake that would be great to have in Kudu.

I think they do support upserts now and back in-time queries - while they
do not upsert/delete records, they just create new micro-partitions and it
is a metadata operation - that's who they can see data in a given time. The
idea of virtual warehouse is also very appealing especially in healthcare
as we often need to share data with other departments or partners.

do not get me wrong, I decided to use Kudu for the same reasons and I
really did not have a choice on CDH other than HBase but I get questioned
lately why we should not just give up CDH and go all in on Cloud with Cloud
native tech like Redshift and Snowflake. It is getting harder to explain
why :)

My biggest gripe with Kudu besides known limitations is the management of
partitions and compaction process. For our largest tables, we just max out
the number of tablets as Kudu allows per cluster. Smaller tables though is
a challenge and more like an art while I feel there should be an automated
process with good defaults like in other data storage systems. No one
expects to assign partitions manually in Snowflake, BigQuery or Redshift.
No one is expected to tweak compaction parameters and deal with fast
degraded performance over time on tables with a high number of
deletes/updates. We are happy with performance but it is not all about
performance.

We also struggle on Impala side as it feels it is limiting what we can do
with Kudu and it feels like Impala team is always behind. pyspark is
another problem - Kudu pyspark client is lagging behind java client and
very difficult to install/deploy. Some api is not even available in pyspark.

And honestly, I do not like that Impala is the only viable option with
Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
tank but we cannot do that anymore with Kudu tables and struggle with disk
spilling issues, Impala choosing bad explain plans and forcing us to use
straight_join hint many times etc.

Sorry if this post came a bit on a negative side, I really like Kudu and
the dev team behind it rocks. But I do not like their industry is going and
why Kudu is not getting the traction it deserves.

On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:

> This is a good conversation but I don't think the comparison with
> Snowflake is a fair one, at least from an older version of Snowflake (In my
> last job, about 5 years ago, I pretty much single-handedly scale tested
> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
> closed source, it seems pretty clear the architectures are quite different.
> Snowflake has no primary key index, no UPSERT capability, features that
> make Kudu valuable for some use cases.
>
> It also seems to me that their intended workloads are quite different.
> Snowflake is great for intensive analytics on demand, and can handle deeply
> nested data very well, where Kudu can't handle that at all. Snowflake is
> not designed for heavy concurrency, but complex query plans for a small
> group of users. If you select an x-large Snowflake cluster it's probably
> because you have a large amount of data to churn through, not because you
> have a large number of users. Or, at least that's how we used it.
>
> At my current workplace we use Kudu/Impala to handle about 30-60
> concurrent queries. I agree that getting very fussy about partitioning can
> be a pain, but for the large fact tables we generally use a simple strategy
> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries.
>
> The net result blows away what we had with RedShift at less than 1/3 the
> cost, with performance improvements mostly from better concurrency
> handling. This is despite the fact that RedShift has built-in cache. We
> also use streaming ingestion which, aside from being impossible with
> RedShift, removes the added cost of staging.
>
> Getting back to Snowflake, there's no way we could use it the same way we
> use Kudu, and even if we could, the cost would would probably put us out of
> business!
>
> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> thanks Andrew for taking your time responding to me. It seems that there
>> are no exact recommendations.
>>
>> I did look at scaling recommendations but that math is extremely
>> complicated and I do not think anyone will know all the answers to plug
>> into that calculation. We have no control really what users are doing, how
>> many queries they run, how many are hot vs cold etc. It is not realistic
>> IMHO to expect that knowledge of user query patterns.
>>
>> I do like the Snowflake approach than the engine takes care of defaults
>> and can estimate the number of micro-partitions and even repartition tables
>> as they grow. I feel Kudu has the same capabilities as the design is very
>> similar. I really do not like to pick random number of buckets. Also we
>> manager 100s of tables, I cannot look at them each one by one to make these
>> decisions. Does it make sense?
>>
>>
>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>
>>> Hey Boris,
>>>
>>> Sorry you didn't have much luck on Slack. I know partitioning in general
>>> can be tricky; thanks for the question. Left some thoughts below:
>>>
>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>> tablets to be closer to 1Gb?
>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>> million rows, should I just use 1 partition or still break them on smaller
>>>> tablets for concurrency?
>>>
>>>
>>> Per your numbers, this confirms that the partitions are the units of
>>> concurrency here, and that therefore having more and having smaller
>>> partitions yields a concurrency bump. That said, extending a scheme of
>>> smaller partitions across all tables may not scale when thinking about the
>>> total number of partitions cluster-wide.
>>>
>>> There are some trade offs with the replica count per tablet server here
>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>> an SSD, I would recommend doing so), each tablet introduces some
>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>> operations in the pool of background operations to be run, etc.
>>>
>>> Your point about scan concurrency is certainly a valid one -- there have
>>> been patches for other integrations that have tackled this to decouple
>>> partitioning from scan concurrency (KUDU-2437
>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
>>> Kudu's Spark integration will split range scans into smaller-scoped scan
>>> tokens to be run concurrently, though this optimization hasn't made its way
>>> into Impala yet). I filed KUDU-3071
>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think
>>> is left on the Kudu-side to get this up and running, so that it can be
>>> worked into Impala.
>>>
>>> For now, I would try to take into account the total sum of resources you
>>> have available to Kudu (including number of tablet servers, amount of
>>> storage per node, number of disks per tablet server, type of disk for the
>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>> system can handle (the scaling guide
>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here),
>>> and hopefully that, along with your own SLAs per table, can help guide how
>>> you partition your tables.
>>>
>>> confused why they say "at least" not "at most" - does it mean I should
>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>
>>>
>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
>>> tens of GB per tablet replica, though exact perf obviously depends on your
>>> workload. As you show and as pointed out in documentation, larger and fewer
>>> tablets can limit the amount of concurrency for writes and reads, though
>>> we've seen multiple GBs works relatively well for many use cases while
>>> weighing the above mentioned tradeoffs with replica count.
>>>
>>> It is recommended that new tables which are expected to have heavy read
>>>> and write workloads have at least as many tablets as tablet servers.
>>>
>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>
>>>
>>> The recommendation here is to have at least 20 logical partitions per
>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>> scan would be broken up into 20 tablet scans, and each of those might land
>>> on a different tablet server running on isolated hardware. For a
>>> significantly larger table into which you expect highly concurrent
>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>> having more partitions, and if your data is naturally time-oriented,
>>> consider range-partitioning on timestamp.
>>>
>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> just saw this in the docs but it is still confusing statement
>>>> No Default Partitioning
>>>> Kudu does not provide a default partitioning strategy when creating
>>>> tables. It is recommended that new tables which are expected to have heavy
>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> hey guys,
>>>>>
>>>>> I asked the same question on Slack on and got no responses. I just
>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>> answer.
>>>>>
>>>>> Can someone comment?
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>>
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>>
>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>> rule but rethinking this now for another project.
>>>>>
>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> forgot to post results of my quick test:
>>>>>>
>>>>>>  Kudu 1.5
>>>>>>
>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>
>>>>>> Tablets Tablet Size Query run time, sec
>>>>>> 3 2Gb 65
>>>>>> 9 700Mb 27
>>>>>> 18 350Mb 17
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>
>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>
>>>>>>> Partitioning Rules of Thumb
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>    as you have cores in the cluster.
>>>>>>>    -
>>>>>>>
>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>
>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>> returns.
>>>>>>>
>>>>>>>
>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>> correctly. Let me use concrete example.
>>>>>>>
>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>> 1Gb".
>>>>>>>
>>>>>>> Really confused what the doc is saying, please help
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

This is a good conversation but I don't think the comparison with Snowflake
is a fair one, at least from an older version of Snowflake (In my last job,
about 5 years ago, I pretty much single-handedly scale tested Snowflake in
exchange for a sweetheart pricing deal) . Though Snowflake is closed
source, it seems pretty clear the architectures are quite different.
Snowflake has no primary key index, no UPSERT capability, features that
make Kudu valuable for some use cases.

It also seems to me that their intended workloads are quite different.
Snowflake is great for intensive analytics on demand, and can handle deeply
nested data very well, where Kudu can't handle that at all. Snowflake is
not designed for heavy concurrency, but complex query plans for a small
group of users. If you select an x-large Snowflake cluster it's probably
because you have a large amount of data to churn through, not because you
have a large number of users. Or, at least that's how we used it.

At my current workplace we use Kudu/Impala to handle about 30-60 concurrent
queries. I agree that getting very fussy about partitioning can be a pain,
but for the large fact tables we generally use a simple strategy of
twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted by a
load balancer. We're in AWS, and thanks to Kudu's replication we can use
"for free" instance-store NVMe. We also have associated compute-oriented
stateless Impala Spot Fleet clusters for HLL and other compute oriented
queries.

The net result blows away what we had with RedShift at less than 1/3 the
cost, with performance improvements mostly from better concurrency
handling. This is despite the fact that RedShift has built-in cache. We
also use streaming ingestion which, aside from being impossible with
RedShift, removes the added cost of staging.

Getting back to Snowflake, there's no way we could use it the same way we
use Kudu, and even if we could, the cost would would probably put us out of
business!

On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> thanks Andrew for taking your time responding to me. It seems that there
> are no exact recommendations.
>
> I did look at scaling recommendations but that math is extremely
> complicated and I do not think anyone will know all the answers to plug
> into that calculation. We have no control really what users are doing, how
> many queries they run, how many are hot vs cold etc. It is not realistic
> IMHO to expect that knowledge of user query patterns.
>
> I do like the Snowflake approach than the engine takes care of defaults
> and can estimate the number of micro-partitions and even repartition tables
> as they grow. I feel Kudu has the same capabilities as the design is very
> similar. I really do not like to pick random number of buckets. Also we
> manager 100s of tables, I cannot look at them each one by one to make these
> decisions. Does it make sense?
>
>
> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>
>> Hey Boris,
>>
>> Sorry you didn't have much luck on Slack. I know partitioning in general
>> can be tricky; thanks for the question. Left some thoughts below:
>>
>> Maybe I was not asking a clear question. If my cluster is large enough in
>>> my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>> tablets to be closer to 1Gb?
>>> And a follow-up question, if I have tons of smaller tables under 5
>>> million rows, should I just use 1 partition or still break them on smaller
>>> tablets for concurrency?
>>
>>
>> Per your numbers, this confirms that the partitions are the units of
>> concurrency here, and that therefore having more and having smaller
>> partitions yields a concurrency bump. That said, extending a scheme of
>> smaller partitions across all tables may not scale when thinking about the
>> total number of partitions cluster-wide.
>>
>> There are some trade offs with the replica count per tablet server here
>> -- generally, each tablet replica has a resource cost on tablet servers:
>> WALs and tablet-related metadata use a shared disk (if you can put this on
>> an SSD, I would recommend doing so), each tablet introduces some
>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>> operations in the pool of background operations to be run, etc.
>>
>> Your point about scan concurrency is certainly a valid one -- there have
>> been patches for other integrations that have tackled this to decouple
>> partitioning from scan concurrency (KUDU-2437
>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
>> Kudu's Spark integration will split range scans into smaller-scoped scan
>> tokens to be run concurrently, though this optimization hasn't made its way
>> into Impala yet). I filed KUDU-3071
>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think
>> is left on the Kudu-side to get this up and running, so that it can be
>> worked into Impala.
>>
>> For now, I would try to take into account the total sum of resources you
>> have available to Kudu (including number of tablet servers, amount of
>> storage per node, number of disks per tablet server, type of disk for the
>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>> system can handle (the scaling guide
>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here),
>> and hopefully that, along with your own SLAs per table, can help guide how
>> you partition your tables.
>>
>> confused why they say "at least" not "at most" - does it mean I should
>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>
>>
>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
>> tens of GB per tablet replica, though exact perf obviously depends on your
>> workload. As you show and as pointed out in documentation, larger and fewer
>> tablets can limit the amount of concurrency for writes and reads, though
>> we've seen multiple GBs works relatively well for many use cases while
>> weighing the above mentioned tradeoffs with replica count.
>>
>> It is recommended that new tables which are expected to have heavy read
>>> and write workloads have at least as many tablets as tablet servers.
>>
>> if I have 20 tablet servers and I have two tables - one with 1MM rows and
>>> another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
>>> by 3 because of replication)?
>>
>>
>> The recommendation here is to have at least 20 logical partitions per
>> table. That way, a scan were to touch a table's entire keyspace, the table
>> scan would be broken up into 20 tablet scans, and each of those might land
>> on a different tablet server running on isolated hardware. For a
>> significantly larger table into which you expect highly concurrent
>> workloads, the recommendation serves as a lower bound -- I'd recommend
>> having more partitions, and if your data is naturally time-oriented,
>> consider range-partitioning on timestamp.
>>
>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> just saw this in the docs but it is still confusing statement
>>> No Default Partitioning
>>> Kudu does not provide a default partitioning strategy when creating
>>> tables. It is recommended that new tables which are expected to have heavy
>>> read and write workloads have at least as many tablets as tablet servers.
>>>
>>>
>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>> (divide by 3 because of replication)?
>>>
>>>
>>>
>>>
>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> hey guys,
>>>>
>>>> I asked the same question on Slack on and got no responses. I just went
>>>> through the docs and design doc and FAQ and still did not find an answer.
>>>>
>>>> Can someone comment?
>>>>
>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>> tablets to be closer to 1Gb?
>>>>
>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>> million rows, should I just use 1 partition or still break them on smaller
>>>> tablets for concurrency?
>>>>
>>>> We cannot pick the partitioning strategy for each table as we need to
>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>> rule but rethinking this now for another project.
>>>>
>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> forgot to post results of my quick test:
>>>>>
>>>>>  Kudu 1.5
>>>>>
>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>
>>>>> Tablets Tablet Size Query run time, sec
>>>>> 3 2Gb 65
>>>>> 9 700Mb 27
>>>>> 18 350Mb 17
>>>>> [image: image.png]
>>>>>
>>>>>
>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>
>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>
>>>>>> Partitioning Rules of Thumb
>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    For large tables, such as fact tables, aim for as many tablets as
>>>>>>    you have cores in the cluster.
>>>>>>    -
>>>>>>
>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>    tablet is at least 1 GB in size.
>>>>>>
>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>> returns.
>>>>>>
>>>>>>
>>>>>> I've read this a few times but I am not sure I understand it
>>>>>> correctly. Let me use concrete example.
>>>>>>
>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>
>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>> 1Gb".
>>>>>>
>>>>>> Really confused what the doc is saying, please help
>>>>>>
>>>>>> Boris
>>>>>>
>>>>>>
>>
>> --
>> Andrew Wong
>>
>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

thanks Andrew for taking your time responding to me. It seems that there
are no exact recommendations.

I did look at scaling recommendations but that math is extremely
complicated and I do not think anyone will know all the answers to plug
into that calculation. We have no control really what users are doing, how
many queries they run, how many are hot vs cold etc. It is not realistic
IMHO to expect that knowledge of user query patterns.

I do like the Snowflake approach than the engine takes care of defaults and
can estimate the number of micro-partitions and even repartition tables as
they grow. I feel Kudu has the same capabilities as the design is very
similar. I really do not like to pick random number of buckets. Also we
manager 100s of tables, I cannot look at them each one by one to make these
decisions. Does it make sense?


On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:

> Hey Boris,
>
> Sorry you didn't have much luck on Slack. I know partitioning in general
> can be tricky; thanks for the question. Left some thoughts below:
>
> Maybe I was not asking a clear question. If my cluster is large enough in
>> my example above, should I go with 3, 9 or 18 tablets? or should I pick
>> tablets to be closer to 1Gb?
>> And a follow-up question, if I have tons of smaller tables under 5
>> million rows, should I just use 1 partition or still break them on smaller
>> tablets for concurrency?
>
>
> Per your numbers, this confirms that the partitions are the units of
> concurrency here, and that therefore having more and having smaller
> partitions yields a concurrency bump. That said, extending a scheme of
> smaller partitions across all tables may not scale when thinking about the
> total number of partitions cluster-wide.
>
> There are some trade offs with the replica count per tablet server here --
> generally, each tablet replica has a resource cost on tablet servers: WALs
> and tablet-related metadata use a shared disk (if you can put this on an
> SSD, I would recommend doing so), each tablet introduces some Raft-related
> RPC traffic, each tablet replica introduces some maintenance operations in
> the pool of background operations to be run, etc.
>
> Your point about scan concurrency is certainly a valid one -- there have
> been patches for other integrations that have tackled this to decouple
> partitioning from scan concurrency (KUDU-2437
> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
> Kudu's Spark integration will split range scans into smaller-scoped scan
> tokens to be run concurrently, though this optimization hasn't made its way
> into Impala yet). I filed KUDU-3071
> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think
> is left on the Kudu-side to get this up and running, so that it can be
> worked into Impala.
>
> For now, I would try to take into account the total sum of resources you
> have available to Kudu (including number of tablet servers, amount of
> storage per node, number of disks per tablet server, type of disk for the
> WAL/metadata disks), to settle on roughly how many tablet replicas your
> system can handle (the scaling guide
> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here),
> and hopefully that, along with your own SLAs per table, can help guide how
> you partition your tables.
>
> confused why they say "at least" not "at most" - does it mean I should
>> design it so a tablet takes 2Gb or 3Gb in this example?
>
>
> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
> tens of GB per tablet replica, though exact perf obviously depends on your
> workload. As you show and as pointed out in documentation, larger and fewer
> tablets can limit the amount of concurrency for writes and reads, though
> we've seen multiple GBs works relatively well for many use cases while
> weighing the above mentioned tradeoffs with replica count.
>
> It is recommended that new tables which are expected to have heavy read
>> and write workloads have at least as many tablets as tablet servers.
>
> if I have 20 tablet servers and I have two tables - one with 1MM rows and
>> another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
>> by 3 because of replication)?
>
>
> The recommendation here is to have at least 20 logical partitions per
> table. That way, a scan were to touch a table's entire keyspace, the table
> scan would be broken up into 20 tablet scans, and each of those might land
> on a different tablet server running on isolated hardware. For a
> significantly larger table into which you expect highly concurrent
> workloads, the recommendation serves as a lower bound -- I'd recommend
> having more partitions, and if your data is naturally time-oriented,
> consider range-partitioning on timestamp.
>
> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> just saw this in the docs but it is still confusing statement
>> No Default Partitioning
>> Kudu does not provide a default partitioning strategy when creating
>> tables. It is recommended that new tables which are expected to have heavy
>> read and write workloads have at least as many tablets as tablet servers.
>>
>>
>> if I have 20 tablet servers and I have two tables - one with 1MM rows and
>> another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
>> by 3 because of replication)?
>>
>>
>>
>>
>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> hey guys,
>>>
>>> I asked the same question on Slack on and got no responses. I just went
>>> through the docs and design doc and FAQ and still did not find an answer.
>>>
>>> Can someone comment?
>>>
>>> Maybe I was not asking a clear question. If my cluster is large enough
>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>> tablets to be closer to 1Gb?
>>>
>>> And a follow-up question, if I have tons of smaller tables under 5
>>> million rows, should I just use 1 partition or still break them on smaller
>>> tablets for concurrency?
>>>
>>> We cannot pick the partitioning strategy for each table as we need to
>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>> rule but rethinking this now for another project.
>>>
>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> forgot to post results of my quick test:
>>>>
>>>>  Kudu 1.5
>>>>
>>>> Table takes 18Gb of disk space after 3x replication
>>>>
>>>> Tablets Tablet Size Query run time, sec
>>>> 3 2Gb 65
>>>> 9 700Mb 27
>>>> 18 350Mb 17
>>>> [image: image.png]
>>>>
>>>>
>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> just want to clarify recommendations from the doc. It says:
>>>>>
>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>
>>>>> Partitioning Rules of Thumb
>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>
>>>>>    -
>>>>>
>>>>>    For large tables, such as fact tables, aim for as many tablets as
>>>>>    you have cores in the cluster.
>>>>>    -
>>>>>
>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>    tablet is at least 1 GB in size.
>>>>>
>>>>> In general, be mindful the number of tablets limits the parallelism of
>>>>> reads, in the current implementation. Increasing the number of tablets
>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>> returns.
>>>>>
>>>>>
>>>>> I've read this a few times but I am not sure I understand it
>>>>> correctly. Let me use concrete example.
>>>>>
>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>
>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>> 1Gb".
>>>>>
>>>>> Really confused what the doc is saying, please help
>>>>>
>>>>> Boris
>>>>>
>>>>>
>
> --
> Andrew Wong
>

Re: Partitioning Rules of Thumb

Posted by Andrew Wong <aw...@cloudera.com>.

Hey Boris,

Sorry you didn't have much luck on Slack. I know partitioning in general
can be tricky; thanks for the question. Left some thoughts below:

Maybe I was not asking a clear question. If my cluster is large enough in
> my example above, should I go with 3, 9 or 18 tablets? or should I pick
> tablets to be closer to 1Gb?
> And a follow-up question, if I have tons of smaller tables under 5 million
> rows, should I just use 1 partition or still break them on smaller tablets
> for concurrency?

Per your numbers, this confirms that the partitions are the units of
concurrency here, and that therefore having more and having smaller
partitions yields a concurrency bump. That said, extending a scheme of
smaller partitions across all tables may not scale when thinking about the
total number of partitions cluster-wide.

There are some trade offs with the replica count per tablet server here --
generally, each tablet replica has a resource cost on tablet servers: WALs
and tablet-related metadata use a shared disk (if you can put this on an
SSD, I would recommend doing so), each tablet introduces some Raft-related
RPC traffic, each tablet replica introduces some maintenance operations in
the pool of background operations to be run, etc.

Your point about scan concurrency is certainly a valid one -- there have
been patches for other integrations that have tackled this to decouple
partitioning from scan concurrency (KUDU-2437
<https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
<https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
Kudu's Spark integration will split range scans into smaller-scoped scan
tokens to be run concurrently, though this optimization hasn't made its way
into Impala yet). I filed KUDU-3071
<https://issues.apache.org/jira/browse/KUDU-3071> to track what I think is
left on the Kudu-side to get this up and running, so that it can be worked
into Impala.

For now, I would try to take into account the total sum of resources you
have available to Kudu (including number of tablet servers, amount of
storage per node, number of disks per tablet server, type of disk for the
WAL/metadata disks), to settle on roughly how many tablet replicas your
system can handle (the scaling guide
<https://kudu.apache.org/docs/scaling_guide.html> may be helpful here), and
hopefully that, along with your own SLAs per table, can help guide how you
partition your tables.

confused why they say "at least" not "at most" - does it mean I should
> design it so a tablet takes 2Gb or 3Gb in this example?

Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
tens of GB per tablet replica, though exact perf obviously depends on your
workload. As you show and as pointed out in documentation, larger and fewer
tablets can limit the amount of concurrency for writes and reads, though
we've seen multiple GBs works relatively well for many use cases while
weighing the above mentioned tradeoffs with replica count.

It is recommended that new tables which are expected to have heavy read and
> write workloads have at least as many tablets as tablet servers.

if I have 20 tablet servers and I have two tables - one with 1MM rows and
> another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
> by 3 because of replication)?

The recommendation here is to have at least 20 logical partitions per
table. That way, a scan were to touch a table's entire keyspace, the table
scan would be broken up into 20 tablet scans, and each of those might land
on a different tablet server running on isolated hardware. For a
significantly larger table into which you expect highly concurrent
workloads, the recommendation serves as a lower bound -- I'd recommend
having more partitions, and if your data is naturally time-oriented,
consider range-partitioning on timestamp.

On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> just saw this in the docs but it is still confusing statement
> No Default Partitioning
> Kudu does not provide a default partitioning strategy when creating
> tables. It is recommended that new tables which are expected to have heavy
> read and write workloads have at least as many tablets as tablet servers.
>
> if I have 20 tablet servers and I have two tables - one with 1MM rows and
> another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
> by 3 because of replication)?
>
>
>
>
> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> hey guys,
>>
>> I asked the same question on Slack on and got no responses. I just went
>> through the docs and design doc and FAQ and still did not find an answer.
>>
>> Can someone comment?
>>
>> Maybe I was not asking a clear question. If my cluster is large enough in
>> my example above, should I go with 3, 9 or 18 tablets? or should I pick
>> tablets to be closer to 1Gb?
>>
>> And a follow-up question, if I have tons of smaller tables under 5
>> million rows, should I just use 1 partition or still break them on smaller
>> tablets for concurrency?
>>
>> We cannot pick the partitioning strategy for each table as we need to
>> stream 100s of tables and we use PK from RBDMS and need to come with an
>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>> rule but rethinking this now for another project.
>>
>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> forgot to post results of my quick test:
>>>
>>>  Kudu 1.5
>>>
>>> Table takes 18Gb of disk space after 3x replication
>>>
>>> Tablets Tablet Size Query run time, sec
>>> 3 2Gb 65
>>> 9 700Mb 27
>>> 18 350Mb 17
>>> [image: image.png]
>>>
>>>
>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> just want to clarify recommendations from the doc. It says:
>>>>
>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>
>>>> Partitioning Rules of Thumb
>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>
>>>>    -
>>>>
>>>>    For large tables, such as fact tables, aim for as many tablets as
>>>>    you have cores in the cluster.
>>>>    -
>>>>
>>>>    For small tables, such as dimension tables, ensure that each tablet
>>>>    is at least 1 GB in size.
>>>>
>>>> In general, be mindful the number of tablets limits the parallelism of
>>>> reads, in the current implementation. Increasing the number of tablets
>>>> significantly beyond the number of cores is likely to have diminishing
>>>> returns.
>>>>
>>>>
>>>> I've read this a few times but I am not sure I understand it correctly.
>>>> Let me use concrete example.
>>>>
>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>> confused why they say "at least" not "at most" - does it mean I should
>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>
>>>> Assume that I have tons of CPU cores on a cluster...
>>>> Based on my quick test, it seems that queries are faster if I have more
>>>> tablets/partitions...In this example, 18 tablets gave me the best timing
>>>> but tablet size was around 300-400Mb. But the doc says "at least 1Gb".
>>>>
>>>> Really confused what the doc is saying, please help
>>>>
>>>> Boris
>>>>
>>>>

-- 
Andrew Wong

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

just saw this in the docs but it is still confusing statement
No Default Partitioning
Kudu does not provide a default partitioning strategy when creating tables.
It is recommended that new tables which are expected to have heavy read and
write workloads have at least as many tablets as tablet servers.

if I have 20 tablet servers and I have two tables - one with 1MM rows and
another one with 100MM rows, do I pick 20 / 3 partitions for both (divide
by 3 because of replication)?




On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com> wrote:

> hey guys,
>
> I asked the same question on Slack on and got no responses. I just went
> through the docs and design doc and FAQ and still did not find an answer.
>
> Can someone comment?
>
> Maybe I was not asking a clear question. If my cluster is large enough in
> my example above, should I go with 3, 9 or 18 tablets? or should I pick
> tablets to be closer to 1Gb?
>
> And a follow-up question, if I have tons of smaller tables under 5 million
> rows, should I just use 1 partition or still break them on smaller tablets
> for concurrency?
>
> We cannot pick the partitioning strategy for each table as we need to
> stream 100s of tables and we use PK from RBDMS and need to come with an
> automated way to pick number of partitions/tablets. So far I was using 1Gb
> rule but rethinking this now for another project.
>
> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> forgot to post results of my quick test:
>>
>>  Kudu 1.5
>>
>> Table takes 18Gb of disk space after 3x replication
>>
>> Tablets Tablet Size Query run time, sec
>> 3 2Gb 65
>> 9 700Mb 27
>> 18 350Mb 17
>> [image: image.png]
>>
>>
>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> just want to clarify recommendations from the doc. It says:
>>>
>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>
>>> Partitioning Rules of Thumb
>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>
>>>    -
>>>
>>>    For large tables, such as fact tables, aim for as many tablets as
>>>    you have cores in the cluster.
>>>    -
>>>
>>>    For small tables, such as dimension tables, ensure that each tablet
>>>    is at least 1 GB in size.
>>>
>>> In general, be mindful the number of tablets limits the parallelism of
>>> reads, in the current implementation. Increasing the number of tablets
>>> significantly beyond the number of cores is likely to have diminishing
>>> returns.
>>>
>>>
>>> I've read this a few times but I am not sure I understand it correctly.
>>> Let me use concrete example.
>>>
>>> If a table ends up taking 18Gb after replication (so with 3x replication
>>> it is ~9Gb per tablet if I do not partition), should I aim for 1Gb tablets
>>> (6 tablets before replication) or should I aim for 500Mb tablets if my
>>> cluster capacity allows so (12 tablets before replication)? confused why
>>> they say "at least" not "at most" - does it mean I should design it so a
>>> tablet takes 2Gb or 3Gb in this example?
>>>
>>> Assume that I have tons of CPU cores on a cluster...
>>> Based on my quick test, it seems that queries are faster if I have more
>>> tablets/partitions...In this example, 18 tablets gave me the best timing
>>> but tablet size was around 300-400Mb. But the doc says "at least 1Gb".
>>>
>>> Really confused what the doc is saying, please help
>>>
>>> Boris
>>>
>>>