You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2020/04/25 18:54:58 UTC

Re: Partitioning Rules of Thumb

Cliff, i would be extremely interested to see a blog post to compare
Snowflake, Redshift and Impala/Kudu since you tried all of them.

would love to get some details how you set up Kudu/Impala cluster on AWS as
well as my company might be heading the same direction. this does not mean
much to me as we are not using cloud but I hope you can elaborate on your
setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by a
load balancer. We're in AWS, and thanks to Kudu's replication we can use
"for free" instance-store NVMe. We also have associated compute-oriented
stateless Impala Spot Fleet clusters for HLL and other compute oriented
queries. "

I cannot really find anything on the web that would compare Impala/Kudu to
Snowflake and Redshift. Everything I see is about Snowflake, Redshift and
BigQuery.

On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:

> This is a good conversation but I don't think the comparison with
> Snowflake is a fair one, at least from an older version of Snowflake (In my
> last job, about 5 years ago, I pretty much single-handedly scale tested
> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
> closed source, it seems pretty clear the architectures are quite different.
> Snowflake has no primary key index, no UPSERT capability, features that
> make Kudu valuable for some use cases.
>
> It also seems to me that their intended workloads are quite different.
> Snowflake is great for intensive analytics on demand, and can handle deeply
> nested data very well, where Kudu can't handle that at all. Snowflake is
> not designed for heavy concurrency, but complex query plans for a small
> group of users. If you select an x-large Snowflake cluster it's probably
> because you have a large amount of data to churn through, not because you
> have a large number of users. Or, at least that's how we used it.
>
> At my current workplace we use Kudu/Impala to handle about 30-60
> concurrent queries. I agree that getting very fussy about partitioning can
> be a pain, but for the large fact tables we generally use a simple strategy
> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries.
>
> The net result blows away what we had with RedShift at less than 1/3 the
> cost, with performance improvements mostly from better concurrency
> handling. This is despite the fact that RedShift has built-in cache. We
> also use streaming ingestion which, aside from being impossible with
> RedShift, removes the added cost of staging.
>
> Getting back to Snowflake, there's no way we could use it the same way we
> use Kudu, and even if we could, the cost would would probably put us out of
> business!
>
> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> thanks Andrew for taking your time responding to me. It seems that there
>> are no exact recommendations.
>>
>> I did look at scaling recommendations but that math is extremely
>> complicated and I do not think anyone will know all the answers to plug
>> into that calculation. We have no control really what users are doing, how
>> many queries they run, how many are hot vs cold etc. It is not realistic
>> IMHO to expect that knowledge of user query patterns.
>>
>> I do like the Snowflake approach than the engine takes care of defaults
>> and can estimate the number of micro-partitions and even repartition tables
>> as they grow. I feel Kudu has the same capabilities as the design is very
>> similar. I really do not like to pick random number of buckets. Also we
>> manager 100s of tables, I cannot look at them each one by one to make these
>> decisions. Does it make sense?
>>
>>
>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>
>>> Hey Boris,
>>>
>>> Sorry you didn't have much luck on Slack. I know partitioning in general
>>> can be tricky; thanks for the question. Left some thoughts below:
>>>
>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>> tablets to be closer to 1Gb?
>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>> million rows, should I just use 1 partition or still break them on smaller
>>>> tablets for concurrency?
>>>
>>>
>>> Per your numbers, this confirms that the partitions are the units of
>>> concurrency here, and that therefore having more and having smaller
>>> partitions yields a concurrency bump. That said, extending a scheme of
>>> smaller partitions across all tables may not scale when thinking about the
>>> total number of partitions cluster-wide.
>>>
>>> There are some trade offs with the replica count per tablet server here
>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>> an SSD, I would recommend doing so), each tablet introduces some
>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>> operations in the pool of background operations to be run, etc.
>>>
>>> Your point about scan concurrency is certainly a valid one -- there have
>>> been patches for other integrations that have tackled this to decouple
>>> partitioning from scan concurrency (KUDU-2437
>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
>>> Kudu's Spark integration will split range scans into smaller-scoped scan
>>> tokens to be run concurrently, though this optimization hasn't made its way
>>> into Impala yet). I filed KUDU-3071
>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think
>>> is left on the Kudu-side to get this up and running, so that it can be
>>> worked into Impala.
>>>
>>> For now, I would try to take into account the total sum of resources you
>>> have available to Kudu (including number of tablet servers, amount of
>>> storage per node, number of disks per tablet server, type of disk for the
>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>> system can handle (the scaling guide
>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here),
>>> and hopefully that, along with your own SLAs per table, can help guide how
>>> you partition your tables.
>>>
>>> confused why they say "at least" not "at most" - does it mean I should
>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>
>>>
>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
>>> tens of GB per tablet replica, though exact perf obviously depends on your
>>> workload. As you show and as pointed out in documentation, larger and fewer
>>> tablets can limit the amount of concurrency for writes and reads, though
>>> we've seen multiple GBs works relatively well for many use cases while
>>> weighing the above mentioned tradeoffs with replica count.
>>>
>>> It is recommended that new tables which are expected to have heavy read
>>>> and write workloads have at least as many tablets as tablet servers.
>>>
>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>
>>>
>>> The recommendation here is to have at least 20 logical partitions per
>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>> scan would be broken up into 20 tablet scans, and each of those might land
>>> on a different tablet server running on isolated hardware. For a
>>> significantly larger table into which you expect highly concurrent
>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>> having more partitions, and if your data is naturally time-oriented,
>>> consider range-partitioning on timestamp.
>>>
>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> just saw this in the docs but it is still confusing statement
>>>> No Default Partitioning
>>>> Kudu does not provide a default partitioning strategy when creating
>>>> tables. It is recommended that new tables which are expected to have heavy
>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> hey guys,
>>>>>
>>>>> I asked the same question on Slack on and got no responses. I just
>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>> answer.
>>>>>
>>>>> Can someone comment?
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>>
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>>
>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>> rule but rethinking this now for another project.
>>>>>
>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> forgot to post results of my quick test:
>>>>>>
>>>>>>  Kudu 1.5
>>>>>>
>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>
>>>>>> Tablets Tablet Size Query run time, sec
>>>>>> 3 2Gb 65
>>>>>> 9 700Mb 27
>>>>>> 18 350Mb 17
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>
>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>
>>>>>>> Partitioning Rules of Thumb
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>    as you have cores in the cluster.
>>>>>>>    -
>>>>>>>
>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>
>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>> returns.
>>>>>>>
>>>>>>>
>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>> correctly. Let me use concrete example.
>>>>>>>
>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>> 1Gb".
>>>>>>>
>>>>>>> Really confused what the doc is saying, please help
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

My company actually uses CDP, but for the type of virtualized deployments
my team does a simple Apache build of Impala and Kudu works best. We have
an automated build script that we run every now and then to build on Amazon
Linux. We do however have to tweak it every time we build new versions.

On Sat, Apr 25, 2020, 9:10 PM Cliff Resnick <cr...@gmail.com> wrote:

> That's pretty much it. Every now and then we get notice an instance is
> scheduled for retirement, or maybe just goes flakey on its own, and we swap
> it out. 3x replication has been plenty for us, and though we regularly back
> up to S3 we've never had to rebuild due to failure. In my limited
> experience disk failure is not a much of a thing with NVMe, and one of the
> nice things about renting versus owning is we don't worry about things like
> write amplification. We also co-locate a couple of masters, but their tiny
> data goes to EBS and survives reboots. But then our cluster must be much
> smaller than yours because rebalancing takes about a half an hour at most.
>
> On Sat, Apr 25, 2020, 3:20 PM Boris Tyukin <bo...@boristyukin.com> wrote:
>
>> Cliff, I think i got the idea - you basically use a bunch of temp servers
>> that can be evicted any time and each has attached temporary nvme drives -
>> very clever! I am surprised it works well with Kudu as our experience with
>> our on-prem cluster that Kudu is extremely sensitive to disk failures and
>> random reboots. at one point we had a situation when 2 out of 3 replicas
>> were down for a while and it took us quite a bit of time to recover and we
>> had to use kudu cli to evict some tablets manually.
>>
>> Also manual rebalancing is in order often and it takes a lot of time
>> (many hours).
>>
>> On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> Cliff, i would be extremely interested to see a blog post to compare
>>> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>>>
>>> would love to get some details how you set up Kudu/Impala cluster on AWS
>>> as well as my company might be heading the same direction. this does not
>>> mean much to me as we are not using cloud but I hope you can elaborate on
>>> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
>>> a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>> use "for free" instance-store NVMe. We also have associated
>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>> compute oriented queries. "
>>>
>>> I cannot really find anything on the web that would compare Impala/Kudu
>>> to Snowflake and Redshift. Everything I see is about Snowflake, Redshift
>>> and BigQuery.
>>>
>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>>
>>>> This is a good conversation but I don't think the comparison with
>>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>> closed source, it seems pretty clear the architectures are quite different.
>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>> make Kudu valuable for some use cases.
>>>>
>>>> It also seems to me that their intended workloads are quite different.
>>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>> not designed for heavy concurrency, but complex query plans for a small
>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>> because you have a large amount of data to churn through, not because you
>>>> have a large number of users. Or, at least that's how we used it.
>>>>
>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>> be a pain, but for the large fact tables we generally use a simple strategy
>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>> use "for free" instance-store NVMe. We also have associated
>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>> compute oriented queries.
>>>>
>>>> The net result blows away what we had with RedShift at less than 1/3
>>>> the cost, with performance improvements mostly from better concurrency
>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>> also use streaming ingestion which, aside from being impossible with
>>>> RedShift, removes the added cost of staging.
>>>>
>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>> we use Kudu, and even if we could, the cost would would probably put us out
>>>> of business!
>>>>
>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>> there are no exact recommendations.
>>>>>
>>>>> I did look at scaling recommendations but that math is extremely
>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>> into that calculation. We have no control really what users are doing, how
>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>
>>>>> I do like the Snowflake approach than the engine takes care of
>>>>> defaults and can estimate the number of micro-partitions and even
>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>> the design is very similar. I really do not like to pick random number of
>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one by
>>>>> one to make these decisions. Does it make sense?
>>>>>
>>>>>
>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>>
>>>>>> Hey Boris,
>>>>>>
>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>
>>>>>>
>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>>> total number of partitions cluster-wide.
>>>>>>
>>>>>> There are some trade offs with the replica count per tablet server
>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>
>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>> have been patches for other integrations that have tackled this to decouple
>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>>> be worked into Impala.
>>>>>>
>>>>>> For now, I would try to take into account the total sum of resources
>>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>> system can handle (the scaling guide
>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>> guide how you partition your tables.
>>>>>>
>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>
>>>>>>
>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>
>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>>> on a different tablet server running on isolated hardware. For a
>>>>>> significantly larger table into which you expect highly concurrent
>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>> consider range-partitioning on timestamp.
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>> No Default Partitioning
>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hey guys,
>>>>>>>>
>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>> answer.
>>>>>>>>
>>>>>>>> Can someone comment?
>>>>>>>>
>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>>> tablets for concurrency?
>>>>>>>>
>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>
>>>>>>>>>  Kudu 1.5
>>>>>>>>>
>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>
>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>> 3 2Gb 65
>>>>>>>>> 9 700Mb 27
>>>>>>>>> 18 350Mb 17
>>>>>>>>> [image: image.png]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>> boris@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi guys,
>>>>>>>>>>
>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>
>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>
>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>
>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>>> diminishing returns.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>
>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>
>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>>> 1Gb".
>>>>>>>>>>
>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>
>>>>>>>>>> Boris
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Andrew Wong
>>>>>>
>>>>>

Re: Partitioning Rules of Thumb

Posted by Cliff Resnick <cr...@gmail.com>.

That's pretty much it. Every now and then we get notice an instance is
scheduled for retirement, or maybe just goes flakey on its own, and we swap
it out. 3x replication has been plenty for us, and though we regularly back
up to S3 we've never had to rebuild due to failure. In my limited
experience disk failure is not a much of a thing with NVMe, and one of the
nice things about renting versus owning is we don't worry about things like
write amplification. We also co-locate a couple of masters, but their tiny
data goes to EBS and survives reboots. But then our cluster must be much
smaller than yours because rebalancing takes about a half an hour at most.

On Sat, Apr 25, 2020, 3:20 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> Cliff, I think i got the idea - you basically use a bunch of temp servers
> that can be evicted any time and each has attached temporary nvme drives -
> very clever! I am surprised it works well with Kudu as our experience with
> our on-prem cluster that Kudu is extremely sensitive to disk failures and
> random reboots. at one point we had a situation when 2 out of 3 replicas
> were down for a while and it took us quite a bit of time to recover and we
> had to use kudu cli to evict some tablets manually.
>
> Also manual rebalancing is in order often and it takes a lot of time (many
> hours).
>
> On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Cliff, i would be extremely interested to see a blog post to compare
>> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>>
>> would love to get some details how you set up Kudu/Impala cluster on AWS
>> as well as my company might be heading the same direction. this does not
>> mean much to me as we are not using cloud but I hope you can elaborate on
>> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
>> a load balancer. We're in AWS, and thanks to Kudu's replication we can
>> use "for free" instance-store NVMe. We also have associated
>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>> compute oriented queries. "
>>
>> I cannot really find anything on the web that would compare Impala/Kudu
>> to Snowflake and Redshift. Everything I see is about Snowflake, Redshift
>> and BigQuery.
>>
>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>>
>>> This is a good conversation but I don't think the comparison with
>>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>> closed source, it seems pretty clear the architectures are quite different.
>>> Snowflake has no primary key index, no UPSERT capability, features that
>>> make Kudu valuable for some use cases.
>>>
>>> It also seems to me that their intended workloads are quite different.
>>> Snowflake is great for intensive analytics on demand, and can handle deeply
>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>> not designed for heavy concurrency, but complex query plans for a small
>>> group of users. If you select an x-large Snowflake cluster it's probably
>>> because you have a large amount of data to churn through, not because you
>>> have a large number of users. Or, at least that's how we used it.
>>>
>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>> concurrent queries. I agree that getting very fussy about partitioning can
>>> be a pain, but for the large fact tables we generally use a simple strategy
>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>> use "for free" instance-store NVMe. We also have associated
>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>> compute oriented queries.
>>>
>>> The net result blows away what we had with RedShift at less than 1/3 the
>>> cost, with performance improvements mostly from better concurrency
>>> handling. This is despite the fact that RedShift has built-in cache. We
>>> also use streaming ingestion which, aside from being impossible with
>>> RedShift, removes the added cost of staging.
>>>
>>> Getting back to Snowflake, there's no way we could use it the same way
>>> we use Kudu, and even if we could, the cost would would probably put us out
>>> of business!
>>>
>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Andrew for taking your time responding to me. It seems that
>>>> there are no exact recommendations.
>>>>
>>>> I did look at scaling recommendations but that math is extremely
>>>> complicated and I do not think anyone will know all the answers to plug
>>>> into that calculation. We have no control really what users are doing, how
>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>> IMHO to expect that knowledge of user query patterns.
>>>>
>>>> I do like the Snowflake approach than the engine takes care of defaults
>>>> and can estimate the number of micro-partitions and even repartition tables
>>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>>> similar. I really do not like to pick random number of buckets. Also we
>>>> manager 100s of tables, I cannot look at them each one by one to make these
>>>> decisions. Does it make sense?
>>>>
>>>>
>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>>
>>>>> Hey Boris,
>>>>>
>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>>> tablets to be closer to 1Gb?
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>> tablets for concurrency?
>>>>>
>>>>>
>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>> concurrency here, and that therefore having more and having smaller
>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>> smaller partitions across all tables may not scale when thinking about the
>>>>> total number of partitions cluster-wide.
>>>>>
>>>>> There are some trade offs with the replica count per tablet server
>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can put
>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>>> operations in the pool of background operations to be run, etc.
>>>>>
>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>> have been patches for other integrations that have tackled this to decouple
>>>>> partitioning from scan concurrency (KUDU-2437
>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>> its way into Impala yet). I filed KUDU-3071
>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>>> be worked into Impala.
>>>>>
>>>>> For now, I would try to take into account the total sum of resources
>>>>> you have available to Kudu (including number of tablet servers, amount of
>>>>> storage per node, number of disks per tablet server, type of disk for the
>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>> system can handle (the scaling guide
>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>> guide how you partition your tables.
>>>>>
>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>
>>>>>
>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>>> your workload. As you show and as pointed out in documentation, larger and
>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>
>>>>> It is recommended that new tables which are expected to have heavy
>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>> The recommendation here is to have at least 20 logical partitions per
>>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>>> on a different tablet server running on isolated hardware. For a
>>>>> significantly larger table into which you expect highly concurrent
>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>> consider range-partitioning on timestamp.
>>>>>
>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> just saw this in the docs but it is still confusing statement
>>>>>> No Default Partitioning
>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>>
>>>>>>
>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>>> (divide by 3 because of replication)?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> hey guys,
>>>>>>>
>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>> answer.
>>>>>>>
>>>>>>> Can someone comment?
>>>>>>>
>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>
>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>>> tablets for concurrency?
>>>>>>>
>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>>> rule but rethinking this now for another project.
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> forgot to post results of my quick test:
>>>>>>>>
>>>>>>>>  Kudu 1.5
>>>>>>>>
>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>
>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>> 3 2Gb 65
>>>>>>>> 9 700Mb 27
>>>>>>>> 18 350Mb 17
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi guys,
>>>>>>>>>
>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>
>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>
>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>>    as you have cores in the cluster.
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>
>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>> parallelism of reads, in the current implementation. Increasing the number
>>>>>>>>> of tablets significantly beyond the number of cores is likely to have
>>>>>>>>> diminishing returns.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>
>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>
>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>>> 1Gb".
>>>>>>>>>
>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>
>>>>>>>>> Boris
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Andrew Wong
>>>>>
>>>>

Re: Partitioning Rules of Thumb

Posted by Boris Tyukin <bo...@boristyukin.com>.

Cliff, I think i got the idea - you basically use a bunch of temp servers
that can be evicted any time and each has attached temporary nvme drives -
very clever! I am surprised it works well with Kudu as our experience with
our on-prem cluster that Kudu is extremely sensitive to disk failures and
random reboots. at one point we had a situation when 2 out of 3 replicas
were down for a while and it took us quite a bit of time to recover and we
had to use kudu cli to evict some tablets manually.

Also manual rebalancing is in order often and it takes a lot of time (many
hours).

On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> Cliff, i would be extremely interested to see a blog post to compare
> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>
> would love to get some details how you set up Kudu/Impala cluster on AWS
> as well as my company might be heading the same direction. this does not
> mean much to me as we are not using cloud but I hope you can elaborate on
> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
> a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries. "
>
> I cannot really find anything on the web that would compare Impala/Kudu to
> Snowflake and Redshift. Everything I see is about Snowflake, Redshift and
> BigQuery.
>
> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cr...@gmail.com> wrote:
>
>> This is a good conversation but I don't think the comparison with
>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>> last job, about 5 years ago, I pretty much single-handedly scale tested
>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>> closed source, it seems pretty clear the architectures are quite different.
>> Snowflake has no primary key index, no UPSERT capability, features that
>> make Kudu valuable for some use cases.
>>
>> It also seems to me that their intended workloads are quite different.
>> Snowflake is great for intensive analytics on demand, and can handle deeply
>> nested data very well, where Kudu can't handle that at all. Snowflake is
>> not designed for heavy concurrency, but complex query plans for a small
>> group of users. If you select an x-large Snowflake cluster it's probably
>> because you have a large amount of data to churn through, not because you
>> have a large number of users. Or, at least that's how we used it.
>>
>> At my current workplace we use Kudu/Impala to handle about 30-60
>> concurrent queries. I agree that getting very fussy about partitioning can
>> be a pain, but for the large fact tables we generally use a simple strategy
>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>> use "for free" instance-store NVMe. We also have associated
>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>> compute oriented queries.
>>
>> The net result blows away what we had with RedShift at less than 1/3 the
>> cost, with performance improvements mostly from better concurrency
>> handling. This is despite the fact that RedShift has built-in cache. We
>> also use streaming ingestion which, aside from being impossible with
>> RedShift, removes the added cost of staging.
>>
>> Getting back to Snowflake, there's no way we could use it the same way we
>> use Kudu, and even if we could, the cost would would probably put us out of
>> business!
>>
>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Andrew for taking your time responding to me. It seems that there
>>> are no exact recommendations.
>>>
>>> I did look at scaling recommendations but that math is extremely
>>> complicated and I do not think anyone will know all the answers to plug
>>> into that calculation. We have no control really what users are doing, how
>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>> IMHO to expect that knowledge of user query patterns.
>>>
>>> I do like the Snowflake approach than the engine takes care of defaults
>>> and can estimate the number of micro-partitions and even repartition tables
>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>> similar. I really do not like to pick random number of buckets. Also we
>>> manager 100s of tables, I cannot look at them each one by one to make these
>>> decisions. Does it make sense?
>>>
>>>
>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>
>>>> Hey Boris,
>>>>
>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>
>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>
>>>>
>>>> Per your numbers, this confirms that the partitions are the units of
>>>> concurrency here, and that therefore having more and having smaller
>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>> smaller partitions across all tables may not scale when thinking about the
>>>> total number of partitions cluster-wide.
>>>>
>>>> There are some trade offs with the replica count per tablet server here
>>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>>> an SSD, I would recommend doing so), each tablet introduces some
>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>> operations in the pool of background operations to be run, etc.
>>>>
>>>> Your point about scan concurrency is certainly a valid one -- there
>>>> have been patches for other integrations that have tackled this to decouple
>>>> partitioning from scan concurrency (KUDU-2437
>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>> its way into Impala yet). I filed KUDU-3071
>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>> be worked into Impala.
>>>>
>>>> For now, I would try to take into account the total sum of resources
>>>> you have available to Kudu (including number of tablet servers, amount of
>>>> storage per node, number of disks per tablet server, type of disk for the
>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>> system can handle (the scaling guide
>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>> guide how you partition your tables.
>>>>
>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>
>>>>
>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>> your workload. As you show and as pointed out in documentation, larger and
>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>> though we've seen multiple GBs works relatively well for many use cases
>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>
>>>> It is recommended that new tables which are expected to have heavy read
>>>>> and write workloads have at least as many tablets as tablet servers.
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>> The recommendation here is to have at least 20 logical partitions per
>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>> on a different tablet server running on isolated hardware. For a
>>>> significantly larger table into which you expect highly concurrent
>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>> having more partitions, and if your data is naturally time-oriented,
>>>> consider range-partitioning on timestamp.
>>>>
>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> just saw this in the docs but it is still confusing statement
>>>>> No Default Partitioning
>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> hey guys,
>>>>>>
>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>> answer.
>>>>>>
>>>>>> Can someone comment?
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or should
>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>>> tablets for concurrency?
>>>>>>
>>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>>> rule but rethinking this now for another project.
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> forgot to post results of my quick test:
>>>>>>>
>>>>>>>  Kudu 1.5
>>>>>>>
>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>
>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>> 3 2Gb 65
>>>>>>> 9 700Mb 27
>>>>>>> 18 350Mb 17
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>
>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>
>>>>>>>> Partitioning Rules of Thumb
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>    as you have cores in the cluster.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>
>>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>>> of reads, in the current implementation. Increasing the number of tablets
>>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>>> returns.
>>>>>>>>
>>>>>>>>
>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>
>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim for
>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>> tablets if my cluster capacity allows so (12 tablets before replication)?
>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>
>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>> 1Gb".
>>>>>>>>
>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>
>>>>>>>> Boris
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Andrew Wong
>>>>
>>>