You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2017/11/28 15:25:26 UTC

Help me understand Kudu scalability limitations

Hi guys,

I was really excited about Kudu until I saw this:

https://kudu.apache.org/docs/known_issues.html


   -

   Recommended maximum amount of stored data, post-replication and
   post-compression, per tablet server is 8TB.
   -

   Recommended maximum number of tablets per tablet server is 2000,
   post-replication.
   -

   Maximum number of tablets per table for each tablet server is 60,
   post-replication, at table-creation time.

These numbers are very concerning to me because the project I am working on
will have 300+ plus tables and 20 tables have over 1B rows, 50-100 tables
are 200M rows in average and the rest are below 50M rows. I want to see if
I can build near real-time data lake, ingesting data from our source rdbms
systems.

My cluster is 30 nodes cluster with 12 spinning HDDs each (each drive is
8Tb) and each node is 2 CPU 22 core beast with 512Gb of DDR4 RAM.

Does these limitations above still apply in my case? Looks like I can only
have 24Tb worth of data in Kudu which is way below that I need. My modest
estimate is 80-100Tb.

Also concerned that I can only have 20,000 tablets after replication - as I
mentioned above I am going to have a bunch of tables with lots of rows.

I do not have an option to pick a different hardware configuration for our
cluster.

thanks

Re: Help me understand Kudu scalability limitations

Posted by Boris Tyukin <bo...@boristyukin.com>.
awesome, this is great to know! thanks again Andrew

On Wed, Nov 29, 2017 at 2:35 PM, Andrew Wong <aw...@cloudera.com> wrote:

> Right, I think you're interpreting that correctly. If you're feeling
> adventurous, you could experiment with those limits even further :)
>
> Node density is something we're tracking and hoping to improve in the near
> future. There has already been some pretty drastic bumps in this area (see
> here <https://issues.apache.org/jira/browse/KUDU-1967>), although I don't
> think there's an exact timeline.
>
>
> Andrew
>
> On Wed, Nov 29, 2017 at 11:16 AM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks for your response, Andrew. every node has 12 8Tb hdds - so 96 Tb
>> total per node. our production cluster will have 30 nodes so 2.8 PTb total
>> of local hdd space. Looks like with Kudu we will only be able to use 8Tb x
>> 30 = 240Tb total before replication so realistically it will be 80Tb top.
>> Can you confirm that?
>>
>> This is exactly my concern that a lot of space is wasted. We can use it
>> for HDFS of course and Kafka or something else but my concern is why Kudu
>> cannot use more than 8Tb per node. Is it something that is going to change
>> in future maybe?
>>
>> On Wed, Nov 29, 2017 at 1:06 PM, Andrew Wong <aw...@cloudera.com> wrote:
>>
>>> Hi Boris,
>>>
>>> The recommendations listed indicate what has been tested. Going beyond
>>> that is uncharted territory, although that isn't to say it can't be done!
>>>
>>> This sort of planning depends on what your schemas look like. Without
>>> that, it's hard to gauge how many tablets are needed for your tables. That
>>> would then guide the number of tablets you could hold total.
>>>
>>> In terms of space, it seems like the number of nodes would provide ample
>>> space (30 nodes * 8TB per node >> 80-100TB), unless I'm missing something.
>>> Although given the number of HDDs per node, it sounds like a lot would go
>>> unused. If you meant that you have 3 nodes, that's a different story. Would
>>> you mind clarifying?
>>>
>>>
>>> Andrew
>>>
>>> On Tue, Nov 28, 2017 at 7:25 AM, Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I was really excited about Kudu until I saw this:
>>>>
>>>> https://kudu.apache.org/docs/known_issues.html
>>>>
>>>>
>>>>    -
>>>>
>>>>    Recommended maximum amount of stored data, post-replication and
>>>>    post-compression, per tablet server is 8TB.
>>>>    -
>>>>
>>>>    Recommended maximum number of tablets per tablet server is 2000,
>>>>    post-replication.
>>>>    -
>>>>
>>>>    Maximum number of tablets per table for each tablet server is 60,
>>>>    post-replication, at table-creation time.
>>>>
>>>> These numbers are very concerning to me because the project I am
>>>> working on will have 300+ plus tables and 20 tables have over 1B rows,
>>>> 50-100 tables are 200M rows in average and the rest are below 50M rows. I
>>>> want to see if I can build near real-time data lake, ingesting data from
>>>> our source rdbms systems.
>>>>
>>>> My cluster is 30 nodes cluster with 12 spinning HDDs each (each drive
>>>> is 8Tb) and each node is 2 CPU 22 core beast with 512Gb of DDR4 RAM.
>>>>
>>>> Does these limitations above still apply in my case? Looks like I can
>>>> only have 24Tb worth of data in Kudu which is way below that I need. My
>>>> modest estimate is 80-100Tb.
>>>>
>>>> Also concerned that I can only have 20,000 tablets after replication -
>>>> as I mentioned above I am going to have a bunch of tables with lots of rows.
>>>>
>>>> I do not have an option to pick a different hardware configuration for
>>>> our cluster.
>>>>
>>>> thanks
>>>>
>>>
>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>
>>
>
>
> --
> Andrew Wong
>

Re: Help me understand Kudu scalability limitations

Posted by Andrew Wong <aw...@cloudera.com>.
Right, I think you're interpreting that correctly. If you're feeling
adventurous, you could experiment with those limits even further :)

Node density is something we're tracking and hoping to improve in the near
future. There has already been some pretty drastic bumps in this area (see
here <https://issues.apache.org/jira/browse/KUDU-1967>), although I don't
think there's an exact timeline.


Andrew

On Wed, Nov 29, 2017 at 11:16 AM, Boris Tyukin <bo...@boristyukin.com>
wrote:

> thanks for your response, Andrew. every node has 12 8Tb hdds - so 96 Tb
> total per node. our production cluster will have 30 nodes so 2.8 PTb total
> of local hdd space. Looks like with Kudu we will only be able to use 8Tb x
> 30 = 240Tb total before replication so realistically it will be 80Tb top.
> Can you confirm that?
>
> This is exactly my concern that a lot of space is wasted. We can use it
> for HDFS of course and Kafka or something else but my concern is why Kudu
> cannot use more than 8Tb per node. Is it something that is going to change
> in future maybe?
>
> On Wed, Nov 29, 2017 at 1:06 PM, Andrew Wong <aw...@cloudera.com> wrote:
>
>> Hi Boris,
>>
>> The recommendations listed indicate what has been tested. Going beyond
>> that is uncharted territory, although that isn't to say it can't be done!
>>
>> This sort of planning depends on what your schemas look like. Without
>> that, it's hard to gauge how many tablets are needed for your tables. That
>> would then guide the number of tablets you could hold total.
>>
>> In terms of space, it seems like the number of nodes would provide ample
>> space (30 nodes * 8TB per node >> 80-100TB), unless I'm missing something.
>> Although given the number of HDDs per node, it sounds like a lot would go
>> unused. If you meant that you have 3 nodes, that's a different story. Would
>> you mind clarifying?
>>
>>
>> Andrew
>>
>> On Tue, Nov 28, 2017 at 7:25 AM, Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> I was really excited about Kudu until I saw this:
>>>
>>> https://kudu.apache.org/docs/known_issues.html
>>>
>>>
>>>    -
>>>
>>>    Recommended maximum amount of stored data, post-replication and
>>>    post-compression, per tablet server is 8TB.
>>>    -
>>>
>>>    Recommended maximum number of tablets per tablet server is 2000,
>>>    post-replication.
>>>    -
>>>
>>>    Maximum number of tablets per table for each tablet server is 60,
>>>    post-replication, at table-creation time.
>>>
>>> These numbers are very concerning to me because the project I am working
>>> on will have 300+ plus tables and 20 tables have over 1B rows, 50-100
>>> tables are 200M rows in average and the rest are below 50M rows. I want to
>>> see if I can build near real-time data lake, ingesting data from our source
>>> rdbms systems.
>>>
>>> My cluster is 30 nodes cluster with 12 spinning HDDs each (each drive is
>>> 8Tb) and each node is 2 CPU 22 core beast with 512Gb of DDR4 RAM.
>>>
>>> Does these limitations above still apply in my case? Looks like I can
>>> only have 24Tb worth of data in Kudu which is way below that I need. My
>>> modest estimate is 80-100Tb.
>>>
>>> Also concerned that I can only have 20,000 tablets after replication -
>>> as I mentioned above I am going to have a bunch of tables with lots of rows.
>>>
>>> I do not have an option to pick a different hardware configuration for
>>> our cluster.
>>>
>>> thanks
>>>
>>
>>
>>
>> --
>> Andrew Wong
>>
>
>


-- 
Andrew Wong

Re: Help me understand Kudu scalability limitations

Posted by Boris Tyukin <bo...@boristyukin.com>.
thanks for your response, Andrew. every node has 12 8Tb hdds - so 96 Tb
total per node. our production cluster will have 30 nodes so 2.8 PTb total
of local hdd space. Looks like with Kudu we will only be able to use 8Tb x
30 = 240Tb total before replication so realistically it will be 80Tb top.
Can you confirm that?

This is exactly my concern that a lot of space is wasted. We can use it for
HDFS of course and Kafka or something else but my concern is why Kudu
cannot use more than 8Tb per node. Is it something that is going to change
in future maybe?

On Wed, Nov 29, 2017 at 1:06 PM, Andrew Wong <aw...@cloudera.com> wrote:

> Hi Boris,
>
> The recommendations listed indicate what has been tested. Going beyond
> that is uncharted territory, although that isn't to say it can't be done!
>
> This sort of planning depends on what your schemas look like. Without
> that, it's hard to gauge how many tablets are needed for your tables. That
> would then guide the number of tablets you could hold total.
>
> In terms of space, it seems like the number of nodes would provide ample
> space (30 nodes * 8TB per node >> 80-100TB), unless I'm missing something.
> Although given the number of HDDs per node, it sounds like a lot would go
> unused. If you meant that you have 3 nodes, that's a different story. Would
> you mind clarifying?
>
>
> Andrew
>
> On Tue, Nov 28, 2017 at 7:25 AM, Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi guys,
>>
>> I was really excited about Kudu until I saw this:
>>
>> https://kudu.apache.org/docs/known_issues.html
>>
>>
>>    -
>>
>>    Recommended maximum amount of stored data, post-replication and
>>    post-compression, per tablet server is 8TB.
>>    -
>>
>>    Recommended maximum number of tablets per tablet server is 2000,
>>    post-replication.
>>    -
>>
>>    Maximum number of tablets per table for each tablet server is 60,
>>    post-replication, at table-creation time.
>>
>> These numbers are very concerning to me because the project I am working
>> on will have 300+ plus tables and 20 tables have over 1B rows, 50-100
>> tables are 200M rows in average and the rest are below 50M rows. I want to
>> see if I can build near real-time data lake, ingesting data from our source
>> rdbms systems.
>>
>> My cluster is 30 nodes cluster with 12 spinning HDDs each (each drive is
>> 8Tb) and each node is 2 CPU 22 core beast with 512Gb of DDR4 RAM.
>>
>> Does these limitations above still apply in my case? Looks like I can
>> only have 24Tb worth of data in Kudu which is way below that I need. My
>> modest estimate is 80-100Tb.
>>
>> Also concerned that I can only have 20,000 tablets after replication - as
>> I mentioned above I am going to have a bunch of tables with lots of rows.
>>
>> I do not have an option to pick a different hardware configuration for
>> our cluster.
>>
>> thanks
>>
>
>
>
> --
> Andrew Wong
>

Re: Help me understand Kudu scalability limitations

Posted by Andrew Wong <aw...@cloudera.com>.
Hi Boris,

The recommendations listed indicate what has been tested. Going beyond that
is uncharted territory, although that isn't to say it can't be done!

This sort of planning depends on what your schemas look like. Without that,
it's hard to gauge how many tablets are needed for your tables. That would
then guide the number of tablets you could hold total.

In terms of space, it seems like the number of nodes would provide ample
space (30 nodes * 8TB per node >> 80-100TB), unless I'm missing something.
Although given the number of HDDs per node, it sounds like a lot would go
unused. If you meant that you have 3 nodes, that's a different story. Would
you mind clarifying?


Andrew

On Tue, Nov 28, 2017 at 7:25 AM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi guys,
>
> I was really excited about Kudu until I saw this:
>
> https://kudu.apache.org/docs/known_issues.html
>
>
>    -
>
>    Recommended maximum amount of stored data, post-replication and
>    post-compression, per tablet server is 8TB.
>    -
>
>    Recommended maximum number of tablets per tablet server is 2000,
>    post-replication.
>    -
>
>    Maximum number of tablets per table for each tablet server is 60,
>    post-replication, at table-creation time.
>
> These numbers are very concerning to me because the project I am working
> on will have 300+ plus tables and 20 tables have over 1B rows, 50-100
> tables are 200M rows in average and the rest are below 50M rows. I want to
> see if I can build near real-time data lake, ingesting data from our source
> rdbms systems.
>
> My cluster is 30 nodes cluster with 12 spinning HDDs each (each drive is
> 8Tb) and each node is 2 CPU 22 core beast with 512Gb of DDR4 RAM.
>
> Does these limitations above still apply in my case? Looks like I can only
> have 24Tb worth of data in Kudu which is way below that I need. My modest
> estimate is 80-100Tb.
>
> Also concerned that I can only have 20,000 tablets after replication - as
> I mentioned above I am going to have a bunch of tables with lots of rows.
>
> I do not have an option to pick a different hardware configuration for our
> cluster.
>
> thanks
>



-- 
Andrew Wong