You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Alexey Serbin <as...@cloudera.com> on 2021/09/24 01:33:49 UTC
Re: Kudu cluster sizing questions

Hi Chetan,

Thank you for taking a look at Kudu!  Apache Kudu is designed to perform
well in OLAP workloads.

You can scale Kudu cluster horizontally pretty well at least up to few
hundreds of nodes, and here you could find more information on recommended
data-per-node-sizes, scaling limitations, and more:
https://kudu.apache.org/docs/known_issues.html#_scale

In the past, scans could become slower if the ingestion of the data follows
the 'trickling inserts' pattern (see
https://issues.apache.org/jira/browse/KUDU-1400), but it's been addressed
and newer versions (1.10 and newer) don't have the issue.

There isn't a limit of how many large tables you can host in a Kudu
cluster, assuming you partition those large tables appropriately (see
https://kudu.apache.org/docs/schema_design.html#schema_design) and scale
the cluster as needed, especially, if some of those tables contain 'cold'
data.

Random reads and updates are supported regardless of the scale of a Kudu
cluster.  Even more: starting with Kudu 1.15 there is support for multi-row
transactions marked as an experimental feature supporting
INSERT/INSERT_IGNORE operations only at this point -- it targets rather
the 'bulk ingest' use case, not OLTP patterns with many small transactions,
though.

The important points to allow as many parallel workloads against a single
Kudu cluster are (a) choose table schema properly (b) partition tables
accordingly (c) use multiple data directories backed by separate HDD/SSD
per node (d) use SSD or NVMe devices for the WAL (d) allocate enough memory
for the block cache.  I'd recommend building a POC to get some real numbers
because workloads vary and it's hard to provide exact numbers without
knowing much of the details.

As for related articles/blogs about using Kudu, I can recommend taking a
look at the following relatively recent posts:
  https://boristyukin.com/building-near-real-time-big-data-lake-part-i/
  https://boristyukin.com/building-near-real-time-big-data-lake-part-2/

Probably, other people could chime in to provide more insights based on
their own experience running Kudu with their workloads.


Kind regards,

Alexey

On Tue, Sep 21, 2021 at 11:29 PM Chetan Rautela <ra...@gmail.com>
wrote:

> Hi team ,
>
> I am looking for some storage solution that can give high ingestion/update
> rates and able to run OLAP queries, Apache Kudu looks one promising
> solution,
> Please help me to check if Apache Kudu is correct fit
>
> Use Case:
> ------------ .
>         I am receiving 40K records per sec. record size is less, 5 fields
> max. 2 string 2 timestamp 1 number.
>         With primary key I will be getting ~ 2 billion unique records per
> day and rest will be updates.
>         With Apache Spark aggregation we can reduce 20% of updates.
>         TTL of each record will be 30 days.
>
> How much data can we store in kudu per node ?
> With large updates , will get/scan request become slow over time ?
> How much large tables can we create in Kudu ?
> Will random read and update be supported at this scale ?
> How many parallel ingestion jobs can we in a Kudu, for different tables ?
>
>
> Please suggest some articles related to kudu sizing and performance.
>
> Regards,
> Chetan Rautela