You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Chao Sun <su...@uber.com> on 2017/10/31 05:23:12 UTC

Low ingestion rate from Kafka

Hi,

We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
The data are coming from Kafka at a rate of around 30K / sec, and hash
partitioned into 128 buckets. However, with default settings, Kudu can only
consume the topics at a rate of around 1.5K / second. This is a direct
ingest with no transformation on the data.

Could this because I was using the default configurations? also we are
using Kudu on HDD - could that also be related?

Any help would be appreciated. Thanks.

Best,
Chao

Re: Low ingestion rate from Kafka

Posted by Janne Keskitalo <ja...@paf.com>.

2017-11-01 21:40 GMT+01:00 Todd Lipcon <to...@cloudera.com>:

>
> Great. Keep in mind that, since you have a UUID component at the front of
> your key, you are doing something like a random-write workload. So, as your
> data grows, if your PK column (and its bloom filters) ends up being larger
> than the available RAM for caching, each write may generate a disk seek
> which will make throughput plummet. This is unlike some other storage
> options like HBase which does "blind puts".
>

 Is this cache size configurable or just dependent on the available RAM on
the host? And how could I check the current sizes of the PK bloom filters
or detect when some of them don't fit the cache anymore?

-- 
Br.
Janne Keskitalo,

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Nov 1, 2017 at 2:10 PM, Chao Sun <su...@uber.com> wrote:

> > Great. Keep in mind that, since you have a UUID component at the front
> of your key, you are doing something like a random-write workload. So, as
> your data grows, if your PK column (and its bloom filters) ends up being
> larger than the available RAM for caching, each write may generate a disk
> seek which will make throughput plummet. This is unlike some other storage
> options like HBase which does "blind puts".
>
> > Just something to be aware of, for performance planning.
>
> Thanks for letting me know. I'll keep a note.
>
> > I think in 1.3 it was called "kudu test loadgen" and may have fewer
> options available.
>
> Cool. I just run it on one of the TS node ('kudu test loadgen <hostname>
> --num-threads=8 --num-rows-per-thread=1000000 --table-num-buckets=32'), and
> got the following:
>
> Generator report
>   time total  : 5434.15 ms
>   time per row: 0.000679268 ms
>
> ~1.5M / sec? looks good.
>


yep, sounds about right. My machines I was running on are relatively old
spec CPUs and also somewhat overloaded (it's a torture-test cluster of
sorts that is always way out of balance, re-replicating stuff, etc)

-Todd


>
>
>
> On Wed, Nov 1, 2017 at 1:40 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun <su...@uber.com> wrote:
>>
>>> Thanks Todd! I improved my code to use multi Kudu clients for processing
>>> the Kafka messages and
>>> was able to improve the number to 250K - 300K per sec. Pretty happy with
>>> this now.
>>>
>>
>> Great. Keep in mind that, since you have a UUID component at the front of
>> your key, you are doing something like a random-write workload. So, as your
>> data grows, if your PK column (and its bloom filters) ends up being larger
>> than the available RAM for caching, each write may generate a disk seek
>> which will make throughput plummet. This is unlike some other storage
>> options like HBase which does "blind puts".
>>
>> Just something to be aware of, for performance planning.
>>
>>
>>>
>>> Will take a look at the perf tool - looks very nice. It seems it is not
>>> available on Kudu 1.3 though.
>>>
>>>
>> I think in 1.3 it was called "kudu test loadgen" and may have fewer
>> options available.
>>
>> -Todd
>>
>> On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>>>
>>>>> Sounds good.
>>>>>
>>>>> BTW, you can try a quick load test using the 'kudu perf loadgen'
>>>>> tool.  For example something like:
>>>>>
>>>>> kudu perf loadgen my-kudu-master.example.com --num-threads=8
>>>>> --num-rows-per-thread=1000000 --table-num-buckets=32
>>>>>
>>>>> There are also a bunch of options to tune buffer sizes, flush options,
>>>>> etc. But with the default settings above on an 8-node cluster I have, I was
>>>>> able to insert 8M rows in 44 seconds (180k/sec).
>>>>>
>>>>> Adding --buffer-size-bytes=10000000 almost doubled the above
>>>>> throughput (330k rows/sec)
>>>>>
>>>>
>>>> One more quick datapoint: I ran the above command simultaneously (in
>>>> parallel) four times. Despite running 4x as many clients,  they all
>>>> finished in the same time as a single client did (ie aggregate throughput
>>>> ~1.2M rows/sec).
>>>>
>>>> Again this isn't a scientific benchmark, and it's such a short burst of
>>>> activity that it doesn't represent a real workload, but 15k rows/sec is
>>>> definitely at least an order of magnitude lower than the peak throughput I
>>>> would expect.
>>>>
>>>> -Todd
>>>>
>>>>
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>
>>>>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>>>>>>>
>>>>>>>> Thanks Zhen and Todd.
>>>>>>>>
>>>>>>>> Yes increasing the # of consumers will definitely help, but we also
>>>>>>>> want to test the best throughput we can get from Kudu.
>>>>>>>>
>>>>>>>
>>>>>>> Sure, but increasing the number of consumers can increase the
>>>>>>> throughput (without increasing the number of Kudu tablet servers).
>>>>>>>
>>>>>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>>>>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>>>>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>>>>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I think the default batch size is 1000 rows?
>>>>>>>>
>>>>>>>
>>>>>>> In manual flush mode, it's up to you to determine how big your
>>>>>>> batches are. It will buffer until you call 'Flush()'. So you could wait
>>>>>>> until you've accumulated way more than 1000 to flush.
>>>>>>>
>>>>>>>
>>>>>>>> I tested with a few different options between 1000 and 200000, but
>>>>>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>>>>>> background mode and 32 hash partitions but results are similar.
>>>>>>>>
>>>>>>>
>>>>>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>>>>>
>>>>>>>
>>>>>>>> The primary key is UUID + some string column though - they always
>>>>>>>> come in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2,
>>>>>>>> etc.
>>>>>>>>
>>>>>>>
>>>>>>> Given this, are you hash-partitioning on just the UUID portion of
>>>>>>> the PK? ie if your PK is (uuid, timestamp), you could hash-partitition on
>>>>>>> the UUID. This should ensure that you get pretty good batching of the
>>>>>>> writes.
>>>>>>>
>>>>>>> Todd
>>>>>>>
>>>>>>>
>>>>>>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> In addition to what Zhen suggests, I'm also curious how you are
>>>>>>>>> sizing your batches in manual-flush mode? With 128 hash partitions, each
>>>>>>>>> batch is generating 128 RPCs, so if for example you are only batching 1000
>>>>>>>>> rows at a time, you'll end up with a lot of fixed overhead in each RPC to
>>>>>>>>> insert just 1000/128 = ~8 rows.
>>>>>>>>>
>>>>>>>>> Generally I would expect an 8 node cluster (even with HDDs) to be
>>>>>>>>> able to sustain several hundred thousand rows/second insert rate. Of
>>>>>>>>> course, it depends on the size of the rows and also the primary key you've
>>>>>>>>> chosen. If your primary key is generally increasing (such as the kafka
>>>>>>>>> sequence number) then you should have very little compaction and good
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> -Todd
>>>>>>>>>
>>>>>>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Maybe you can add your consumer number? In my opinion,
>>>>>>>>>> more threads to insert can give a better throughput.
>>>>>>>>>>
>>>>>>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>>>>>>>
>>>>>>>>>>> OK. Thanks! I changed to manual flush mode and it increased to
>>>>>>>>>>> ~15K / sec. :)
>>>>>>>>>>>
>>>>>>>>>>> Is there any other tuning I can do to further improve this? and
>>>>>>>>>>> also, how much would
>>>>>>>>>>> SSD help in this case (only upsert)?
>>>>>>>>>>>
>>>>>>>>>>> Thanks again,
>>>>>>>>>>> Chao
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <todd@cloudera.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> If you want to manage batching yourself you can use the manual
>>>>>>>>>>>> flush mode. Easiest would be the auto flush background mode.
>>>>>>>>>>>>
>>>>>>>>>>>> Todd
>>>>>>>>>>>>
>>>>>>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull
>>>>>>>>>>>>> the data.
>>>>>>>>>>>>> For Kudu, I was doing something very simple that basically
>>>>>>>>>>>>> just follow the example here
>>>>>>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>>>>>>> .
>>>>>>>>>>>>> In specific:
>>>>>>>>>>>>>
>>>>>>>>>>>>> loop {
>>>>>>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>>>>>>   // fill the columns
>>>>>>>>>>>>>   kuduSession.apply(insert)
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Chao
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <
>>>>>>>>>>>>> todd@cloudera.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Chao,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What are you using to consume from Kafka and write to Kudu?
>>>>>>>>>>>>>> Is it possible that it is Java code and you are using the SYNC flush mode?
>>>>>>>>>>>>>> That would result in a separate round trip for each record and thus very
>>>>>>>>>>>>>> low throughput.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Todd
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1,
>>>>>>>>>>>>>> revision af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a
>>>>>>>>>>>>>> 8-node cluster.
>>>>>>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec,
>>>>>>>>>>>>>> and hash partitioned into 128 buckets. However, with default settings, Kudu
>>>>>>>>>>>>>> can only consume the topics at a rate of around 1.5K / second. This is a
>>>>>>>>>>>>>> direct ingest with no transformation on the data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could this because I was using the default configurations?
>>>>>>>>>>>>>> also we are using Kudu on HDD - could that also be related?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Chao
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Todd Lipcon
>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Posted by Chao Sun <su...@uber.com>.

> Great. Keep in mind that, since you have a UUID component at the front of
your key, you are doing something like a random-write workload. So, as your
data grows, if your PK column (and its bloom filters) ends up being larger
than the available RAM for caching, each write may generate a disk seek
which will make throughput plummet. This is unlike some other storage
options like HBase which does "blind puts".

> Just something to be aware of, for performance planning.

Thanks for letting me know. I'll keep a note.

> I think in 1.3 it was called "kudu test loadgen" and may have fewer
options available.

Cool. I just run it on one of the TS node ('kudu test loadgen <hostname>
--num-threads=8 --num-rows-per-thread=1000000 --table-num-buckets=32'), and
got the following:

Generator report
  time total  : 5434.15 ms
  time per row: 0.000679268 ms

~1.5M / sec? looks good.

Best,
Chao





On Wed, Nov 1, 2017 at 1:40 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun <su...@uber.com> wrote:
>
>> Thanks Todd! I improved my code to use multi Kudu clients for processing
>> the Kafka messages and
>> was able to improve the number to 250K - 300K per sec. Pretty happy with
>> this now.
>>
>
> Great. Keep in mind that, since you have a UUID component at the front of
> your key, you are doing something like a random-write workload. So, as your
> data grows, if your PK column (and its bloom filters) ends up being larger
> than the available RAM for caching, each write may generate a disk seek
> which will make throughput plummet. This is unlike some other storage
> options like HBase which does "blind puts".
>
> Just something to be aware of, for performance planning.
>
>
>>
>> Will take a look at the perf tool - looks very nice. It seems it is not
>> available on Kudu 1.3 though.
>>
>>
> I think in 1.3 it was called "kudu test loadgen" and may have fewer
> options available.
>
> -Todd
>
> On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> Sounds good.
>>>>
>>>> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
>>>> For example something like:
>>>>
>>>> kudu perf loadgen my-kudu-master.example.com --num-threads=8
>>>> --num-rows-per-thread=1000000 --table-num-buckets=32
>>>>
>>>> There are also a bunch of options to tune buffer sizes, flush options,
>>>> etc. But with the default settings above on an 8-node cluster I have, I was
>>>> able to insert 8M rows in 44 seconds (180k/sec).
>>>>
>>>> Adding --buffer-size-bytes=10000000 almost doubled the above
>>>> throughput (330k rows/sec)
>>>>
>>>
>>> One more quick datapoint: I ran the above command simultaneously (in
>>> parallel) four times. Despite running 4x as many clients,  they all
>>> finished in the same time as a single client did (ie aggregate throughput
>>> ~1.2M rows/sec).
>>>
>>> Again this isn't a scientific benchmark, and it's such a short burst of
>>> activity that it doesn't represent a real workload, but 15k rows/sec is
>>> definitely at least an order of magnitude lower than the peak throughput I
>>> would expect.
>>>
>>> -Todd
>>>
>>>
>>>>
>>>> -Todd
>>>>
>>>>
>>>>
>>>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>>>>>>
>>>>>>> Thanks Zhen and Todd.
>>>>>>>
>>>>>>> Yes increasing the # of consumers will definitely help, but we also
>>>>>>> want to test the best throughput we can get from Kudu.
>>>>>>>
>>>>>>
>>>>>> Sure, but increasing the number of consumers can increase the
>>>>>> throughput (without increasing the number of Kudu tablet servers).
>>>>>>
>>>>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>>>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>>>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>>>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I think the default batch size is 1000 rows?
>>>>>>>
>>>>>>
>>>>>> In manual flush mode, it's up to you to determine how big your
>>>>>> batches are. It will buffer until you call 'Flush()'. So you could wait
>>>>>> until you've accumulated way more than 1000 to flush.
>>>>>>
>>>>>>
>>>>>>> I tested with a few different options between 1000 and 200000, but
>>>>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>>>>> background mode and 32 hash partitions but results are similar.
>>>>>>>
>>>>>>
>>>>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>>>>
>>>>>>
>>>>>>> The primary key is UUID + some string column though - they always
>>>>>>> come in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2,
>>>>>>> etc.
>>>>>>>
>>>>>>
>>>>>> Given this, are you hash-partitioning on just the UUID portion of the
>>>>>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
>>>>>> UUID. This should ensure that you get pretty good batching of the writes.
>>>>>>
>>>>>> Todd
>>>>>>
>>>>>>
>>>>>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> In addition to what Zhen suggests, I'm also curious how you are
>>>>>>>> sizing your batches in manual-flush mode? With 128 hash partitions, each
>>>>>>>> batch is generating 128 RPCs, so if for example you are only batching 1000
>>>>>>>> rows at a time, you'll end up with a lot of fixed overhead in each RPC to
>>>>>>>> insert just 1000/128 = ~8 rows.
>>>>>>>>
>>>>>>>> Generally I would expect an 8 node cluster (even with HDDs) to be
>>>>>>>> able to sustain several hundred thousand rows/second insert rate. Of
>>>>>>>> course, it depends on the size of the rows and also the primary key you've
>>>>>>>> chosen. If your primary key is generally increasing (such as the kafka
>>>>>>>> sequence number) then you should have very little compaction and good
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Maybe you can add your consumer number? In my opinion,
>>>>>>>>> more threads to insert can give a better throughput.
>>>>>>>>>
>>>>>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>>>>>>
>>>>>>>>>> OK. Thanks! I changed to manual flush mode and it increased to
>>>>>>>>>> ~15K / sec. :)
>>>>>>>>>>
>>>>>>>>>> Is there any other tuning I can do to further improve this? and
>>>>>>>>>> also, how much would
>>>>>>>>>> SSD help in this case (only upsert)?
>>>>>>>>>>
>>>>>>>>>> Thanks again,
>>>>>>>>>> Chao
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> If you want to manage batching yourself you can use the manual
>>>>>>>>>>> flush mode. Easiest would be the auto flush background mode.
>>>>>>>>>>>
>>>>>>>>>>> Todd
>>>>>>>>>>>
>>>>>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull
>>>>>>>>>>>> the data.
>>>>>>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>>>>>>> follow the example here
>>>>>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>>>>>> .
>>>>>>>>>>>> In specific:
>>>>>>>>>>>>
>>>>>>>>>>>> loop {
>>>>>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>>>>>   // fill the columns
>>>>>>>>>>>>   kuduSession.apply(insert)
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Chao
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <
>>>>>>>>>>>> todd@cloudera.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Chao,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is
>>>>>>>>>>>>> it possible that it is Java code and you are using the SYNC flush mode?
>>>>>>>>>>>>> That would result in a separate round trip for each record and thus very
>>>>>>>>>>>>> low throughput.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec,
>>>>>>>>>>>>> and hash partitioned into 128 buckets. However, with default settings, Kudu
>>>>>>>>>>>>> can only consume the topics at a rate of around 1.5K / second. This is a
>>>>>>>>>>>>> direct ingest with no transformation on the data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could this because I was using the default configurations?
>>>>>>>>>>>>> also we are using Kudu on HDD - could that also be related?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Chao
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Todd Lipcon
>>>>>>>> Software Engineer, Cloudera
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Nov 1, 2017 at 1:23 PM, Chao Sun <su...@uber.com> wrote:

> Thanks Todd! I improved my code to use multi Kudu clients for processing
> the Kafka messages and
> was able to improve the number to 250K - 300K per sec. Pretty happy with
> this now.
>

Great. Keep in mind that, since you have a UUID component at the front of
your key, you are doing something like a random-write workload. So, as your
data grows, if your PK column (and its bloom filters) ends up being larger
than the available RAM for caching, each write may generate a disk seek
which will make throughput plummet. This is unlike some other storage
options like HBase which does "blind puts".

Just something to be aware of, for performance planning.


>
> Will take a look at the perf tool - looks very nice. It seems it is not
> available on Kudu 1.3 though.
>
>
I think in 1.3 it was called "kudu test loadgen" and may have fewer options
available.

-Todd

On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Sounds good.
>>>
>>> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
>>> For example something like:
>>>
>>> kudu perf loadgen my-kudu-master.example.com --num-threads=8
>>> --num-rows-per-thread=1000000 --table-num-buckets=32
>>>
>>> There are also a bunch of options to tune buffer sizes, flush options,
>>> etc. But with the default settings above on an 8-node cluster I have, I was
>>> able to insert 8M rows in 44 seconds (180k/sec).
>>>
>>> Adding --buffer-size-bytes=10000000 almost doubled the above throughput
>>> (330k rows/sec)
>>>
>>
>> One more quick datapoint: I ran the above command simultaneously (in
>> parallel) four times. Despite running 4x as many clients,  they all
>> finished in the same time as a single client did (ie aggregate throughput
>> ~1.2M rows/sec).
>>
>> Again this isn't a scientific benchmark, and it's such a short burst of
>> activity that it doesn't represent a real workload, but 15k rows/sec is
>> definitely at least an order of magnitude lower than the peak throughput I
>> would expect.
>>
>> -Todd
>>
>>
>>>
>>> -Todd
>>>
>>>
>>>
>>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>>>>>
>>>>>> Thanks Zhen and Todd.
>>>>>>
>>>>>> Yes increasing the # of consumers will definitely help, but we also
>>>>>> want to test the best throughput we can get from Kudu.
>>>>>>
>>>>>
>>>>> Sure, but increasing the number of consumers can increase the
>>>>> throughput (without increasing the number of Kudu tablet servers).
>>>>>
>>>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>>>
>>>>>
>>>>>>
>>>>>> I think the default batch size is 1000 rows?
>>>>>>
>>>>>
>>>>> In manual flush mode, it's up to you to determine how big your batches
>>>>> are. It will buffer until you call 'Flush()'. So you could wait until
>>>>> you've accumulated way more than 1000 to flush.
>>>>>
>>>>>
>>>>>> I tested with a few different options between 1000 and 200000, but
>>>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>>>> background mode and 32 hash partitions but results are similar.
>>>>>>
>>>>>
>>>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>>>
>>>>>
>>>>>> The primary key is UUID + some string column though - they always
>>>>>> come in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2,
>>>>>> etc.
>>>>>>
>>>>>
>>>>> Given this, are you hash-partitioning on just the UUID portion of the
>>>>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
>>>>> UUID. This should ensure that you get pretty good batching of the writes.
>>>>>
>>>>> Todd
>>>>>
>>>>>
>>>>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In addition to what Zhen suggests, I'm also curious how you are
>>>>>>> sizing your batches in manual-flush mode? With 128 hash partitions, each
>>>>>>> batch is generating 128 RPCs, so if for example you are only batching 1000
>>>>>>> rows at a time, you'll end up with a lot of fixed overhead in each RPC to
>>>>>>> insert just 1000/128 = ~8 rows.
>>>>>>>
>>>>>>> Generally I would expect an 8 node cluster (even with HDDs) to be
>>>>>>> able to sustain several hundred thousand rows/second insert rate. Of
>>>>>>> course, it depends on the size of the rows and also the primary key you've
>>>>>>> chosen. If your primary key is generally increasing (such as the kafka
>>>>>>> sequence number) then you should have very little compaction and good
>>>>>>> performance.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Maybe you can add your consumer number? In my opinion, more threads
>>>>>>>> to insert can give a better throughput.
>>>>>>>>
>>>>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>>>>>
>>>>>>>>> OK. Thanks! I changed to manual flush mode and it increased to
>>>>>>>>> ~15K / sec. :)
>>>>>>>>>
>>>>>>>>> Is there any other tuning I can do to further improve this? and
>>>>>>>>> also, how much would
>>>>>>>>> SSD help in this case (only upsert)?
>>>>>>>>>
>>>>>>>>> Thanks again,
>>>>>>>>> Chao
>>>>>>>>>
>>>>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> If you want to manage batching yourself you can use the manual
>>>>>>>>>> flush mode. Easiest would be the auto flush background mode.
>>>>>>>>>>
>>>>>>>>>> Todd
>>>>>>>>>>
>>>>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Todd,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the
>>>>>>>>>>> data.
>>>>>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>>>>>> follow the example here
>>>>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>>>>> .
>>>>>>>>>>> In specific:
>>>>>>>>>>>
>>>>>>>>>>> loop {
>>>>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>>>>   // fill the columns
>>>>>>>>>>>   kuduSession.apply(insert)
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Chao
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <todd@cloudera.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Chao,
>>>>>>>>>>>>
>>>>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>>>>
>>>>>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is
>>>>>>>>>>>> it possible that it is Java code and you are using the SYNC flush mode?
>>>>>>>>>>>> That would result in a separate round trip for each record and thus very
>>>>>>>>>>>> low throughput.
>>>>>>>>>>>>
>>>>>>>>>>>> Todd
>>>>>>>>>>>>
>>>>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec,
>>>>>>>>>>>> and hash partitioned into 128 buckets. However, with default settings, Kudu
>>>>>>>>>>>> can only consume the topics at a rate of around 1.5K / second. This is a
>>>>>>>>>>>> direct ingest with no transformation on the data.
>>>>>>>>>>>>
>>>>>>>>>>>> Could this because I was using the default configurations? also
>>>>>>>>>>>> we are using Kudu on HDD - could that also be related?
>>>>>>>>>>>>
>>>>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Chao
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Posted by Chao Sun <su...@uber.com>.

Thanks Todd! I improved my code to use multi Kudu clients for processing
the Kafka messages and
was able to improve the number to 250K - 300K per sec. Pretty happy with
this now.

Will take a look at the perf tool - looks very nice. It seems it is not
available on Kudu 1.3 though.

Best,
Chao

On Wed, Nov 1, 2017 at 12:23 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Sounds good.
>>
>> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
>> For example something like:
>>
>> kudu perf loadgen my-kudu-master.example.com --num-threads=8
>> --num-rows-per-thread=1000000 --table-num-buckets=32
>>
>> There are also a bunch of options to tune buffer sizes, flush options,
>> etc. But with the default settings above on an 8-node cluster I have, I was
>> able to insert 8M rows in 44 seconds (180k/sec).
>>
>> Adding --buffer-size-bytes=10000000 almost doubled the above throughput
>> (330k rows/sec)
>>
>
> One more quick datapoint: I ran the above command simultaneously (in
> parallel) four times. Despite running 4x as many clients,  they all
> finished in the same time as a single client did (ie aggregate throughput
> ~1.2M rows/sec).
>
> Again this isn't a scientific benchmark, and it's such a short burst of
> activity that it doesn't represent a real workload, but 15k rows/sec is
> definitely at least an order of magnitude lower than the peak throughput I
> would expect.
>
> -Todd
>
>
>>
>> -Todd
>>
>>
>>
>>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>>>>
>>>>> Thanks Zhen and Todd.
>>>>>
>>>>> Yes increasing the # of consumers will definitely help, but we also
>>>>> want to test the best throughput we can get from Kudu.
>>>>>
>>>>
>>>> Sure, but increasing the number of consumers can increase the
>>>> throughput (without increasing the number of Kudu tablet servers).
>>>>
>>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>>
>>>>
>>>>>
>>>>> I think the default batch size is 1000 rows?
>>>>>
>>>>
>>>> In manual flush mode, it's up to you to determine how big your batches
>>>> are. It will buffer until you call 'Flush()'. So you could wait until
>>>> you've accumulated way more than 1000 to flush.
>>>>
>>>>
>>>>> I tested with a few different options between 1000 and 200000, but
>>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>>> background mode and 32 hash partitions but results are similar.
>>>>>
>>>>
>>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>>
>>>>
>>>>> The primary key is UUID + some string column though - they always come
>>>>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>>>>>
>>>>
>>>> Given this, are you hash-partitioning on just the UUID portion of the
>>>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
>>>> UUID. This should ensure that you get pretty good batching of the writes.
>>>>
>>>> Todd
>>>>
>>>>
>>>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> In addition to what Zhen suggests, I'm also curious how you are
>>>>>> sizing your batches in manual-flush mode? With 128 hash partitions, each
>>>>>> batch is generating 128 RPCs, so if for example you are only batching 1000
>>>>>> rows at a time, you'll end up with a lot of fixed overhead in each RPC to
>>>>>> insert just 1000/128 = ~8 rows.
>>>>>>
>>>>>> Generally I would expect an 8 node cluster (even with HDDs) to be
>>>>>> able to sustain several hundred thousand rows/second insert rate. Of
>>>>>> course, it depends on the size of the rows and also the primary key you've
>>>>>> chosen. If your primary key is generally increasing (such as the kafka
>>>>>> sequence number) then you should have very little compaction and good
>>>>>> performance.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Maybe you can add your consumer number? In my opinion, more threads
>>>>>>> to insert can give a better throughput.
>>>>>>>
>>>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>>>>
>>>>>>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K
>>>>>>>> / sec. :)
>>>>>>>>
>>>>>>>> Is there any other tuning I can do to further improve this? and
>>>>>>>> also, how much would
>>>>>>>> SSD help in this case (only upsert)?
>>>>>>>>
>>>>>>>> Thanks again,
>>>>>>>> Chao
>>>>>>>>
>>>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> If you want to manage batching yourself you can use the manual
>>>>>>>>> flush mode. Easiest would be the auto flush background mode.
>>>>>>>>>
>>>>>>>>> Todd
>>>>>>>>>
>>>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Todd,
>>>>>>>>>>
>>>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the
>>>>>>>>>> data.
>>>>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>>>>> follow the example here
>>>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>>>> .
>>>>>>>>>> In specific:
>>>>>>>>>>
>>>>>>>>>> loop {
>>>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>>>   // fill the columns
>>>>>>>>>>   kuduSession.apply(insert)
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Chao
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Chao,
>>>>>>>>>>>
>>>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>>>
>>>>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is
>>>>>>>>>>> it possible that it is Java code and you are using the SYNC flush mode?
>>>>>>>>>>> That would result in a separate round trip for each record and thus very
>>>>>>>>>>> low throughput.
>>>>>>>>>>>
>>>>>>>>>>> Todd
>>>>>>>>>>>
>>>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec,
>>>>>>>>>>> and hash partitioned into 128 buckets. However, with default settings, Kudu
>>>>>>>>>>> can only consume the topics at a rate of around 1.5K / second. This is a
>>>>>>>>>>> direct ingest with no transformation on the data.
>>>>>>>>>>>
>>>>>>>>>>> Could this because I was using the default configurations? also
>>>>>>>>>>> we are using Kudu on HDD - could that also be related?
>>>>>>>>>>>
>>>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Chao
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Nov 1, 2017 at 12:20 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Sounds good.
>
> BTW, you can try a quick load test using the 'kudu perf loadgen' tool.
> For example something like:
>
> kudu perf loadgen my-kudu-master.example.com --num-threads=8
> --num-rows-per-thread=1000000 --table-num-buckets=32
>
> There are also a bunch of options to tune buffer sizes, flush options,
> etc. But with the default settings above on an 8-node cluster I have, I was
> able to insert 8M rows in 44 seconds (180k/sec).
>
> Adding --buffer-size-bytes=10000000 almost doubled the above throughput
> (330k rows/sec)
>

One more quick datapoint: I ran the above command simultaneously (in
parallel) four times. Despite running 4x as many clients,  they all
finished in the same time as a single client did (ie aggregate throughput
~1.2M rows/sec).

Again this isn't a scientific benchmark, and it's such a short burst of
activity that it doesn't represent a real workload, but 15k rows/sec is
definitely at least an order of magnitude lower than the peak throughput I
would expect.

-Todd


>
> -Todd
>
>
>
>> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>>
>>>
>>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>>>
>>>> Thanks Zhen and Todd.
>>>>
>>>> Yes increasing the # of consumers will definitely help, but we also
>>>> want to test the best throughput we can get from Kudu.
>>>>
>>>
>>> Sure, but increasing the number of consumers can increase the throughput
>>> (without increasing the number of Kudu tablet servers).
>>>
>>> Currently, if you run 'top' on the TS nodes, do you see them using a
>>> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
>>> utilization? My guess is that at 15k/sec you are hardly utilizing the
>>> nodes, and you're mostly bound by round trip latencies, etc.
>>>
>>>
>>>>
>>>> I think the default batch size is 1000 rows?
>>>>
>>>
>>> In manual flush mode, it's up to you to determine how big your batches
>>> are. It will buffer until you call 'Flush()'. So you could wait until
>>> you've accumulated way more than 1000 to flush.
>>>
>>>
>>>> I tested with a few different options between 1000 and 200000, but
>>>> always got some number between 15K to 20K per sec. Also tried flush
>>>> background mode and 32 hash partitions but results are similar.
>>>>
>>>
>>> In your AUTO_FLUSH test, were you still calling Flush()?
>>>
>>>
>>>> The primary key is UUID + some string column though - they always come
>>>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>>>>
>>>
>>> Given this, are you hash-partitioning on just the UUID portion of the
>>> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
>>> UUID. This should ensure that you get pretty good batching of the writes.
>>>
>>> Todd
>>>
>>>
>>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>>
>>>>> In addition to what Zhen suggests, I'm also curious how you are sizing
>>>>> your batches in manual-flush mode? With 128 hash partitions, each batch is
>>>>> generating 128 RPCs, so if for example you are only batching 1000 rows at a
>>>>> time, you'll end up with a lot of fixed overhead in each RPC to insert just
>>>>> 1000/128 = ~8 rows.
>>>>>
>>>>> Generally I would expect an 8 node cluster (even with HDDs) to be able
>>>>> to sustain several hundred thousand rows/second insert rate. Of course, it
>>>>> depends on the size of the rows and also the primary key you've chosen. If
>>>>> your primary key is generally increasing (such as the kafka sequence
>>>>> number) then you should have very little compaction and good performance.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com> wrote:
>>>>>
>>>>>> Maybe you can add your consumer number? In my opinion, more threads
>>>>>> to insert can give a better throughput.
>>>>>>
>>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>>>
>>>>>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K
>>>>>>> / sec. :)
>>>>>>>
>>>>>>> Is there any other tuning I can do to further improve this? and
>>>>>>> also, how much would
>>>>>>> SSD help in this case (only upsert)?
>>>>>>>
>>>>>>> Thanks again,
>>>>>>> Chao
>>>>>>>
>>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If you want to manage batching yourself you can use the manual
>>>>>>>> flush mode. Easiest would be the auto flush background mode.
>>>>>>>>
>>>>>>>> Todd
>>>>>>>>
>>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Todd,
>>>>>>>>>
>>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the
>>>>>>>>> data.
>>>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>>>> follow the example here
>>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>>> .
>>>>>>>>> In specific:
>>>>>>>>>
>>>>>>>>> loop {
>>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>>   // fill the columns
>>>>>>>>>   kuduSession.apply(insert)
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Chao
>>>>>>>>>
>>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Chao,
>>>>>>>>>>
>>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>>
>>>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>>>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>>>>>>> would result in a separate round trip for each record and thus very low
>>>>>>>>>> throughput.
>>>>>>>>>>
>>>>>>>>>> Todd
>>>>>>>>>>
>>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec, and
>>>>>>>>>> hash partitioned into 128 buckets. However, with default settings, Kudu can
>>>>>>>>>> only consume the topics at a rate of around 1.5K / second. This is a direct
>>>>>>>>>> ingest with no transformation on the data.
>>>>>>>>>>
>>>>>>>>>> Could this because I was using the default configurations? also
>>>>>>>>>> we are using Kudu on HDD - could that also be related?
>>>>>>>>>>
>>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Chao
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Oct 31, 2017 at 11:56 PM, Chao Sun <su...@uber.com> wrote:

> > Sure, but increasing the number of consumers can increase the throughput
> (without increasing the number of Kudu tablet servers).
>
> I see. Make sense. I'll test that later.
>
> > Currently, if you run 'top' on the TS nodes, do you see them using a
> high amount of CPU? Similar question for 'iostat -dxm 1' - high IO
> utilization? My guess is that at 15k/sec you are hardly utilizing the
> nodes, and you're mostly bound by round trip latencies, etc.
>
> From the top and iostat commands, the TS nodes seem pretty under-utilized.
> CPU usage is less than 10%.
>
> > In manual flush mode, it's up to you to determine how big your batches
> are. It will buffer until you call 'Flush()'. So you could wait until
> you've accumulated way more than 1000 to flush.
>
> Got it. I meant the default buffer size is 1000 - found out that I need to
> bump this up in order to bypass "buffer is too big" error.
>
> > In your AUTO_FLUSH test, were you still calling Flush()?
>
> Yes.
>

OK,  in that case, the "Flush()" call is still a synchronous flush. So you
may want to only call Flush() infrequently.


>
> > Given this, are you hash-partitioning on just the UUID portion of the
> PK? ie if your PK is (uuid, timestamp), you could hash-partitition on the
> UUID. This should ensure that you get pretty good batching of the writes.
>
> Yes, I only hash-partitioned on the UUID portion.
>

Sounds good.

BTW, you can try a quick load test using the 'kudu perf loadgen' tool.  For
example something like:

kudu perf loadgen my-kudu-master.example.com --num-threads=8
--num-rows-per-thread=1000000 --table-num-buckets=32

There are also a bunch of options to tune buffer sizes, flush options, etc.
But with the default settings above on an 8-node cluster I have, I was able
to insert 8M rows in 44 seconds (180k/sec).

Adding --buffer-size-bytes=10000000 almost doubled the above throughput
(330k rows/sec)

-Todd



> On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>>
>>
>> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>>
>>> Thanks Zhen and Todd.
>>>
>>> Yes increasing the # of consumers will definitely help, but we also want
>>> to test the best throughput we can get from Kudu.
>>>
>>
>> Sure, but increasing the number of consumers can increase the throughput
>> (without increasing the number of Kudu tablet servers).
>>
>> Currently, if you run 'top' on the TS nodes, do you see them using a high
>> amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization?
>> My guess is that at 15k/sec you are hardly utilizing the nodes, and you're
>> mostly bound by round trip latencies, etc.
>>
>>
>>>
>>> I think the default batch size is 1000 rows?
>>>
>>
>> In manual flush mode, it's up to you to determine how big your batches
>> are. It will buffer until you call 'Flush()'. So you could wait until
>> you've accumulated way more than 1000 to flush.
>>
>>
>>> I tested with a few different options between 1000 and 200000, but
>>> always got some number between 15K to 20K per sec. Also tried flush
>>> background mode and 32 hash partitions but results are similar.
>>>
>>
>> In your AUTO_FLUSH test, were you still calling Flush()?
>>
>>
>>> The primary key is UUID + some string column though - they always come
>>> in batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>>>
>>
>> Given this, are you hash-partitioning on just the UUID portion of the PK?
>> ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID.
>> This should ensure that you get pretty good batching of the writes.
>>
>> Todd
>>
>>
>>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> In addition to what Zhen suggests, I'm also curious how you are sizing
>>>> your batches in manual-flush mode? With 128 hash partitions, each batch is
>>>> generating 128 RPCs, so if for example you are only batching 1000 rows at a
>>>> time, you'll end up with a lot of fixed overhead in each RPC to insert just
>>>> 1000/128 = ~8 rows.
>>>>
>>>> Generally I would expect an 8 node cluster (even with HDDs) to be able
>>>> to sustain several hundred thousand rows/second insert rate. Of course, it
>>>> depends on the size of the rows and also the primary key you've chosen. If
>>>> your primary key is generally increasing (such as the kafka sequence
>>>> number) then you should have very little compaction and good performance.
>>>>
>>>> -Todd
>>>>
>>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com> wrote:
>>>>
>>>>> Maybe you can add your consumer number? In my opinion, more threads to
>>>>> insert can give a better throughput.
>>>>>
>>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>>
>>>>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>>>>>> sec. :)
>>>>>>
>>>>>> Is there any other tuning I can do to further improve this? and also,
>>>>>> how much would
>>>>>> SSD help in this case (only upsert)?
>>>>>>
>>>>>> Thanks again,
>>>>>> Chao
>>>>>>
>>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> If you want to manage batching yourself you can use the manual flush
>>>>>>> mode. Easiest would be the auto flush background mode.
>>>>>>>
>>>>>>> Todd
>>>>>>>
>>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>
>>>>>>>> Hi Todd,
>>>>>>>>
>>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the
>>>>>>>> data.
>>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>>> follow the example here
>>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>>> .
>>>>>>>> In specific:
>>>>>>>>
>>>>>>>> loop {
>>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>>   // fill the columns
>>>>>>>>   kuduSession.apply(insert)
>>>>>>>> }
>>>>>>>>
>>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Chao
>>>>>>>>
>>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Chao,
>>>>>>>>>
>>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>>
>>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>>>>>> would result in a separate round trip for each record and thus very low
>>>>>>>>> throughput.
>>>>>>>>>
>>>>>>>>> Todd
>>>>>>>>>
>>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec, and
>>>>>>>>> hash partitioned into 128 buckets. However, with default settings, Kudu can
>>>>>>>>> only consume the topics at a rate of around 1.5K / second. This is a direct
>>>>>>>>> ingest with no transformation on the data.
>>>>>>>>>
>>>>>>>>> Could this because I was using the default configurations? also we
>>>>>>>>> are using Kudu on HDD - could that also be related?
>>>>>>>>>
>>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Chao
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Posted by Chao Sun <su...@uber.com>.

> Sure, but increasing the number of consumers can increase the throughput
(without increasing the number of Kudu tablet servers).

I see. Make sense. I'll test that later.

> Currently, if you run 'top' on the TS nodes, do you see them using a high
amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization?
My guess is that at 15k/sec you are hardly utilizing the nodes, and you're
mostly bound by round trip latencies, etc.

From the top and iostat commands, the TS nodes seem pretty under-utilized.
CPU usage is less than 10%.

> In manual flush mode, it's up to you to determine how big your batches
are. It will buffer until you call 'Flush()'. So you could wait until
you've accumulated way more than 1000 to flush.

Got it. I meant the default buffer size is 1000 - found out that I need to
bump this up in order to bypass "buffer is too big" error.

> In your AUTO_FLUSH test, were you still calling Flush()?

Yes.

> Given this, are you hash-partitioning on just the UUID portion of the PK?
ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID.
This should ensure that you get pretty good batching of the writes.

Yes, I only hash-partitioned on the UUID portion.

Best,
Chao

On Tue, Oct 31, 2017 at 11:25 PM, Todd Lipcon <to...@cloudera.com> wrote:

>
>
> On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:
>
>> Thanks Zhen and Todd.
>>
>> Yes increasing the # of consumers will definitely help, but we also want
>> to test the best throughput we can get from Kudu.
>>
>
> Sure, but increasing the number of consumers can increase the throughput
> (without increasing the number of Kudu tablet servers).
>
> Currently, if you run 'top' on the TS nodes, do you see them using a high
> amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization?
> My guess is that at 15k/sec you are hardly utilizing the nodes, and you're
> mostly bound by round trip latencies, etc.
>
>
>>
>> I think the default batch size is 1000 rows?
>>
>
> In manual flush mode, it's up to you to determine how big your batches
> are. It will buffer until you call 'Flush()'. So you could wait until
> you've accumulated way more than 1000 to flush.
>
>
>> I tested with a few different options between 1000 and 200000, but always
>> got some number between 15K to 20K per sec. Also tried flush background
>> mode and 32 hash partitions but results are similar.
>>
>
> In your AUTO_FLUSH test, were you still calling Flush()?
>
>
>> The primary key is UUID + some string column though - they always come in
>> batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>>
>
> Given this, are you hash-partitioning on just the UUID portion of the PK?
> ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID.
> This should ensure that you get pretty good batching of the writes.
>
> Todd
>
>
>> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> In addition to what Zhen suggests, I'm also curious how you are sizing
>>> your batches in manual-flush mode? With 128 hash partitions, each batch is
>>> generating 128 RPCs, so if for example you are only batching 1000 rows at a
>>> time, you'll end up with a lot of fixed overhead in each RPC to insert just
>>> 1000/128 = ~8 rows.
>>>
>>> Generally I would expect an 8 node cluster (even with HDDs) to be able
>>> to sustain several hundred thousand rows/second insert rate. Of course, it
>>> depends on the size of the rows and also the primary key you've chosen. If
>>> your primary key is generally increasing (such as the kafka sequence
>>> number) then you should have very little compaction and good performance.
>>>
>>> -Todd
>>>
>>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com> wrote:
>>>
>>>> Maybe you can add your consumer number? In my opinion, more threads to
>>>> insert can give a better throughput.
>>>>
>>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>>
>>>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>>>>> sec. :)
>>>>>
>>>>> Is there any other tuning I can do to further improve this? and also,
>>>>> how much would
>>>>> SSD help in this case (only upsert)?
>>>>>
>>>>> Thanks again,
>>>>> Chao
>>>>>
>>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> If you want to manage batching yourself you can use the manual flush
>>>>>> mode. Easiest would be the auto flush background mode.
>>>>>>
>>>>>> Todd
>>>>>>
>>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>
>>>>>>> Hi Todd,
>>>>>>>
>>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the
>>>>>>> data.
>>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>>> follow the example here
>>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>>> .
>>>>>>> In specific:
>>>>>>>
>>>>>>> loop {
>>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>>   PartialRow row = insert.getRow();
>>>>>>>   // fill the columns
>>>>>>>   kuduSession.apply(insert)
>>>>>>> }
>>>>>>>
>>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>>> should I use MANUAL_FLUSH?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Chao
>>>>>>>
>>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Chao,
>>>>>>>>
>>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>>
>>>>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>>>>> would result in a separate round trip for each record and thus very low
>>>>>>>> throughput.
>>>>>>>>
>>>>>>>> Todd
>>>>>>>>
>>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>>> The data are coming from Kafka at a rate of around 30K / sec, and
>>>>>>>> hash partitioned into 128 buckets. However, with default settings, Kudu can
>>>>>>>> only consume the topics at a rate of around 1.5K / second. This is a direct
>>>>>>>> ingest with no transformation on the data.
>>>>>>>>
>>>>>>>> Could this because I was using the default configurations? also we
>>>>>>>> are using Kudu on HDD - could that also be related?
>>>>>>>>
>>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Chao
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Oct 31, 2017 at 11:14 PM, Chao Sun <su...@uber.com> wrote:

> Thanks Zhen and Todd.
>
> Yes increasing the # of consumers will definitely help, but we also want
> to test the best throughput we can get from Kudu.
>

Sure, but increasing the number of consumers can increase the throughput
(without increasing the number of Kudu tablet servers).

Currently, if you run 'top' on the TS nodes, do you see them using a high
amount of CPU? Similar question for 'iostat -dxm 1' - high IO utilization?
My guess is that at 15k/sec you are hardly utilizing the nodes, and you're
mostly bound by round trip latencies, etc.


>
> I think the default batch size is 1000 rows?
>

In manual flush mode, it's up to you to determine how big your batches are.
It will buffer until you call 'Flush()'. So you could wait until you've
accumulated way more than 1000 to flush.


> I tested with a few different options between 1000 and 200000, but always
> got some number between 15K to 20K per sec. Also tried flush background
> mode and 32 hash partitions but results are similar.
>

In your AUTO_FLUSH test, were you still calling Flush()?


> The primary key is UUID + some string column though - they always come in
> batches, e.g., 300 rows for uuid1 followed by 400 rows for uuid2, etc.
>

Given this, are you hash-partitioning on just the UUID portion of the PK?
ie if your PK is (uuid, timestamp), you could hash-partitition on the UUID.
This should ensure that you get pretty good batching of the writes.

Todd


> On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> In addition to what Zhen suggests, I'm also curious how you are sizing
>> your batches in manual-flush mode? With 128 hash partitions, each batch is
>> generating 128 RPCs, so if for example you are only batching 1000 rows at a
>> time, you'll end up with a lot of fixed overhead in each RPC to insert just
>> 1000/128 = ~8 rows.
>>
>> Generally I would expect an 8 node cluster (even with HDDs) to be able to
>> sustain several hundred thousand rows/second insert rate. Of course, it
>> depends on the size of the rows and also the primary key you've chosen. If
>> your primary key is generally increasing (such as the kafka sequence
>> number) then you should have very little compaction and good performance.
>>
>> -Todd
>>
>> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com> wrote:
>>
>>> Maybe you can add your consumer number? In my opinion, more threads to
>>> insert can give a better throughput.
>>>
>>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>>
>>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>>>> sec. :)
>>>>
>>>> Is there any other tuning I can do to further improve this? and also,
>>>> how much would
>>>> SSD help in this case (only upsert)?
>>>>
>>>> Thanks again,
>>>> Chao
>>>>
>>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com>
>>>> wrote:
>>>>
>>>>> If you want to manage batching yourself you can use the manual flush
>>>>> mode. Easiest would be the auto flush background mode.
>>>>>
>>>>> Todd
>>>>>
>>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>
>>>>>> Hi Todd,
>>>>>>
>>>>>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>>>>>> For Kudu, I was doing something very simple that basically just
>>>>>> follow the example here
>>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>>> .
>>>>>> In specific:
>>>>>>
>>>>>> loop {
>>>>>>   Insert insert = kuduTable.newInsert();
>>>>>>   PartialRow row = insert.getRow();
>>>>>>   // fill the columns
>>>>>>   kuduSession.apply(insert)
>>>>>> }
>>>>>>
>>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>>> AUTO_FLUSH_SYNC as default?
>>>>>> should I use MANUAL_FLUSH?
>>>>>>
>>>>>> Thanks,
>>>>>> Chao
>>>>>>
>>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Chao,
>>>>>>>
>>>>>>> Nice to hear you are checking out Kudu.
>>>>>>>
>>>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>>>> would result in a separate round trip for each record and thus very low
>>>>>>> throughput.
>>>>>>>
>>>>>>> Todd
>>>>>>>
>>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>>> The data are coming from Kafka at a rate of around 30K / sec, and
>>>>>>> hash partitioned into 128 buckets. However, with default settings, Kudu can
>>>>>>> only consume the topics at a rate of around 1.5K / second. This is a direct
>>>>>>> ingest with no transformation on the data.
>>>>>>>
>>>>>>> Could this because I was using the default configurations? also we
>>>>>>> are using Kudu on HDD - could that also be related?
>>>>>>>
>>>>>>> Any help would be appreciated. Thanks.
>>>>>>>
>>>>>>> Best,
>>>>>>> Chao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Posted by Chao Sun <su...@uber.com>.

Thanks Zhen and Todd.

Yes increasing the # of consumers will definitely help, but we also want to
test the best throughput we can get from Kudu.

I think the default batch size is 1000 rows? I tested with a few different
options between 1000 and 200000, but always got some number between 15K to
20K per sec. Also tried flush background mode and 32 hash partitions but
results are similar. The primary key is UUID + some string column though -
they always come in batches, e.g., 300 rows for uuid1 followed by 400 rows
for uuid2, etc.

Best,
Chao


On Tue, Oct 31, 2017 at 6:25 PM, Todd Lipcon <to...@cloudera.com> wrote:

> In addition to what Zhen suggests, I'm also curious how you are sizing
> your batches in manual-flush mode? With 128 hash partitions, each batch is
> generating 128 RPCs, so if for example you are only batching 1000 rows at a
> time, you'll end up with a lot of fixed overhead in each RPC to insert just
> 1000/128 = ~8 rows.
>
> Generally I would expect an 8 node cluster (even with HDDs) to be able to
> sustain several hundred thousand rows/second insert rate. Of course, it
> depends on the size of the rows and also the primary key you've chosen. If
> your primary key is generally increasing (such as the kafka sequence
> number) then you should have very little compaction and good performance.
>
> -Todd
>
> On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com> wrote:
>
>> Maybe you can add your consumer number? In my opinion, more threads to
>> insert can give a better throughput.
>>
>> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>>
>>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>>> sec. :)
>>>
>>> Is there any other tuning I can do to further improve this? and also,
>>> how much would
>>> SSD help in this case (only upsert)?
>>>
>>> Thanks again,
>>> Chao
>>>
>>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> If you want to manage batching yourself you can use the manual flush
>>>> mode. Easiest would be the auto flush background mode.
>>>>
>>>> Todd
>>>>
>>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>
>>>>> Hi Todd,
>>>>>
>>>>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>>>>> For Kudu, I was doing something very simple that basically just follow
>>>>> the example here
>>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>>> .
>>>>> In specific:
>>>>>
>>>>> loop {
>>>>>   Insert insert = kuduTable.newInsert();
>>>>>   PartialRow row = insert.getRow();
>>>>>   // fill the columns
>>>>>   kuduSession.apply(insert)
>>>>> }
>>>>>
>>>>> I didn't specify the flushing mode, so it will pick up the
>>>>> AUTO_FLUSH_SYNC as default?
>>>>> should I use MANUAL_FLUSH?
>>>>>
>>>>> Thanks,
>>>>> Chao
>>>>>
>>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Chao,
>>>>>>
>>>>>> Nice to hear you are checking out Kudu.
>>>>>>
>>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>>> would result in a separate round trip for each record and thus very low
>>>>>> throughput.
>>>>>>
>>>>>> Todd
>>>>>>
>>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>>> The data are coming from Kafka at a rate of around 30K / sec, and
>>>>>> hash partitioned into 128 buckets. However, with default settings, Kudu can
>>>>>> only consume the topics at a rate of around 1.5K / second. This is a direct
>>>>>> ingest with no transformation on the data.
>>>>>>
>>>>>> Could this because I was using the default configurations? also we
>>>>>> are using Kudu on HDD - could that also be related?
>>>>>>
>>>>>> Any help would be appreciated. Thanks.
>>>>>>
>>>>>> Best,
>>>>>> Chao
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

In addition to what Zhen suggests, I'm also curious how you are sizing your
batches in manual-flush mode? With 128 hash partitions, each batch is
generating 128 RPCs, so if for example you are only batching 1000 rows at a
time, you'll end up with a lot of fixed overhead in each RPC to insert just
1000/128 = ~8 rows.

Generally I would expect an 8 node cluster (even with HDDs) to be able to
sustain several hundred thousand rows/second insert rate. Of course, it
depends on the size of the rows and also the primary key you've chosen. If
your primary key is generally increasing (such as the kafka sequence
number) then you should have very little compaction and good performance.

-Todd

On Tue, Oct 31, 2017 at 6:20 PM, Zhen Zhang <zh...@gmail.com> wrote:

> Maybe you can add your consumer number? In my opinion, more threads to
> insert can give a better throughput.
>
> 2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:
>
>> OK. Thanks! I changed to manual flush mode and it increased to ~15K /
>> sec. :)
>>
>> Is there any other tuning I can do to further improve this? and also, how
>> much would
>> SSD help in this case (only upsert)?
>>
>> Thanks again,
>> Chao
>>
>> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> If you want to manage batching yourself you can use the manual flush
>>> mode. Easiest would be the auto flush background mode.
>>>
>>> Todd
>>>
>>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>>
>>>> Hi Todd,
>>>>
>>>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>>>> For Kudu, I was doing something very simple that basically just follow
>>>> the example here
>>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>>> .
>>>> In specific:
>>>>
>>>> loop {
>>>>   Insert insert = kuduTable.newInsert();
>>>>   PartialRow row = insert.getRow();
>>>>   // fill the columns
>>>>   kuduSession.apply(insert)
>>>> }
>>>>
>>>> I didn't specify the flushing mode, so it will pick up the
>>>> AUTO_FLUSH_SYNC as default?
>>>> should I use MANUAL_FLUSH?
>>>>
>>>> Thanks,
>>>> Chao
>>>>
>>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Hey Chao,
>>>>>
>>>>> Nice to hear you are checking out Kudu.
>>>>>
>>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>>> would result in a separate round trip for each record and thus very low
>>>>> throughput.
>>>>>
>>>>> Todd
>>>>>
>>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>>>>> partitioned into 128 buckets. However, with default settings, Kudu can only
>>>>> consume the topics at a rate of around 1.5K / second. This is a direct
>>>>> ingest with no transformation on the data.
>>>>>
>>>>> Could this because I was using the default configurations? also we are
>>>>> using Kudu on HDD - could that also be related?
>>>>>
>>>>> Any help would be appreciated. Thanks.
>>>>>
>>>>> Best,
>>>>> Chao
>>>>>
>>>>>
>>>>>
>>>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Low ingestion rate from Kafka

Posted by Zhen Zhang <zh...@gmail.com>.

Maybe you can add your consumer number? In my opinion, more threads to
insert can give a better throughput.

2017-10-31 15:07 GMT+08:00 Chao Sun <su...@uber.com>:

> OK. Thanks! I changed to manual flush mode and it increased to ~15K / sec.
> :)
>
> Is there any other tuning I can do to further improve this? and also, how
> much would
> SSD help in this case (only upsert)?
>
> Thanks again,
> Chao
>
> On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> If you want to manage batching yourself you can use the manual flush
>> mode. Easiest would be the auto flush background mode.
>>
>> Todd
>>
>> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>>
>>> Hi Todd,
>>>
>>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>>> For Kudu, I was doing something very simple that basically just follow
>>> the example here
>>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>>> .
>>> In specific:
>>>
>>> loop {
>>>   Insert insert = kuduTable.newInsert();
>>>   PartialRow row = insert.getRow();
>>>   // fill the columns
>>>   kuduSession.apply(insert)
>>> }
>>>
>>> I didn't specify the flushing mode, so it will pick up the
>>> AUTO_FLUSH_SYNC as default?
>>> should I use MANUAL_FLUSH?
>>>
>>> Thanks,
>>> Chao
>>>
>>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> Hey Chao,
>>>>
>>>> Nice to hear you are checking out Kudu.
>>>>
>>>> What are you using to consume from Kafka and write to Kudu? Is it
>>>> possible that it is Java code and you are using the SYNC flush mode? That
>>>> would result in a separate round trip for each record and thus very low
>>>> throughput.
>>>>
>>>> Todd
>>>>
>>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>>>> partitioned into 128 buckets. However, with default settings, Kudu can only
>>>> consume the topics at a rate of around 1.5K / second. This is a direct
>>>> ingest with no transformation on the data.
>>>>
>>>> Could this because I was using the default configurations? also we are
>>>> using Kudu on HDD - could that also be related?
>>>>
>>>> Any help would be appreciated. Thanks.
>>>>
>>>> Best,
>>>> Chao
>>>>
>>>>
>>>>
>>>
>

Re: Low ingestion rate from Kafka

Posted by Chao Sun <su...@uber.com>.

OK. Thanks! I changed to manual flush mode and it increased to ~15K / sec.
:)

Is there any other tuning I can do to further improve this? and also, how
much would
SSD help in this case (only upsert)?

Thanks again,
Chao

On Mon, Oct 30, 2017 at 11:42 PM, Todd Lipcon <to...@cloudera.com> wrote:

> If you want to manage batching yourself you can use the manual flush mode.
> Easiest would be the auto flush background mode.
>
> Todd
>
> On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:
>
>> Hi Todd,
>>
>> Thanks for the reply! I used a single Kafka consumer to pull the data.
>> For Kudu, I was doing something very simple that basically just follow
>> the example here
>> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
>> .
>> In specific:
>>
>> loop {
>>   Insert insert = kuduTable.newInsert();
>>   PartialRow row = insert.getRow();
>>   // fill the columns
>>   kuduSession.apply(insert)
>> }
>>
>> I didn't specify the flushing mode, so it will pick up the
>> AUTO_FLUSH_SYNC as default?
>> should I use MANUAL_FLUSH?
>>
>> Thanks,
>> Chao
>>
>> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hey Chao,
>>>
>>> Nice to hear you are checking out Kudu.
>>>
>>> What are you using to consume from Kafka and write to Kudu? Is it
>>> possible that it is Java code and you are using the SYNC flush mode? That
>>> would result in a separate round trip for each record and thus very low
>>> throughput.
>>>
>>> Todd
>>>
>>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>>
>>> Hi,
>>>
>>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>>> partitioned into 128 buckets. However, with default settings, Kudu can only
>>> consume the topics at a rate of around 1.5K / second. This is a direct
>>> ingest with no transformation on the data.
>>>
>>> Could this because I was using the default configurations? also we are
>>> using Kudu on HDD - could that also be related?
>>>
>>> Any help would be appreciated. Thanks.
>>>
>>> Best,
>>> Chao
>>>
>>>
>>>
>>

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

If you want to manage batching yourself you can use the manual flush mode.
Easiest would be the auto flush background mode.

Todd

On Oct 30, 2017 11:10 PM, "Chao Sun" <su...@uber.com> wrote:

> Hi Todd,
>
> Thanks for the reply! I used a single Kafka consumer to pull the data.
> For Kudu, I was doing something very simple that basically just follow the
> example here
> <https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
> .
> In specific:
>
> loop {
>   Insert insert = kuduTable.newInsert();
>   PartialRow row = insert.getRow();
>   // fill the columns
>   kuduSession.apply(insert)
> }
>
> I didn't specify the flushing mode, so it will pick up the AUTO_FLUSH_SYNC
> as default?
> should I use MANUAL_FLUSH?
>
> Thanks,
> Chao
>
> On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey Chao,
>>
>> Nice to hear you are checking out Kudu.
>>
>> What are you using to consume from Kafka and write to Kudu? Is it
>> possible that it is Java code and you are using the SYNC flush mode? That
>> would result in a separate round trip for each record and thus very low
>> throughput.
>>
>> Todd
>>
>> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>>
>> Hi,
>>
>> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
>> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
>> The data are coming from Kafka at a rate of around 30K / sec, and hash
>> partitioned into 128 buckets. However, with default settings, Kudu can only
>> consume the topics at a rate of around 1.5K / second. This is a direct
>> ingest with no transformation on the data.
>>
>> Could this because I was using the default configurations? also we are
>> using Kudu on HDD - could that also be related?
>>
>> Any help would be appreciated. Thanks.
>>
>> Best,
>> Chao
>>
>>
>>
>

Re: Low ingestion rate from Kafka

Posted by Chao Sun <su...@uber.com>.

Hi Todd,

Thanks for the reply! I used a single Kafka consumer to pull the data.
For Kudu, I was doing something very simple that basically just follow the
example here
<https://github.com/cloudera/kudu-examples/blob/master/java/java-sample/src/main/java/org/kududb/examples/sample/Sample.java>
.
In specific:

loop {
  Insert insert = kuduTable.newInsert();
  PartialRow row = insert.getRow();
  // fill the columns
  kuduSession.apply(insert)
}

I didn't specify the flushing mode, so it will pick up the AUTO_FLUSH_SYNC
as default?
should I use MANUAL_FLUSH?

Thanks,
Chao

On Mon, Oct 30, 2017 at 10:39 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Chao,
>
> Nice to hear you are checking out Kudu.
>
> What are you using to consume from Kafka and write to Kudu? Is it possible
> that it is Java code and you are using the SYNC flush mode? That would
> result in a separate round trip for each record and thus very low
> throughput.
>
> Todd
>
> On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:
>
> Hi,
>
> We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
> af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
> The data are coming from Kafka at a rate of around 30K / sec, and hash
> partitioned into 128 buckets. However, with default settings, Kudu can only
> consume the topics at a rate of around 1.5K / second. This is a direct
> ingest with no transformation on the data.
>
> Could this because I was using the default configurations? also we are
> using Kudu on HDD - could that also be related?
>
> Any help would be appreciated. Thanks.
>
> Best,
> Chao
>
>
>

Re: Low ingestion rate from Kafka

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Chao,

Nice to hear you are checking out Kudu.

What are you using to consume from Kafka and write to Kudu? Is it possible
that it is Java code and you are using the SYNC flush mode? That would
result in a separate round trip for each record and thus very low
throughput.

Todd

On Oct 30, 2017 10:23 PM, "Chao Sun" <su...@uber.com> wrote:

Hi,

We are evaluating Kudu (version kudu 1.3.0-cdh5.11.1, revision
af02f3ea6d9a1807dcac0ec75bfbca79a01a5cab) on a 8-node cluster.
The data are coming from Kafka at a rate of around 30K / sec, and hash
partitioned into 128 buckets. However, with default settings, Kudu can only
consume the topics at a rate of around 1.5K / second. This is a direct
ingest with no transformation on the data.

Could this because I was using the default configurations? also we are
using Kudu on HDD - could that also be related?

Any help would be appreciated. Thanks.

Best,
Chao