You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Navneeth Krishnan <re...@gmail.com> on 2020/12/07 06:32:44 UTC

Kafka Streams Optimizations

Hi All,

I have been working on moving an application to kafka streams and I have
the following questions.

1. We were planning to use an EFS mount to share rocksdb data for KV store
and global state store with which we were hoping to minimize the state
restore time when new instances are brought up. But later we found that
global state stores require a lock so directories cannot be shared. Is
there some way around this? How is everyone minimizing the state
restoration time?

2. Topology optimization: We are using PAPI and as per the docs topology
optimization will have no effects on low level api. Is my understanding
correct?

3. There are about 5 KV stores in our stream application and for a few the
data size is a bit larger. Is there a config to write data to the changelog
topic only once a minute or something? I know it will be a problem in
maintaining the data integrity. Basically we want to reduce the amount of
changelog data written since we will have some updates for each user every
5 secs or so. Any suggestions on optimizations.

4. Compress data: Is there an option to compress the data being sent and
consumed from kafka only for the intermediate topics. The major reason is
we don't want to change the final sink because it's used by many
applications. If we can just compress and write the data only for the
intermediate topics and changelog that would be nice.

Thanks and appreciate all the help.

Regards,
Navneeth

Re: Kafka Streams Optimizations

Posted by Navneeth Krishnan <re...@gmail.com>.
Hi All,

Any advice would really help.

Thanks,
Navneeth

On Sun, Dec 6, 2020 at 10:32 PM Navneeth Krishnan <re...@gmail.com>
wrote:

> Hi All,
>
> I have been working on moving an application to kafka streams and I have
> the following questions.
>
> 1. We were planning to use an EFS mount to share rocksdb data for KV store
> and global state store with which we were hoping to minimize the state
> restore time when new instances are brought up. But later we found that
> global state stores require a lock so directories cannot be shared. Is
> there some way around this? How is everyone minimizing the state
> restoration time?
>
> 2. Topology optimization: We are using PAPI and as per the docs topology
> optimization will have no effects on low level api. Is my understanding
> correct?
>
> 3. There are about 5 KV stores in our stream application and for a few the
> data size is a bit larger. Is there a config to write data to the changelog
> topic only once a minute or something? I know it will be a problem in
> maintaining the data integrity. Basically we want to reduce the amount of
> changelog data written since we will have some updates for each user every
> 5 secs or so. Any suggestions on optimizations.
>
> 4. Compress data: Is there an option to compress the data being sent and
> consumed from kafka only for the intermediate topics. The major reason is
> we don't want to change the final sink because it's used by many
> applications. If we can just compress and write the data only for the
> intermediate topics and changelog that would be nice.
>
> Thanks and appreciate all the help.
>
> Regards,
> Navneeth
>

Re: Kafka Streams Optimizations

Posted by Navneeth Krishnan <re...@gmail.com>.
Thanks Guozhang for the response. Appreciate the help.

Regards,
Navneeth

On Mon, Dec 14, 2020 at 2:44 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hello Navneeth,
>
> Please find answers to your questions below.
>
> On Sun, Dec 6, 2020 at 10:33 PM Navneeth Krishnan <
> reachnavneeth2@gmail.com>
> wrote:
>
> > Hi All,
> >
> > I have been working on moving an application to kafka streams and I have
> > the following questions.
> >
> > 1. We were planning to use an EFS mount to share rocksdb data for KV
> store
> > and global state store with which we were hoping to minimize the state
> > restore time when new instances are brought up. But later we found that
> > global state stores require a lock so directories cannot be shared. Is
> > there some way around this? How is everyone minimizing the state
> > restoration time?
> >
> >
> The common suggestion is to check if during restoration, the write-buffer
> is flushed too frequently so that a lot of IOs are incurred for compaction:
> generally speaking you'd want to have few large sst files instead of a lot
> of small sst files. And the default config may learn towards the latter
> case which would slow down restoration.
>
> We are currently working to make the following changes: use addSSTFiles API
> of rocksDB to try to reduce the restoration IO cost; and consider moving
> restoration to a different thread so that stream thread would not block on
> IO when this happens. Stay tuned for the next release.
>
>
> > 2. Topology optimization: We are using PAPI and as per the docs topology
> > optimization will have no effects on low level api. Is my understanding
> > correct?
> >
>
> Correct.
>
> >
> > 3. There are about 5 KV stores in our stream application and for a few
> the
> > data size is a bit larger. Is there a config to write data to the
> changelog
> > topic only once a minute or something? I know it will be a problem in
> > maintaining the data integrity. Basically we want to reduce the amount of
> > changelog data written since we will have some updates for each user
> every
> > 5 secs or so. Any suggestions on optimizations.
> >
> >
> Currently increasing the total cache may help (configured as
> "cache.max.bytes.buffering"), this is because the caching layer is at the
> same time used for suppressing updates for the same key, and hence to the
> changelogs as well.
>
>
> > 4. Compress data: Is there an option to compress the data being sent and
> > consumed from kafka only for the intermediate topics. The major reason is
> > we don't want to change the final sink because it's used by many
> > applications. If we can just compress and write the data only for the
> > intermediate topics and changelog that would be nice.
> >
> >
> I think you can set compression codec at the per-topic basis on Kafka.
>
>
> > Thanks and appreciate all the help.
> >
> > Regards,
> > Navneeth
> >
>
>
> --
> -- Guozhang
>

Re: Kafka Streams Optimizations

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Navneeth,

Please find answers to your questions below.

On Sun, Dec 6, 2020 at 10:33 PM Navneeth Krishnan <re...@gmail.com>
wrote:

> Hi All,
>
> I have been working on moving an application to kafka streams and I have
> the following questions.
>
> 1. We were planning to use an EFS mount to share rocksdb data for KV store
> and global state store with which we were hoping to minimize the state
> restore time when new instances are brought up. But later we found that
> global state stores require a lock so directories cannot be shared. Is
> there some way around this? How is everyone minimizing the state
> restoration time?
>
>
The common suggestion is to check if during restoration, the write-buffer
is flushed too frequently so that a lot of IOs are incurred for compaction:
generally speaking you'd want to have few large sst files instead of a lot
of small sst files. And the default config may learn towards the latter
case which would slow down restoration.

We are currently working to make the following changes: use addSSTFiles API
of rocksDB to try to reduce the restoration IO cost; and consider moving
restoration to a different thread so that stream thread would not block on
IO when this happens. Stay tuned for the next release.


> 2. Topology optimization: We are using PAPI and as per the docs topology
> optimization will have no effects on low level api. Is my understanding
> correct?
>

Correct.

>
> 3. There are about 5 KV stores in our stream application and for a few the
> data size is a bit larger. Is there a config to write data to the changelog
> topic only once a minute or something? I know it will be a problem in
> maintaining the data integrity. Basically we want to reduce the amount of
> changelog data written since we will have some updates for each user every
> 5 secs or so. Any suggestions on optimizations.
>
>
Currently increasing the total cache may help (configured as
"cache.max.bytes.buffering"), this is because the caching layer is at the
same time used for suppressing updates for the same key, and hence to the
changelogs as well.


> 4. Compress data: Is there an option to compress the data being sent and
> consumed from kafka only for the intermediate topics. The major reason is
> we don't want to change the final sink because it's used by many
> applications. If we can just compress and write the data only for the
> intermediate topics and changelog that would be nice.
>
>
I think you can set compression codec at the per-topic basis on Kafka.


> Thanks and appreciate all the help.
>
> Regards,
> Navneeth
>


-- 
-- Guozhang