You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Nitay Kufert <ni...@ironsrc.com> on 2020/12/07 14:59:15 UTC

Sub Tasks being processed only after restart

Hey,
We are running a kafka-stream based app in production where the input,
intermediate and global topics have 36 partitions.
We have 17 sub-tasks (2 of them are for global stores so they won't
generate tasks).
More tech details:
6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads
15 * 36 = 540 tasks
3 tasks per thread

Every once in a while, during our rush-hours, some of the internal topics,
on specific partitions, start to lag - the lag usually keeps increasing
until i restart the application - and the lag disappears very quickly.

It seems like there is some problem in the work allocation since the
machines are not loaded at all, and have enough threads (more than double
the cpu's).

Any idea what's going on there?

-- 

Nitay Kufert
Backend Team Leader
[image: ironSource] <http://www.ironsrc.com>

email nitay.k@ironsrc.com
mobile +972-54-5480021
fax +972-77-5448273
skype nitay.kufert.ssa
121 Menachem Begin St., Tel Aviv, Israel
ironsrc.com <http://www.ironsrc.com>
[image: linkedin] <https://www.linkedin.com/company/ironsource> [image:
twitter] <https://twitter.com/ironsource> [image: facebook]
<https://www.facebook.com/ironSource> [image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Sub Tasks being processed only after restart

Posted by Leah Thomas <lt...@confluent.io>.
Hey Nitay,

In terms of rocksDB metrics, 2.5.1 should have a number of debug level
metrics that could shed some light on the situation. Particularly I'd
recommend looking at WRITE_STALL_DURATION_AVG / WRITE_STALL_DURATION_TOTAL,
as well as some of the compaction metrics such as COMPACTION_TIME_MAX,
BYTES_READ_DURING_COMPACTION or BYTES_WRITTEN_DURING_COMPACTION. The
compaction metrics, in particular, could alert you to rocksDB falling
behind in compaction which could be solved by the restart you're doing.

I do think it *could* still be something in your topology. Definitely
confirm that your subtopologies have a fairly even load of processing,
overloaded tasks could definitely be impacting performance.

Good luck!
Leah





On Wed, Dec 9, 2020 at 3:00 PM Nitay Kufert <ni...@ironsrc.com> wrote:

> Hey Leah, Thanks for the response.
>
> We are running Kafka 2.5.1 and if the topology will still be useful after
> the next few sentences, i will share it with you (its messy!).
> It happens on few partitions, and few internal topics - and it seems to be
> kind of random which topics and which partitions exactly.
> The business logic in prune to having "hot" partitions since the identifier
> being used is coming-in a very different rate during different times of the
> day.
> We are using rocksdb and I would like to know which metrics you think can
> help us (I didn't expose the metrics in a clever way outside yet :/)
>
> Since the topic and partitions are changing, and reset usually fixes the
> problem almost immediately - i find it hard to believe it has anything to
> do with the topology or business logic but I might be missing something
> (since, after restart, the lag disappear with no real effort).
>
> Thanks
>
>
>
>
> On Tue, Dec 8, 2020 at 9:35 PM Leah Thomas <lt...@confluent.io> wrote:
>
> > Hi Nitay,
> >
> > What version of Kafka are you running? If you could also give the
> topology
> > you're using that would be great. Do you have a sense of if the lag is
> > happening on all partitions or just a few? Also if you're using rocksDB
> > there are some rocksDB metrics in newer versions of Kafka that could be
> > helpful for diagnosing the issue.
> >
> > Cheers,
> > Leah
> >
> > On Mon, Dec 7, 2020 at 8:59 AM Nitay Kufert <ni...@ironsrc.com> wrote:
> >
> > > Hey,
> > > We are running a kafka-stream based app in production where the input,
> > > intermediate and global topics have 36 partitions.
> > > We have 17 sub-tasks (2 of them are for global stores so they won't
> > > generate tasks).
> > > More tech details:
> > > 6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads
> > > 15 * 36 = 540 tasks
> > > 3 tasks per thread
> > >
> > > Every once in a while, during our rush-hours, some of the internal
> > topics,
> > > on specific partitions, start to lag - the lag usually keeps increasing
> > > until i restart the application - and the lag disappears very quickly.
> > >
> > > It seems like there is some problem in the work allocation since the
> > > machines are not loaded at all, and have enough threads (more than
> double
> > > the cpu's).
> > >
> > > Any idea what's going on there?
> > >
> > > --
> > >
> > > Nitay Kufert
> > > Backend Team Leader
> > > [image: ironSource] <http://www.ironsrc.com>
> > >
> > > email nitay.k@ironsrc.com
> > > mobile +972-54-5480021
> > > fax +972-77-5448273
> > > skype nitay.kufert.ssa
> > > 121 Menachem Begin St., Tel Aviv, Israel
> > > ironsrc.com <http://www.ironsrc.com>
> > > [image: linkedin] <https://www.linkedin.com/company/ironsource>
> [image:
> > > twitter] <https://twitter.com/ironsource> [image: facebook]
> > > <https://www.facebook.com/ironSource> [image: googleplus]
> > > <https://plus.google.com/+ironsrc>
> > > This email (including any attachments) is for the sole use of the
> > intended
> > > recipient and may contain confidential information which may be
> protected
> > > by legal privilege. If you are not the intended recipient, or the
> > employee
> > > or agent responsible for delivering it to the intended recipient, you
> are
> > > hereby notified that any use, dissemination, distribution or copying of
> > > this communication and/or its content is strictly prohibited. If you
> are
> > > not the intended recipient, please immediately notify us by reply email
> > or
> > > by telephone, delete this email and destroy any copies. Thank you.
> > >
> >
>
>
> --
>
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com>
>
> email nitay.k@ironsrc.com
> mobile +972-54-5480021
> fax +972-77-5448273
> skype nitay.kufert.ssa
> 121 Menachem Begin St., Tel Aviv, Israel
> ironsrc.com <http://www.ironsrc.com>
> [image: linkedin] <https://www.linkedin.com/company/ironsource> [image:
> twitter] <https://twitter.com/ironsource> [image: facebook]
> <https://www.facebook.com/ironSource> [image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>

Re: Sub Tasks being processed only after restart

Posted by Nitay Kufert <ni...@ironsrc.com>.
Hey Leah, Thanks for the response.

We are running Kafka 2.5.1 and if the topology will still be useful after
the next few sentences, i will share it with you (its messy!).
It happens on few partitions, and few internal topics - and it seems to be
kind of random which topics and which partitions exactly.
The business logic in prune to having "hot" partitions since the identifier
being used is coming-in a very different rate during different times of the
day.
We are using rocksdb and I would like to know which metrics you think can
help us (I didn't expose the metrics in a clever way outside yet :/)

Since the topic and partitions are changing, and reset usually fixes the
problem almost immediately - i find it hard to believe it has anything to
do with the topology or business logic but I might be missing something
(since, after restart, the lag disappear with no real effort).

Thanks




On Tue, Dec 8, 2020 at 9:35 PM Leah Thomas <lt...@confluent.io> wrote:

> Hi Nitay,
>
> What version of Kafka are you running? If you could also give the topology
> you're using that would be great. Do you have a sense of if the lag is
> happening on all partitions or just a few? Also if you're using rocksDB
> there are some rocksDB metrics in newer versions of Kafka that could be
> helpful for diagnosing the issue.
>
> Cheers,
> Leah
>
> On Mon, Dec 7, 2020 at 8:59 AM Nitay Kufert <ni...@ironsrc.com> wrote:
>
> > Hey,
> > We are running a kafka-stream based app in production where the input,
> > intermediate and global topics have 36 partitions.
> > We have 17 sub-tasks (2 of them are for global stores so they won't
> > generate tasks).
> > More tech details:
> > 6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads
> > 15 * 36 = 540 tasks
> > 3 tasks per thread
> >
> > Every once in a while, during our rush-hours, some of the internal
> topics,
> > on specific partitions, start to lag - the lag usually keeps increasing
> > until i restart the application - and the lag disappears very quickly.
> >
> > It seems like there is some problem in the work allocation since the
> > machines are not loaded at all, and have enough threads (more than double
> > the cpu's).
> >
> > Any idea what's going on there?
> >
> > --
> >
> > Nitay Kufert
> > Backend Team Leader
> > [image: ironSource] <http://www.ironsrc.com>
> >
> > email nitay.k@ironsrc.com
> > mobile +972-54-5480021
> > fax +972-77-5448273
> > skype nitay.kufert.ssa
> > 121 Menachem Begin St., Tel Aviv, Israel
> > ironsrc.com <http://www.ironsrc.com>
> > [image: linkedin] <https://www.linkedin.com/company/ironsource> [image:
> > twitter] <https://twitter.com/ironsource> [image: facebook]
> > <https://www.facebook.com/ironSource> [image: googleplus]
> > <https://plus.google.com/+ironsrc>
> > This email (including any attachments) is for the sole use of the
> intended
> > recipient and may contain confidential information which may be protected
> > by legal privilege. If you are not the intended recipient, or the
> employee
> > or agent responsible for delivering it to the intended recipient, you are
> > hereby notified that any use, dissemination, distribution or copying of
> > this communication and/or its content is strictly prohibited. If you are
> > not the intended recipient, please immediately notify us by reply email
> or
> > by telephone, delete this email and destroy any copies. Thank you.
> >
>


-- 

Nitay Kufert
Backend Team Leader
[image: ironSource] <http://www.ironsrc.com>

email nitay.k@ironsrc.com
mobile +972-54-5480021
fax +972-77-5448273
skype nitay.kufert.ssa
121 Menachem Begin St., Tel Aviv, Israel
ironsrc.com <http://www.ironsrc.com>
[image: linkedin] <https://www.linkedin.com/company/ironsource> [image:
twitter] <https://twitter.com/ironsource> [image: facebook]
<https://www.facebook.com/ironSource> [image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Sub Tasks being processed only after restart

Posted by Leah Thomas <lt...@confluent.io>.
Hi Nitay,

What version of Kafka are you running? If you could also give the topology
you're using that would be great. Do you have a sense of if the lag is
happening on all partitions or just a few? Also if you're using rocksDB
there are some rocksDB metrics in newer versions of Kafka that could be
helpful for diagnosing the issue.

Cheers,
Leah

On Mon, Dec 7, 2020 at 8:59 AM Nitay Kufert <ni...@ironsrc.com> wrote:

> Hey,
> We are running a kafka-stream based app in production where the input,
> intermediate and global topics have 36 partitions.
> We have 17 sub-tasks (2 of them are for global stores so they won't
> generate tasks).
> More tech details:
> 6 machines with 16cpu's, 30 threads so: 6 * 30 = 180 stream-threads
> 15 * 36 = 540 tasks
> 3 tasks per thread
>
> Every once in a while, during our rush-hours, some of the internal topics,
> on specific partitions, start to lag - the lag usually keeps increasing
> until i restart the application - and the lag disappears very quickly.
>
> It seems like there is some problem in the work allocation since the
> machines are not loaded at all, and have enough threads (more than double
> the cpu's).
>
> Any idea what's going on there?
>
> --
>
> Nitay Kufert
> Backend Team Leader
> [image: ironSource] <http://www.ironsrc.com>
>
> email nitay.k@ironsrc.com
> mobile +972-54-5480021
> fax +972-77-5448273
> skype nitay.kufert.ssa
> 121 Menachem Begin St., Tel Aviv, Israel
> ironsrc.com <http://www.ironsrc.com>
> [image: linkedin] <https://www.linkedin.com/company/ironsource> [image:
> twitter] <https://twitter.com/ironsource> [image: facebook]
> <https://www.facebook.com/ironSource> [image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>