You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by XiaoChuan Yu <xi...@kik.com> on 2017/10/19 19:14:11 UTC

Re: Samza Job Slow to Restart

Hi,

We were finally able to find out why the job takes so long to start.
There was higher than normal network IO during job startup and so we
checked size of the checkpoint topic on disk and it was ~21GB.
We then restarted the Kafka node who was the leader for the checkpoint
topic, the topic disk size went down to ~1.8GB and the job started up
fairly quickly.
Its probably due to a bug in Kafka where log cleaner died and we never
noticed: https://issues.apache.org/jira/browse/KAFKA-3894.
We have since been working on upgrading Kafka to avoid this bug.
Hope this helps if anyone else ever runs into it.

Thanks,
Xiaochuan Yu

On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <xi...@kik.com> wrote:

> >> How long does it take?
> It took around 10 minute from "Got offset 0 for topic <checkpoint topic>
> ..." to init() being called on the Task.
>
> >> Have you measured which parts of the start up sequence take the most
> time?
> >> - is it checkpoint restoration, or restore of local state?
> Should be checkpoint restoration. There is no local state for this job.
>
> >> If reading from the checkpoint topic takes the most time, then I'd
> >> recommend reading from the beginning from that topic, and benchmarking
> how
> >> long it takes? It'll also help to verify if the checkpoint topic is
> >> actually log-compacted.
> I'm not sure how to verify how much the topic is compacted by Kafka.
> The cleanup policy is to compact though.
>
> >> Do containers eventually start? Or does the start-up hang?
> >> If so, a thread dump will be useful.
> It does eventually start up.
>
> >> Can you please link and attach the entire log file for us to take a
> look?
> Unfortunately there is too much stuff for me to redact from the log right
> now.
> However, I can tell you that the job has two input topics both with the
> following settings:
> systems.kafka.streams.my-special-topic.samza.reset.offset=true
> systems.kafka.streams.my-special-topic.samza.offset.default=upcoming
> It was thought that this would speedup startup of the job to no avail.
>
> On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <
> jagadish1989@gmail.com> wrote:
>
>> Hi Xiaochuan,
>>
>> >> What does that loop do exactly?
>>
>> Most of what the run-loop does is documented in
>> https://samza.apache.org/learn/documentation/0.9/container/event-loop.html
>>
>> >> We are running into a problem where it seems to take a very long time
>> to
>> restart a Samza job.
>>
>> Some follow-up questions,
>>
>> How long does it take?
>> Have you measured which parts of the start up sequence take the most time?
>> - is it checkpoint restoration, or restore of local state?
>> If reading from the checkpoint topic takes the most time, then I'd
>> recommend reading from the beginning from that topic, and benchmarking how
>> long it takes? It'll also help to verify if the checkpoint topic is
>> actually log-compacted.
>> Do containers eventually start? Or does the start-up hang? If so, a thread
>> dump will be useful.
>> Can you please link and attach the entire log file for us to take a look?
>>
>> >> 3. Any ideas on how to fix this?
>>
>> We can perhaps, try to narrow down where the time is spent in startup from
>> the logs? Depending on that, I can suggest a fix :-)
>>
>> Thanks,
>> Jagadish
>>
>> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xi...@kik.com>
>> wrote:
>>
>> > Hi,
>> >
>> > We are running into a problem where it seems to take a very long time to
>> > restart a Samza job.
>> > We are using Samza 0.9.1 at the moment.
>> >
>> > From the logs for a particular container it looks like it has something
>> to
>> > do with reading checkpoints from Kafka:
>> >
>> > 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager
>> [main]
>> > -
>> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
>> > partition 0. Attempting to fetch messages for checkpoint log.
>> > 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager
>> [main]
>> > -
>> > Get latest offset 42890599 for topic
>> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
>> >
>> > Looking at this line in KafkaCheckpointManager
>> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
>> > src/main/scala/org/apache/samza/checkpoint/kafka/
>> > KafkaCheckpointManager.scala#L275>,
>> > it seems to indicate that the loop iterates from 0 to 42890599 and make
>> > requests for each.
>> >
>> > Questions:
>> > 1. What does that loop do exactly?
>> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..."
>> normal?
>> > 3. Any ideas on how to fix this?
>> >
>> > Thanks,
>> > Xiaochuan Yu
>> >
>>
>>
>>
>> --
>> Jagadish V,
>> Graduate Student,
>> Department of Computer Science,
>> Stanford University
>>
>

Re: Samza Job Slow to Restart

Posted by Liu Bo <di...@gmail.com>.

met the same problem before and resolved with Yi's help, xD

On 20 October 2017 at 06:10, Yi Pan <ni...@gmail.com> wrote:

> Awesome that you have figured it out! Just a general notice: any logcompact
> topic used in Samza may see this slow-down if the Kafka log cleaner thread
> dies, which include checkpoint, coordinator stream, and changelog topics.
>
> Best!
>
> -Yi
>
> On Thu, Oct 19, 2017 at 12:14 PM, XiaoChuan Yu <xi...@kik.com>
> wrote:
>
> > Hi,
> >
> > We were finally able to find out why the job takes so long to start.
> > There was higher than normal network IO during job startup and so we
> > checked size of the checkpoint topic on disk and it was ~21GB.
> > We then restarted the Kafka node who was the leader for the checkpoint
> > topic, the topic disk size went down to ~1.8GB and the job started up
> > fairly quickly.
> > Its probably due to a bug in Kafka where log cleaner died and we never
> > noticed: https://issues.apache.org/jira/browse/KAFKA-3894.
> > We have since been working on upgrading Kafka to avoid this bug.
> > Hope this helps if anyone else ever runs into it.
> >
> > Thanks,
> > Xiaochuan Yu
> >
> > On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <xi...@kik.com>
> wrote:
> >
> > > >> How long does it take?
> > > It took around 10 minute from "Got offset 0 for topic <checkpoint
> topic>
> > > ..." to init() being called on the Task.
> > >
> > > >> Have you measured which parts of the start up sequence take the most
> > > time?
> > > >> - is it checkpoint restoration, or restore of local state?
> > > Should be checkpoint restoration. There is no local state for this job.
> > >
> > > >> If reading from the checkpoint topic takes the most time, then I'd
> > > >> recommend reading from the beginning from that topic, and
> benchmarking
> > > how
> > > >> long it takes? It'll also help to verify if the checkpoint topic is
> > > >> actually log-compacted.
> > > I'm not sure how to verify how much the topic is compacted by Kafka.
> > > The cleanup policy is to compact though.
> > >
> > > >> Do containers eventually start? Or does the start-up hang?
> > > >> If so, a thread dump will be useful.
> > > It does eventually start up.
> > >
> > > >> Can you please link and attach the entire log file for us to take a
> > > look?
> > > Unfortunately there is too much stuff for me to redact from the log
> right
> > > now.
> > > However, I can tell you that the job has two input topics both with the
> > > following settings:
> > > systems.kafka.streams.my-special-topic.samza.reset.offset=true
> > > systems.kafka.streams.my-special-topic.samza.offset.default=upcoming
> > > It was thought that this would speedup startup of the job to no avail.
> > >
> > > On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <
> > > jagadish1989@gmail.com> wrote:
> > >
> > >> Hi Xiaochuan,
> > >>
> > >> >> What does that loop do exactly?
> > >>
> > >> Most of what the run-loop does is documented in
> > >> https://samza.apache.org/learn/documentation/0.9/
> > container/event-loop.html
> > >>
> > >> >> We are running into a problem where it seems to take a very long
> time
> > >> to
> > >> restart a Samza job.
> > >>
> > >> Some follow-up questions,
> > >>
> > >> How long does it take?
> > >> Have you measured which parts of the start up sequence take the most
> > time?
> > >> - is it checkpoint restoration, or restore of local state?
> > >> If reading from the checkpoint topic takes the most time, then I'd
> > >> recommend reading from the beginning from that topic, and benchmarking
> > how
> > >> long it takes? It'll also help to verify if the checkpoint topic is
> > >> actually log-compacted.
> > >> Do containers eventually start? Or does the start-up hang? If so, a
> > thread
> > >> dump will be useful.
> > >> Can you please link and attach the entire log file for us to take a
> > look?
> > >>
> > >> >> 3. Any ideas on how to fix this?
> > >>
> > >> We can perhaps, try to narrow down where the time is spent in startup
> > from
> > >> the logs? Depending on that, I can suggest a fix :-)
> > >>
> > >> Thanks,
> > >> Jagadish
> > >>
> > >> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xi...@kik.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > We are running into a problem where it seems to take a very long
> time
> > to
> > >> > restart a Samza job.
> > >> > We are using Samza 0.9.1 at the moment.
> > >> >
> > >> > From the logs for a particular container it looks like it has
> > something
> > >> to
> > >> > do with reading checkpoints from Kafka:
> > >> >
> > >> > 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> > >> [main]
> > >> > -
> > >> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
> > >> > partition 0. Attempting to fetch messages for checkpoint log.
> > >> > 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> > >> [main]
> > >> > -
> > >> > Get latest offset 42890599 for topic
> > >> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
> > >> >
> > >> > Looking at this line in KafkaCheckpointManager
> > >> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
> > >> > src/main/scala/org/apache/samza/checkpoint/kafka/
> > >> > KafkaCheckpointManager.scala#L275>,
> > >> > it seems to indicate that the loop iterates from 0 to 42890599 and
> > make
> > >> > requests for each.
> > >> >
> > >> > Questions:
> > >> > 1. What does that loop do exactly?
> > >> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..."
> > >> normal?
> > >> > 3. Any ideas on how to fix this?
> > >> >
> > >> > Thanks,
> > >> > Xiaochuan Yu
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Jagadish V,
> > >> Graduate Student,
> > >> Department of Computer Science,
> > >> Stanford University
> > >>
> > >
> >
>



-- 
All the best

Liu Bo

Re: Samza Job Slow to Restart

Posted by Yi Pan <ni...@gmail.com>.

Awesome that you have figured it out! Just a general notice: any logcompact
topic used in Samza may see this slow-down if the Kafka log cleaner thread
dies, which include checkpoint, coordinator stream, and changelog topics.

Best!

-Yi

On Thu, Oct 19, 2017 at 12:14 PM, XiaoChuan Yu <xi...@kik.com> wrote:

> Hi,
>
> We were finally able to find out why the job takes so long to start.
> There was higher than normal network IO during job startup and so we
> checked size of the checkpoint topic on disk and it was ~21GB.
> We then restarted the Kafka node who was the leader for the checkpoint
> topic, the topic disk size went down to ~1.8GB and the job started up
> fairly quickly.
> Its probably due to a bug in Kafka where log cleaner died and we never
> noticed: https://issues.apache.org/jira/browse/KAFKA-3894.
> We have since been working on upgrading Kafka to avoid this bug.
> Hope this helps if anyone else ever runs into it.
>
> Thanks,
> Xiaochuan Yu
>
> On Sat, Sep 23, 2017 at 6:17 PM XiaoChuan Yu <xi...@kik.com> wrote:
>
> > >> How long does it take?
> > It took around 10 minute from "Got offset 0 for topic <checkpoint topic>
> > ..." to init() being called on the Task.
> >
> > >> Have you measured which parts of the start up sequence take the most
> > time?
> > >> - is it checkpoint restoration, or restore of local state?
> > Should be checkpoint restoration. There is no local state for this job.
> >
> > >> If reading from the checkpoint topic takes the most time, then I'd
> > >> recommend reading from the beginning from that topic, and benchmarking
> > how
> > >> long it takes? It'll also help to verify if the checkpoint topic is
> > >> actually log-compacted.
> > I'm not sure how to verify how much the topic is compacted by Kafka.
> > The cleanup policy is to compact though.
> >
> > >> Do containers eventually start? Or does the start-up hang?
> > >> If so, a thread dump will be useful.
> > It does eventually start up.
> >
> > >> Can you please link and attach the entire log file for us to take a
> > look?
> > Unfortunately there is too much stuff for me to redact from the log right
> > now.
> > However, I can tell you that the job has two input topics both with the
> > following settings:
> > systems.kafka.streams.my-special-topic.samza.reset.offset=true
> > systems.kafka.streams.my-special-topic.samza.offset.default=upcoming
> > It was thought that this would speedup startup of the job to no avail.
> >
> > On Wed, Sep 20, 2017 at 3:21 PM Jagadish Venkatraman <
> > jagadish1989@gmail.com> wrote:
> >
> >> Hi Xiaochuan,
> >>
> >> >> What does that loop do exactly?
> >>
> >> Most of what the run-loop does is documented in
> >> https://samza.apache.org/learn/documentation/0.9/
> container/event-loop.html
> >>
> >> >> We are running into a problem where it seems to take a very long time
> >> to
> >> restart a Samza job.
> >>
> >> Some follow-up questions,
> >>
> >> How long does it take?
> >> Have you measured which parts of the start up sequence take the most
> time?
> >> - is it checkpoint restoration, or restore of local state?
> >> If reading from the checkpoint topic takes the most time, then I'd
> >> recommend reading from the beginning from that topic, and benchmarking
> how
> >> long it takes? It'll also help to verify if the checkpoint topic is
> >> actually log-compacted.
> >> Do containers eventually start? Or does the start-up hang? If so, a
> thread
> >> dump will be useful.
> >> Can you please link and attach the entire log file for us to take a
> look?
> >>
> >> >> 3. Any ideas on how to fix this?
> >>
> >> We can perhaps, try to narrow down where the time is spent in startup
> from
> >> the logs? Depending on that, I can suggest a fix :-)
> >>
> >> Thanks,
> >> Jagadish
> >>
> >> On Wed, Sep 20, 2017 at 11:21 AM, XiaoChuan Yu <xi...@kik.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > We are running into a problem where it seems to take a very long time
> to
> >> > restart a Samza job.
> >> > We are using Samza 0.9.1 at the moment.
> >> >
> >> > From the logs for a particular container it looks like it has
> something
> >> to
> >> > do with reading checkpoints from Kafka:
> >> >
> >> > 2017-09-20 03:21:02.060 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> >> [main]
> >> > -
> >> > Got offset 0 for topic __samza_checkpoint_ver_1_for_test-job_1 and
> >> > partition 0. Attempting to fetch messages for checkpoint log.
> >> > 2017-09-20 03:21:02.072 INFO  o.a.s.c.kafka.KafkaCheckpointManager
> >> [main]
> >> > -
> >> > Get latest offset 42890599 for topic
> >> > __samza_checkpoint_ver_1_for_test-job_1 and partition 0.
> >> >
> >> > Looking at this line in KafkaCheckpointManager
> >> > <https://github.com/apache/samza/blob/0.9.1/samza-kafka/
> >> > src/main/scala/org/apache/samza/checkpoint/kafka/
> >> > KafkaCheckpointManager.scala#L275>,
> >> > it seems to indicate that the loop iterates from 0 to 42890599 and
> make
> >> > requests for each.
> >> >
> >> > Questions:
> >> > 1. What does that loop do exactly?
> >> > 2. Is this an expected behaviour? Is "Got offset 0 for topic ..."
> >> normal?
> >> > 3. Any ideas on how to fix this?
> >> >
> >> > Thanks,
> >> > Xiaochuan Yu
> >> >
> >>
> >>
> >>
> >> --
> >> Jagadish V,
> >> Graduate Student,
> >> Department of Computer Science,
> >> Stanford University
> >>
> >
>