You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by "Krishnan Nair, Praveen" <Pr...@IGT.com> on 2019/12/27 20:30:41 UTC

apache gobblin on yarn using helix - job getting stuck without error message

Hi,

I am trying to configure apache gobblin on yarn to pull data from postgres to hdfs to store as avro files in daily partitions. Gobblin uses helix version : 0.8.2 to manage the tasks.
I am facing an issue as the job gets stuck when data volume is increased with some of the tasks getting completed(as per debug logs) but result files are missing.

There are 63 tasks one for each partition for this job and I can see from logs 4 task runners are initialized and assigned tasks.
After creating most of the task result files in task output dir, the job is getting stuck - with no error message/exception.
For a reduced volume of data same configuration works and the job finishes. If it is getting stuck that happens roughly in 25 - 30 mins.

One of such tasks with log as COMPLETED but file missing in output dir is ..._1577133620749_3 as shown below.

2019-12-23 20:40:28 UTC WARN [GenericHelixController-event_process] org.apache.helix.task.assigner.AssignableInstance - AssignableInstance does not have enough capacity for quotaType: DEFAULT. Task: 23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name: GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule: 40

2019-12-23 20:40:32 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher - Setting task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to RUNNING on instance GobblinYarnTaskRunner_1

2019-12-23 20:48:38 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher - Task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has a pending state transition on instance GobblinYarnTaskRunner_4. Using the previous ideal state which was RUNNING.
2019-12-23 20:50:20 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher - Setting task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to RUNNING on instance GobblinYarnTaskRunner_4.

2019-12-23 20:50:22 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher - Task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has completed with state COMPLETED. Marking as such in rebalancer context.

Not sure how to figure out what is happening with the job. Appreciate any advice/suggestions.

Thanks & Regards,
Praveen

CONFIDENTIALITY NOTICE: This message is the property of International Game Technology PLC and/or its subsidiaries and may contain proprietary, confidential or trade secret information. This message is intended solely for the use of the addressee. If you are not the intended recipient and have received this message in error, please delete this message from your system. Any unauthorized reading, distribution, copying, or other use of this message or its attachments is strictly prohibited.

Re: apache gobblin on yarn using helix - job getting stuck without error message

Posted by Hunter Lee <na...@gmail.com>.

As for why the output file is missing, you'll have to look at the
Participant (worker/task runner) log.
I am not too sure about what you mean by

"Here one thing puzzling me is, if I check the log each task of the job has
debug logs suggesting it changed state from RUNNING to COMPLETED at some
point but job is still unfinished."

Once something is COMPLETED, it should never change states because
COMPLETED is a terminal state.

Hunter

On Sat, Dec 28, 2019 at 9:57 AM Krishnan Nair, Praveen <
Praveen.KrishnanNair@igt.com> wrote:

> Thanks Lee!
>
>
>
> Yes I posted a request in gobblin forum waiting for some advise.
>
>
>
> Here one thing puzzling me is, if I check the log each task of the job has
> debug logs suggesting it changed state from RUNNING to COMPLETED at some
> point but job is still unfinished.
>
> The capacity full message was for task runner 2. But task _3 was first
> assigned to task runner 1 and then moved to task runner 4 and transitioned
> from running to completed but still no result file.
>
> I will try increasing capacity from 40/ increase number of task runners
> but I didn’t see any time out/abort logs.
>
>
>
> “*ArchiveJob16_1577133620749_3  has completed with state COMPLETED.
> Marking as such in rebalancer context.”*
>
>
>
> But then no log suggesting any task is pending in RUNNING. But no result
> file in task output dir.
>
>
>
> *And then logs like*
>
> *2019-12-23 21:01:02 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.JobRebalancer  - All partitions: [0, 1, 2, 3, 4, 5,
> 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
> 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
> 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62]
> taskAssignment: {GobblinYarnTaskRunner_1=[], GobblinYarnTaskRunner_2=[],
> GobblinYarnTaskRunner_3=[], GobblinYarnTaskRunner_4=[]} excludedInstances:
> []*
>
> *2019-12-23 21:01:02 UTC WARN  [GenericHelixController-event_process]
> org.apache.helix.task.assigner.ThreadCountBasedTaskAssigner  - No task to
> assign!*
>
> *2019-12-23 21:01:02 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.manager.zk.zookeeper.ZkClient  - Waiting for keeper state
> SyncConnected*
>
> *2019-12-23 21:01:02 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.manager.zk.zookeeper.ZkClient  - State is SyncConnected*
>
>
>
> Thanks & Regards,
>
> Praveen
>
>
>
>
>
> *From:* Hunter Lee <na...@gmail.com>
> *Sent:* Friday, December 27, 2019 4:30 PM
> *To:* user@helix.apache.org
> *Subject:* Re: apache gobblin on yarn using helix - job getting stuck
> without error message
>
>
>
> *[THIS MESSAGE ORIGINATED FROM A NON-IGT EMAIL ADDRESS]*
>
>
>
> *2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process]
> org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance
> does not have enough capacity for quotaType: DEFAULT. Task:
> 23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name:
> GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule:
> 40*
>
>
>
> The log message above is telling. Helix Task Framework has a statically
> defined capacity per instance of 40 tasks. For some reason, you have all of
> your capacity full for GobblinYarnTaskRunner_2 (which should be a Helix
> Participant).
>
>
>
> This may mean things like:
>
> 1. You should add more Helix Participants (task runners) to give the
> cluster more capacity for tasks.
>
> 2. Your tasks are not completing properly - meaning they are stuck in
> RUNNING state. This usually is due to the user-defined logic (either your
> own or Apache Gobblin's task ingestion logic).
>
>
>
> We collaborate with Apache Gobblin quite often and they have been using
> Helix Task Framework for data ingestion with success. I also suggest you
> reach out to Gobblin's community to see if your configs are set correctly.
>
>
>
> Hunter
>
>
>
> On Fri, Dec 27, 2019 at 12:30 PM Krishnan Nair, Praveen <
> Praveen.KrishnanNair@igt.com> wrote:
>
> Hi,
>
>
>
> I am trying to configure apache gobblin on yarn to pull data from postgres
> to hdfs to store as avro files in daily partitions. Gobblin uses helix
> version : 0.8.2 to manage the tasks.
>
> I am facing an issue as the job gets stuck when data volume is increased
> with some of the tasks getting completed(as per debug logs) but result
> files are missing.
>
>
>
> There are 63 tasks one for each partition for this job and I can see from
> logs 4 task runners are initialized and assigned tasks.
>
> After creating most of the task result files in task output dir, the job
> is getting stuck – with no error message/exception.
>
> For a reduced volume of data same configuration works and the job
> finishes. If it is getting stuck that happens roughly in 25 – 30 mins.
>
>
>
> One of such tasks with log as COMPLETED but file missing in output dir is
> …_1577133620749_3 as shown below.
>
>
>
> *2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process]
> org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance
> does not have enough capacity for quotaType: DEFAULT. Task:
> 23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name:
> GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule:
> 40*
>
>
>
> *2019-12-23 20:40:32 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to
> RUNNING on instance GobblinYarnTaskRunner_1*
>
>
>
> *2019-12-23 20:48:38 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has a
> pending state transition on instance GobblinYarnTaskRunner_4. Using the
> previous ideal state which was RUNNING.*
>
> *2019-12-23 20:50:20 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to
> RUNNING on instance GobblinYarnTaskRunner_4.*
>
>
>
> *2019-12-23 20:50:22 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has
> completed with state COMPLETED. Marking as such in rebalancer context.*
>
>
>
> Not sure how to figure out what is happening with the job. Appreciate any
> advice/suggestions.
>
>
>
> Thanks & Regards,
>
> Praveen
>
>
>
> CONFIDENTIALITY NOTICE: This message is the property of International Game
> Technology PLC and/or its subsidiaries and may contain proprietary,
> confidential or trade secret information. This message is intended solely
> for the use of the addressee. If you are not the intended recipient and
> have received this message in error, please delete this message from your
> system. Any unauthorized reading, distribution, copying, or other use of
> this message or its attachments is strictly prohibited.
>
>

RE: apache gobblin on yarn using helix - job getting stuck without error message

Posted by "Krishnan Nair, Praveen" <Pr...@IGT.com>.

Thanks Lee!


Yes I posted a request in gobblin forum waiting for some advise.



Here one thing puzzling me is, if I check the log each task of the job has debug logs suggesting it changed state from RUNNING to COMPLETED at some point but job is still unfinished.

The capacity full message was for task runner 2. But task _3 was first assigned to task runner 1 and then moved to task runner 4 and transitioned from running to completed but still no result file.

I will try increasing capacity from 40/ increase number of task runners but I didn’t see any time out/abort logs.



“ArchiveJob16_1577133620749_3  has completed with state COMPLETED. Marking as such in rebalancer context.”



But then no log suggesting any task is pending in RUNNING. But no result file in task output dir.



And then logs like

2019-12-23 21:01:02 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.JobRebalancer  - All partitions: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62] taskAssignment: {GobblinYarnTaskRunner_1=[], GobblinYarnTaskRunner_2=[], GobblinYarnTaskRunner_3=[], GobblinYarnTaskRunner_4=[]} excludedInstances: []

2019-12-23 21:01:02 UTC WARN  [GenericHelixController-event_process] org.apache.helix.task.assigner.ThreadCountBasedTaskAssigner  - No task to assign!

2019-12-23 21:01:02 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.manager.zk.zookeeper.ZkClient  - Waiting for keeper state SyncConnected

2019-12-23 21:01:02 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.manager.zk.zookeeper.ZkClient  - State is SyncConnected



Thanks & Regards,

Praveen


From: Hunter Lee <na...@gmail.com>
Sent: Friday, December 27, 2019 4:30 PM
To: user@helix.apache.org
Subject: Re: apache gobblin on yarn using helix - job getting stuck without error message


[THIS MESSAGE ORIGINATED FROM A NON-IGT EMAIL ADDRESS]


2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process] org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance does not have enough capacity for quotaType: DEFAULT. Task: 23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name: GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule: 40

The log message above is telling. Helix Task Framework has a statically defined capacity per instance of 40 tasks. For some reason, you have all of your capacity full for GobblinYarnTaskRunner_2 (which should be a Helix Participant).

This may mean things like:
1. You should add more Helix Participants (task runners) to give the cluster more capacity for tasks.
2. Your tasks are not completing properly - meaning they are stuck in RUNNING state. This usually is due to the user-defined logic (either your own or Apache Gobblin's task ingestion logic).

We collaborate with Apache Gobblin quite often and they have been using Helix Task Framework for data ingestion with success. I also suggest you reach out to Gobblin's community to see if your configs are set correctly.

Hunter

On Fri, Dec 27, 2019 at 12:30 PM Krishnan Nair, Praveen <Pr...@igt.com>> wrote:
Hi,

I am trying to configure apache gobblin on yarn to pull data from postgres to hdfs to store as avro files in daily partitions. Gobblin uses helix version : 0.8.2 to manage the tasks.
I am facing an issue as the job gets stuck when data volume is increased with some of the tasks getting completed(as per debug logs) but result files are missing.

There are 63 tasks one for each partition for this job and I can see from logs 4 task runners are initialized and assigned tasks.
After creating most of the task result files in task output dir, the job is getting stuck – with no error message/exception.
For a reduced volume of data same configuration works and the job finishes. If it is getting stuck that happens roughly in 25 – 30 mins.

One of such tasks with log as COMPLETED but file missing in output dir is …_1577133620749_3 as shown below.

2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process] org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance does not have enough capacity for quotaType: DEFAULT. Task: 23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name: GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule: 40

2019-12-23 20:40:32 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to RUNNING on instance GobblinYarnTaskRunner_1

2019-12-23 20:48:38 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher  - Task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has a pending state transition on instance GobblinYarnTaskRunner_4. Using the previous ideal state which was RUNNING.
2019-12-23 20:50:20 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to RUNNING on instance GobblinYarnTaskRunner_4.

2019-12-23 20:50:22 UTC DEBUG [GenericHelixController-event_process] org.apache.helix.task.AbstractTaskDispatcher  - Task partition job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has completed with state COMPLETED. Marking as such in rebalancer context.

Not sure how to figure out what is happening with the job. Appreciate any advice/suggestions.

Thanks & Regards,
Praveen

CONFIDENTIALITY NOTICE: This message is the property of International Game Technology PLC and/or its subsidiaries and may contain proprietary, confidential or trade secret information. This message is intended solely for the use of the addressee. If you are not the intended recipient and have received this message in error, please delete this message from your system. Any unauthorized reading, distribution, copying, or other use of this message or its attachments is strictly prohibited.

Re: apache gobblin on yarn using helix - job getting stuck without error message

Posted by Hunter Lee <na...@gmail.com>.

*2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process]
org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance
does not have enough capacity for quotaType: DEFAULT. Task:
23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name:
GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule:
40*


The log message above is telling. Helix Task Framework has a statically
defined capacity per instance of 40 tasks. For some reason, you have all of
your capacity full for GobblinYarnTaskRunner_2 (which should be a Helix
Participant).


This may mean things like:

1. You should add more Helix Participants (task runners) to give the
cluster more capacity for tasks.

2. Your tasks are not completing properly - meaning they are stuck in
RUNNING state. This usually is due to the user-defined logic (either your
own or Apache Gobblin's task ingestion logic).


We collaborate with Apache Gobblin quite often and they have been using
Helix Task Framework for data ingestion with success. I also suggest you
reach out to Gobblin's community to see if your configs are set correctly.


Hunter


On Fri, Dec 27, 2019 at 12:30 PM Krishnan Nair, Praveen <
Praveen.KrishnanNair@igt.com> wrote:

> Hi,
>
>
>
> I am trying to configure apache gobblin on yarn to pull data from postgres
> to hdfs to store as avro files in daily partitions. Gobblin uses helix
> version : 0.8.2 to manage the tasks.
>
> I am facing an issue as the job gets stuck when data volume is increased
> with some of the tasks getting completed(as per debug logs) but result
> files are missing.
>
>
>
> There are 63 tasks one for each partition for this job and I can see from
> logs 4 task runners are initialized and assigned tasks.
>
> After creating most of the task result files in task output dir, the job
> is getting stuck – with no error message/exception.
>
> For a reduced volume of data same configuration works and the job
> finishes. If it is getting stuck that happens roughly in 25 – 30 mins.
>
>
>
> One of such tasks with log as COMPLETED but file missing in output dir is
> …_1577133620749_3 as shown below.
>
>
>
> *2019-12-23 20:40:28 UTC WARN  [GenericHelixController-event_process]
> org.apache.helix.task.assigner.AssignableInstance  - AssignableInstance
> does not have enough capacity for quotaType: DEFAULT. Task:
> 23b17106-6e18-4516-9745-879a3f6a30b8, quotaType: DEFAULT, Instance name:
> GobblinYarnTaskRunner_2. Current capacity: 40 capacity needed to schedule:
> 40*
>
>
>
> *2019-12-23 20:40:32 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to
> RUNNING on instance GobblinYarnTaskRunner_1*
>
>
>
> *2019-12-23 20:48:38 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has a
> pending state transition on instance GobblinYarnTaskRunner_4. Using the
> previous ideal state which was RUNNING.*
>
> *2019-12-23 20:50:20 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Setting task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 state to
> RUNNING on instance GobblinYarnTaskRunner_4.*
>
>
>
> *2019-12-23 20:50:22 UTC DEBUG [GenericHelixController-event_process]
> org.apache.helix.task.AbstractTaskDispatcher  - Task partition
> job_ArchiveJob16_1577133620749_job_ArchiveJob16_1577133620749_3 has
> completed with state COMPLETED. Marking as such in rebalancer context.*
>
>
>
> Not sure how to figure out what is happening with the job. Appreciate any
> advice/suggestions.
>
>
>
> Thanks & Regards,
>
> Praveen
>
>
> CONFIDENTIALITY NOTICE: This message is the property of International Game
> Technology PLC and/or its subsidiaries and may contain proprietary,
> confidential or trade secret information. This message is intended solely
> for the use of the addressee. If you are not the intended recipient and
> have received this message in error, please delete this message from your
> system. Any unauthorized reading, distribution, copying, or other use of
> this message or its attachments is strictly prohibited.
>