You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@oozie.apache.org by Virag Kothari <vi...@yahoo-inc.com> on 2012/10/01 22:43:01 UTC

Re: Action stuck in PREP state.

Hey Eduardo,

No need to be sorry at all. I was just trying to get better understanding
of the problem.

Usually after the workflow action submits a hadoop job, it moves from PREP
to RUNNING. But as the action is stuck in PREP, I don't think it was able
to submit a hadoop job. I am not sure why this happened. But does the
hadoop cluster have enough map slots?
Also, it would be helpful if you can paste the logs related to the stuck
Pig action (0000173-120927111953690-oozie-oozi-W@pig-node)

Thanks,
Virag

On 9/29/12 4:50 PM, "Eduardo Afonso Ferreira" <ea...@yahoo.com> wrote:

>Hey, Virag.
>
>Sorry I probably did not explain the problem properly.
>Let me try to make it more clear :)
>
>I have a Coordinator with frequency=3 (minutes) and concurrency=2.
>My Coordinator launches one app Workflow.
>My Workflow has 2 actions, 1st is a Pig, 2nd is a Shell.
>The Workflow launches the Pig action.
>The Pig action completes and the Workfolw launches the Shell action.
>When both actions complete, the Workflow completes.
>
>I'm sure you know how this works very well.
>
>Under hight traffic, my pig actions have a lot of data to process and
>they take longer than 3 minutes to complete.
>So, the coordinator launches up to 2 workflows due to concurrency=2.
>As traffic continues heavy, several workflow instances are moved to the
>READY state to run when there's room.
>When traffic diminishes, the processing will catch up and eventually all
>READY workflows will run.
>Most of the time, everything works as expected.
>
>Here's when the problem happens every once in a while:
>The Pig action gets stuck in the PREP state for some reason I don't know.
>When that happens, here's what I have:
>
>Coordinator State: RUNNING
>Workflow State: RUNNING
>Action (Pig) State: PREP
>
>
>As the Workflow is RUNNING, the Coordinator won't launch a new Workflow
>instance to "occupy" that concurrency slot.
>When 2 Workflows get in that situation, both slots in my Coordinator will
>be occupied and therefore no more Workflows will be launched anymore.
>
>
>Let me know if this is enough information (sorry, I probably still wrote
>too much).
>Thank you.
>Eduardo.
>
>
>________________________________
> From: Virag Kothari <vi...@yahoo-inc.com>
>To: "user@oozie.apache.org" <us...@oozie.apache.org>; Eduardo Afonso
>Ferreira <ea...@yahoo.com>
>Sent: Friday, September 28, 2012 4:12 PM
>Subject: Re: Action stuck in PREP state.
>
>Hi Eduardo,
>
>I am bit confused by this entire thread of discussion. From your initial
>question, I thought you meant your workflow action was stuck in 'PREP' and
>not coordinator action.
>If that is true, the coordinator is running fine (RUNNING state) and has
>already spawned the workflow.
>You don't need to modify any coordinator related knobs. Can you clarify
>what is getting stuck (workflow, coordinator) and in which state (PREP,
>WAITING etc.)?. Also, please paste the entire logs related to that action.
>
>Thanks,
>Virag
>
>
>
>On 9/28/12 11:53 AM, "Mona Chitnis" <ch...@yahoo-inc.com> wrote:
>
>>
>>See answers inline
>>
>>--
>>Mona Chitnis
>>
>>
>>On 9/28/12 9:22 AM, "Eduardo Afonso Ferreira"
>><ea...@yahoo.com>> wrote:
>>
>>Hey, Mona,
>>
>>Hi Eduardo,
>>
>>Thanks for the deliberation on the suggestions. For max concurrent db
>>connections, we do not have any benchmarks currently around a recommended
>>number, but it can be an estimate of what you think your particular
>>database can handle without hitting network issues. For e.g. For some
>>prod oracle servers, I have seen it set to ~300.
>>
>>
>>The list of jobs in the PROD environment is basically the same as in REF.
>>
>>When you mentioned "more database-intensive" I assume you're referring to
>>the communication with the Oozie DB and in that case, nothing is
>>different. My jobs don't do anything extra communication with the Oozie
>>DB. I just submit a handful of coordinators to Oozie server and let Oozie
>>take care of the rest.
>>
>>The main difference between REF and PROD is that in PROD we have a much
>>larger data to process and therefore, the jobs that run on the hadoop
>>cluster take longer to complete and we may have a larger number of
>>concurrent oozie workflows running. So, I see the importance of setting
>>the max active conn accordingly
>>(oozie.service.JPAService.pool.max.active.conn). I currently have it at
>>50, but do you have a recommendation on the ideal value for that? How do
>>you determine what number is good, like not too small but not too large?
>>
>>If you want to configure your Oozie server to check for available input
>>dependencies more frequently, you can reduce the
>>"oozie.service.coord.input.check.requeue.interval" (default is 60 sec).
>>Though it will result in more memory usage (oozie queue) and network
>>usage (requests to NN), but it might make your actions start sooner.
>>
>>What does the timeout attribute do
>>(oozie.service.coord.normal.default.timeout)? It's currently set at 120
>>(minutes).
>>
>>This is the timeout for a coordinator action input check (in minutes) for
>>normal job. But the default value is pretty high, so unless you set a
>>smaller non-default, this is not to worry about.
>>
>>How the throttle attributes affect the way coordinators launch new
>>workflows?
>>I'm talking about the attributes oozie.service.coord.default.throttle
>>(12) and oozie.service.coord.materialization.throttling.factor (0.05).
>>Are there any other attributes I could or should use that are related to
>>materialization?
>>
>>oozie.service.coord.default.throttle (12) controls how many actions per
>>coordinator can be in state WAITING concurrently. WAITING is when the
>>input dependency checks occur. The "materialization.throttling.factor" is
>>similar, but this is a function (percentage) of your queue size. This
>>makes more sense in a multi-tenant environment when you don't want all of
>>your Oozie command queue getting filled up only one user's job.
>>
>>
>>Also, would you elaborate a little more on how I can control the
>>materialization so that my coordinators won't get stuck like it's
>>happening every once in a while in my PROD environment?
>>
>>
>>How should I set those attributes to cause coordinators to do things like
>>these:
>>- Always launch new jobs (workflows) if their nominal time is reached and
>>there's room (max concurrency not reached yet).
>>- Launch most current jobs first (execution=LIFO)
>>
>>One of the coordinator <control><execution-order> tags. Set it to LIFO.
>>Possible values are
>>
>>* FIFO (oldest first) default
>>* LIFO (newest first)
>>* LAST_ONLY (discards all older materializations)
>>
>>- Don't create a bunch of WAITING jobs. I don't care about WAITING
>>(future) jobs, but just READY ones, i.e. nominal time is reached (or
>>about to be reached).
>>
>>Jobs will go to WAITING first, have their input dependencies checked, and
>>then become READY. For this case (to avoid stuck jobs, ok with fewer jobs
>>but READY more quickly) decrease your throttle values and decrease input
>>check requeue interval (not too much) to have actions addressed faster.
>>
>>- Don't leave actions in the PREP state forever. If they can't start for
>>any reason, either retry or get them out of the way (FAIL, KILL,
>>whatever). Whenever an action is left in the PREP state, one of the
>>concurrency slots is blocked and if several actions get in the same
>>scenario, pretty soon the coordinator will get stuck with no more room to
>>launch new tasks.
>>
>>Actions will not block concurrency slots in PREP (PREP just means action
>>id persisted in database).
>>
>>Let me know if the above steps work for you. Happy to help.
>>
>>
>>
>>Thank you.
>>Eduardo.
>>
>>
>>
>>________________________________
>>From: Mona Chitnis <ch...@yahoo-inc.com>>
>>To: "user@oozie.apache.org<ma...@oozie.apache.org>"
>><us...@oozie.apache.org>>; Eduardo Afonso
>>Ferreira <ea...@yahoo.com>>
>>Sent: Thursday, September 27, 2012 6:25 PM
>>Subject: Re: Action stuck in PREP state.
>>Hi Eduardo,
>>
>>If your PROD environment has jobs that are more database-intensive, can
>>you check/increase your oozie server settings for the following
>>
>>*   oozie.service.JPAService.pool.max.active.conn
>>*   oozie.service.coord.normal.default.timeout
>>
>>Other properties to check
>>
>>*   oozie.command.default.lock.timeout
>>*   If materialization window value is large (you want more coord actions
>>to get materialized simultaneously), but the throttling factor is low,
>>then your actions will stay in PREP
>>
>>Your log errors are pointing towards transaction-level problems. Can you
>>elaborate a bit more on the difference between your REF and PROD
>>environments?
>>
>>--
>>Mona Chitnis
>>
>>
>>
>>On 9/27/12 3:05 PM, "Eduardo Afonso Ferreira"
>><ea...@yahoo.com>
>>>
>> wrote:
>>
>>Hi there,
>>
>>I've seen some posts about this problem, but I still could not determine
>>what causes it and what is the fix for it.
>>
>>I have been running Oozie 2.3.2 from Cloudera's package (2.3.2-cdh3u3)
>>until this morning on a VM running Ubuntu 10.04.3 LTS.
>>The database is MySQL running on another VM, same Ubuntu version, MySQL
>>server version 5.1.63-0ubuntu0.10.04.1-log (as displayed by mysql when I
>>connect).
>>
>>I have coordinators launching workflows with frequency=3, concurrency=2.
>>5 to 10 Coordinators.
>>
>>My workflows run 2-3 actions each, java, pig, shell (python). Nothing too
>>heavy.
>>
>>
>>This morning I upgraded Oozie to version 3.2.0 built from the stable
>>branch I downloaded from here
>>(http://incubator.apache.org/oozie/Downloads.html), (06-Jun-2012 18:43).
>>
>>I ran this version for at least one week, maybe two in a REF environment
>>without any problems, but I'm having issues in PROD.
>>
>>I see connection issues to MySQL, timeouts, workflow actions getting
>>stuck in PREP state.
>>
>>Do you guys know what could be causing this problem? Anything I may have
>>missed on the PROD environment?
>>
>>Related to the problem, I see on the oozie.log file it displays the
>>following:
>>
>>
>>2012-09-27 13:39:00,403 DEBUG ActionStartXCommand:545 - USER[aspen]
>>GROUP[-] TOKEN[] APP[ad_counts-wf]
>>JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Acquired lock for
>>[0000173-120927111953690-oozie-oozi-W] in [action.start]
>>2012-09-27 13:39:00,404 DEBUG ActionStartXCommand:545 - USER[aspen]
>>GROUP[-] TOKEN[] APP[ad_counts-wf]
>>JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Load state for
>>[0000173-120927111953690-oozie-oozi-W]
>>2012-09-27 13:39:00,403 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-]
>>APP[-] JOB[-] ACTION[-] Executing JPAExecutor
>>[WorkflowJobUpdateJPAExecutor]
>>2012-09-27 13:39:00,404 DEBUG JPAService:548 - USER[-] GROUP[-] TOKEN[-]
>>APP[-] JOB[-] ACTION[-] Executing JPAExecutor [WorkflowJobGetJPAExecutor]
>>2012-09-27 13:39:00,405  WARN JPAService:542 - USER[-] GROUP[-] TOKEN[-]
>>APP[-] JOB[-] ACTION[-] JPAExecutor [WorkflowJobGetJPAExecutor] ended
>>with an active transaction, rolling back
>>2012-09-27 13:39:00,405 DEBUG ActionStartXCommand:545 - USER[aspen]
>>GROUP[-] TOKEN[] APP[ad_counts-wf]
>>JOB[0000173-120927111953690-oozie-oozi-W] ACTION[-] Released lock for
>>[0000173-120927111953690-oozie-oozi-W] in [action.start]
>>
>>
>>Actions that complete the processing perform other operations before
>>releasing the lock listed on the last line above. I suspect something is
>>failing before that, but I don't see a log message indicating what
>>happened.
>>
>>If you have insights on this problem, please help me.
>>
>>Thank you.
>>Eduardo.